I'm working on an ETL process to extract notes from Evernote's ENML format and I'm trying to figure out if I should use Beautiful Soup (BS4) or stick with Python's built-in xml.etree.ElementTree. I've heard that Beautiful Soup is easier to use, but I've also read that the standard library can be faster. Considering these points, is there any reason I should lean towards BS4 instead of using the standard library?
3 Answers
The xml.etree.ElementTree is actually quite nice for XML parsing and it has some decent filtering functionality. Although it's not typed, it's still pretty effective for parsing structured XML like ENML. From what I understand, Beautiful Soup is more geared towards scraping HTML and might be overkill for your situation.
I've used xml.etree.ElementTree for various XML data sources, and it works perfectly fine for large datasets. If performance is a priority, I'd agree that sticking with the standard library could be your best bet for parsing ENML.
Yeah, I think you're right about BS4 mainly being for HTML. Just be cautious; lxml's HTML parser doesn't fully replicate real browser behavior, which can lead to misparsing. However, if you just need to extract data from structured XML, ElementTree should suit your needs really well.
That sounds good! I will likely go with xml.etree.ElementTree. It's a variant of XML, so sticking to the standard library makes sense.