While I was writing a script to merge feeds using Universal Feed Parser, it can not be ignored that the parsing is slow, this wasn’t the first time I directly observing the slowness, I did a speed test on it for years ago. At the time, BeautifulSoup played a big part to affect the speed.
I have 5 feeds to parse, honestly, it really isn’t too bad since I wouldn’t be running it directly, still I wanted to know if there is any other solution is faster, then I found speedparser, which describes itself as:
feedparser but faster and worse.
I agreed with that completely.
1 Environment
This performance test is a very simple one, few bullet points:
- First 100 feeds from speedparser’s tests/feeds.tar.bz2, that is 0001.dat to 0100.dat.
- Time with timeit module for 3-run average.
- Python 2.7.9 only. speedparser has not yet supported Python 3.
- feedparser 5.2.1 (2015-07-24), which no longer depending on BeautifulSoup since this version.
- speedparser 0.2.0 (2014-08-16) with chardet 2.2.1 and lxml 3.4.1
3 Result
The output of testing script, with 5.1.3 result inserted:
Benchmarking
100 feeds
3 timeit runs
Results
feedparser (5.1.3): 233.125 seconds => 0.429 feeds / second
feedparser (5.2.1): 54.638 seconds => 1.830 feeds / second
speedparser (0.2.0): 12.175 seconds => 8.214 feeds / second
feedparser 5.2.1 is 326.57% faster than 5.1.3, speedparser 0.2.0 is 348.85% faster than feedparser 5.2.1.
Here is the official result from its README:
feedparser 2.5 feeds/sec
speedparser 65 feeds/sec with HTML cleaning on
speedparser 200 feeds/sec with HTML cleaning off
4 Thoughts
Although speedparser is faster, like it said, also worse. I had less problems when merging those five feeds when using feedparser, it would take more work with speedparser. 300%+ faster isn’t really a big factor for me, especially it requires two additional libraries.
speedparser says:
If you are writing an application that does not ingest many feeds, or where CPU is not a problem, you should use feedparser as it is flexible with bad or malformed data and has a much better test suite.
I think a good design for parsing many feeds is to split, one for parsing and another for presenting the results. For example, a background process goes through each feed, parse and generate a data that the presenter process can use to display.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.