While I was writing a script to merge feeds using Universal Feed Parser, it can not be ignored that the parsing is slow, this wasn’t the first time I directly observing the slowness, I did a speed test on it for years ago. At the time, BeautifulSoup played a big part to affect the speed.

I have 5 feeds to parse, honestly, it really isn’t too bad since I wouldn’t be running it directly, still I wanted to know if there is any other solution is faster, then I found speedparser, which describes itself as:

feedparser but faster and worse.

I agreed with that completely.

1   Environment

This performance test is a very simple one, few bullet points:

  • First 100 feeds from speedparser’s tests/feeds.tar.bz2, that is 0001.dat to 0100.dat.
  • Time with timeit module for 3-run average.
  • Python 2.7.9 only. speedparser has not yet supported Python 3.
  • feedparser 5.2.1 (2015-07-24), which no longer depending on BeautifulSoup since this version.
  • speedparser 0.2.0 (2014-08-16) with chardet 2.2.1 and lxml 3.4.1

2   Script

You can find the script on Gist.

3   Result

The output of testing script, with 5.1.3 result inserted:


Benchmarking
100 feeds
3 timeit runs

Results
feedparser (5.1.3): 233.125 seconds => 0.429 feeds / second
feedparser (5.2.1): 54.638 seconds => 1.830 feeds / second
speedparser (0.2.0): 12.175 seconds => 8.214 feeds / second

feedparser 5.2.1 is 326.57% faster than 5.1.3, speedparser 0.2.0 is 348.85% faster than feedparser 5.2.1.

Here is the official result from its README:


feedparser 2.5 feeds/sec
speedparser 65 feeds/sec with HTML cleaning on
speedparser 200 feeds/sec with HTML cleaning off

4   Thoughts

Although speedparser is faster, like it said, also worse. I had less problems when merging those five feeds when using feedparser, it would take more work with speedparser. 300%+ faster isn’t really a big factor for me, especially it requires two additional libraries.

speedparser says:

If you are writing an application that does not ingest many feeds, or where CPU is not a problem, you should use feedparser as it is flexible with bad or malformed data and has a much better test suite.

I think a good design for parsing many feeds is to split, one for parsing and another for presenting the results. For example, a background process goes through each feed, parse and generate a data that the presenter process can use to display.