YJL: speedparser vs. feedparser in performance

While I was writing a script to merge feeds using Universal Feed Parser, it can not be ignored that the parsing is slow, this wasn’t the first time I directly observing the slowness, I did a speed test on it for years ago. At the time, BeautifulSoup played a big part to affect the speed.

I have 5 feeds to parse, honestly, it really isn’t too bad since I wouldn’t be running it directly, still I wanted to know if there is any other solution is faster, then I found speedparser, which describes itself as:

feedparser but faster and worse.

I agreed with that completely.

Contents

1 Environment
2 Script
3 Result
4 Thoughts

1 Environment

This performance test is a very simple one, few bullet points:

First 100 feeds from speedparser’s tests/feeds.tar.bz2, that is 0001.dat to 0100.dat.
Time with timeit module for 3-run average.
Python 2.7.9 only. speedparser has not yet supported Python 3.
feedparser 5.2.1 (2015-07-24), which no longer depending on BeautifulSoup since this version.
speedparser 0.2.0 (2014-08-16) with chardet 2.2.1 and lxml 3.4.1

2 Script

You can find the script on Gist.

3 Result

The output of testing script, with 5.1.3 result inserted:


Benchmarking
  100 feeds
    3 timeit runs

Results
  feedparser  (5.1.3): 233.125 seconds =>  0.429 feeds / second
  feedparser  (5.2.1):  54.638 seconds =>  1.830 feeds / second
  speedparser (0.2.0):  12.175 seconds =>  8.214 feeds / second

feedparser 5.2.1 is 326.57% faster than 5.1.3, speedparser 0.2.0 is 348.85% faster than feedparser 5.2.1.

Here is the official result from its README:


feedparser    2.5 feeds/sec
speedparser  65   feeds/sec with HTML cleaning on
speedparser 200   feeds/sec with HTML cleaning off

4 Thoughts

Although speedparser is faster, like it said, also worse. I had less problems when merging those five feeds when using feedparser, it would take more work with speedparser. 300%+ faster isn’t really a big factor for me, especially it requires two additional libraries.

speedparser says:

If you are writing an application that does not ingest many feeds, or where CPU is not a problem, you should use feedparser as it is flexible with bad or malformed data and has a much better test suite.

I think a good design for parsing many feeds is to split, one for parsing and another for presenting the results. For example, a background process goes through each feed, parse and generate a data that the presenter process can use to display.

YJL

speedparser vs. feedparser in performance

1 Environment

2 Script

3 Result

4 Thoughts

0 comments:

Post a Comment