Last time, I did a test on simplejson 2.1.3, it was actually a post for writing this one, I want to show you using feedparser causes quite some time.
Last month, feedparser 5.0 was released and just few days ago 5.0.1 was released. Before version 5, I was using 4.1, which I found it’s slow. But version 5.0 is even slower in Python 2.5 , much more more slower in Python 2.6. 5.0 claims to support Python 3, but Python 3 says the module has invalid syntax. So, I won’t be knowing if it knows faster in Python 3.
Here is how I test. I downloaded 500 entries of Google Blog using Blogger API for format Atom and RSS, I also downloaded the JSON for comparison:
http://www.blogger.com/feeds/10861780/posts/default?max-results=500&alt=atom
http://www.blogger.com/feeds/10861780/posts/default?max-results=500&alt=rss
http://www.blogger.com/feeds/10861780/posts/default?max-results=500&alt=json
Because the execution/parsing time is really long, I simply used Bash’s time built-in to get the time. The commands are as follow:
time python -c 'import feedparser as fp; fp.parse("test.atom")'
time python -c 'import feedparser as fp; fp.parse("test.rss")'
time python -c 'import simplejson as json; json.load(open("test.json"))'
Here is the result:
feedparser | 5.0.1 | 4.1 | 2.1.3 + C | |||
---|---|---|---|---|---|---|
Python | 2.5 | 2.6 | 2.5 | 2.6 | 2.5 | 2.6 |
Atom | 14.515s | 54.676s | 5.939s | 5.929s | ||
RSS | 14.205s | 53.185s | 5.249s | 5.286s | ||
JSON | 1.383s | 1.372s |
I don’t think I need to explain the number, it’s very clear. I listed simplejson because I also did this:
simplejson + YQL for 100 entries: 4.445s
Using a YQL like:
select * from atom where url="http://www.blogger.com/feeds/10861780/posts/default?max-results=100&start-index=1&alt=atom"
and:
time python -c 'import simplejson as json; from urllib2 import urlopen; json.load(urlopen("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20atom%20where%20url%3D%22http%3A%2F%2Fwww.blogger.com%2Ffeeds%2F10861780%2Fposts%2Fdefault%3Fmax-results%3D100%26start-index%3D=1%26alt%3Datom%22&format=json&callback="))'
I found if I used 500 entries, YQL returns empty result. 250 seems to be the limit but simplejson shows error in the JSON. I don’t know if YQL causes that or not, but 100 with 5 requests seems fine. The time is around 1 to 5 seconds, so multiply by 5 requests, the estimation will be upto 25 seconds in Python 2.6. And most of time is waiting for response from YQL, which doesn’t utilize much CPU resource.
YQL is a great tool I have known. You can load a cross-domain XML file or get the HTTP error code for cross-domain JSONP. You can convert any format it supports to JSON.
feedparser is a great tool, too, but it’s really slow. I noticed because I have a script which loads many feeds using feedparser every ten minutes and it always is the TOP 1 CPU time user. After this test, I don’t think I will use 5.0 and I will probably go back to use Python 2.5 to run the script. I might even switch to process JSON with YQL’s help since the data is downloaded, it would be just a URL change.
I know it’s unfair to say feedparser is slow, if you only look at the time without considering the formats or the implementation, then it’s slow. And speed is only thing I do care. If there is one library implemented in C, I am sure the time will have huge improvement. I tried to search for one, but I can’t find any.
1 Regarding BeautifulSoup
Added at 2011-09-16T17:17:30Z
Note
While checking upon a faster alternative, speedparser, I found out since version 5.2.1 (2015-07-24), FeedParser is no longer depending on BeautifulSoup and runs 326.57% faster than 5.1.3. (2015-08-21T23:44:43Z)
A search keywords in the visitor report got me Google it and found this issue1. Which explained why FeedParser 5.0.1 is much slower in my Python 2.6, the reason is the BeautifulSoup. At the time I tested, the BeautifulSoup was installed as a dependency of lxml on my Gentoo. I didn’t notice and had no idea it would cause such performance slump.
Now, here is an updated test result for FeedParser 5.0.1:
Python | 2.5.4 | 2.6.6 | 2.7.1 | |||
---|---|---|---|---|---|---|
BeautifulSoup 3.2.0 | Atom | RSS | Atom | RSS | Atom | RSS |
With | 23.026s | 23.785s | 48.466s | 50.220s | 20.864s | 21.373s |
Without | 14.425s | 14.658s | 11.556s | 12.001s | 12.983s | 13.275s |
Though 5.0.1 is still slower than 4.1, but with Python 2.6 is actually slightly faster than with Python 2.5 or Python 2.7, from which I made a wrong judgment (if you don’t have BeautifulSoup installed) in the original test.
[1] | http://code.google.com/p/feedparser/issues/detail?id=300 is gone. |
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.