pyplaintext (GitHub) is a very simple conversion library for HTML-to-plaintext, consider the following code:
#!/usr/bin/env python import urllib2 from pyplaintext import converter parser = converter.HTML2PlainParser() URL = 'http://example.com' f = urllib2.urlopen(URL) html = f.read() f.close() result = parser.html_to_plain_text(html) print(result)
It produces:
Example Domain This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission. More information...[1] [1]: http://www.iana.org/domains/example
The hyperlink is listed as well.
It does not provide any command-line interface at this moment, hopefully, it would, or you can only use it as a library. It could be useful for quickly grabbing data from web in script.
pyplaintext is written by Martin Brochhaus for Python 2 under the MIT License, currently version 0.1 (2014-04-07).
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.