YJL: summarize.py using NLTK and BeautifulSoup 4

summarize.py is a tool for generating summarized text of a web page, it utilizes NLTK to process the text and BeautifulSoup 4 to process the HTML.

For NLTK, you will need to download a couple of data sets, using

$ python -c 'import nltk; nltk.download()'

It will initializes the downloader of NLTK, you will need the following two:

stopwords: Stopwords Corpus
punkt: Punkt Tokenizer Models

Here is a sample output, I used on PEP 8:

$ ./summarize.py http://legacy.python.org/dev/peps/pep-0008/
PEP 8 -- Style Guide for Python Code - http://legacy.python.org/dev/peps/pep-0008/

Limit all lines to a maximum of 79 characters.
For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.
Limiting the required editor window width makes it possible to have several files open side-by-side, and works well when using code review tools that present the two versions in adjacent columns.
Some web based tools may not offer dynamic line wrapping at all.
Some teams strongly prefer a longer line length.
The Python standard library is conservative and requires limiting lines to 79 characters (and docstrings/comments to 72).
The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces.
Backslashes may still be appropriate at times.
Another such case is with assert statements.
Make sure to indent the continued line appropriately.

It doesn’t really run very well with Python 2.7, not sure about Python 3, I kept running into UnicodeEncodeError almost all web pages I have tried, here is one of traceback:

$ ./summarize.py http://docs.python.org/3/whatsnew/3.4.html
Traceback (most recent call last):
  File "./summarize.py", line 129, in <module>
    print(summarize_page(sys.argv[1]))
  File "./summarize.py", line 119, in summarize_page
    summaries = summarize_blocks(map(lambda p: p.text, b.find_all('p')))
  File "./summarize.py", line 110, in summarize_blocks
    return [u(re.sub('\s+', ' ', summary.strip())) for summary in summaries if any(c.lower() in string.ascii_lowercase for c in summary)]
  File "./summarize.py", line 30, in u
    return codecs.unicode_escape_decode(s)[0]
UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-28: ordinal not in range(128)

Hopefully, the issue will be fixed in future releases.

YJL

summarize.py using NLTK and BeautifulSoup 4

0 comments:

Post a Comment