I am trying to add some readability indexes to a script. For a starter, ARI (Automated Readability Index) seems like a good choice, it only needs character, word, and sentence counts.

At first thought, using regular expressions seems like a logical direction. After digging around, NLTK looks like a good tool to use. So, this is the code I have:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
import nltk.data
from nltk import wordpunct_tokenize

text = '''There are two ways of constructing a software design:
One way is to make it so simple that there are obviously no deficiencies and
the other way is to make it so complicated that there are no obvious deficiencies.'''
# — C.A.R. Hoare, The 1980 ACM Turing Award Lecture

# split into words by punctuations
# remove punctuations and all '-' words
RE = re.compile('[0-9a-z-]', re.I)
words = filter(lambda w: RE.search(w) and w.replace('-', ''), wordpunct_tokenize(text))

wordc = len(words)
charc = sum(len(w) for w in words)

sent = nltk.data.load('tokenizers/punkt/english.pickle')

sents = sent.tokenize(text)
sentc = len(sents)

print words
print charc, wordc, sentc
print 4.71 * charc / wordc + 0.5 * wordc / sentc - 21.43
['There', 'are', 'two', 'ways', 'of', 'constructing', 'a', 'software', 'design', 'One', 'way', 'is', 'to', 'make', 'it', 'so', 'simple', 'that', 'there', 'are', 'obviously', 'no', 'deficiencies', 'and', 'the', 'other', 'way', 'is', 'to', 'make', 'it', 'so', 'complicated', 'that', 'there', 'are', 'no', 'obvious', 'deficiencies']
173 39 1
18.9630769231

It uses training data to tokenize sentences1. If you see an error like:

Traceback (most recent call last):
  File "./test.py", line 13, in <module>
    sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
  File "/usr/lib64/python2.7/site-packages/nltk/data.py", line 594, in load
    resource_val = pickle.load(_open(resource_url))
  File "/usr/lib64/python2.7/site-packages/nltk/data.py", line 673, in _open
    return find(path).open()
  File "/usr/lib64/python2.7/site-packages/nltk/data.py", line 455, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource 'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource: >>>
  nltk.download().
  Searched in:
    - '/home/livibetter/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Just run and type:

python
import nltk
nltk.download()
d punkt

to download the data file.

The result isn’t consistent with other calculators you have on the Internet. In fact, most likely each calculator results different value of index, which is caused by what is considered as sentence divider and all other small details about how to count.

There is actually an old contributed code to NLTK, which probably has everything you need, all different indexes and methods for calculation. I don’t see it in current code yet, the latest code will raise NotImplementedError.

Unfortunately, NLTK doesn’t support Python 3. And the script I am writing will be Python 3 only, I don’t plan to make it compatible with Python 2.X. There seemed to have a branch for it, but it was gone. So, after all these, I might have to revert back to the regular expressions. Well, that’s not too bad actually.

[1]http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html is gone.