YJL: Soundex in Bash

I recently read this article Using Fuzzy Matching to Search by Sound with Python, which uses Fuzzy library. I didn’t have any need to use this but I just found this interesting. So, I wrote this quick script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import fuzzy

dmeta = fuzzy.DMetaphone()
fmt_dmeta = lambda s: '%4s, %4s' % tuple(dmeta(s))

funcs = [fuzzy.Soundex(8), fmt_dmeta, fuzzy.nysiis]
names = [
    [
      ['en',    'Tom Hanks',       'Tom Hanks'],
      ['en',    'Thomas Hanks',    'Thomas Hanks'],
      ['zh_TW', 'Tangmu hankesi',  '湯姆·漢克斯'],
      ['jp',    'Tomu hankusu',    'トム・ハンクス'],
      ['kr',    'tom haengkeuseu', '톰 행크스'],
    ],
    [
      ['en',    'Barack Obama',    'Barack Obama'],
      ['zh_TW', 'Balake Oubama',   '巴拉克·歐巴馬'],
      ['jp',    'Baraku Obama',    'バラク・オバマ'],
      ['ar',    'barak awbama',    'باراك أوباما',],
    ],
    [
      ['en',    'Rose Kennedy',    'Rose Kennedy'],
      ['zh_TW', 'Luosi Gannaidi',  '蘿絲·甘迺迪'],
      ['jp',    'Rozu kenedi',     'ローズ・ケネディ'],
    ],
  ]


for name in names:
  for lang in name:
    results = '%8s / %10s / %-10s' % tuple(f(lang[1]) for f in funcs)
    print '%-16s %s %s' % (lang[1], results, lang[2])
  print '-'*78

Tom Hanks        T5200000 / TMNK, None / TANANC     Tom Hanks
Thomas Hanks     T5252000 / TMSN, None / TANASANC   Thomas Hanks
Tangmu hankesi   T5252000 / TNKM, None / TANGNANCAS 湯姆·漢克斯
Tomu hankusu     T5200000 / TMNK, None / TANANCAS   トム・ハンクス
tom haengkeuseu  T5200000 / TMNK, None / TANANGCAS  톰 행크스
------------------------------------------------------------------------------
Barack Obama     B6215000 / PRKP, None / BARACABAN  Barack Obama
Balake Oubama    B4215000 / PLKP, None / BALACABAN  巴拉克·歐巴馬
Baraku Obama     B6215000 / PRKP, None / BARACABAN  バラク・オバマ
barak awbama     B6215000 / PRKP, None / BARACABAN  باراك أوباما
------------------------------------------------------------------------------
Rose Kennedy     R2530000 / RSKN, None / RASACANADY Rose Kennedy
Luosi Gannaidi   L2530000 / LSKN, None / LASAGANAD  蘿絲·甘迺迪
Rozu kenedi      R2530000 / RSKN, RTSK / RASACANAD  ローズ・ケネディ
------------------------------------------------------------------------------

I later found out I shouldn’t use full name, the algorithm doesn’t seem to be designed for that. But since the script has written, so I just dumped results here.

I was planning to write same library but in Bash and in JavaScript. So far, I finished Soundex in Bash, and I would not continue, you will read the reason at the end. Soundex is only one part of Fuzzy, it also supports Double Metaphone and NYSIIS.

My implementation can be found on Gist. It’s not fast, but you can’t ask more from Bash. Here is some numbers on my computer, runtime per call of Soundex function with fuzzy as name:

500ns - Fuzzy python library (which is C extension, written in PyX).
50us - Python implementation from Rosetta Code.
600us - my Bash implementation.
50ms - Bash implementation from Advanced Bash-Scripting Guide.

Honestly, it’s really not too bad, only 12x slower then pure Python implementation. But only if it does output correct result, although I added 38 tests, I still think it may go wrong with some names. Here is the test list:

PASS: Robert       -> R163
PASS: Rupert       -> R163
PASS: Rubin        -> R150
PASS: Ashcraft     -> A261
PASS: Ashcroft     -> A261
PASS: Tymczak      -> T522
PASS: Pfister      -> P236
PASS: Smith        -> S530
PASS: Smythe       -> S530
PASS: Harrison     -> H625
PASS: Hargison     -> H622
PASS: Harriman     -> H655
PASS: Soundex      -> S532
PASS: Example      -> E251
PASS: Sownteks     -> S532
PASS: Ekzampul     -> E251
PASS: Euler        -> E460
PASS: Gauss        -> G200
PASS: Hilbert      -> H416
PASS: Knuth        -> K530
PASS: Lloyd        -> L300
PASS: Lukasiewicz  -> L222
PASS: Ellery       -> E460
PASS: Ghosh        -> G200
PASS: Heilbronn    -> H416
PASS: Kant         -> K530
PASS: Ladd         -> L300
PASS: Lissajous    -> L222
PASS: Wheaton      -> W350
PASS: Burroughs    -> B620
PASS: Burrows      -> B620
PASS: O'Hara       -> O600
PASS: Washington   -> W252
PASS: Lee          -> L000
PASS: Gutierrez    -> G362
PASS: Jackson      -> J250
PASS: VanDeusen    -> V532
PASS: Tomahawk     -> T520
Tests passed: 38 / 38
603.000 us per call

If you don’t know how complicated it is, please prepare your pen and a piece of paper, read the rules in Soundex page, try to do Jackson. That Rules is written so confusing, especially the Rule #3. They are also unclear what to do when encounter non-alphabets.

As I mentioned earlier, I was planning to write all three algorithms. But just check the Metaphone, 19 rules; NYSIIS, is that page even written in English?

And for the Double Metaphone, let me quote something for you:

for example, it tests for approximately 100 different contexts of the use of the letter C alone.

I must be crazy if I am going to write this which I don’t even use. Whoever invented this, must be nuts! Just kidding. Anyway, I want to keep my brain cells alive.

YJL

Soundex in Bash

0 comments:

Post a Comment