I recently read this article Using Fuzzy Matching to Search by Sound with Python, which uses Fuzzy library. I didn’t have any need to use this but I just found this interesting. So, I wrote this quick script:
#!/usr/bin/env python # -*- coding: utf-8 -*- import fuzzy dmeta = fuzzy.DMetaphone() fmt_dmeta = lambda s: '%4s, %4s' % tuple(dmeta(s)) funcs = [fuzzy.Soundex(8), fmt_dmeta, fuzzy.nysiis] names = [ [ ['en', 'Tom Hanks', 'Tom Hanks'], ['en', 'Thomas Hanks', 'Thomas Hanks'], ['zh_TW', 'Tangmu hankesi', '湯姆·漢克斯'], ['jp', 'Tomu hankusu', 'トム・ハンクス'], ['kr', 'tom haengkeuseu', '톰 행크스'], ], [ ['en', 'Barack Obama', 'Barack Obama'], ['zh_TW', 'Balake Oubama', '巴拉克·歐巴馬'], ['jp', 'Baraku Obama', 'バラク・オバマ'], ['ar', 'barak awbama', 'باراك أوباما',], ], [ ['en', 'Rose Kennedy', 'Rose Kennedy'], ['zh_TW', 'Luosi Gannaidi', '蘿絲·甘迺迪'], ['jp', 'Rozu kenedi', 'ローズ・ケネディ'], ], ] for name in names: for lang in name: results = '%8s / %10s / %-10s' % tuple(f(lang[1]) for f in funcs) print '%-16s %s %s' % (lang[1], results, lang[2]) print '-'*78
Tom Hanks T5200000 / TMNK, None / TANANC Tom Hanks Thomas Hanks T5252000 / TMSN, None / TANASANC Thomas Hanks Tangmu hankesi T5252000 / TNKM, None / TANGNANCAS 湯姆·漢克斯 Tomu hankusu T5200000 / TMNK, None / TANANCAS トム・ハンクス tom haengkeuseu T5200000 / TMNK, None / TANANGCAS 톰 행크스 ------------------------------------------------------------------------------ Barack Obama B6215000 / PRKP, None / BARACABAN Barack Obama Balake Oubama B4215000 / PLKP, None / BALACABAN 巴拉克·歐巴馬 Baraku Obama B6215000 / PRKP, None / BARACABAN バラク・オバマ barak awbama B6215000 / PRKP, None / BARACABAN باراك أوباما ------------------------------------------------------------------------------ Rose Kennedy R2530000 / RSKN, None / RASACANADY Rose Kennedy Luosi Gannaidi L2530000 / LSKN, None / LASAGANAD 蘿絲·甘迺迪 Rozu kenedi R2530000 / RSKN, RTSK / RASACANAD ローズ・ケネディ ------------------------------------------------------------------------------
I later found out I shouldn’t use full name, the algorithm doesn’t seem to be designed for that. But since the script has written, so I just dumped results here.
I was planning to write same library but in Bash and in JavaScript. So far, I finished Soundex in Bash, and I would not continue, you will read the reason at the end. Soundex is only one part of Fuzzy, it also supports Double Metaphone and NYSIIS.
My implementation can be found on Gist. It’s not fast, but you can’t ask more from Bash. Here is some numbers on my computer, runtime per call of Soundex function with fuzzy as name:
- 500ns - Fuzzy python library (which is C extension, written in PyX).
- 50us - Python implementation from Rosetta Code.
- 600us - my Bash implementation.
- 50ms - Bash implementation from Advanced Bash-Scripting Guide.
Honestly, it’s really not too bad, only 12x slower then pure Python implementation. But only if it does output correct result, although I added 38 tests, I still think it may go wrong with some names. Here is the test list:
PASS: Robert -> R163 PASS: Rupert -> R163 PASS: Rubin -> R150 PASS: Ashcraft -> A261 PASS: Ashcroft -> A261 PASS: Tymczak -> T522 PASS: Pfister -> P236 PASS: Smith -> S530 PASS: Smythe -> S530 PASS: Harrison -> H625 PASS: Hargison -> H622 PASS: Harriman -> H655 PASS: Soundex -> S532 PASS: Example -> E251 PASS: Sownteks -> S532 PASS: Ekzampul -> E251 PASS: Euler -> E460 PASS: Gauss -> G200 PASS: Hilbert -> H416 PASS: Knuth -> K530 PASS: Lloyd -> L300 PASS: Lukasiewicz -> L222 PASS: Ellery -> E460 PASS: Ghosh -> G200 PASS: Heilbronn -> H416 PASS: Kant -> K530 PASS: Ladd -> L300 PASS: Lissajous -> L222 PASS: Wheaton -> W350 PASS: Burroughs -> B620 PASS: Burrows -> B620 PASS: O'Hara -> O600 PASS: Washington -> W252 PASS: Lee -> L000 PASS: Gutierrez -> G362 PASS: Jackson -> J250 PASS: VanDeusen -> V532 PASS: Tomahawk -> T520 Tests passed: 38 / 38 603.000 us per call
If you don’t know how complicated it is, please prepare your pen and a piece of paper, read the rules in Soundex page, try to do Jackson. That Rules is written so confusing, especially the Rule #3. They are also unclear what to do when encounter non-alphabets.
As I mentioned earlier, I was planning to write all three algorithms. But just check the Metaphone, 19 rules; NYSIIS, is that page even written in English?
And for the Double Metaphone, let me quote something for you:
for example, it tests for approximately 100 different contexts of the use of the letter C alone.
I must be crazy if I am going to write this which I don’t even use. Whoever invented this, must be nuts! Just kidding. Anyway, I want to keep my brain cells alive.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.