YJL: hanzidentifier, Chinese characters identifier

hanziidentifier (GitHub) is a Python library for identify if characters in a string is Chinese, for examples, modified from README:

>>> import hanzidentifier as hi
>>> hi.has_chinese('Hello my name is John.')
False
>>> hi.is_simplified('John说：你好！')
True
>>> hi.is_traditional('John說：你好！')
True
>>> hi.has_chinese('Country in Simplified: 国家. '
...                'Country in Traditional: 國家.')
True
>>> hi.identify('Hello my name is Thomas.') is hi.UNKNOWN
True
>>> hi.identify('Thomas 说：你好！') is hi.SIMPLIFIED
True
>>> hi.identify('Thomas 說：你好！') is hi.TRADITIONAL
True
>>> hi.identify('你好！') is hi.BOTH
True
>>> hi.identify('Country in Simplified: 国家. '
...             'Country in Traditional: 國家.'
... ) is hi.MIXED
True

It can recognize five types of strings:

hanzidentifier.UNKNOWN: no recognized Chinese characters
hanzidentifier.BOTH: with both Simplified and Traditional Chinese characters
hanzidentifier.TRADITIONAL: with Traditional Chinese characters
hanzidentifier.SIMPLIFIED: with Simplified Chinese characters
hanzidentifier.MIXED: with solely as Traditional and Simplified Chinese characters

This library depends on Zhon, which is used for Chinese text processing and provides CC-CEDICT data that used to identify the characters. hanziidentifier is written by Thomas Roten under the MIT License, for Python 2 and 3, currently version 1.0.1 (2014-04-14).

YJL

hanzidentifier, Chinese characters identifier

0 comments:

Post a Comment