hanziidentifier (GitHub) is a Python library for identify if characters in a string is Chinese, for examples, modified from README:
>>> import hanzidentifier as hi >>> hi.has_chinese('Hello my name is John.') False >>> hi.is_simplified('John说:你好!') True >>> hi.is_traditional('John說:你好!') True >>> hi.has_chinese('Country in Simplified: 国家. ' ... 'Country in Traditional: 國家.') True >>> hi.identify('Hello my name is Thomas.') is hi.UNKNOWN True >>> hi.identify('Thomas 说:你好!') is hi.SIMPLIFIED True >>> hi.identify('Thomas 說:你好!') is hi.TRADITIONAL True >>> hi.identify('你好!') is hi.BOTH True >>> hi.identify('Country in Simplified: 国家. ' ... 'Country in Traditional: 國家.' ... ) is hi.MIXED True
It can recognize five types of strings:
- hanzidentifier.UNKNOWN: no recognized Chinese characters
- hanzidentifier.BOTH: with both Simplified and Traditional Chinese characters
- hanzidentifier.TRADITIONAL: with Traditional Chinese characters
- hanzidentifier.SIMPLIFIED: with Simplified Chinese characters
- hanzidentifier.MIXED: with solely as Traditional and Simplified Chinese characters
This library depends on Zhon, which is used for Chinese text processing and provides CC-CEDICT data that used to identify the characters. hanziidentifier is written by Thomas Roten under the MIT License, for Python 2 and 3, currently version 1.0.1 (2014-04-14).
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.