YJL: i and I in Turkish

Do you know what’s the output of the following code?

#!/usr/bin/env python2
import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.utf8')
print 'i'.upper()

The answer is i not I.

It began with this bug report, where I commmented

This is so bizarre: scancode: 42 name:’KEY_SHiFT_L’

Why is it small ‘i’?

I didn’t understand why that key symbol has a small ‘i’, later I knew the ‘i’ is a totally different story in Turkish language as we used to know the ‘i’ in English, and it actually has four different letters of i.

Contents

1 i, İ, ı, and I
2 String Normalization with Turkish locale
3 Conclusion

1 i, İ, ı, and I

(The header title is actually: i, İ, ı, and I.)

So, a quick lesson of switching case for ‘i’ in Turkish (Hopefully I am not making mistakes ;):

dotted i and İ
- [upper] i (ASCII i) to İ (U+0130)
- [lower] İ (U+0130) to i (ASCII i)
dotless ı and I
- [upper] ı (U+0131) to I (ASCII I)
- [lower] I (ASCII I) to ı (U+0131)

Two of these four are actually the same we have in ASCII characters.

Did you notice something wasn’t right? As I said at the beginning the upper case of ‘i’ is still ‘i’, but the wiki says the upper case of ‘i’ is İ (U+0130), the dotted cap I. I believe Python couldn’t do it right even the locale has been set, but Python is not the only one, from the wiki:

Dotless i (and dotted capital I) is handled problematically in the Turkish locales of several software packages, including Oracle DBMS, Java,[1] and Unixware 7, where implicit capitalization of names of keywords, variables, and tables has effects not foreseen by the application developers. The C or US English locales do not have these problems.

However, if you set the locale correctly (with right charset), it has no problem:

import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.iso88599')

lower_i = '\xfd i'
upper_I = 'I \xdd'
print 'lower_i', lower_i.decode('iso8859-9').encode('utf-8')
print '2_upper', lower_i.upper().decode('iso8859-9').encode('utf-8')
print
print 'upper_I', upper_I.decode('iso8859-9').encode('utf-8')
print '2_lower', upper_I.lower().decode('iso8859-9').encode('utf-8')

lower_i ı i
2_upper I İ

upper_I I İ
2_lower ı i

They are correct.

But! (here comes my favorite word) If you use unicode string, you get unexpected result:

import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.utf8')

lower_i = u'\u0131 i'
upper_I = u'I \u0130'
print 'lower_i', lower_i.encode('utf-8')
print '2_upper', lower_i.upper().encode('utf-8')
print
print 'upper_I', upper_I.encode('utf-8')
print '2_lower', upper_I.lower().encode('utf-8')

lower_i ı i
2_upper I I

upper_I I İ
2_lower i i

For dotless small i and dot cap I, they have correct result. The other two are not. However, if you are not really dealing locale stuff, i.e. Turkish, this might be what you want, see next section.

The only way I know to deal with this is to manually replace.

import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.utf8')

lower_i = u'\u0131 i'
upper_I = u'I \u0130'
print 'lower_i', lower_i.encode('utf-8')
print '2_upper', lower_i.replace(u'i', u'\u0130').upper().encode('utf-8')

lower_i ı i
2_upper I İ

2 String Normalization with Turkish locale

In the bug of that project, we have key symbols all switched to upper cases, then use it to compare to a value which is from predefined table. The data in table is all CAPS, so this is the problem, we can never find the match since ‘i’ isn’t being switched to ‘I’.

This is just one case. When coding, the metadata most likely is just [a-z0-9-_]+, they are always ASCII. You might sanitize them to make sure, e.g. blog post slug. Say a post title is ‘This Is A Post,’ a typical slug would be this-is-a-post. If you only use str string, you end up with thIs-Is-a-post.

A quick fix is to convert the string to Unicode and that would be fine. If you are using Python 3, you won’t be aware of this.

Another way is to set the locale, which I did for that bug at first.

3 Conclusion

Locale is as painful as key stuff and no, I can’t speak Turkish and yes, I only read that wiki page. (Okay, okay, half of it)

While I was reading that wiki page, I was shocked to read about the lack of the dotless i on phone system caused deaths.

YJL

i and I in Turkish

1 i, İ, ı, and I

2 String Normalization with Turkish locale

3 Conclusion

0 comments:

Post a Comment