Wednesday, December 07, 2005

New tamil unicode encoding proposal - My opinion

Recently Tamil Nadu Government website has posted a proposal for new tamil unicode standard (named as TUNE ) and have asked for comments.

The details are in this URL:

Problem statement given by TUNE team is: "As a result, Tamil Language, the script of which is NOT a "Complex Script" unlike other Indian Languages had also been considered as a "Complex Script" and the Unicode encoding developed employing level-2 implementation (instead of level-1) resulting in an unwieldy coding scheme for Tamil. "

My understanding is that the above statement says the following two points:
1. Tamil unicode has been made into complex unnecessarily [ My opinion: so what ??? ]

2. The Unicode encoding developed employing level-2 implementation (instead of level-1) resulting in an unwieldy coding scheme for Tamil. [ My opinion: This sounds more to be an opinion of the TUNE creators. This statement dont have any factual backing. ]

Above problem statement itself appear very weak. I don't find a need for proposing a new Tamil Unicode standard.

And I find the paper gives the following incorrect information:

"Tamil Unicode support is yet to be available in other operating systems like MacOS, Linux , etc"

[4th sentence under the heading Other shortcomings in this URL: ]

Tamil unicode is very well supported in Linux operating systems. Linux Distros like FC4 comes
with out of box tamil unicode support.

Also this paper says another problem with existing unicode standard is vendors like Adobe are yet to support tamil Unicode standard.

This is absurd, the authors of this new proposal should ask the vendors like Adobe to support for tamil unicode in their applications instead of trying to change tamil unicode standard for vendor not supporting.

Current status of tamil unicode standard is as follows:

1. Unicode tamil is supported very well in linux, Windows & Mac platforms.

2. Most Applications have now tamil unicode support (e.g. Tamil Mozilla & Tamil OpenOffice are already available)

3. Most of the tamil websites are now in Unicode

5. There are over 1000 Tamil bloggers writing in unicode tamil.

[ There are about 846 tamil bloggers listed in alone , there are lot more listed in technograti and not in ]

Existing tamil unicode standard works fine in every aspect. And it is gaining acceptance among vendors and users very fast.

There is absolutely no need to change the existing tamil unicode. Efforts like TUNE should not be encouraged at all. I find such efforts are mere waste of time and energy.

At the same time I am not religious about supporting current tamil unicode support only.
If by chance, unicode consortium accepts the new tamil unicode proposal (which is a distant possibility) , then I too will start using/supporting new tamil unicode standard.

Until then I will continue to support and popularize the existing tamil unicode standard.


Anonymous said...

//Existing tamil unicode standard works fine in every aspect.// Not really. eg.,sorting.

Anonymous said...

Well said Mugunth.

As ususal, uninformed people (usually TUNE supporters) seem to be intent on spreading FUD about the current Unicode.

In,, Pari writes:

> Sorting in current Tamil Unicode is horrible. So is parsing.
> They better get it right this time, or goto hell.

This is a typical example an ill-informed, emotional statement about Tamil Unicode, made with little understanding of the facts.

First, sorting has nothing to do with the Unicode encoding. Unicode merely defines a mapping from each abstract character entity in a script to a unique 16-bit (or 32-bit) number. Sorting, on the other hand, is a highly locale-specific issue; one that depends on linguisitics, country/region, and the type of text being sorted. Therefore, sorting needs to be highly customizable and not tied to the character entity order in any script.

The proponents of TUNE seem to hold the simplistic, but mistaken, view that character order in Unicode equals sorting order, or atleast it should be. This may work for the Tamil script, as there is only one dominant language that uses the Tamil script, but for other scripts such an assumption will create all kinds of problems.

Thus, correct sorting requires additional knowledge beyond the mere character order and is typically implmented on a per-locale basis by the system libraries in each operating system. (For example, we can have a different sort orders for ta_IN and ta_MY respectively.)

It is well known that the current *Windows* implementation of Tamil sorting has some flaws; ie, Windows does not sort in the order that a native Tamil speaker would expect. However, that doesn't mean the rest of the computer world has to get it wrong either: The GNU C library implements Tamil sorting according to the Madras Tamil Lexicon order. Any system that uses the GNU C library, such as Linux, can sort Tamil text perfectly well. So, it is not the fault of the Unicode encoding, nor of the Unicode consortium, if a particular system library does not implement sorting correctly.

Second, where is the evidence that parsing Tamil Unicode is hard? Most major computer languages, including Java, C/C++, Python, have excellent Unicode handling support that obviate the need for ever hand-coding a parser.

The TUNE proposal, a work of questionable scholarship based on flimsy evidence, will never never be accepted by the Unicode consortium on technical grounds. Fortunately Tamil Unicode is here to stay, despite the shenanigans of the TUNE proponents, thanks to the excellent multi-platform support, its daily use by Tamil bloggers and Tamil websites with "clue".

CAPitalZ said...

Please read at:

Please read in order upto P-9 [so far]


CAPitalZ said...

இதை வாசியுங்கள்:

Conventionally, when character data is stored, the sort sequence is based on thenumeric values of the characters defined by the character encoding scheme. This is called a binary sort.

Binary sorts are the fastest type of sort, and produce reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters A to Z in ascending numeric value.

[Please note the point, the BINARY sorting is only possible because the letters are in order]

[Here is how our Tamil may be sorted]
A linguistic sort operates by replacing characters with numeric values that reflect each character' s proper linguistic order. These numeric values are found in a table containing major and minor values.
இதற்கு கீழ் குறிப்பிடப் படும் முறைகள் எல்லாம், தமிழ் போன்ற [level-2] மொழிகளுக்கு. பாருங்கள் எவ்வளவு மேலதிக processing தேவை என்பதை.

காலப்போக்கில், இந்த நேர இடைவெளி வெகுவாக குறையும், ஆனால், இவை தேவை என்பது நிரந்தரமே.

Using linguistic indices you can provide the sophisticated sorting capabilities of a multilingual sort while achieving sorting performance nearly as good as a binary sort (which offers the best performance).

Binary sort is NOT possible for present day Tamil Unicode! If anybody can prove that they can do a binary sort for the presend day Tamil Unicode, I'll clean their shoes by licking.


Anonymous said...

> Binary sorts are the fastest type
> of sort, and produce reasonable
> results for the English alphabet
> because the ASCII and EBCDIC standards
> define the letters A to Z in ascending
> numeric value.

This is only true if your data is all upper case (or all lower case). Since most English data is _not_ all the same case, most such data cannot be sorted in a binary sort.

Anonymous said...

The above page of mine would shed some light with this problem.


Current Unicode is based on Tamil Grammar, excluding the natural sort order.

Well, as far as sorting concerned, we can touch the nose by taking the arm through the back of the neck!!!

TUNE might make the sorting easier, but nothing else is technically correct.

Like Mugunth said, I think current Unicode is good but if TUNE get accepted we can use that too, including for ezuththuch chiirmai.

Another thing to note, Unicode in general is not working in certain important instances. It will be same weather it is current Unicode or TUNE. Unicode in general is still crawling. eg, email, graphics etc are a painfully not working. This is where a temporary Tamil ISO standard would still help while Unicode grows to be a real one in the future. May be instead of spending time with TUNE, working for a temporary non-hacked Tamil ISO would be a godd move.

Sinnathurai Srivas