Once again: no excuses to ignore i18n in XML
| Email weblog link | ||
| Blog this |

Uche Ogbuji
Aug. 17, 2004 09:32 PM
Permalink
![]()
URL: http://www.javareport.com/article.asp?id=9797...
I think the most pervasive problem in XML adoption is ingorance and even wilful sabotage of the international foundation on which XML is built. In several recent incidents, both in my consulting work and in my OSS/community work I have come across systems that ignore or break XML's Unicode character model.I've almost grown tired of saying it, but it is worth saying until I've worked through my very last nerve: the single most important aspect of XML is its character model. Ditch XML and use something else before you mess with that. A tremendous amount of damage is done by people who can't see past the pointy brackets as the point of XML.
Yes, Unicode is hard. There is nothing to be done about this. We have a myriad of languages, writing systems and local conventions, and they complicate just about everything. That's our wacky, wondrous world for you. Nevertheless, as a software professional in this age, there is no excuse not to buckle down and learn the rigors of i18n. I'm not meaning to be a pedant about this: I know a lot less abotu i18n than I wish I did, and I fall short of good i18n in much of my code. However, I respect the problem and I strive to work on my skills in the area, and my discipline in applying it in software development.
If you use XML in your work, please read "The skew.org XML Tutorial. A reintroduction to XML with an emphasis on character encoding", by Mike Brown (a truly brilliant article). You might also want to check out my article "Proper XML Output in Python". Even if you're not a Python programmer, you might find some use in its discussion of common character problems when generating XML.
Uche Ogbuji is a Partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies.
Showing messages 1 through 2 of 2.
-
Unicode is easy, bad unicode support is bothersome
2004-08-18 12:31:48 tcowan [View]
-
Unicode is easy, bad unicode support is bothersome
2004-08-18 13:05:38 Uche Ogbuji |
[View]
Sure the abstract concept of Unicode is easy, but Unicode is more than just that. Unicode includes the transfer formats (standard encodings), which is generaly where it gets hairy. Unfortunately, you can't just ifgnore the transfer formats when dealing with XML because that is how the data gets to the XML processor.
Even with the brilliant minds behind Java, they made many mistakes in their implementations of Unicode-related technlogies. I'm more familiar with the case of Python, where lessons from Java and Perl were kept in mind, and some of the biggest brains on the planet hammered out solid Unicode facilities. Even with all these stars aligned, things get rough in patches.
I think all these facts, as well as my plentiful experience working with Unicode myself, and with other developers, proves that Unicode is not as easy as a two-sentence description would imply.
And there is simply no comparing Unicode with ASCII. Unicode is necessarily much more complex.
--Uche
| Showing messages 1 through 2 of 2. |
Return to weblogs.oreilly.com.
Weblog authors are solely responsible for the content and accuracy of their weblogs, including opinions they express, and O'Reilly Media, Inc., disclaims any and all liabililty for that content, its accuracy, and opinions it may contain.
This work is licensed under a
Creative Commons License.







Taylor