Automating Localization

Dr. Dobb's Journal February, 2005

Regular expressions make a tough job easier

By Hew Wolff

Hew is a senior engineer for Art & Logic. He can be contacted at hewwolff@ sonic.net.

Web interfaces for configuration and management have been showing up in all kinds of networked devices—cable modems, routers, and printers—as an efficient way to provide a GUI to any client with a browser. You can think of the embedded web interface as a specialized, small-footprint web application. We've been building these interfaces for years. In one recent short and simple project, we translated our embedded web interface into Japanese.

The funny thing is, we basically did this same project a year ago. All this work is part of the web interface for our client's family of home-networking devices, which we've been developing since early 2002. The last part of this work was a major rewrite: We added a flexible XML interface between the user interface and the device's back end, and we threw in the localization feature, too. But the project had been dormant for months, and the client still had not built our latest code into any actual products.

Then (as in the end of the movie Spinal Tap), the project was suddenly revived by interest from Japan, where the client wanted to sell an older model of the same device. But the older model was running the preXML web interface code, which had no localization built in. If we returned to this obsolete codebase, how long would it take us to add localization? We had an excellent estimate based on our previous experience, but could we cut it down this time?

I didn't think we could, because the localization process is pretty straightforward. By "localization," I mean the same thing as "globalization" (oddly) or "internationalization." You go through the files looking for English text strings, and pull them into a big "language table," assigning each one a unique key. The web server loads this language table into its memory, and whenever it needs to display a text string, it looks up the key in this table. Translate your language table into a new language, and you've translated the whole web interface. Okay, you also have images that display text, so you need to prepare and serve a new set of images for each language, but the language table is the main thing. For the previous localization effort, our hapless web designer had to go through a hundred HTML files and pick out all the text strings. Tedious, certainly, but not complicated.

Fortunately, my boss had an idea: Could we automate some of this process? Surely we could tell a program how to pick out a text string from a bunch of HTML code, such as:

<title>My Favorite Device</title>

The more I thought about this, the more I liked it. We'd be done sooner, and no one would have to read through all that HTML, so I could stick to writing nice code and our web designer could stick to designing beautiful images. As someone said, "I'd rather write code to write code than write code."

There were still definite risks, of course. On the positive side, we were familiar with the codebase. On the negative side, the automated localizer had to cope with a fairly complex codebase: HTML, lots of client-side JavaScript, and some server-side JavaScript code as well, not to mention some text strings that were still embedded in the C code of the web server. And Japanese might be hard to work with: I worried about finding the right character encoding, what the right browser platform would be, and whether I could even test the code on my system. The client provided someone to do the translation, but he was in Japan and we hadn't worked with him before.

We estimated four programmer-weeks for the localization itself (two weeks less than our previous experience) and another couple of weeks for other tasks such as preparing the Japanese images.

Localization

The localization would be mainly text processing, so I needed some kind of scripting language with good support for regular expressions (REs). Python or Perl would be an obvious choice, but I was more familiar with Java, so I started with that. I used the Apache Regexp package and tried a simple test. It was an AI problem, like spam filtering: A person can easily recognize text strings just by looking at the HTML code, and I wanted a program to approximate that intuitive filtering process. For example, here's one that picks out plain text between HTML tags, such as "My Favorite Device" in the aforementioned HTML code:

[^%]>\s*([A-Z][A-Za-z0-9 .,:/\(\)\-'\"&;!]+)\s*<[^%]

REs form their own language, one that is very useful but hard to read and document. Here's how this RE works. At the beginning, it says there must be the end of an HTML tag and that server-side code ending with "%>" doesn't count. Then there may be some whitespace, followed by our text (enclosed in parentheses so that we can pull it out and do something with it), and finally more possible whitespace and the beginning of another HTML tag. The text itself must begin with a capital letter and may contain letters, numbers, and reasonable punctuation.

The first regular expressions that I tried did well. Encouraged by these early results, I started adding more REs to cover more cases. The HTML, server-side JavaScript, client-side JavaScript, and C code all contained different kinds of text strings. So, for example, I wrote an RE to grab all the client-side JavaScript for further processing by other REs. The localizer was taking shape.

The localizer filter made two kinds of mistakes: It might grab some code as a text string (a false positive), or it might ignore a text string as code (a false negative). False positives are easier to clean up than false negatives: You're removing a key from the language table, rather than inventing a new key and inserting it into the language table in the right place. False positives are also easier to find. To test the filter by hand, I tweaked the localizer to replace each text string with a "marked" version in brackets, such as "{{My Favorite Device}}." Then I ran the web interface and looked carefully at the text strings. A false negative showed up as an unmarked text string. But a false positive usually showed up dramatically; for example, as a blank page when the JavaScript code choked on a marked string. So in my testing, I was more lenient about false positives, and I wound up with about 50 false positives and only two false negatives. The false negatives began with "11b" and "11g" (as in the 802.11 wireless standards), which look more like code than text. Converter.java, the localizer code, is available electronically (see "Resource Center," page 5).

Testing

Very soon I realized that this project was ideal for test-driven development (also known as "test-first development"). In this approach, you maintain a test suite together with the application. To add some functionality, you add a test, check that the test fails with the current code, and—only then—add some code to make the test pass. Although I've often included automated testing in my work informally, this was my first serious attempt to follow this approach, which is one of the core practices of Extreme Programming. I was inspired by Roger Lipscombe's Roman numeral example (see http://www.differentpla.net/node/view/58), in which some beautiful code grows organically before your eyes. The schedule on this project was somewhat relaxed, which allowed me to experiment a bit. (Or maybe it was the success of the experiments that kept the schedule relaxed; it's hard to tell.)

Here the test suite was a small set of web source files, together with localized versions that I prepared by hand. I periodically ran the localizer in a special mode, which processed the test files and compared the current output with the correct output. Whenever this comparison choked, I fixed the problem, rewriting the test or the code depending on which looked right. The test suite became a distilled version of the web codebase, containing all the interesting cases.

The localizer turned out to be a strange application. First of all, it didn't have to be perfect. If it could make the right decision 99 percent of the time, I could clean up the other cases by hand in a day or so. I figured there would be a cleanup stage. For example, I was assuming that brand names such as "AppleTalk" should not be translated, but the client might later decide that they should be. On the other hand, although perfection was not a goal, performance was important. The localizer was designed to run just once for real. Although I ran it many times to test it during development, once the official localization was done and I started editing the web files by hand, running the localizer again would wipe out that work. To justify writing it, it had to perform well on its single run. In fact, that's how I decided when the localizer was done: when refining it further would take longer than the cleanup.

But performance in terms of running time was not very important in this application. It ran the test suite in about eight seconds and chewed through the whole 2-MB codebase in about 15 seconds, which was a bit slow, but acceptable.

Sometimes I combed through the localized output looking for bugs to add to the test suite. Sometimes a small change to an RE would trigger an unexpected problem, and I would thrash around changing REs until they worked again. Sometimes my code started to look ugly and I stopped to refactor it, relying on the test suite and the compiler to tell me if I'd broken anything.

Sometimes I cheated on the test-driven process. I would find myself writing the code before the test, especially if both were trivial. Or I would fix a localizer bug by rewriting the web code to avoid some exceptional case. It might seem easier just to leave those exceptions to the cleanup, but doing even a small amount of localization by hand is tedious. Also, when I was writing the simple testing framework, I didn't bother adding a test for the framework itself. Actually, the step in the process where you break the old test, although it may seem pointless, provides a good check on the integrity of the test suite and framework. It guarantees that the test can catch bugs and hasn't been accidentally disabled.

But in the end, I was confident that the localizer was reliable—and written in nice clean code, too. It should be easy to reuse large parts of the localizer code. I doubt this will happen, though, because I imagine the requirements will be too different next time. We'll probably just throw away the test suite along with the code. This is a shame because a test suite really shines in maintenance. But the general approach, as documented in the code and here, is still useful.

The localization ended up taking about three programmer-weeks. This was well within our budget, but still seemed slow to me, although it was fun. I had hoped that test-driven development would magically turn me into a fast developer, which did not happen. But it fit well with an evolutionary style of development, in which you stay close to running code all the time and add features gradually. I find this philosophy comfortable, and it's perfect for a small project such as this one.

Translation

The translation got off to a shaky start. I had only a narrow channel to communicate with the Japanese translator because of the time difference: I could post a message for him and get a terse response the next day. At first he was unclear about his part in the project, but I explained the project's background in detail, and that seemed to help.

I handed over the language table, which had about 1500 entries, and the translation came back surprisingly soon. However, it had the wrong number of lines. Although I could only read the keys, it seemed clear that some lines had been duplicated, merged, and generally mangled. This was not too surprising for such a long file. I wrote a little script to compare the keys in the English and Japanese versions, and that made it easy to fix the problems.

What I was slow to realize was that I could keep this script around as part of the build process. It caught more mistakes as I continued to tweak the language tables. I was still absorbing a key idea of test-driven development: Automate anything you can. Programs are much better at repetitive checking than you are, so use them.

The obvious character encoding to try was UTF-8, the 8-bit Unicode Transformation Format created by Rob Pike and Ken Thompson (http://www.ietf.org/rfc/rfc3629.txt/). Among other things, it's designed to work well with C code, and it's well supported by web browsers. In short, it worked beautifully. The Japanese strings passed successfully through the web server C code and the client-side JavaScript, and—at least on my machine—were correctly displayed by the browser using the character set already installed in Windows. I was not able to type Japanese text into the input controls, but I copied and pasted some from my browser, and that seemed to work fine, too.

At one point, I noticed some suspicious characters in the Japanese text in the browser—something like "?A?" Looking at the language table in a hex editor and using Markus Kuhn's reference (http://www.cl.cam.ac.uk/~mgk25/unicode.html), there did seem to be some bad UTF-8 characters. I found out that I had corrupted the language file text strings somehow while editing. (What do you expect when different people are using Word and Emacs to edit the same large file?) Because the keys were not affected, the checking script I had written did not catch the problem.

I decided to integrate some UTF-8 verification into the language build. First, I looked around for existing Java support but was disappointed. Java supports UTF-8, but apparently does not support debugging UTF-8 text. For example, if b is a byte array, you can convert it into a string using either

new String(b, "UTF-8");

or

(new DataInputStream
(new ByteArrayInputStream(b))).readUTF();

But neither of these techniques identify the location of a bad character, and in any case, Java has its own, slightly incompatible version of UTF-8. So I wrote my own UTF-8 verifier, taking this as an opportunity to try out the popular JUnit testing framework (http://www.junit.org/).

JUnit was lightweight—I hate a giant download—and easy to incorporate. My favorite feature was the assert methods built into the TestCase class, which were handy for expressing what's supposed to be true. Also, it was reassuring to see the brief report of tests passed and time elapsed. JUnit has a bunch of features that only kick in for larger projects, such as a nice way to group test cases into suites, but already I could see the appeal. (Check Utf8.java; the UTF-8 checker is available electronically.)

Again, I found that the test-driven approach produced satisfying, clean code—slowly. The basic algorithm came out in three short methods. I spent much more time on the test code, putting in thorough coverage of the various kinds of characters.

Conclusion

Next time, I might try different tools because mine had some problems. For example, for efficiency I had to use little scripts to build and run the code, which might have been easier with an interactive scripting language like Python. A formal build tool such as Ant might also have helped. Sun's free Java SDK is nice, but at one point the Java Virtual Machine began crashing with weird exceptions, which I fixed by upgrading. The Java code wound up at around 600 lines, which seems long, and as I mentioned, it was kind of slow. The Apache Regexp documentation was sketchy (what's the precedence of the various operators?). But overall, I was happy with my choices. I would try something new mostly for my own education.

I would definitely try the test-driven approach again. I won't use it all the time (for example, I'm still frustrated by the difficulties of writing automated tests for GUI applications), but I'll use it whenever I can. And I'll recommend UTF-8 to all my friends.

Finally, when you're facing weeks of tedious coding, consider automating it somehow. Programming should be interesting; if it's repetitive, something's wrong. A quick experiment could save you a lot of time.

DDJ