Archive for the ‘L10n’ tag
Do we need to look for new software ?
In an unguarded moment of misguided enthusiasm (and, there is no other way to put it) I volunteered to translate a couple of my favorite TED talks. The idea was simple – challenging myself enough to learn the literary side of translating whole pieces of text would allow me to get to the innards of the language that is my mother tongue and, I use for conversation. Turns out that there was an area that I never factored in.
Talks have transcripts and, they are whole blocks of dialogue which have a different feel when undergoing translations than the User Interface artifacts that make of the components of the software I translate. In some kind of confusion I turned to the person who does this so often that she’s real good at poking holes in any theory I propound. In reality, it was my turn to be shocked. When she does translations of documents, Runa faces problems far deeper than what I faced during the translation of transcripts. And, her current toolset is woefully inadequate because they are tuned to the software translation way of doing things rather than document/transcript/pieces of text translation.
In a nutshell, the problem relates to the breaking of text into chunks that are malleable for translation. More often than not, if the complete text is a paragraph or, at least a couple of sentences – the underlying grammar and the construction are built to project a particular line of thought – a single idea. Chunking causes that seamless thread to be broken. Additionally, when using our standard tools viz. Lokalize/KBabel, Virtaal, Lotte, Pootle, such chunks of text make coherent translation more difficult because of the need to fit things within tags.
Here’s an example from the TED talk by Alan Kay. It is not representative, but would suffice to provide an idea. If you consider it as a complete paragraph expressing a single idea, you could look at something like:
“So let's take a look now at how we might use the computer for some of this. And, so the first idea here is just to how you the kind of things that children can do. I am using the software that we're putting on the 100 dollar laptop. So, I'd like to draw a little car here. I'll just do this very quickly. And put a big tire on him. And I get a little object here, and I can look inside this object. I'll call it a car. And here's a little behavior car forward. Each time I click it, car turn. If I want to make a little script to do this over and over again, I just drag these guys out and set them going.”
Do you see what is happening ? If you read the entire text as a block, and, if you are grasping the idea, the context based translation that can present the same thing lucidly in your target language starts taking shape.
Now, check what happens if we chunk it in the way TED does it for translation.
So let's take a look now at how we might use the computer for some of this.
And, so the first idea here is
just to how you the kind of things that children can do.
I am using the software that we're putting on the 100 dollar laptop.
So, I'd like to draw a little car here.
I'll just do this very quickly. And put a big tire on him.
And I get a little object here, and I can look inside this object.
I'll call it a car. And here's a little behavior car forward.
Each time I click it, car turn.
If I want to make a little script to do this over and over again,
I just drag these guys out and set them going.
Get them out of context and, it does make threading the idea together somewhat difficult. At least, it seems difficult for me. So, what’s the deal here ? How do other languages deal with similar issues ? I am assuming you just will not be considering the entire paragraph, translating accordingly and then slicing and dicing according to the chunks. That is difficult isn’t it ?
On a side note, the TED folks could start looking at an easier interface to allow translation. I could not figure out how one could translate and save as draft, and, return again to pick up from where one left off. It looks like it mandates a single session sitdown-deliver mode of work. That isn’t how I am used to doing translations in the FOSS world that it makes it awkward. Integrating translation memories which would be helpful for languages with substantial work and, auto translation tools would be sweet too. Plus, they need to create a forum to ask questions – the email address seems to be unresponsive at best.
In the company of a ninja
It looks like watching the Ninja Assassin hasn’t done Shreyank any good. Else, he would have figured out that it is easy-peasy for a Founder and Chief Ninja like Dimitris Glezos (who is also known as DeltaGamma) to be at Bangalore and, elsewhere. Dimitris paid a surprise visit to Pune yesterday and it was fun. It isn’t always that you get a CEO of a startup provide you with an in-person repeat of his keynote with added wisecracks and side-talks that are too scandalous for a “keynote”
And, that too, at a fairly crowded Barista. It was awesome.
In fact I wanted to talk with him about how massive the momentum built up by Transifex has been. Just two years ago, in 2007, Tx was a GSoC project within The Fedora Project aimed at looking at managing translations from a developer’s perspective. Today, it is a start-up which is hiring employees, relocating to newer offices, has a foot-print across a significant portion of upstream community projects and, most importantly, has clients willing to pay for customization services and, developer services. Tx isn’t only helping translation communities by allowing them to craft their work in peace – it is keeping developer sanity with the fire-n-forget model of the architecture. I hear that PulseAudio, PackageKit developers are strong supporters of Tx. That is tremendous news. The provocative nature of Tx is also based on the charm that it has been bootstrapped. That should provide hope to developers thinking along the “product” route.
I would say that these two years have done Dimitris good. His focus on the road Tx should take has become more vivid and, he has a deeper insight into the changes he wants to bring about via Indifex. There’s nothing more exciting than keeping a close watch on his team and his company for news that would come up soon. Tx is coming up with a killer set of features in the upcoming releases. That should get the attention of a couple of clients too.
Throughout the afternoon we ended up talking about getting youngsters up to speed to think beyond patches as contributions and, starting tuning their thoughts to products. Dimitris opines that patches are excellent jump-off points but in order to become a valuable contributor, one must start thinking about “architecture”, “design”, “roadmap”, “milestones” and all such issues that form part of the theory classes but never see implementation in real-life scenarios. In addition, there is also the need to inculcate the “CC thinking” in everyday work of creativity – be it code or, content or even be it hardware and standards (the “CC thinking” is a fancy short-hand towards thinking about Open Standards, Open Protocols and so forth. In a somewhat twitter-ish way, we compressed it to a meta-statement we both could relate to and agree with).
Dinner and post-dinner with a couple of us was another story. Having a bunch of hard-core “Fedora” folks in the room creates a passion. Sitting back to savor the flames of discussions and, interjecting with a leading viewpoint to keep the debate flowing is the best way to get action items resolved. Nothing wasn’t touched upon – from the way to get best out of *SCos to mundane stuff like getting feature requests into Tx, OLPC and Sugar, or, talking about the general issues within the IT development community in Greece. And of course, the frequent checks on Wikipedia to validate various points in the argument. We could have done with an offline Wiki Reader yesterday
I think I finally went to sleep at something around 0200 today – which is impossibly past my standard time. There are photos aplenty, though I don’t know who will be uploading them. There was food, there was coffee, cakes, and, there were friends – in short, a nice day.
Pleasant experiences and project loyalty
As a general case, my experience with most of the FOSS projects whose products I consume or, contribute to, have been very pleasant. Feedback has generally been well received, requests listened to. So, what I am going to write is not very special. But, they are striking by themselves.
Sometime ago, I was shopping for an off-line translation tool. I was fed up with Lokalize’s issues and, the fact that it wasn’t letting me do what I wanted to do at that point in time – translate. Additionally, I wasn’t in the mood to actually install a translation content management system to do stuff. Face it, I am an individual translator and, calling in the heavy shots to get the job done was a bit silly. So, I turned to virtaal. Actually, I think I was goaded into giving it a try by Runa.
Virtaal was, at that point in time, not really a good tool
And, you can figure from the blog link above that I wasn’t interested in it too much. However, since I ended up giving it a chance (you cannot simply ignore a recommendation from her) I ended up running into two issues. One was predominantly more annoying than the other and, in effect was what was putting me off the tool. However, the developers took interest to get it fixed and, in the latest release have resolved it.
The other bug was resolved in an even more interesting way – over IRC with hand-holding to obtain the appropriate debug information and, then on to editing the file to put in the fix. At the end, the fix might be trivial. But the level of interest and care taken by the team to listen to their users is what makes me happy. In this aspect, the other development crew I can mention is Transifex. I haven’t met most of them and yet they keep taking suggestions, reports via every communication channel they are on – blogs, micro-blogs, IMs, IRC and trac. That makes them visible, gets them into the shoes of the users and, I am sure it earns them invaluable karma points.
Yesterday, while helping (I just did the file editing while Walter did all the brain muscling) to close the other bug, I felt incredibly happy to be part of a system where it isn’t important who you are or, where you are from. What is important that you have a real desire to develop better software and, make useful artifacts for all.
As it goes – “Your mother was right, it is better to share” link to video.
The post is brought to you by lekhonee v0.8
Context,subtext and inter-text
There are two points with which I’d like to begin:
- One, in their Credits to Contributors section, Mozilla (for both Firefox and Thunderbird) state that “We would like to thank our contributors, whose efforts make this software what it is. These people have helped by writing code and documentation, and by testing. They have created and maintained this product, its associated development kits, our build tools and our web sites.” (Open Firefox, go to Help -> About Mozilla Firefox -> Credits, and click on the Contributors hyperlink)
- Two, whether with design or, with inadvertent serendipity, projects using Transifex tend to end up defining their portals as “translate.<insert_project_name>.domain_name”. Translation, as an aesthetic requirement is squarely in the forefront. And, in addition to the enmeshed meaning with localization, the mere usage of the word translation provides an elevated meaning to the action and, the end result.
A quick use of the Dictionary applet in GNOME provides the following definition of the word ‘translation’:
The act of rendering into another language; interpretation; as, the translation of idioms is difficult. [1913 Webster]
With each passing day innovative software is released under the umbrella of various Free and Open Source Software (FOSS) projects. For software that is to be consumed as a desktop application, the ability to be localized into various languages makes the difference in wide adoption and usage. Localization (or, translation) projects form important and integral sub-projects of various upstream software development projects.
In somewhat trivial off-the-cuff remarks which make translation appear easier than it actually is, it is often said that translation is the act of rendering into a target language the content available in the source language. However, localization and translation are not merely replacing the appropriate word or phrases from one language (mostly English) to another language. It requires an understanding of the context, the form, the function and most importantly the idiom of the target language ie. the local language. And yet, in addition to this, there is the fine requirement of the localized interface being usable, while being able to appropriate communicate the message to users of the software – technical and non-technical alike.
There are multiple areas that were briefly touched in the above paragraph. The most important of them being the interplay of context-subtext and inter-text. Translation, by all accounts, provides a referential equivalence. This is because languages and, word forms evolve separately. And, in spite of adoption and assimilation of words from languages, the core framework of a language remains remarkably unique. Add to this mix the extent with which various themes (technology, knowledge, education, social studies, religion) organically evolve and, there is a distinct chance that idioms and meta-data of words,phrases which are so commonplace in a source language, may not be relevant or, present at all in the target language.
This brings about two different problems. The first, whether to stay true to the source language or, whether to adapt the form to the target language. And, the second, as to how far would losses in translations be acceptable. The second is somewhat unique – translations, by their very nature have the capacity to add/augment to the content, to take away/subtract from the content thereby creating a ‘loss’ or, they can adjust and hence provide an arbitrary measure of compensation. The amount of improvement or, comprehension a piece of translated term can bring forward is completely dependent on the strength of the local language and, the grasp over the idiomatic usage of the same that the translator brings to the task at hand. More importantly, it becomes a paramount necessity that the translator be very well versed in the idioms of the source language in additional to being colloquially fluent in the target language.
The first problem is somewhat more delicate – it differs when doing translations for content as opposed to when translating strings of the UI. Additionally, it can differ when doing translations for a desktop environment like, for example, Sugar. The known user model of such a desktop provides a reference, a context that can be used easily when thinking through the context of words/strings that need to be translated. A trivial example is the need to stress on terms that are more prevalent or, commonly used. A pit-fall is of course it might make the desktop “colloquial”. And yet, that would perhaps be what makes it more user-friendly. This paradox of whether to be source-centric or, target-friendly is amplified when it comes to terms which are yet to evolve their local equivalents in common usage. For example, terms like “Emulator” or, “Tooltip” or, “Iconify”being some of the trivial and quick examples.
I can pick up the recent example of “Unmove” from PDFMod to illustrate a need to appreciate the evolution of English as a language and, to point to the need for the developers to listen to the translators and localization communities. The currently available tools and, processes do not allow a proper elaboration of the context of the word. In English, within the context of an action word “move” it is fairly easy to take a guess at what “Unmove” would mean. In languages where the usage of the action word “move” in the context of an operation on a computer desktop (here’s a quirk – the desktop is a metaphor that is being adopted to be used within the context of a computation device) is evolving, Unmove itself would not lend itself well to translation. Such “absent contexts” are the ones which create a “loss in translation”.
The singularity here is that the source language strings can evolve beautifully if feedback is obtained from the translated language in terms of what does improve the software. The trick is perhaps how best to document the context of the words and phrases to enable a much richer and useful translated UI. And, work on tooling that can include and incorporate such feedback. For example, there are enormous enhancements that can be trivially (and sometimes non-trivially) made to translation memory or, machine translation software so as to enable a much sharper equivalence.
Looking forward to some improvements
I have been using Transifex based systems for a couple of days/weeks now. And, in line with what I did mention on my micro-blog, Transifex and Lotte make things really easy. The coolest devel crew makes that happen. And, since they lurk online and engage with their users, every little tweak or, improvement that is suggested and considered makes the consumers feel part of the good work they are doing. Good karma and awesome excitement all around.
At some point in time during the week, I’d put them in the tickets as feature enhancements. However, for the time being, here’s a couple:
- Lotte should allow me to click on a file that is not yet translated for my language and, add it to the collection. If I recall correctly, the current way to add it is to download the .pot, convert to the appropriate .po and, upload it with comments etc
- Lotte needs to allow “Copy from Source”. This should accelerate translation by removing the extra step of having to actually select, copy and paste. This comes in handy when translating strings within tags or, brands/trademarks and so forth
- Handling and using translation memory could be built into Lotte. For a particular file in a specific language within a project, it could perhaps provide suggestions of translated words. In the future, allowing teams to add their glossaries would make it a more powerful tool too. Having said that, I’ve always wondered what happens when team glossaries are created from files across various projects – is there a license compatibility soup problem that could crop up ?
- A Transifex installation could provide notifications of new files or, updated files for the language. This could be limited to the files for which the last translator is the person receiving the notices or, ideally, could be for the language itself.
- Statistics – providing each language a visual representation of commits over time or, per contributor commits would also be a nice addition
So much for Transifex, in fact, I need to write out all of that in a nicer way so as to allow the possibility of these turning into GSoC projects within Transifex.
Coming to Virtaal. With lokalize being unbearably useless for me (it adds garbled text or whitespaces into files when using the stock F11 supplied one) and, before it is commented, no I haven’t filed a bug yet, getting the files done was a bit more important at that specific point. So, mea culpa. But I do check with every yum update and, it is still the same. The specific issue with Virtaal is that each time one gets a new string loaded for translation, the text input area loses the input method details. Which means that it is a constant game of switch back and forth between the inputs. Sadly enough, this is the only software that currently works for me (I don’t want to set up a local pootle/transifex instance and, do web based translation)
Lost in translation ?
From a recent mail on the Foundation list, here’s an interesting quote:
Collaboration among advisory board members: Now that we have a sys admin team in place would like to find ways that we can collaborate better. Mentioned an article by J5 that talked about that RH, Novell and others are less involved because of the maintenance burden.They spend time on money on things like translations. No process to get them upstream and so they do it all over again next year.
It is the last line that I find a bit off-key and, out of context.
The post is brought to you by lekhonee v0.7
For the win !
“What does it take to be good at something at which failure is so easy,so effortless ? ” : a quote from Better: A Surgeon’s Notes on Performance by Atul Gawande which is a highly recommended reading for those who have not read it yet (that’s a link to the flipkart.com entry for those who are local).
Last evening over dinner, among other things, Runa and me got talking about translations and, translation quality. That is one of our favorite shop-talk items and, since the morning blog had bits about my performance with spellings, it was a bit more significant. It is a somewhat known issue that most translation teams measure the length of the sprint, that is, how many strings were completed or, the percentage of the coverage for a particular project. Some projects attach badges like “supported” / “unsupported“, “main” / “beta” to the coverage and thus make the rush to the tape more important. At some point in time, it is important for the teams to sit down, understand and make notes about the quality of translations. Left to itself, the phrase “quality of translations” doesn’t mean anything does it ? For example, if the phrase was “Disconnect from VPN…” and, you were required to translate it – how wrong can you go ?
It seems you can go wrong, and, most often do.
- One of the reasons that I have observed is that translating strings in application and, translating content like documentation/release_notes/guides require different kind of mind patterns.
- The second reason is the lack of fluency in the source language. So, if you are a translator/reviewer for any language, if you are using English source files (as most of us do), you need to be extremely proficient in the language. The way the sentences, phrases and sub-phrases arrange themselves in English may or may not lend themselves to direct translations
- The third reason is that most translators do not take time out to first use the application in English (or, read the documentation completely in English) and, use it again (or, read it again) after translation. That is a recipe for disaster. English is a funny language and, sometimes, due to the structure of the source files, the context of the content is lost. What does look like a simple word might have a funny implication if the comprehension about how it is placed within the UI or, the user-interaction flow is not made a note of.
Now that most projects have some kind of “localization steering committees” it would be a good small project to observe which locales are coming up with the highest quality of translations and, attempting to understand what they are doing. Asking the language teams about the reasons that inhibit them from maintaining a high quality would also enable deeper understanding of how a project can help itself become a better one (in a somewhat strange loop way). Such discussions would enable coming up with Guidelines for Quality which are important to have. I firmly believe that all developers desire that their applications be consumed by the largest number of audience possible and, at heart, they are willing to sit down and listen to constructive suggestions about how best they can help the localization teams make it happen. That is the sweet spot the “LSCo” folks need to converge on and get going. In fact, for projects like OLPC, where a lot of new paradigms are being created, understanding translation processes and, chipping away at improving translation quality is highly requested.
Translation is still an activity that requires a fanatical attention to detail and, that little bit of ingenuity. There is something not right about committing a translation that smacks of a “letting go of the disciplined focus on detail” and, does not contain anything new. The job is made somewhat more hard when it comes to documentation. One cannot (and, perhaps should not) go beyond what the author has written and yet, it has to be made available in the local language after “stepping into the shoes” (or, “getting into the mind”) of the original author while making it aligned with the natural flow of the target language. This is also the place where the “translator memory”, as opposed to the “Translation Memory” becomes important. The mind should be supple enough to recall how similar idioms were translated earlier or, if an error that was already reported has cropped up again. Translators have a significant bit to contribute towards making the translation source files better, cleaner, well-maintained and, well documented. And, they have to do it right every time.
All this would come together to produce high quality translations and, wider usage of applications and documentation. Collaboration for the win !
The post is brought to you by lekhonee v0.6
Digital Content in Local Languages: Technology Challenges
I was reading through an article of the same name by Vasudeva Varma. Barring a whopper of a statement, the author does a reasonable job of pointing out some of the areas that needs to be worked on. To begin with however, let’s take that statement:
For example, Hindi is rendered properly only on Windows XP and beyond. Though there are some efforts to create Indic versions of the Linux, largely there is very little support for Indian languages.
It is a bit out out of context but nevertheless it is worth pointing out that one would have expected a bit more accuracy from the author. Especially because availability of Indian languages and their ease of use on Linux distributions have improved significantly. And, folks who use the Indian language Linux desktop on a regular basis for their usual workflow are somewhat unanimous that “things do work”. In fact, it would have been nicer if the author had taken the time to test out a few Linux distributions in the native language mode to identify the weak points. Most of the upstream projects do have very active native language projects with a significant quantum of participants from Indian language communities. For example, translate.fedoraproject.org, l10n.gnome.org, l10n.kde.org etc are the ones that come to mind immediately.
At a larger level, I would whole heartedly agree with the author that there exists gaps which need to be filled up. For example, with the desktop and applications getting localized, there is an urgent need to have “Cookbook” like documentation in native languages primarily for desktop applications. There is a greater need to improve existing work on the following:
- spell checkers
- dictionaries
- OCR
for the various Indic languages so as to enable a more wholesome usage of desktop applications. Sadly enough, a large bulk of the work around the above three bits are still “in captivity” at the various R&D initiatives across institutes in India with not much hope of being made available under an appropriate license allowing integration into FOSS applications.
The other part of the equation are folks who create content or, collate content ie. the writers and the publishers. To a large extent, there is a dearth of large volume of local language content on the Internet. And while it could have been said that the difficulty with Linux and Indian languages was a show stopper, it isn’t really so any more. “Better search” has been a buzzword that has been around for a while, but till the time a quantification of better does happen, it isn’t impossible to get along with what is available right now. The primary barriers to input methods, display/rendering and printing have been largely overcome and, the tools that allow content to be created in Indian languages are somewhat more encoding aware than before. With projects like Firefox taking an active interest in getting things going around Indic, I would hazard a guess that things would get better.
Which brings us to the Desktop Publishing folks. I have talked about them and the need to figure out their requirements a lot of times. Suffice to state, the DTP tools need to be able to handle Indic stuff far better than they do now. And, probably we do have the work cut out there.
Global <-> local
In all the years that I have been interacting with the various upstream FOSS projects, reasoning and convincing various groups to have a ‘local’ view of issues that complements the global strategy has been an uphill task. Sometimes it is just that interpersonal relations have been able to overcome the curve. At other times, it has just been a constant pegging away with facts, data points and a regular representation of issues that validate the need to approach and integrate local issues within the fold of the greater goals of the project. Either way, it makes me happy to see another project realize the need to align the views and inputs of the local participants and, figure out ways and means to respect their inputs and listen to their feedback.
The Regional Groups aspect of OpenOffice.org has gone a bit unnoticed and somewhat unloved (and, it has been my fault since I do not recall talking too much on this). This would be one area where it would be good to have a few folks stand up and take ownership as a steward.
In other news of the day, I have an @gnome.org alias for myself (thanks SysAdmins). Sadly, it has the usual pain of making a botched job of my actual name and, by now, I am so used to folks chopping up my first name every way they feel that I am more amused and less bewildered at the lack of appreciation of names.
2009 would be the Year of …
Og Maciel writes about the possibility of 2009 being the Year of Translations. With the coming-out of awesome tools like Transifex, Damned Lies, Vertimus etc, it sure feels good to be even marginally involved in the process of translations.
Infrastructural pieces coming together ensure that a translation workflow that appeals to all, is easy for the end-user can be put in place with much ease. And, it would also mean a disruptive playing field for startups like Indifex. Making wide open spaces for innovation in translation workflow and infrastructure is an area that is bound to be welcomed by the folks who spend countless hours making applications, desktops and operating systems available in their local languages/locales. They don’t get appreciated often. They get recognized during release times in release notes and the like, but they do keep the engines running and the lights on. This is going to be their year.
I would venture so far as to state that in a trend of “2009 would be the Year of <insert_your_favorite_prediction>” it would be a Year of Content. Free and Open content un-encumbered by restrictive rights and legalese that would be re-distributable, would be informative, would be educational and would be able to bring about a change. Over a period of the last 24 months, methods and tools that enable content creation on Linux desktops have simplified. Especially when it comes to Indian languages. So, there are fonts available (some of them quite elegant), there are keyboard layouts, on-screen keyboards (like Indic OnScreen Keyboard or iok and even Quillpad), input methods, word-lists and like bits that form the user-experience completion when using a Linux desktop to compose content. In sort, the traditional problems in the fields of input-display-printing have been substantially addressed to bring the end-user experience at a level of where it should be easy to just plug-and-create.
There is a wealth of content in Indian languages, starting right from folk-tales that are part of the oral tradition to commercially generated content which needs to start moving into the UTF-8 encoding space. Projects like the OLPC can benefit from the availability of such forms. Work on Indic OCR remains to move forward at a much aggressive pace than what is currently, but there are signs of good things coming out of it. Digitizing data would also enable a lot of content to be archived and made available for consumption.
This is the year that should see a large part of such things happening. The marriage of content creators with the infrastructure developers is something that needs to happen as well. And, this needs to include folks from fields of comparative literature, media studies and the like. Anyone who really does generate content, should be met with and talked to regarding the need to exert themselves to become part of the process. Content already takes in a large chunk of investment outlay for the mobile players and with the availability of easy means of generating content, it would not be far to start thinking about a need to consolidate, find patterns, predict trends.
The convergence of the computing and application prowess of mobile devices, content creation workflows and upswing in the production of Indic language content for the webspace promises to make 2009 an interesting year of innovations.
Season’s Greetings to all.





