Notes on L10n and Language Technology recommendations

The Localisation and Language Technology Standards Recommendation for eGovernance had been put up for public review here. I did not read anyone post observations on it, so thought it would be nice to collect the small notes I have had into one place.

My notes are italicised.
It states that the existing standards and resources for Indian Language computing are not all complete. Some of the gaps are in

  • Keyboard layouts and character formations (of conjunct characters) – the way I see it is that it is actually 2 separate issues – [i] standardisation of keyboard layouts for Indic languages for conjunct characters and [ii] standardisation of character formations for conjunct characters
  • Terminologies for Indian languages (both technical and non-technical) – it would be at this point in time to make available for public download the available work on the standard/government accepted terminology. There has been a substantial amount of work completed in terms of providing acceptable localisation terms and having them available for download and usage (both commercial and non commercial) would be of great help
  • Unicode points for some Indian scripts (such as Santhali and Kashmiri) – these I believe would be required to be pushed through the Unicode Consortium process and thus would require involvement of the Ministry of Information Technology (and thus TDIL). Would be good to have the status of all such Unicode related issues that are being currently handled collated at one of the sites. Additionally, the C-DAC GIST unit out of Pune has had linguistic experience in dealing with languages that are “new” – a method to have status update on the same would go a long way.
  • Transliteration for Indian names – again it would be good to know the accepted recommendations for geographical names including standardization of their localised forms
  • A small group of experts shall be constituted for each of the 22 Official languages which will make a thorough study of the current status of all aspects of technology support (including character encoding schemes, input methods, OS and browser support, interconversion between different formats such as PDF and PostScript, search and processing etc) for the concerned language script, identify gap areas and suggest necessary action plan for bridging gaps quickly. The study may be completed within a time frame of 3 months – this is a very large chunk of a very big pie. It would be good to get this study/assessment done in [i] a transparent fashion and [ii] a way that its output can be tracked in terms of accuracy and relevance
  • A small group of experts shall be constituted for each of the 22 Official Languages which will make a thorough study of the current status of all aspects of lexical resources (including corpora, dictionaries, morphological analyzers, thesauri and wordnets, spellcheckers etc) for the concerned language/script, identify gap areas and suggest necessary action plan for bridging the gaps quickly. The study may be completed within a time frame of 3 months – again this is a very large piece and is the “may be” in the time frame indicative of the possible slippage ? For over half a decade now research institutes in the field of language technology have been tracking all these things along with trying to push the envelope of Machine Translation forward. It is possible that they already have such assessment reports in place – is it possible to make them available in public so that an inclusive process can hasten the study ?
  • A pilot study in the localisation of a selected G2C e-Governance application shall be carried out within 6 months. This will help formulate guidelines and priorities for further research and development in relevant areas – [i] 6 months from when, [ii] are the details about the application selected or what is desired in the application for the pilot available [iii] what are the acceptance criteria for the pilot
  • Local language support may reduce the language barrier to some extent but using keyboard-mouse-screen interface is still too complex and cumbersome for most people. Future lies in speech technologies. Speech technologies can be used for input, as well for output taking technology directly to the people. Emphasis may be laid on relevant R&D in this direction – is the assessment of the current work done including relevant OPEN tasks available to general public ?
  • You can buy any computer from any vendor anywhere in India and expect to be able to type in a letter, save it, print it and do all such basic operations in English without having to buy or install any specialized hardware or software or font. The case is not so with Indian scripts. Localization and specialized solutions are explicitly called for – this reads like a gross generalization of issues of defacto and dejure standards. For those who are on reasonably modern Linux distributions, the input-storage-printing-display does really not require explicit calling of specialized solutions
  • Specification for non INSCRIPT keyboard layouts should be made available by either TDIL/CDAC – the relevant part is “why” is this suggestion being made. The specification should have been available to general public for a long time and has not been made available.

Additionally, it would be nice to track how ICU is dealing with the OPEN Indic issues if any. Does anyone have pointers to that ?