Using the simple language and locale codes
Posted with :
How to choose the proper language and locale codes when localizing?
If you localize only for the macro-language use the macro language code. If you have only one English use just “en” code but if you have more than one “en” will mean “en-US” and you will have to add more detailed codes and like “en-CA”. Unicode website gives more details in picking the right language code article.
The table below shows language codes that I would recommend you to use. The principle is simple: use the simplest code that does not generate confusion. For languages not listed here it is safe to consult this article. If you have questions please email me, I will investigate and complete the below table with other cases.
Language | Recommended code | Explanation | Other valid codes |
---|---|---|---|
Chinese (Simplified) | zh | This is a macrolanguage. It does map to Chinese (Simplified) language (sometimes referred as Mandarin) because this is the predominant language. ref1 | zh-hans, zh-cn, zh-sg, zh-hant-cn, zh-hant-sg |
Chinese (Traditional) | zh-hant | It is a good idea not to specify the region because there are several regions where Traditional Chinese is present so specifying only the script is better. | zh-hant, zh-tw, zh-hk, zh-mo, zh-hant-tw, zh-hant-hk, zh-hant-mo |
English (US) | en | You should use this code if you have only one English translation or if this is American English. American English is the predominant language so the “en” code will auto-map to en-us. | en-us |
English (UK) | en-gb | en-uk (just for compatibility) | |
Portuguese (Brazilian) generic | pt | Use pt instead of pt-br to enable easy fallback when you do not have a **pt-pt **translation. | pt-br |
Portuguese (Portugal) | pt-pt | Do not use just “pt” if you have translations for both Brazilian Portuguese and Portugal Portuguesse. | |
Romanian | ro | Romanian has a ISO 639-1 code and there no need to use a more complex code like ones specified in 639-2 or 639-3 | ro-ro, ro-latn-ro |
Spanish (Spain) | es | es-es is considered the predominant language. | es-es |
French (France) | fr | fr-fr is considered the macro-languages. | fr-fr |
For most languages it will be safe to use the two letters code, this will work without problems for Arabic (ar), Czech (cs), Danish (da), German (de), Greek (el), Finnish (fi), Hebrew (he), Hungarian (hu), Italian (it), Japanese (ja), Korean (ko), Norwegian (nb), Dutch (nl), Polish (pl), Romanian (ro), Russian (ru), Swedish (sv), Turkish (tr), Ukrainian (uk).
Matching languages codes
You have an application localized in a number of languages and the system (OS or browser) is reporting you one or more language codes that do not exactly match your list. How do you make an optimal selection for this?
I think that it would be wise to reuse the same language tags from HTML specification.
In case you are doing a browser application you will get a more detailed information (see RFC2616) about language preferences of the user: “en, es, de, ja, zh-TW” or even with preference factor (0.0-1.0) "en, es;q=0.8, de;q=0.7, ja;q=0.3, zh-TW;q=0.1"
So the only remaining problem is that you need to make a proper matching between what you have and what the system reports. The matching is not very simple because usually you don’t know the exact form reported by the system. Codes like “zh-TW” should map to “zh-hant” and “zh-CN” or “zh-hans” should map to “zh” (Simplified Chinese). Also mapping “zh-TW” to “zh” is not allowed even if you have only one Chinese translation available.
Soon, I will complete this article with a matching algorithm implemented in Python so anyone could port it to his own language.
Resources
- IETF language tags – defined by BCP-47 (used by HTTP, HTML, XML,…)
- Language tags in HTML and XML (W3C)
- Language Subtag Lookup (W3C)
- IANA listings of codes for languages, scripts, regions, variants and redundant codes in text, xml and SQL format (langtag.net)
- PHP code that detects and selects proper languages based on Accept-Language header
- Ethnologue.com– encyclopedic reference of languages.