locale files - Addendum

And since we may be close to a change, there is no reason, not to change it
to be able to deal with a full generality locale file. The last issue left
here are languages that can be written in more than one script. Sometimes
the different scripts are used in different countries and the string la-CO
(language-Country) completely defines it, but sometimes NOT, for example
Azeri. It is written in Arabic script in Iran and in both Cyrillic and
Latin script in Azerbaijan, making it impossible to localize with a country
string.

For those cases, the common string to define locale is la-Scrp-CO, and the
most commonly used locales that require the script language to be defined
are:

az_Cyrl_AZ
az_Latn_AZ

ha_Arab_NG
ha_Latn_NG

mn_Cyrl_MN
mn_Mong_CN

sr_Cyrl_BA
sr_Latn_BA

sr_Cyrl_CS
sr_Latn_CS

sr_Cyrl_ME
sr_Latn_ME

sr_Cyrl_RS
sr_Latn_RS

uz_Cyrl_UZ
uz_Latn_UZ

zh_Hans_HK
zh_Hant_HK

zh_Hans_MO
zh_Hant_MO

In the case of CSL, the change would be just the ability of accepting this
more general string as the name of a locale file, and of the corresponding
language file (-sr_Cyrl.xml and -sr-Latn.xml).

That is definitely my last posting on the subject for a while, I’ll come
back now with the tool for someone to be able to create and save a local
and a way to produce all factored common language files, depending on the
adoption of the proposals.

Paulo Ney

This actually has come up once, but I never accepted the pull request
because I didn’t know how to deal with this issue. See


(somebody
contributed a Latin version of Serbian, while we already have Cyrillic).

In all these cases, should we always include the script in the file name
(and “xml:lang” attribute within the locale)? Or should we designate one
script (e.g. Serbian Cyrillic) as the “primary” script and assign it “sr”
instead of “sr-cyrl”?

RintzeOn Mon, Oct 7, 2013 at 7:45 AM, Paulo Ney de Souza <@Paulo_Ney_de_Souza>wrote:

And since we may be close to a change, there is no reason, not to change
it to be able to deal with a full generality locale file. The last issue
left here are languages that can be written in more than one script.
Sometimes the different scripts are used in different countries and the
string la-CO (language-Country) completely defines it, but sometimes NOT,
for example Azeri. It is written in Arabic script in Iran and in both
Cyrillic and Latin script in Azerbaijan, making it impossible to localize
with a country string.

In the case of CSL, the change would be just the ability of accepting this
more general string as the name of a locale file, and of the corresponding
language file (-sr_Cyrl.xml and -sr-Latn.xml).

So the idea is to follow RFC 5646?

    http://tools.ietf.org/html/rfc5646

This actually has come up once, but I never accepted the pull request
because I didn’t know how to deal with this issue. See
https://github.com/citation-style-language/locales/pull/46#issuecomment-10812215
(somebody contributed a Latin version of Serbian, while we already have
Cyrillic).

In all these cases, should we always include the script in the file name
(and “xml:lang” attribute within the locale)? Or should we designate one
script (e.g. Serbian Cyrillic) as the “primary” script and assign it "sr"
instead of “sr-cyrl”?

That’s specified In RFC 5646. Languages can be associated with a
primary script, in which case it is omitted from the tag (or in this
case, the filename). It’s at 2.2.3, para 4 (reference to
’Suppress-Script’).

Following the RFC would probably be best, but we would have to think
about how to specify the fallback behaviour.

Frank

I would vote to include the script in the filename because that is what
Linux, Python, PHP, and other open source projects do. It is not a standard
yet, but it is fast becoming one …

Paulo NeyOn Mon, Oct 7, 2013 at 10:21 AM, Rintze Zelle <@Rintze_Zelle>wrote:

So the idea is to follow RFC 5646?

    http://tools.ietf.org/html/rfc5646

We should try to follow the RFC as close as possible, it is the best
guidance we have.

In all these cases, should we always include the script in the file name
(and “xml:lang” attribute within the locale)? Or should we designate one
script (e.g. Serbian Cyrillic) as the “primary” script and assign it "sr"
instead of “sr-cyrl”?

That’s specified In RFC 5646. Languages can be associated with a
primary script, in which case it is omitted from the tag (or in this
case, the filename). It’s at 2.2.3, para 4 (reference to
’Suppress-Script’).

My reading of the RFC is that one MAY omit or not the Script tag, depending
on judgement if “it adds no distinguishing value to the tag”. Some
languages-locale combinations have a preferred script, examples being:

Han Traditional in Taiwan

Han Simplified in Mainland China

Some others like Serbian obviously depend on the region and there is NO
preferred way that could be specified by the location tag. Since one would
have to come up with a uniform scheme (to make it easier for all), the way
it being used is to INCLUDE the tag. That is how it is done virtually all
Linux distributions, Python, Perl, PHP and several other Open Source
projects with large i18n efforts.

Paulo NeyOn Mon, Oct 7, 2013 at 10:45 AM, Frank Bennett <@Frank_Bennett>wrote:

So the idea is to follow RFC 5646?

    http://tools.ietf.org/html/rfc5646

We should try to follow the RFC as close as possible, it is the best
guidance we have.

In all these cases, should we always include the script in the file name
(and “xml:lang” attribute within the locale)? Or should we designate one
script (e.g. Serbian Cyrillic) as the “primary” script and assign it
"sr"
instead of “sr-cyrl”?

That’s specified In RFC 5646. Languages can be associated with a
primary script, in which case it is omitted from the tag (or in this
case, the filename). It’s at 2.2.3, para 4 (reference to
’Suppress-Script’).

My reading of the RFC is that one MAY omit or not the Script tag, depending
on judgement if “it adds no distinguishing value to the tag”. Some
languages-locale combinations have a preferred script, examples being:

Han Traditional in Taiwan

Han Simplified in Mainland China

Some others like Serbian obviously depend on the region and there is NO
preferred way that could be specified by the location tag. Since one would
have to come up with a uniform scheme (to make it easier for all), the way
it being used is to INCLUDE the tag. That is how it is done virtually all
Linux distributions, Python, Perl, PHP and several other Open Source
projects with large i18n efforts.

I don’t have a strong view on file naming. We’ll need to work out how
fallback works, though.

I would make sense NOT to change from one script to another while doing a
fall-back (unless it is the ultimate English substitution), so the scheme
of

zh-HanS-CN  -->  zh-HanS  --> en-US

seems the more appropriate, or even better:

zh-HanS-CN  -->  zh-HanS  --> en

after the English general file is created.

PNOn Mon, Oct 7, 2013 at 11:25 AM, Frank Bennett <@Frank_Bennett>wrote:

To elaborate on my concern, for languages with multiple scripts, we have to
figure out what happens if a user specifies “zh-CN”, and we only have
"zh-hans-cn" and “zh-hant-cn” files. What then?

RintzeOn Mon, Oct 7, 2013 at 10:46 AM, Paulo Ney de Souza <@Paulo_Ney_de_Souza>wrote:

That is the most important reason why it is better for the fall-back file
to be defined inside the XML file. In this particular case we would have a
small file

zh-CN

that would define a fall-back to zh-hans-CN (the most natural choice in
this case), or directly to the language file zh-Hans.xml

Paulo NeyOn Mon, Oct 7, 2013 at 12:05 PM, Rintze Zelle <@Rintze_Zelle>wrote:

Similar to dependent CSL styles? I’m a bit worried about the burden to
implement that.

RintzeOn Mon, Oct 7, 2013 at 11:46 AM, Paulo Ney de Souza <@Paulo_Ney_de_Souza>wrote:

I’ll write the files and send them to you. Once a schema is decided I can
chenges a few lines on a script that will write all of them!

Paulo Ney

I meant the burden for implementing the scheme in CSL processors. Making
the files and changing the schema are the easy parts.

RintzeOn Mon, Oct 7, 2013 at 12:14 PM, Paulo Ney de Souza <@Paulo_Ney_de_Souza>wrote:

This all sounds like we are trying to implement something parallel to RFC
5646 / BCP 47. This is a standard that we should embrace as possible, and
it has some implicit ideas of partial matching and fallback, but we’d be
adding substantial ad-hoc semantics by trying to define a new set of
fallback relationships.

I’ve been lurking on the IETF Languages mailing list for several years
(working group for RFC 5646 language subtags and stomping grounds of
language tagging experts), since Frank and I worked on getting some new
subtags approved by that body, and it strikes me that we are doing
something wrong if we are coming up with things like zh-Hans -> zh-Hant ->
en.

I know that a substantial body of useful localization data is encoded in
CLDR (http://cldr.unicode.org/), but I don’t know if we’d find this there.

The basic logic of substituting strings from successively fuzzier matches
makes sense at the surface, but in practice it’s going to be technically
difficult and probably hard to debug. I’d recommend not supporting
fallback, except perhaps to the guaranteed-complete locale of English.

Some logical fallbacks could be hard to detect for a user debugging strange
translations, or potentially offensive. There are cases where languages are
similar and we could save duplicated strings by having pt-PT and pt-BR both
inherit from a common pt, or one from the other, but that would make it
difficult to determine, for example, how complete the Brazilian Portuguese
translation is. Other failovers between scripts could be politically
fraught-- Hans and Hant, or all instances of Latn/Cyrl.

RFC 5646 also brings up other questions about failover – do we want to be
handling macrolanguages?
The example of zh here has thusfar ignored the fact that zh isn’t even a
correct subtag for these locale files-- we should be using cmn, as zh is a
macrolanguage encompassing several subtags (
http://people.w3.org/rishida/utils/subtags/index.php?lookup=zh&submit=Look+up
).

I think that the answer across the board is that no, we don’t want to
handle failover. Our goal should be fuzzy matching to get the best single
locale file for the user’s desired locale. We should use scarce engineering
resources to make sure that that component does work, so that we match the
style- or user-specified locale of ru-alalc97 to ru-Latn-alalc97 if
available, or to best-matches like ru-RU if not, so that we match cmn-Hant
to whatever zh we have available. That’s the level we have to get right.
String-level failover is out of spec and bound to be extremely confusing
for implementers, localizers and users.

This all sounds like we are trying to implement something parallel to RFC
5646 / BCP 47. This is a standard that we should embrace as possible, and it
has some implicit ideas of partial matching and fallback, but we’d be adding
substantial ad-hoc semantics by trying to define a new set of fallback
relationships.

I’ve been lurking on the IETF Languages mailing list for several years
(working group for RFC 5646 language subtags and stomping grounds of
language tagging experts), since Frank and I worked on getting some new
subtags approved by that body, and it strikes me that we are doing something
wrong if we are coming up with things like zh-Hans -> zh-Hant -> en.

I know that a substantial body of useful localization data is encoded in
CLDR (http://cldr.unicode.org/), but I don’t know if we’d find this there.

The basic logic of substituting strings from successively fuzzier matches
makes sense at the surface, but in practice it’s going to be technically
difficult and probably hard to debug. I’d recommend not supporting
fallback, except perhaps to the guaranteed-complete locale of English.

Some logical fallbacks could be hard to detect for a user debugging strange
translations, or potentially offensive. There are cases where languages are
similar and we could save duplicated strings by having pt-PT and pt-BR both
inherit from a common pt, or one from the other, but that would make it
difficult to determine, for example, how complete the Brazilian Portuguese
translation is. Other failovers between scripts could be politically
fraught-- Hans and Hant, or all instances of Latn/Cyrl.

RFC 5646 also brings up other questions about failover – do we want to be
handling macrolanguages?
The example of zh here has thusfar ignored the fact that zh isn’t even a
correct subtag for these locale files-- we should be using cmn, as zh is a
macrolanguage encompassing several subtags
(http://people.w3.org/rishida/utils/subtags/index.php?lookup=zh&submit=Look+up).

I think that the answer across the board is that no, we don’t want to handle
failover. Our goal should be fuzzy matching to get the best single locale
file for the user’s desired locale. We should use scarce engineering
resources to make sure that that component does work, so that we match the
style- or user-specified locale of ru-alalc97 to ru-Latn-alalc97 if
available, or to best-matches like ru-RU if not, so that we match cmn-Hant
to whatever zh we have available. That’s the level we have to get right.
String-level failover is out of spec and bound to be extremely confusing for
implementers, localizers and users.

String-level fallback is required by the CSL spec:

http://citationstyles.org/downloads/specification.html#locale

If a script specifier is added to the mix, we will need to work out
how fallback works with it, if only for the limited case of term
overrides embedded in a style.