locale files - Part II

2- Fallback scheme defined inside the XML files themselves: I understand
from Rintze that the fall-back scheme from one locale to another is defined
and execute from withing the code, and that the sequence

 de-AT  --> de-DE  --> en-US

is hardcoded in and will vary for other languages. That makes the creation
of a new locale file very hard - one has to request and a fall-back schema
to be created and introduced in the code.

The natural place for the fall-back language to be would be inside the XML
files themselves. So one could create -ar-DZ for Arabic-Algeria and have it
fall-back to the language -ar and be in business in a minute! That would
make testing easier and lower the bar on the creation of new contributions.

Care would have to be exercised to prevent an user from creating an
infinite loop in the fall-back. The best way to handle both organization
and this technical problem would be to allow a locale file to fall-back to
a language-file only, and then to the English file. In the previous example
the fall back would:

 -de-AT.xml  -->  -de.xml  -->  -en.xml

in this new schema, there would be 5 new files that will be language
specific: -de, -en, -fr, -pt and -zh containing most of the language
material and a sequence of very small files:

de-AT
de-CH
de-DE

en-GB
en-US

fr-CA
fr-FR

pt-BR
pt-PT

zh-CN
zh-TW

that would contain the real locale-specific settings. None of this files
will need to be created by hand - explanation coming up on Part III.

Paulo Ney

to me this sound reasonable. I’ll say that creating a new locale file when
a similar one exists is not really “very hard”, since you can just copy
over anything that you want to maintain - e.g. for the fr-CA locale, folks
simply took fr-FR and changed the filename & one or two terms, took a total
of 5mins. That said allowing fallbacks to be specified is certainly more
elegant & cleaner. The other concerns would be implementers. I’m not sure
how well locales are implemented in the various citeproc versions and we’d
want to make sure that implementers are happy with this solution - e.g.
even with the current local set-up isn’t work everywhere and that’s a
problem for CSL, so if citeproc maintainers or ref managers object to this,
imho those objections would override concerns for elegance.
SebastianOn Sun, Oct 6, 2013 at 6:01 AM, Paulo Ney de Souza <@Paulo_Ney_de_Souza>wrote:

True! Creating a new locale from a nearby one is never hard, but
maintaining it later becomes hard. Because tokens are no longer
single-sourced, and people end up modifying one file and leaving the other
behind - this is clearly what happened to the pt-PT, pt-BR double.

I’ll explain it a bit better on Part III and IV, but I need more than a
phone to do that message!

Paulo Ney

As far as I know de-AT -> de->DE -> en-US isn’t hardcoded anywhere, at
least for Zotero, unless citeproc-js does something I’m not aware of. At
least in Zotero’s pre-citeproc-js cite code, it just looked for a full
code matching the current locale (e.g., ‘de-AT’), and if that didn’t
exist it looked for a matching language code (‘de’, but expanded to
’de-DE’ for historical reasons), and if that didn’t exist it used
’en-US’. I would guess that citeproc-js still does the same. I don’t
think there’s a reason anything needs to be hardcoded, in the code or in
the XML files, or requested.

Looking for ‘de-DE’ instead of ‘de’, though not technically accurate,
doesn’t strike me as having much of a downside, but with appropriate
warning I imagine the various implementers would be OK with changing
those to the two-character codes. (Some implementations might support
two-character codes already, given the presence of locales-eu.xml,
though it’s also possible that one just doesn’t work in some
implementations.)

If I’m understanding you, this doesn’t address your larger request, to
fall back on individual missing terms. If Frank and others want to
implement that, I guess they can, but I’m not convinced it’s worth the
effort. I’ve actually wanted en-US fallback in Zotero proper (not for
CSLs) for a long time, to avoid errors when new English strings are
added but other locale files haven’t been updated, but unlike the Zotero
strings, which change whenever we add or change features, the CSL
locales seem largely fixed, with relatively little maintenance involved
after the initial creation. It doesn’t really seem beyond the scope of
copy and paste. But, again, it’s really up to the implementers.

Here is what Papers does, implementation-wise. If a term is not available for the requested locale, it first falls back on the “parent” language locale. This parent language is harcoded in our implementation so for instance, de-DE is the parent language for all the de-XX locales. In other words, each family of yy-XX locales have a designated parent yy-??. If the term is not found, it falls back on the en-US locale (again, the fact that this locale is the grandmother of all is hard-coded).

The proposal to formalize that process makes sense.

Charles

Currently the spec dictates

de-AT > de-DE > en-US

for fallback behavior among the locale files, and the spec provides a list
of “primary” locales like “de-DE” (see
http://citationstyles.org/downloads/specification.html#locale-fallback ).
It also says “If the chosen output locale is a language (e.g. ‘de’), the
(primary) dialect is used in step 1 (e.g. ‘de-DE’).” Any CSL processor that
supports this will have to hardcode the primary locales, right?

To avoid hardcoding, we could rename “de-DE” and all other primary locales
to their 2-letter language tag (i.e. “de”), so the locale file fallback
simply becomes:

de-AT > de > en-US

I’m not sure if that’s sufficient or in agreement with what Avram or Paulo
wish for, though. Avram seems to argue against locale file fallback
altogether in the Appendum thread?

Rintze

To avoid hardcoding, we could rename “de-DE” and all other primary locales
to their 2-letter language tag (i.e. “de”), so the locale file fallback
simply becomes:

de-AT > de > en-US

I’m not sure if that’s sufficient or in agreement with what Avram or Paulo
wish for, though. Avram seems to argue against locale file fallback
altogether in the Appendum thread?

this was something I considered, too. So as long as we only have one
version for a language we just use the two language code. After that we
just add locales with countries.

To me this has the advantage of being both elegant and to work out of the
box with current citeproc behavior.

The biggest problem I see with that is different scripts. As both Avram and
Paulo points out, this would create very confusing (and, per Avram,
potentially offensive) substitutions.
I’m less concerned about Avram’s other concern - i.e. “how complete is a
locale” - since those interested can just check the local files. I’m not
sure how big the first issue is and if it may have a simple solution. I
disagree with Avram about not supporting any fallbacks (except to en-US to
prevent errors), since Paulo is correct that this can possibly make
supporting multiple locales for one language rather tedious. Not that
that’s a current problem, but might become one long term.

Sebastian>

Yes. I think we should focus on being good at matching users to the best
matching single locale file for their specified locale, but we shouldn’t
attempt to do it on the level of terms.

As far as I know de-AT -> de->DE -> en-US isn't hardcoded anywhere, at
least for Zotero, unless citeproc-js does something I'm not aware
of. At
least in Zotero's pre-citeproc-js cite code, it just looked for a full
code matching the current locale (e.g., 'de-AT'), and if that didn't
exist it looked for a matching language code ('de', but expanded to
'de-DE' for historical reasons), and if that didn't exist it used
'en-US'. I would guess that citeproc-js still does the same. I don't
think there's a reason anything needs to be hardcoded, in the code
or in
the XML files, or requested.

Currently the spec dictates

de-AT > de-DE > en-US

for fallback behavior among the locale files, and the spec provides a
list of “primary” locales like “de-DE” (see
Redirecting… ).
It also says “If the chosen output locale is a language (e.g. ‘de’),
the (primary) dialect is used in step 1 (e.g. ‘de-DE’).” Any CSL
processor that supports this will have to hardcode the primary
locales, right?

I overlooked ‘zh-CN’, but that’s the only one of that list that needs to
be hardcoded, along with the general fallback of en-US. ‘de-DE’ (or
‘de’) and ‘pt-PT’ could just be rule based: if the user’s full locale
(e.g., ‘de-CH’) doesn’t exist, try the root (‘de’ or ‘de-DE’).

To avoid hardcoding, we could rename “de-DE” and all other primary
locales to their 2-letter language tag (i.e. “de”), so the locale file
fallback simply becomes:

de-AT > de > en-US

Sure, but our ‘de-DE’ was really just ‘de’ expanded for simplicity, and
I think the code — at least our code, but I imagine citeproc-js too —
was designed accordingly.

And to be clear, I’m talking solely about choosing a single file here,
as Avram (I think) is arguing for. I think that term-level substitution
is extreme overengineering for our purposes, with all the problems that
Avram points out. But if someone ends up with Standard German instead of
Austrian German, it’s pretty clear what happened.

As far as I know de-AT → de->DE → en-US isn’t hardcoded anywhere, at
least for Zotero, unless citeproc-js does something I’m not aware of. At
least in Zotero’s pre-citeproc-js cite code, it just looked for a full
code matching the current locale (e.g., ‘de-AT’), and if that didn’t
exist it looked for a matching language code (‘de’, but expanded to
‘de-DE’ for historical reasons), and if that didn’t exist it used
‘en-US’. I would guess that citeproc-js still does the same. I don’t
think there’s a reason anything needs to be hardcoded, in the code or in
the XML files, or requested.

Currently the spec dictates

de-AT > de-DE > en-US

for fallback behavior among the locale files, and the spec provides a list
of “primary” locales like “de-DE” (see
Redirecting… ). It
also says “If the chosen output locale is a language (e.g. ‘de’), the
(primary) dialect is used in step 1 (e.g. ‘de-DE’).” Any CSL processor that
supports this will have to hardcode the primary locales, right?

I overlooked ‘zh-CN’, but that’s the only one of that list that needs to be
hardcoded, along with the general fallback of en-US. ‘de-DE’ (or ‘de’) and
‘pt-PT’ could just be rule based: if the user’s full locale (e.g., ‘de-CH’)
doesn’t exist, try the root (‘de’ or ‘de-DE’).

To avoid hardcoding, we could rename “de-DE” and all other primary locales
to their 2-letter language tag (i.e. “de”), so the locale file fallback
simply becomes:

de-AT > de > en-US

Sure, but our ‘de-DE’ was really just ‘de’ expanded for simplicity, and I
think the code — at least our code, but I imagine citeproc-js too — was
designed accordingly.

And to be clear, I’m talking solely about choosing a single file here, as
Avram (I think) is arguing for. I think that term-level substitution is
extreme overengineering for our purposes, with all the problems that Avram
points out. But if someone ends up with Standard German instead of Austrian
German, it’s pretty clear what happened.

The specification requires us to implement term-level substitution in
the processors for overrides embedded in the style file, with the same
selection logic as for full standalone locale selection. Continuing to
require full coverage in individual standalone locales might be the
best choice for clarity, as Avram and Dan suggest, but just in case,
going the other way wouldn’t be an exceptional burden for the
implementations.