Proposal: test condition for "language"

Languages such as Japanese and Khmer require distinct punctuation and
spacing conventions. At the same time, citation styles in these
languages must support Western citation styles with Western
punctuation and spacing conventions.

As a tentative proposal for discussion, adding “language” as a
variable in CSL, with a test condition for “language”, would allow
such styles to apply the appropriate locale inside the tested block,
when the test is “any” or “all” and the test returns true. For
example:

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
"language" variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

This arrangement would require that a choice be made for the
inter-cite joins within the citation. The processor could apply the
citation-level joins (defined in the locale) appropriate to the last
language that tests true, to the citation as a whole. This would
assume that local-language and Western cites will not be mixed inside
a single citation.

Frank

Languages such as Japanese and Khmer require distinct punctuation and
spacing conventions. At the same time, citation styles in these
languages must support Western citation styles with Western
punctuation and spacing conventions.

So you’re saying conventions for punctuation might vary within a
document if there are certain mixed languages; that if you include
Japanese content in an English document, for example, that content
should get special treatment WRT to punctuation?

As a tentative proposal for discussion, adding “language” as a
variable in CSL, with a test condition for “language”, would allow
such styles to apply the appropriate locale inside the tested block,
when the test is “any” or “all” and the test returns true. For
example:

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
“language” variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

This seems to be the tricky part: how to determine the language.

Does your draft proposal then suggest the input model for CSL cannot
ever be limited to strings, but that it instead must entertain an RDF
like object model that is something like a tuple (value, language,
data-type)?

Bruce

Languages such as Japanese and Khmer require distinct punctuation and
spacing conventions. At the same time, citation styles in these
languages must support Western citation styles with Western
punctuation and spacing conventions.

So you’re saying conventions for punctuation might vary within a
document if there are certain mixed languages; that if you include
Japanese content in an English document, for example, that content
should get special treatment WRT to punctuation?

Yes, that’s it. In scholarship in non-Western realms, Western sources
are generally cited according to their native conventions, but
local-language cites are not forced into that mold. Khmer, for
example, uses zero-width spaces everywhere, ASCII 20 is never used.
Japanese normally contains no spaces, and uses different characters
for comma (、), period (。), etc.

As a tentative proposal for discussion, adding “language” as a
variable in CSL, with a test condition for “language”, would allow
such styles to apply the appropriate locale inside the tested block,
when the test is “any” or “all” and the test returns true. For
example:

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
“language” variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

This seems to be the tricky part: how to determine the language.

Does your draft proposal then suggest the input model for CSL cannot
ever be limited to strings, but that it instead must entertain an RDF
like object model that is something like a tuple (value, language,
data-type)?

That would be preferable. To cope with legacy data, it might be
better to say that applications should provide data as a tuple, but to
accept and parse out string input as well.

Languages such as Japanese and Khmer require distinct punctuation and
spacing conventions. At the same time, citation styles in these
languages must support Western citation styles with Western
punctuation and spacing conventions.

So you’re saying conventions for punctuation might vary within a
document if there are certain mixed languages; that if you include
Japanese content in an English document, for example, that content
should get special treatment WRT to punctuation?

Yes, that’s it. In scholarship in non-Western realms, Western sources
are generally cited according to their native conventions, but
local-language cites are not forced into that mold. Khmer, for
example, uses zero-width spaces everywhere, ASCII 20 is never used.
Japanese normally contains no spaces, and uses different characters
for comma (、), period (。), etc.

As a tentative proposal for discussion, adding “language” as a
variable in CSL, with a test condition for “language”, would allow
such styles to apply the appropriate locale inside the tested block,
when the test is “any” or “all” and the test returns true. For
example:

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
“language” variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

This seems to be the tricky part: how to determine the language.

Does your draft proposal then suggest the input model for CSL cannot
ever be limited to strings, but that it instead must entertain an RDF
like object model that is something like a tuple (value, language,
data-type)?

That would be preferable. To cope with legacy data, it might be
better to say that applications should provide data as a tuple, but to
accept and parse out string input as well.

OK, so we understand the issues. I’ll just throw it out for wider
comment from the different implementors here.

Bruce

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
“language” variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

Wouldn’t it be clearer if the test only tests “true” if also the region code
matches? I would find it a bit counter-intuitive if

would test “true” for “de-CH”. Related to this, is it expected that there
will be cases where you’d want to apply region-specific output (e.g. “de-DE”
versus “de-CH”)?

Perhaps, if it would be desirable to have the ability to test against either
the language or the region, we could introduce a second condition, e.g.:

(tests “true” for “de-DE”, tests “true” for “de-CH”,
tests “false” for “nl-NL”)
(tests “true” for “de-DE”, tests “false” for
“de-CH”, tests “false” for “nl-NL”)

This arrangement would require that a choice be made for the
inter-cite joins within the citation. The processor could apply the
citation-level joins (defined in the locale) appropriate to the last
language that tests true, to the citation as a whole. This would
assume that local-language and Western cites will not be mixed inside
a single citation.

So do we need changes to the CSL locale files as well? (to indicate the
citation-level joins)

RintzeOn Thu, Nov 25, 2010 at 5:22 PM, Frank Bennett <@Frank_Bennett>wrote:

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
“language” variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

Wouldn’t it be clearer if the test only tests “true” if also the region code
matches? I would find it a bit counter-intuitive if

would test “true” for “de-CH”. Related to this, is it expected that there
will be cases where you’d want to apply region-specific output (e.g. “de-DE”
versus “de-CH”)?

Perhaps, if it would be desirable to have the ability to test against either
the language or the region, we could introduce a second condition, e.g.:

(tests “true” for “de-DE”, tests “true” for “de-CH”,
tests “false” for “nl-NL”)
(tests “true” for “de-DE”, tests “false” for
“de-CH”, tests “false” for “nl-NL”)

I don’t think so. If you want to test for generic German, than you
should test for simply “de”; right?

The bigger issue here isn’t these CSL syntax details, it’s what
implementing this would mean for the input model. Effectively, we
would need to change input JSON from:

“title”: “The Title”

… to:

“title”: { “value”: “The Title”, “lang”: “en” }

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
“language” variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

Wouldn’t it be clearer if the test only tests “true” if also the region
code
matches? I would find it a bit counter-intuitive if

would test “true” for “de-CH”. Related to this, is it expected that there
will be cases where you’d want to apply region-specific output (e.g.
“de-DE”
versus “de-CH”)?

Perhaps, if it would be desirable to have the ability to test against
either
the language or the region, we could introduce a second condition, e.g.:

(tests “true” for “de-DE”, tests “true” for
“de-CH”,
tests “false” for “nl-NL”)
(tests “true” for “de-DE”, tests “false” for
“de-CH”, tests “false” for “nl-NL”)

I don’t think so. If you want to test for generic German, than you
should test for simply “de”; right?

Perhaps. But in Frank’s proposal, the test value also represents the
language applied within the scope of the condition, so things aren’t that
clear cut.

The bigger issue here isn’t these CSL syntax details, it’s what
implementing this would mean for the input model. Effectively, we
would need to change input JSON from:

“title”: “The Title”

… to:

“title”: { “value”: “The Title”, “lang”: “en” }

The “language” variable would be a property of the item, right, instead of
of the individual variables? I.e.

{ “title”: “The Title”, “lang”: “en” }

instead of

{ “title”: { “value”: “The Title”, “lang”: “en” } }

Rintze

The bigger issue here isn’t these CSL syntax details, it’s what
implementing this would mean for the input model. Effectively, we
would need to change input JSON from:

“title”: “The Title”

… to:

“title”: { “value”: “The Title”, “lang”: “en” }

The “language” variable would be a property of the item, right, instead of
of the individual variables? I.e.

{ “title”: “The Title”, “lang”: “en” }

instead of

{ “title”: { “value”: “The Title”, “lang”: “en” } }

The former isn’t valid.

Either the value of the key (in this case, the title) is a string (as
it is now), or it’s a hash/dictionary or list (as it would have to be
to support this proposal).

This would have pretty big implications, as no client I’m aware of
support this kinds of modeling, and the only data format I’m aware of
that could carry it (aside from a revised input csl json) is bibo rdf,
or maybe mods.

In short, this would add a significant amount of complexity all around.

Bruce

Isn’t Frank proposing to just add a “language” variable, independent of the
"title" variable? E.g.:

[
{
“id”: “ITEM-1”,
“language”: “de-DE”,
“title”: “Some title”,
“type”: “book”
}
]

Rintze

I don’t know. That wasn’t the impression I got (mainly from his
response to my question about the model), but I could be wrong.

In any case, that approach would only work if you assumed all key
values were of the same language, which might not work in some cases
(like if you have a translation but also include data from the
original language; probably need to know what language that original
data is in).

Bruce

Languages such as Japanese and Khmer require distinct punctuation and
spacing conventions. At the same time, citation styles in these
languages must support Western citation styles with Western
punctuation and spacing conventions.

So you’re saying conventions for punctuation might vary within a
document if there are certain mixed languages; that if you include
Japanese content in an English document, for example, that content
should get special treatment WRT to punctuation?

Yes, that’s it. In scholarship in non-Western realms, Western sources
are generally cited according to their native conventions, but
local-language cites are not forced into that mold. Khmer, for
example, uses zero-width spaces everywhere, ASCII 20 is never used.
Japanese normally contains no spaces, and uses different characters
for comma (、), period (。), etc.

As a tentative proposal for discussion, adding “language” as a
variable in CSL, with a test condition for “language”, would allow
such styles to apply the appropriate locale inside the tested block,
when the test is “any” or “all” and the test returns true. For
example:

The values in test would be limited to a two-character language code
from RFC 5646, followed optionally by a hyphen and a two-character
region code, also as per RFC 5646. The test would return true if the
“language” variable begins with a matching primary language tag. The
locale applied inside the condition would be that of the language plus
the optional region tag, or the base locale of the language where no
region tag is given in the test condition.

This seems to be the tricky part: how to determine the language.

Does your draft proposal then suggest the input model for CSL cannot
ever be limited to strings, but that it instead must entertain an RDF
like object model that is something like a tuple (value, language,
data-type)?

That would be preferable. To cope with legacy data, it might be
better to say that applications should provide data as a tuple, but to
accept and parse out string input as well.

My bad. I was responding in haste, and this was misleading. I
thought you were referring to the language variable, but clearly you
meant item field data generally. This proposal has no implications
for item field input. I’ll explain in a separate post down-thread.

The bigger issue here isn’t these CSL syntax details, it’s what
implementing this would mean for the input model. Effectively, we
would need to change input JSON from:

“title”: “The Title”

… to:

“title”: { “value”: “The Title”, “lang”: “en” }

The “language” variable would be a property of the item, right, instead
of
of the individual variables? I.e.

{ “title”: “The Title”, “lang”: “en” }

instead of

{ “title”: { “value”: “The Title”, “lang”: “en” } }

The former isn’t valid.

Either the value of the key (in this case, the title) is a string (as
it is now), or it’s a hash/dictionary or list (as it would have to be
to support this proposal).

Isn’t Frank proposing to just add a “language” variable, independent of the
“title” variable? E.g.:

[
{
“id”: “ITEM-1”,
“language”: “de-DE”,
“title”: “Some title”,
“type”: “book”
}
]

I don’t know. That wasn’t the impression I got (mainly from his
response to my question about the model), but I could be wrong.

In any case, that approach would only work if you assumed all key
values were of the same language, which might not work in some cases
(like if you have a translation but also include data from the
original language; probably need to know what language that original
data is in).

Sorry for the confusion. It’s been a hectic two weeks, and I threw
the discussion off track with that ill-considered response earlier.
Stepping back …

For multilingual processing, there are two cases to consider: (a)
styles that require foreign language cites to be transliterated
(possibly with supplementary translations) and composed in the native
form of the document’s own language; and (b) styles that require
(certain) foreign language cites to be composed in their own, foreign,
native format.

An example of (a) might look like this:

[1] Maruyama, Hiro, “Koen to tochi shyou (5)" [Parks and Eminent
Domain, Part 5], Nihon zoen gakkai zasshi 50, no. 5 (March 1987):
42-47.
[2] Mauro, Paolo. “Corruption and Growth.” The Quarterly Journal of
Economics 110, no. 3 (August 1995): 681-712.

An example of (b) might look like this:

[1] 丸山宏、「公園と土地収用」、日本造園学会雑誌、1987年3月50巻5号42-47頁。
[2] Mauro, Paolo. “Corruption and Growth.” The Quarterly Journal of
Economics 110, no. 3 (August 1995): 681-712.

In (a), the locale is the same throughout (en-US), and the metadata
content is mangled into a script that is familiar to readers and plays
nice with surrounding punctuation.

In (b), the metadata content is left untouched, but the locale is
adapted to the script used in the cite.

The proposal here relates to (b). Mixed-locale styles of this kind
are the norm in countries with languages that use non-roman scripts.

The use of a conditional to set the locale seems to work out nicely,
but the details of the match condition and selection of the locale to
be applied are still in flux.

Apologies again for the confusion.

Frank

Here is some possible language for the specification. The @-numbers
given the likely place where the text blocks might be inserted in the
current document version:On Sat, Dec 11, 2010 at 6:53 AM, Frank Bennett <@Frank_Bennett> wrote:

On Sat, Dec 11, 2010 at 2:18 AM, Bruce D’Arcus <@Bruce_D_Arcus1> wrote:

On Fri, Dec 10, 2010 at 12:14 PM, Rintze Zelle <@Rintze_Zelle> wrote:

On Fri, Dec 10, 2010 at 12:01 PM, Bruce D’Arcus <@Bruce_D_Arcus1> wrote:

On Fri, Dec 10, 2010 at 11:54 AM, Rintze Zelle <@Rintze_Zelle> >>>> wrote:

On Fri, Dec 10, 2010 at 11:40 AM, Bruce D’Arcus <@Bruce_D_Arcus1> >>>> > wrote:

The bigger issue here isn’t these CSL syntax details, it’s what
implementing this would mean for the input model. Effectively, we
would need to change input JSON from:

“title”: “The Title”

… to:

“title”: { “value”: “The Title”, “lang”: “en” }

The “language” variable would be a property of the item, right, instead
of
of the individual variables? I.e.

{ “title”: “The Title”, “lang”: “en” }

instead of

{ “title”: { “value”: “The Title”, “lang”: “en” } }

The former isn’t valid.

Either the value of the key (in this case, the title) is a string (as
it is now), or it’s a hash/dictionary or list (as it would have to be
to support this proposal).

Isn’t Frank proposing to just add a “language” variable, independent of the
“title” variable? E.g.:

[
{
“id”: “ITEM-1”,
“language”: “de-DE”,
“title”: “Some title”,
“type”: “book”
}
]

I don’t know. That wasn’t the impression I got (mainly from his
response to my question about the model), but I could be wrong.

In any case, that approach would only work if you assumed all key
values were of the same language, which might not work in some cases
(like if you have a translation but also include data from the
original language; probably need to know what language that original
data is in).

Sorry for the confusion. It’s been a hectic two weeks, and I threw
the discussion off track with that ill-considered response earlier.
Stepping back …

For multilingual processing, there are two cases to consider: (a)
styles that require foreign language cites to be transliterated
(possibly with supplementary translations) and composed in the native
form of the document’s own language; and (b) styles that require
(certain) foreign language cites to be composed in their own, foreign,
native format.

An example of (a) might look like this:

[1] Maruyama, Hiro, “Koen to tochi shyou (5)" [Parks and Eminent
Domain, Part 5], Nihon zoen gakkai zasshi 50, no. 5 (March 1987):
42-47.
[2] Mauro, Paolo. “Corruption and Growth.” The Quarterly Journal of
Economics 110, no. 3 (August 1995): 681-712.

An example of (b) might look like this:

[1] 丸山宏、「公園と土地収用」、日本造園学会雑誌、1987年3月50巻5号42-47頁。
[2] Mauro, Paolo. “Corruption and Growth.” The Quarterly Journal of
Economics 110, no. 3 (August 1995): 681-712.

In (a), the locale is the same throughout (en-US), and the metadata
content is mangled into a script that is familiar to readers and plays
nice with surrounding punctuation.

In (b), the metadata content is left untouched, but the locale is
adapted to the script used in the cite.

The proposal here relates to (b). Mixed-locale styles of this kind
are the norm in countries with languages that use non-roman scripts.

The use of a conditional to set the locale seems to work out nicely,
but the details of the match condition and selection of the locale to
be applied are still in flux.

Apologies again for the confusion.

Frank


@1385

language
Tests whether the language field of the item to be
rendered matches any of the given locales. This is
treated as a single test for all values of purposes of
the match attribute (“none”, “any”, or “all”).
When the test succeeds, the locale is changed to the
tested value for all child nodes called via the test.
See The language test attribute_ below for details
on the locale selection.

@1398

The language test attribute
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The language test attribute accepts a list of locale
specifiers. Specifiers may be either a two-character
language code (e.g. “en”, for English), or a two-character language code
and a two-character region code separated by a hyphen
(e.g. “de-CH”, or the variant of German spoken in Switzerland).
Testing and locale assignment are performed as follows:

  1. The test succeeds if any of the given language specifiers match
    against the language field of the item to be rendered;

  2. Two-character language specifiers (those without a region code)
    match any item language value for that language, regardless
    of the region tag;

  3. The first language specifier in the list determines the
    locale to be set on children of the condition statement.

The following examples illustrate this behavior:

… sourcecode:: xml

<choose>
  <if language="pt">
    <text macro="cite"/>
  </if>
  <else>
    <text macro="cite"/>
  </else>
</choose>

In the example above, the “cite” macro will be executed with the
base locale of Portuguese (“pt-PT”) for any item with a language
field value of Portuguese (e.g. “pt”, “pt-BR”, or “pt-PT”). For
all other items, the “cite” macro will be executed with the default
locale of the style.

… sourcecode:: xml

<choose>
  <if language="zh-CH">
    <text macro="cite"/>
  </if>
  <else-if language="zh-TW">
    <text macro="cite"/>
  </else-if>
  <else>
    <text macro="cite"/>
  </else>
</choose>

In the example above, the “cite” macro will be executed with
the mainland (simplified) Chinese locale for items that have the
specifier “zh-CH” set in the language field. Items
that have the specifier for the non-simplified version of
Chinese used in Taiwan set in the language field (“zh-TW”) will be
rendered with that locale. All other items will be rendered with
the default locale of the style.

… sourcecode:: xml

<choose>
  <if language="de-AT de">
    <text macro="cite"/>
  </if>
  <else>
    <text macro="cite"/>
  </else>
</choose>

In the example above, the “cite” macro will be executed
with the Austrian locale (“de-AT”) for all items that have
German set in the item language field, regardless of
region code.


Frank

Here is one more test for the experimental functionality, which shows
how inter-cite joins might be made to work:

http://bitbucket.org/fbennett/citeproc-js/src/tip/tests/fixtures/local/language_InterCiteJoins.txt

Frank

The use of a conditional to set the locale seems to work out nicely,
but the details of the match condition and selection of the locale to
be applied are still in flux.

But to me, the key question here is what I keep focusing on: that this
appears to require a fairly large change to the input expectations of
a CSL processor. E.g. there’s no reliable way to know the languages in
question unless they’re explicitly flagged as such, first by the user,
and then carried in the data format.

As I said to Rintze, I am thinking in particular of translated books
and such. So a Japanese scholar, for example, is citing Japanese
translation of an English language text, and they need to print the
title in both Kanji, and English.

Is my understanding correct?

Bruce

Or followup to last note …

Here is some possible language for the specification. The @-numbers
given the likely place where the text blocks might be inserted in the
current document version:


@1385

language
Tests whether the language field of the item to be
rendered matches any of the given locales.

… I’m thinking this wouldn’t work for the use case I just identified:

“… a Japanese scholar, for example, is citing a Japanese
translation of an English language text, and they need to print the
title in both Kanji, and English.”

Reason: the language for the “item” (BTW, I wonder if we should avoid
vague zotero-based language like this for spec language?) is not
sufficient to know the language of the “original-title” variable, and
you need both.

Bruce

Or followup to last note …

Here is some possible language for the specification. The @-numbers
given the likely place where the text blocks might be inserted in the
current document version:


@1385

language
Tests whether the language field of the item to be
rendered matches any of the given locales.

… I’m thinking this wouldn’t work for the use case I just identified:

“… a Japanese scholar, for example, is citing a Japanese
translation of an English language text, and they need to print the
title in both Kanji, and English.”

Reason: the language for the “item” (BTW, I wonder if we should avoid
vague zotero-based language like this for spec language?)

If there is a preferred or more precise term, I’ll use that instead.

is not
sufficient to know the language of the “original-title” variable, and
you need both.

The formatting conventions applied to a reference are locale-specific,
and apply to the reference as a whole. As you say up-thread, the
locale of a reference (or item) would need to be stated in a field.
It’s a pretty common piece of metadata (field 041, in MARC 21).

Am out the door, but …

Or followup to last note …

Here is some possible language for the specification. The @-numbers
given the likely place where the text blocks might be inserted in the
current document version:


@1385

language
Tests whether the language field of the item to be
rendered matches any of the given locales.

… I’m thinking this wouldn’t work for the use case I just identified:

“… a Japanese scholar, for example, is citing a Japanese
translation of an English language text, and they need to print the
title in both Kanji, and English.”

Reason: the language for the “item” (BTW, I wonder if we should avoid
vague zotero-based language like this for spec language?)

If there is a preferred or more precise term, I’ll use that instead.

What do we use in the spec currently?

I’d prefer an obvious distinction between “reference” data and
“citations” (and/or “citation references”) I guess.

is not
sufficient to know the language of the “original-title” variable, and
you need both.

The formatting conventions applied to a reference are locale-specific,
and apply to the reference as a whole. As you say up-thread, the
locale of a reference (or item) would need to be stated in a field.
It’s a pretty common piece of metadata (field 041, in MARC 21).

Can we back up, because I’m still not understanding the use case I guess.

I thought the basis of this proposal was the observation that
punctuation varies depending on the language/locale of the printed
bibliography (the global output language/locale, if you will), and the
locale information of the input data. But I’m pointing out here that
the input data may include multiple languages, so that we cannot rely
on a single language value.

So what am I missing?

Bruce

Am out the door, but …

Or followup to last note …

Here is some possible language for the specification. The @-numbers
given the likely place where the text blocks might be inserted in the
current document version:


@1385

language
Tests whether the language field of the item to be
rendered matches any of the given locales.

… I’m thinking this wouldn’t work for the use case I just identified:

“… a Japanese scholar, for example, is citing a Japanese
translation of an English language text, and they need to print the
title in both Kanji, and English.”

Reason: the language for the “item” (BTW, I wonder if we should avoid
vague zotero-based language like this for spec language?)

If there is a preferred or more precise term, I’ll use that instead.

What do we use in the spec currently?

I’d prefer an obvious distinction between “reference” data and
“citations” (and/or “citation references”) I guess.

Will do.

is not
sufficient to know the language of the “original-title” variable, and
you need both.

The formatting conventions applied to a reference are locale-specific,
and apply to the reference as a whole. As you say up-thread, the
locale of a reference (or item) would need to be stated in a field.
It’s a pretty common piece of metadata (field 041, in MARC 21).

Can we back up, because I’m still not understanding the use case I guess.

I thought the basis of this proposal was the observation that
punctuation varies depending on the language/locale of the printed
bibliography (the global output language/locale, if you will), and the
locale information of the input data. But I’m pointing out here that
the input data may include multiple languages, so that we cannot rely
on a single language value.

So what am I missing?

I’m not sure how common it is to mix languages within a cite in Asian
publishing. It’s certainly one to watch for (and I’ll keep an eye out
for examples), but in the case you mention – a translation of an
English work – there would be two distinct cites, one to the
translation, and another to the original work, each rendered in the
appropriate locale, without any transliteration or translation in
either (see below).

Citation conventions in the English realm and in Asian languages
differ in this respect, in part because of different audience
expectations. Cites in English-language scholarship appeal to a wide
audience, many of whom cannot be expected to know the target language
of cites to foreign works. Accordingly, styles require that Asian (or
Cyrillic) scripts be transliterated to roman, and possibly that
translations be appended to some fields (such as the title). That is
my case (a) up-thread, and isn’t a target of this proposal.

In Asian scholarly writing, when foreign works are cited directly (not
to a translation), the target audience is assumed to be versed in that
language, so transliterations and translations are not used there.
Conversely, when translations are cited, they are treated as ordinary
Japanese-language works, and the cite is expressed entirely in
Japanese. Here is a citation to one of the translations of Adam
Smith’s Wealth of Nations:

アダム・スミス『諸国民の富(一)〜(五)』(大内兵衛・松川七郎訳、岩波文庫)、岩波書店、1959〜1966年

I’m pretty confident that applying locale formatting at the cite level
is the right thing to do here, but I would agree that we don’t want to
rush it. We’ve applied for a grant here that includes some
multilingual Zotero development work. If that comes through (probably
a 30% chance I figure, we’ll know in March), we’ll have a small amount
of additional manpower for gathering and analyzing use cases.

About the CSL syntax I proposed …

After reviewing the thread and chatting with Rintze, it’s pretty clear
that cs:choose is the wrong vehicle for this, for a bunch of reasons.

A cleaner strategy might look like this:

... ...

The advantages over using a conditional would be:

  • Validation can easily assure that the locale “condition” occurs only
    at the top nesting level within cs:citation or cs:bibliography.

  • The adjustment to joins is expressed in a straightforward way.

  • The syntax is more compact.

Frank