Citation Style Language

En dash subsittution in `text`

Yes, I think there are two things that I need to work out in citeproc-js for this. Currently, numeric variables (including locator, as @cormacrelf notes) are forced to use cs:number in the CSL-M schema. Under the hood, though, the numeric rendering code is applied even if cs:text is used (in vanilla CSL) to render a “numeric” variable (i.e. numerics + locator). That makes sense for locator and (in CSL-M styles) section, but as @bwiernik and @cormacrelf say, it both diverges from the spec, and creates practical problems.

I’m open to suggestions about what to do in the citeproc-js implementation. To more closely approach the CSL 1.0.1 spec, am I right in thinking that numeric variables (other than the locator special case) should just be passed through verbatim if rendered with cs:text? I can do that, but there is a possibility that CSL maintaners will start seeing forum complaints about field content that is no longer normalized.

A less aggressive alternative (from the standpoint of status quo in the hands of users) would be to continue rendering using the existing internal (numeric) method on these fields, but to forego transformation of hyphen to en-dash if the method is called from cs:text. This latter change would be more work to implement, but probably carries less risk of support hassle.

FWIW, cs:text is used with number variables 9891 times in 99% of styles in the main styles repo. Per variable, that’s:

⌁  g/c/styles ╍ (master) rg '<text.+variable="(chapter-number|collection-number|edition|issue|number|number-of-pages|number-of-volumes|volume)"' -r '$1' -o --no-filename | sort | uniq -c | sort -nr
3753 volume
2023 edition
1719 number
1250 issue
 599 collection-number
 357 number-of-pages
 167 number-of-volumes
  23 chapter-number

This is the breakdown for <number variable="x" />, with 3452 instances over 85% of styles:

⌁  g/c/styles ╍ (master) rg '<number.+variable="(chapter-number|collection-number|edition|issue|number|number-of-pages|number-of-volumes|volume)"' -r '$1' -o --no-filename | sort | uniq -c | sort -nr
1815 edition
 968 volume
 411 number-of-volumes
 173 issue
  36 number
  22 collection-number
  18 number-of-pages
   9 chapter-number

How many authors using cs:text do you think actually wanted verbatim text? Should we just change the spec to match what people mean? I personally think it’s still valuable to be able to produce it. You could also change all usages if you wanted.

# like find and sed because I can't be bothered reading their manpages these days
$ brew install fd sd 
# replace all cs:text + number variables with cs:number
$ fd -e csl | xargs sd -i '<text(.+)variable="(chapter-number|collection-number|edition|issue|number|number-of-pages|number-of-volumes|volume)"' '<number${1}variable="$2"'
$ git diff
diff --git a/academy-of-management-review.csl b/academy-of-management-review.csl
index 3ed58effc..71a6359a8 100644
--- a/academy-of-management-review.csl
+++ b/academy-of-management-review.csl
@@ -126,7 +126,7 @@
         <group delimiter=", ">
           <group>
             <text variable="genre" text-case="capitalize-first"/>
-            <text variable="number" prefix=" no. "/>
+            <number variable="number" prefix=" no. "/>
           </group>
           <group delimiter=": ">
             <text variable="publisher-place"/>
@@ -217,7 +217,7 @@
         </group>
       </if>
       <else>
-        <text variable="edition"/>
+        <number variable="edition"/>
       </else>

(etc.)

I also don’t think we should just disable one of or some of the cs:number features (namely en-dashes) when using cs:text. That’s just too difficult to explain to people. And what about ampersands? What about commas, and various locales like fr-FR which use commas as decimal separators? @bwiernik can’t be the only person with a problem of this nature.

I’m happy to shift to passthrough on cs:text. I only posed the question because a change will affect the behavior of deployed clients, and I don’t want to spring any unwelcome surprises on people who also spend more time handling user queries than I do (@Sebastian_Karcher, @Rintze_Zelle, others).

(In the numeric method, citeproc-js passes through a comma immediately followed by a numeral without modification. I know that’s just one of many possible gotchas, and that failure to insert a space after the comma violates the CSL spec, but it’s current citeproc-js behavior.)

I remembered something about how citeproc-js arrived at applying a uniform method to numeric variables, regardless of whether rendered via cs:number or cs:text. The CSL spec requires that labels rendered with plural="contextual" adapt to their eponymous variable. By the spec, arbitration is based on “the variable content,” without reference to the element through which it is rendered.

Before I adopted the current approach, cs:text and cs:number used separate parsing schemes to identify plurality. With incremental changes on both sides in response to user feedback, the burden of addressing queries grew to a point where unified behavior made sense. That’s not an argument for keeping things as they are, but it’s how things ended up in their current state chez citeproc-js.

I’m open to changing this. WRT the actual styles, most styles test for numeric on edition to produce ordinals, and otherwise render it using text, so definitely expecting pass-through. The other variables that Cormac lists all should be fine with literal pass through.

This leaves the two issues identified here:

  1. Locators: this is a tricky one since we probably do want to allow them to be non-numeric but do want to treat them as numeric with things like en-dahs substitution. I’m not keen on re-writing all existing styles with choose logic.
  2. Labels as Frank notes. I wonder whether it’d be possible to keep the labelling behavior while changing the pass-through behavior?

Sniffing plural state with the numeric parsing logic while rendering verbatim is certainly an option, and exposes a couple of underlying issues around ambiguity in the intended meaning of hyphens and commas in the input.

Regarding commas, the spec states:

If a number variable is rendered with cs:number and only contains numeric content (as determined by the rules for is-numeric … the number(s) are extracted.
[…]
Numbers separated by a comma receive one space after the comma (“2,3” and “2 , 3” become “2, 3”), while numbers separated by an ampersand receive one space before and one after the ampersand (“2&3” becomes “2 & 3”).

As @cormacrelf notes, some languages use a comma as a decimal identifier, but this rule would treat 3,1415 as two separate numbers, and therefore plural. The ambiguity could be resolved by removing the language about expanding comma to comma-space.

Hyphens are a harder problem. A value of 10-12 could mean “number 10-12” or “numbers 10 through 12.” The former may be more common, but the number variable is used in many different contexts, and without discriminating markup in the input, styles can’t be guaranteed to guess correctly in all cases. Not sure what to do there, but if a mostly-right solution is the best among available options, it would be good to fix that in the specification.

(Edit: In exceptional cases where a range is intended outside of locator, maybe adopting a convention of double-dash [--] would suffice?)

On the other hand, a plain hyphen can be the leaf-number to a section or page in the page or locator field, in which case some markup is inevitably needed. There might be a case for keeping that markup consistent across fields, to avoid confusion.

but the number variable is used in many different contexts

These are the uses of number in Zotero (and Juris-M):
Report — Report Number

Patent — Patent Number
Bill — Bill Number
Statute — Public Law Number
Case — Docket Number
Hearing — Docket Number
(Gazette — Public Law Number)
(Regulation — Public Law Number)
(Standard — Number)

Podcast — Episode Number
Radio Broadcast — Episode Number
TV Broadcast — Episode Number

Mendely also has:
Generic — Number
Working Paper — Number

(These should all probably be mapped to version instead)
Book — Revision Number
Book Section — Revision Number
Case — Revision Number
Computer Program — Version
Conference Proceedings — Revision Number
Encyclopedia Article — Revision Number
Hearing — Revision Number
Journal Article — Revision Number
Magazine Article — Revision Number
Newspaper Article — Revision Number
Motion Picture — Revision Number
Television Broadcast — Revision Number
Thesis — Revision Number
Webpage — Revision Number

Paperpile also has:
Conference Paper — Number
Preprint Manuscript — Number
Manual — Number
Personal Communication — Number
Unpublished — Number
Miscellaneous — Number
Figure — Number
Audio — Number
Artwork — Number
Musical Score — Number
Grant — Number
Standard — Number
Map — Number
Letter — Number
Interview — Number

Of these, there seems to me to be a limited number of cases where number might conceivably be most commonly be a range (broadcasts—but even there 20-1 could mean “Season 20, Episode 1”, though my understanding is that 20.1 is more standard notation). The others would primarily be using hyphens as part of the number name, not a range (unless I am mistaken about the legal items).

Given that, I think that applying the rule that hyphens are converted to en-dash for locator, but not for number by default would be appropriate. Adopting \- as the standard to suppress conversion of hyphens and -- to force conversion to en-dash seems appropriate.

Besides, locator and number, there are other variables that would need to be treated in one of these two ways. I think archive_location, call-number, andeditionshould be treated likenumberwithout conversion.chapter-number, collection-number, volume, and issue could go either way?

That sounds right. Having a consistent override markup across fields will simplify documentation and support, even if default parsing behavior varies. With teaching term on the horizon and a couple of articles wanting revision, I’ll need to hold off work on this in citeproc-js until the summer. Meanwhile, a few things will be good to have before touching the code:

  • A specification of the parsing/output rules for each category of field in cs:text and cs:number;
  • Category lists covering all numeric fields;
  • Test fixtures that exercise and illustrate the specified behavior.

The third item is particularly important. There are a lot of parameters involved in formatting and pluralization, and burning behavior into test fixtures will save a lot of back-and-forth discussion over requirements. The watch mode of the new citeproc-js test runner makes it easy to build a category of tests for a particular style under an arbitrary name. If you name them something like x-apa-numeric-dev, that will give us a solid base for discussion and coding.

Circling back to this.

How about these rules:

  1. Comma-parsing behavior is locale dependent. If the bibliography is rendered in a locale that uses a comma as a decimal point, then don’t do the comma-space substitution. Otherwise, do the substitution.
  2. Don’t do hyphen substitution on number or version. Do on other numeric variables, even if rendered with cs:text.
  3. Specify \- as an override to prevent hyphen substitution, -- to always substitute to en dash, and --- to always substitute em dash.

If we agree on this, I can write Frank some tests and propose spec language.

Can I revive this to see if it is agreed, because at the moment this is something of a hot mess in the tests, and the spec.

I won’t go into the gory details, but it would be good to confirm the correct intended behaviour:

  1. Single hypens between numeric content, spaced or unspaced, become unspaced en-dashes, except when rendering in the context of number or version:
  • 1-10 -> 1–10
  • 1 - 10 -> 1–10
  • A-11 -> A-11 (no change)
  1. Double hyphens even if they are not between numeric material, become unspaced en-dashes (even in the context of number and version). (Reasonable because this is not a typographically acceptable mark, in its own right, ever.)
  • 1--10 -> 1–10
  • A--11 -> A–11
  1. "Escaped hyphens always remain as hyphens. One might also say (should one?) that any numerals before them are treated as prefixes not as numbers:
  • 1\-10 -> 1-10

  • 1\-10-1\-11 -> 1-10–11 (if we are doing page-range adjustment)?

  • Commas are going to pick up something from locale, but it’s not clear to me how!

I think the first rules could probably be implemented without breaking anything on 1.0.

For hyphens, I think those are close. I propose these:

  1. Single hyphens between numeric content , spaced or unspaced, become the page-range-delimiter term, except when rendering in the context of number or version.
    1. Per discussion here: https://github.com/citation-style-language/schema/issues/122#issuecomment-654694437
    2. The page-range-delimiter term might best be renamed number-range-delimiter.
  2. Double hyphens even if they are not between numeric material, become unspaced en-dashes (even in the context of number and version ). (Reasonable because this is not a typographically acceptable mark, in its own right, ever.)
  3. Escaped hyphens (i.e., \-) always remain as hyphens.

One might also say (should one?) that any numerals before them are treated as prefixes not as numbers.

So, you are saying, if string contains \-, then it should always test as is-numeric="false"?

We may as well support the TeX convention of three-dashes to em-dash; right?

If we’re doing the two-dash.

3 Likes

I’m not sure, but I think citeproc-js and pandoc even do this already. (— to em-dash)

So, you are saying, if string contains \- , then it should always test as is-numeric="false" ?

No.

Suppose one has A11b. That gets treated as a number with three parts: a prefix (A) a number part (11) and a suffix (b). That in turn affects how a page-range would be applied. If one had A11b-A18c that would range, on a minimal scheme to A11b-18c: in other words one deletes any common prefix and ranges the numbers.

But normally it’s hard to range purely numeric material. 11-20 - 11-28 is ambiguous. Is the 11 a prefix, or do we have two ranges themselves separated by a hyphen, or …

One could do all sorts of things to try to work this out, but in the end there’s tough ambiguity. However it would be reasonable to say that one should treat as a prefix anything that contains any non-numeral and non-range-marking text, and is then followed by a numeral (without whitespace). So if one knew that \- was not a range-marker, one could parse 11\-20-11\-28 unambiguously as:

  • Prefix: 11-
  • Numeral: 20
  • Range-marker
  • Prefix: 11-
  • Numeral: 28

Which would in turn allow one to range this to 11-20–28.

Does that make sense?

(Of course, one could have other conventions e.g. that if -- is used as a range-marker, -, even if unescaped, is interpreted as a prefix-character. But I wouldn’t love that because it requires backtracking and would be hard to specify: for instance, would it then follow for the whole string that - is an “ordinary character” not a range-marker? How would one interpret 11-20--11-28, 11-18? As [11-|20|][--][11-|28|][,][ ][|11|][--][|18|] or [11-|20|][--][11-|28|][,][ ][11-|18|]?)

1 Like

Unescaped hyphen has to be treated as a range delimiter. That is how most page ranges are stored.

I see what you mean now. Can you suggest a concise documentation sentence for the prefix treatment to include in the spec?

I guess the only reason not to is if it would never be used in this context.

This turns out (as usual!) to be surprisingly complex, and it’s hard to be concise without being cryptic. It’s something like this:

What is a “number”

CSL regards a “number” as any set of numerals with or without a prefix or a suffix. A prefix or suffix is any set of characters not-including whitespace, an ampersand (&), a comma (,) or any of the characters that can be used to mark a range (-, and ), and not purely consisting of numbers. It may include an “escaped hyphen” (\-), which will be converted on output to a hyphen.

So:

  • 123 is a number (123), without any prefix or suffix
  • A123 is a number (123), with a prefix (A)
  • 123B is a number (123), with a suffix (B)
  • A\-123B is a number (123), with a prefix (A-) and a suffix (B).
  • 1-2 is two numbers (1, and 2) separated by a range marker (-).
  • 1\-2 is one number (2) with a prefix (1-): see further below for numerals in prefixes and suffixes.

Numerals can appear in a prefix so long as they precede some non-digit

  • 123A456 is a number (456) with a prefix (123A)
  • 123\-456 is a number (456) with a prefix (123-)

Numerals can appear in a suffix so long a they are preceded by some non-digit

  • A123b23 is a number (123) with a prefix (A) and a suffix (b123).

In some cases that makes a number ambiguous. Take 123A456. Is that a prefix (123A) followed by a number (456) or a number (123) followed by a suffix (A456). That ambiguity is resolved by favouring prefixes over suffixes, so in that case the number would be interpreted as prefix (123A) followed by a number. The same can apply where there are both prefixes and suffixes, but here a suffix is preferred over none if the string allows it, but the prefix is made as long as it can be consistently with having at least some suffix. Take A123B456C789, which could be interpreted in several different ways:

Prefix: A; Number: 123; Suffix B456C789
Prefix: A123B; Number: 456; Suffix C789 [preferred interpretation]
Prefix: A123B456C; Number 789; No suffix

Here the middle interpretation would be chosen, giving the longest prefix that is consistent with there being some suffix.

What is “numeric” input

An entire input string is treated as numeric if it consists only of numbers, in the sense described above, whitespace, and commas, range marks, and ampersands. So A123B, 23 is numeric. So is 13th because it can be interpreted as a single number (13 with suffix th). But 13th edition is not numeric, because edition is not capable of being understood as a number.

Basic normalization

Basic normalization applies to most numeric strings. The numbers are extracted and: range-markers are replaced by the localised page-range-delimiter (e.g. ), commas receive one space after the comma, and ampersands are spaced. So 12,13-14&17 becomes “12, 13–14 & 17”. This is applied to most numeric material, but not to number or version variables, which are simply printed as they have been entered.

Page-range adjustment

For locator and page fields, additional reformatting is carried out to ranges, depending on the setting of page-range-format on cs:style:

  • Minimal. Ranges are collapsed as far as possible, so 101–103 becomes 101–3. In addition, identical prefixes are removed: A101–A103 becomes A101-3 (suffixes are always preserved)
  • Minimal-two. Ranges are collapsed but preserving two digits, so 101–103 becomes 101–03. Identical prefixes are removed: A101-A121 becomes A101-21 (suffixes are always preserved).
  • Expanded. Ranges are expanded, so 101-3 becomes 101-103. Identical prefixes are preserved. A101-A103 remains as it is. [Question: should one really expand here by adding prefixes: I think not, because their absence can be presumed to be deliberate. The user is always in control of the data if that’s wrong.]
  • Chicago. [Set out the rather complicated Chicago rules. It’s not clear to me how Chicago wants prefixes dealt with, but I think identical prefixes get removed. Would someone who knows about Chicago care to comment?]

Notes

Obviously the rules about commas are troublesome if commas are being used as a decimal marker: that needs to be deal with in the locale file. Why not semicolons as well?

Allowing any numerals in prefixes and suffixes causes inevitable ambiguity. The rules given above seem reasonable to me, but there is no perfect solution.

I haven’t treated an en-dash as a range marker, because it really isn’t. It seems to me changing --- to is desirable, but a different point, because one would want to do that on all fields, not just numeric ones.

See here for a proposal on simplifying and making more coherent the is-numeric test behavior Is-numeric behavior, new is-numberlike test?