En dash subsittution in `text`

Currently, citeproc-js substitutes an en dash for hyphens for text that appears to be a number range, whether that text is called using CSL number or CSL text. This seems suboptimal. For example, when I try to cite this report from the APA manual:

McDaniel, J. E., & Miskel, C. G. (2002). The effect of groups and individuals on national decisionmaking: Influence and domination in the reading policymaking environment (CIERA Report No. 3-025). University of Michigan, Center for Improvement of Early Reading Achievement. Retrieved from http://www.ciera.org/library/reports/inquiry-3/3-025/3-025.pdf

the hyphen in the report number (3-025) is replaced with an en dash:

McDaniel, J. E., & Miskel, C. G. (2002). The effect of groups and individuals on national decisionmaking: Influence and domination in the reading policymaking environment (CIERA Report No. 3–025). University of Michigan, Center for Improvement of Early Reading Achievement. Retrieved from http://www.ciera.org/library/reports/inquiry-3/3-025/3-025.pdf

This number isn’t a range, and the en dash substitution may make retrieval more difficult. I suggest that the number range delimiter substitution should only occur for CSL number and not CSL text.

I think that is the correct interpretation of the spec, specifically:

If a number variable is rendered with cs:number and only contains numeric content (as determined by the rules for is-numeric , see Choose), the number(s) are extracted.

Implemented for citeproc-rs in this commit.

The only difficulty is with locators, which are technically not number variables in CSL 1.0.1, but are in CSL-M. They have to be special cased in 1.0.1 because of this:

The “locator” variable is always rendered with an en-dash replacing any hyphens. For the “page” variable, this replacement is only performed if the page-range-format attribute is set on cs:style

Hence this.

Yes, I think there are two things that I need to work out in citeproc-js for this. Currently, numeric variables (including locator, as @cormacrelf notes) are forced to use cs:number in the CSL-M schema. Under the hood, though, the numeric rendering code is applied even if cs:text is used (in vanilla CSL) to render a “numeric” variable (i.e. numerics + locator). That makes sense for locator and (in CSL-M styles) section, but as @bwiernik and @cormacrelf say, it both diverges from the spec, and creates practical problems.

I’m open to suggestions about what to do in the citeproc-js implementation. To more closely approach the CSL 1.0.1 spec, am I right in thinking that numeric variables (other than the locator special case) should just be passed through verbatim if rendered with cs:text? I can do that, but there is a possibility that CSL maintaners will start seeing forum complaints about field content that is no longer normalized.

A less aggressive alternative (from the standpoint of status quo in the hands of users) would be to continue rendering using the existing internal (numeric) method on these fields, but to forego transformation of hyphen to en-dash if the method is called from cs:text. This latter change would be more work to implement, but probably carries less risk of support hassle.

FWIW, cs:text is used with number variables 9891 times in 99% of styles in the main styles repo. Per variable, that’s:

⌁  g/c/styles ╍ (master) rg '<text.+variable="(chapter-number|collection-number|edition|issue|number|number-of-pages|number-of-volumes|volume)"' -r '$1' -o --no-filename | sort | uniq -c | sort -nr
3753 volume
2023 edition
1719 number
1250 issue
 599 collection-number
 357 number-of-pages
 167 number-of-volumes
  23 chapter-number

This is the breakdown for <number variable="x" />, with 3452 instances over 85% of styles:

⌁  g/c/styles ╍ (master) rg '<number.+variable="(chapter-number|collection-number|edition|issue|number|number-of-pages|number-of-volumes|volume)"' -r '$1' -o --no-filename | sort | uniq -c | sort -nr
1815 edition
 968 volume
 411 number-of-volumes
 173 issue
  36 number
  22 collection-number
  18 number-of-pages
   9 chapter-number

How many authors using cs:text do you think actually wanted verbatim text? Should we just change the spec to match what people mean? I personally think it’s still valuable to be able to produce it. You could also change all usages if you wanted.

# like find and sed because I can't be bothered reading their manpages these days
$ brew install fd sd 
# replace all cs:text + number variables with cs:number
$ fd -e csl | xargs sd -i '<text(.+)variable="(chapter-number|collection-number|edition|issue|number|number-of-pages|number-of-volumes|volume)"' '<number${1}variable="$2"'
$ git diff
diff --git a/academy-of-management-review.csl b/academy-of-management-review.csl
index 3ed58effc..71a6359a8 100644
--- a/academy-of-management-review.csl
+++ b/academy-of-management-review.csl
@@ -126,7 +126,7 @@
         <group delimiter=", ">
           <group>
             <text variable="genre" text-case="capitalize-first"/>
-            <text variable="number" prefix=" no. "/>
+            <number variable="number" prefix=" no. "/>
           </group>
           <group delimiter=": ">
             <text variable="publisher-place"/>
@@ -217,7 +217,7 @@
         </group>
       </if>
       <else>
-        <text variable="edition"/>
+        <number variable="edition"/>
       </else>

(etc.)

I also don’t think we should just disable one of or some of the cs:number features (namely en-dashes) when using cs:text. That’s just too difficult to explain to people. And what about ampersands? What about commas, and various locales like fr-FR which use commas as decimal separators? @bwiernik can’t be the only person with a problem of this nature.

I’m happy to shift to passthrough on cs:text. I only posed the question because a change will affect the behavior of deployed clients, and I don’t want to spring any unwelcome surprises on people who also spend more time handling user queries than I do (@Sebastian_Karcher, @Rintze_Zelle, others).

(In the numeric method, citeproc-js passes through a comma immediately followed by a numeral without modification. I know that’s just one of many possible gotchas, and that failure to insert a space after the comma violates the CSL spec, but it’s current citeproc-js behavior.)

I remembered something about how citeproc-js arrived at applying a uniform method to numeric variables, regardless of whether rendered via cs:number or cs:text. The CSL spec requires that labels rendered with plural="contextual" adapt to their eponymous variable. By the spec, arbitration is based on “the variable content,” without reference to the element through which it is rendered.

Before I adopted the current approach, cs:text and cs:number used separate parsing schemes to identify plurality. With incremental changes on both sides in response to user feedback, the burden of addressing queries grew to a point where unified behavior made sense. That’s not an argument for keeping things as they are, but it’s how things ended up in their current state chez citeproc-js.

I’m open to changing this. WRT the actual styles, most styles test for numeric on edition to produce ordinals, and otherwise render it using text, so definitely expecting pass-through. The other variables that Cormac lists all should be fine with literal pass through.

This leaves the two issues identified here:

  1. Locators: this is a tricky one since we probably do want to allow them to be non-numeric but do want to treat them as numeric with things like en-dahs substitution. I’m not keen on re-writing all existing styles with choose logic.
  2. Labels as Frank notes. I wonder whether it’d be possible to keep the labelling behavior while changing the pass-through behavior?

Sniffing plural state with the numeric parsing logic while rendering verbatim is certainly an option, and exposes a couple of underlying issues around ambiguity in the intended meaning of hyphens and commas in the input.

Regarding commas, the spec states:

If a number variable is rendered with cs:number and only contains numeric content (as determined by the rules for is-numeric … the number(s) are extracted.
[…]
Numbers separated by a comma receive one space after the comma (“2,3” and “2 , 3” become “2, 3”), while numbers separated by an ampersand receive one space before and one after the ampersand (“2&3” becomes “2 & 3”).

As @cormacrelf notes, some languages use a comma as a decimal identifier, but this rule would treat 3,1415 as two separate numbers, and therefore plural. The ambiguity could be resolved by removing the language about expanding comma to comma-space.

Hyphens are a harder problem. A value of 10-12 could mean “number 10-12” or “numbers 10 through 12.” The former may be more common, but the number variable is used in many different contexts, and without discriminating markup in the input, styles can’t be guaranteed to guess correctly in all cases. Not sure what to do there, but if a mostly-right solution is the best among available options, it would be good to fix that in the specification.

(Edit: In exceptional cases where a range is intended outside of locator, maybe adopting a convention of double-dash [--] would suffice?)

On the other hand, a plain hyphen can be the leaf-number to a section or page in the page or locator field, in which case some markup is inevitably needed. There might be a case for keeping that markup consistent across fields, to avoid confusion.

but the number variable is used in many different contexts

These are the uses of number in Zotero (and Juris-M):
Report — Report Number

Patent — Patent Number
Bill — Bill Number
Statute — Public Law Number
Case — Docket Number
Hearing — Docket Number
(Gazette — Public Law Number)
(Regulation — Public Law Number)
(Standard — Number)

Podcast — Episode Number
Radio Broadcast — Episode Number
TV Broadcast — Episode Number

Mendely also has:
Generic — Number
Working Paper — Number

(These should all probably be mapped to version instead)
Book — Revision Number
Book Section — Revision Number
Case — Revision Number
Computer Program — Version
Conference Proceedings — Revision Number
Encyclopedia Article — Revision Number
Hearing — Revision Number
Journal Article — Revision Number
Magazine Article — Revision Number
Newspaper Article — Revision Number
Motion Picture — Revision Number
Television Broadcast — Revision Number
Thesis — Revision Number
Webpage — Revision Number

Paperpile also has:
Conference Paper — Number
Preprint Manuscript — Number
Manual — Number
Personal Communication — Number
Unpublished — Number
Miscellaneous — Number
Figure — Number
Audio — Number
Artwork — Number
Musical Score — Number
Grant — Number
Standard — Number
Map — Number
Letter — Number
Interview — Number

Of these, there seems to me to be a limited number of cases where number might conceivably be most commonly be a range (broadcasts—but even there 20-1 could mean “Season 20, Episode 1”, though my understanding is that 20.1 is more standard notation). The others would primarily be using hyphens as part of the number name, not a range (unless I am mistaken about the legal items).

Given that, I think that applying the rule that hyphens are converted to en-dash for locator, but not for number by default would be appropriate. Adopting \- as the standard to suppress conversion of hyphens and -- to force conversion to en-dash seems appropriate.

Besides, locator and number, there are other variables that would need to be treated in one of these two ways. I think archive_location, call-number, and editionshould be treated likenumberwithout conversion.chapter-number, collection-number, volume, and issue could go either way?

That sounds right. Having a consistent override markup across fields will simplify documentation and support, even if default parsing behavior varies. With teaching term on the horizon and a couple of articles wanting revision, I’ll need to hold off work on this in citeproc-js until the summer. Meanwhile, a few things will be good to have before touching the code:

  • A specification of the parsing/output rules for each category of field in cs:text and cs:number;
  • Category lists covering all numeric fields;
  • Test fixtures that exercise and illustrate the specified behavior.

The third item is particularly important. There are a lot of parameters involved in formatting and pluralization, and burning behavior into test fixtures will save a lot of back-and-forth discussion over requirements. The watch mode of the new citeproc-js test runner makes it easy to build a category of tests for a particular style under an arbitrary name. If you name them something like x-apa-numeric-dev, that will give us a solid base for discussion and coding.

Circling back to this.

How about these rules:

  1. Comma-parsing behavior is locale dependent. If the bibliography is rendered in a locale that uses a comma as a decimal point, then don’t do the comma-space substitution. Otherwise, do the substitution.
  2. Don’t do hyphen substitution on number or version. Do on other numeric variables, even if rendered with cs:text.
  3. Specify \- as an override to prevent hyphen substitution, -- to always substitute to en dash, and --- to always substitute em dash.

If we agree on this, I can write Frank some tests and propose spec language.

Can I revive this to see if it is agreed, because at the moment this is something of a hot mess in the tests, and the spec.

I won’t go into the gory details, but it would be good to confirm the correct intended behaviour:

  1. Single hypens between numeric content, spaced or unspaced, become unspaced en-dashes, except when rendering in the context of number or version:
  • 1-10 -> 1–10
  • 1 - 10 -> 1–10
  • A-11 -> A-11 (no change)
  1. Double hyphens even if they are not between numeric material, become unspaced en-dashes (even in the context of number and version). (Reasonable because this is not a typographically acceptable mark, in its own right, ever.)
  • 1--10 -> 1–10
  • A--11 -> A–11
  1. "Escaped hyphens always remain as hyphens. One might also say (should one?) that any numerals before them are treated as prefixes not as numbers:
  • 1\-10 -> 1-10

  • 1\-10-1\-11 -> 1-10–11 (if we are doing page-range adjustment)?

  • Commas are going to pick up something from locale, but it’s not clear to me how!

I think the first rules could probably be implemented without breaking anything on 1.0.

For hyphens, I think those are close. I propose these:

  1. Single hyphens between numeric content , spaced or unspaced, become the page-range-delimiter term, except when rendering in the context of number or version.
    1. Per discussion here: Allow range formatting for all numbers (esp. dates) · Issue #122 · citation-style-language/schema · GitHub
    2. The page-range-delimiter term might best be renamed number-range-delimiter.
  2. Double hyphens even if they are not between numeric material, become unspaced en-dashes (even in the context of number and version ). (Reasonable because this is not a typographically acceptable mark, in its own right, ever.)
  3. Escaped hyphens (i.e., \-) always remain as hyphens.

One might also say (should one?) that any numerals before them are treated as prefixes not as numbers.

So, you are saying, if string contains \-, then it should always test as is-numeric="false"?

We may as well support the TeX convention of three-dashes to em-dash; right?

If we’re doing the two-dash.

3 Likes

I’m not sure, but I think citeproc-js and pandoc even do this already. (— to em-dash)

So, you are saying, if string contains \- , then it should always test as is-numeric="false" ?

No.

Suppose one has A11b. That gets treated as a number with three parts: a prefix (A) a number part (11) and a suffix (b). That in turn affects how a page-range would be applied. If one had A11b-A18c that would range, on a minimal scheme to A11b-18c: in other words one deletes any common prefix and ranges the numbers.

But normally it’s hard to range purely numeric material. 11-20 - 11-28 is ambiguous. Is the 11 a prefix, or do we have two ranges themselves separated by a hyphen, or …

One could do all sorts of things to try to work this out, but in the end there’s tough ambiguity. However it would be reasonable to say that one should treat as a prefix anything that contains any non-numeral and non-range-marking text, and is then followed by a numeral (without whitespace). So if one knew that \- was not a range-marker, one could parse 11\-20-11\-28 unambiguously as:

  • Prefix: 11-
  • Numeral: 20
  • Range-marker
  • Prefix: 11-
  • Numeral: 28

Which would in turn allow one to range this to 11-20–28.

Does that make sense?

(Of course, one could have other conventions e.g. that if -- is used as a range-marker, -, even if unescaped, is interpreted as a prefix-character. But I wouldn’t love that because it requires backtracking and would be hard to specify: for instance, would it then follow for the whole string that - is an “ordinary character” not a range-marker? How would one interpret 11-20--11-28, 11-18? As [11-|20|][--][11-|28|][,][ ][|11|][--][|18|] or [11-|20|][--][11-|28|][,][ ][11-|18|]?)

1 Like

Unescaped hyphen has to be treated as a range delimiter. That is how most page ranges are stored.

I see what you mean now. Can you suggest a concise documentation sentence for the prefix treatment to include in the spec?

I guess the only reason not to is if it would never be used in this context.