En dash subsittution in `text`

This turns out (as usual!) to be surprisingly complex, and it’s hard to be concise without being cryptic. It’s something like this:

What is a “number”

CSL regards a “number” as any set of numerals with or without a prefix or a suffix. A prefix or suffix is any set of characters not-including whitespace, an ampersand (&), a comma (,) or any of the characters that can be used to mark a range (-, and ), and not purely consisting of numbers. It may include an “escaped hyphen” (\-), which will be converted on output to a hyphen.

So:

  • 123 is a number (123), without any prefix or suffix
  • A123 is a number (123), with a prefix (A)
  • 123B is a number (123), with a suffix (B)
  • A\-123B is a number (123), with a prefix (A-) and a suffix (B).
  • 1-2 is two numbers (1, and 2) separated by a range marker (-).
  • 1\-2 is one number (2) with a prefix (1-): see further below for numerals in prefixes and suffixes.

Numerals can appear in a prefix so long as they precede some non-digit

  • 123A456 is a number (456) with a prefix (123A)
  • 123\-456 is a number (456) with a prefix (123-)

Numerals can appear in a suffix so long a they are preceded by some non-digit

  • A123b23 is a number (123) with a prefix (A) and a suffix (b123).

In some cases that makes a number ambiguous. Take 123A456. Is that a prefix (123A) followed by a number (456) or a number (123) followed by a suffix (A456). That ambiguity is resolved by favouring prefixes over suffixes, so in that case the number would be interpreted as prefix (123A) followed by a number. The same can apply where there are both prefixes and suffixes, but here a suffix is preferred over none if the string allows it, but the prefix is made as long as it can be consistently with having at least some suffix. Take A123B456C789, which could be interpreted in several different ways:

Prefix: A; Number: 123; Suffix B456C789
Prefix: A123B; Number: 456; Suffix C789 [preferred interpretation]
Prefix: A123B456C; Number 789; No suffix

Here the middle interpretation would be chosen, giving the longest prefix that is consistent with there being some suffix.

What is “numeric” input

An entire input string is treated as numeric if it consists only of numbers, in the sense described above, whitespace, and commas, range marks, and ampersands. So A123B, 23 is numeric. So is 13th because it can be interpreted as a single number (13 with suffix th). But 13th edition is not numeric, because edition is not capable of being understood as a number.

Basic normalization

Basic normalization applies to most numeric strings. The numbers are extracted and: range-markers are replaced by the localised page-range-delimiter (e.g. ), commas receive one space after the comma, and ampersands are spaced. So 12,13-14&17 becomes “12, 13–14 & 17”. This is applied to most numeric material, but not to number or version variables, which are simply printed as they have been entered.

Page-range adjustment

For locator and page fields, additional reformatting is carried out to ranges, depending on the setting of page-range-format on cs:style:

  • Minimal. Ranges are collapsed as far as possible, so 101–103 becomes 101–3. In addition, identical prefixes are removed: A101–A103 becomes A101-3 (suffixes are always preserved)
  • Minimal-two. Ranges are collapsed but preserving two digits, so 101–103 becomes 101–03. Identical prefixes are removed: A101-A121 becomes A101-21 (suffixes are always preserved).
  • Expanded. Ranges are expanded, so 101-3 becomes 101-103. Identical prefixes are preserved. A101-A103 remains as it is. [Question: should one really expand here by adding prefixes: I think not, because their absence can be presumed to be deliberate. The user is always in control of the data if that’s wrong.]
  • Chicago. [Set out the rather complicated Chicago rules. It’s not clear to me how Chicago wants prefixes dealt with, but I think identical prefixes get removed. Would someone who knows about Chicago care to comment?]

Notes

Obviously the rules about commas are troublesome if commas are being used as a decimal marker: that needs to be deal with in the locale file. Why not semicolons as well?

Allowing any numerals in prefixes and suffixes causes inevitable ambiguity. The rules given above seem reasonable to me, but there is no perfect solution.

I haven’t treated an en-dash as a range marker, because it really isn’t. It seems to me changing --- to is desirable, but a different point, because one would want to do that on all fields, not just numeric ones.

See here for a proposal on simplifying and making more coherent the is-numeric test behavior Is-numeric behavior, new is-numberlike test?