This turns out (as usual!) to be surprisingly complex, and it’s hard to be concise without being cryptic. It’s something like this:
What is a “number”
CSL regards a “number” as any set of numerals with or without a prefix or a suffix. A prefix or suffix is any set of characters not-including whitespace, an ampersand (&
), a comma (,
) or any of the characters that can be used to mark a range (-
, and –
), and not purely consisting of numbers. It may include an “escaped hyphen” (\-
), which will be converted on output to a hyphen.
So:
-
123
is a number (123), without any prefix or suffix -
A123
is a number (123), with a prefix (A) -
123B
is a number (123), with a suffix (B) -
A\-123B
is a number (123), with a prefix (A-) and a suffix (B). -
1-2
is two numbers (1, and 2) separated by a range marker (-). -
1\-2
is one number (2) with a prefix (1-): see further below for numerals in prefixes and suffixes.
Numerals can appear in a prefix so long as they precede some non-digit
-
123A456
is a number (456) with a prefix (123A) -
123\-456
is a number (456) with a prefix (123-)
Numerals can appear in a suffix so long a they are preceded by some non-digit
-
A123b23
is a number (123) with a prefix (A) and a suffix (b123).
In some cases that makes a number ambiguous. Take 123A456
. Is that a prefix (123A) followed by a number (456) or a number (123) followed by a suffix (A456). That ambiguity is resolved by favouring prefixes over suffixes, so in that case the number would be interpreted as prefix (123A) followed by a number. The same can apply where there are both prefixes and suffixes, but here a suffix is preferred over none if the string allows it, but the prefix is made as long as it can be consistently with having at least some suffix. Take A123B456C789
, which could be interpreted in several different ways:
Prefix: A; Number: 123; Suffix B456C789
Prefix: A123B; Number: 456; Suffix C789 [preferred interpretation]
Prefix: A123B456C; Number 789; No suffix
Here the middle interpretation would be chosen, giving the longest prefix that is consistent with there being some suffix.
What is “numeric” input
An entire input string is treated as numeric if it consists only of numbers, in the sense described above, whitespace, and commas, range marks, and ampersands. So A123B, 23
is numeric. So is 13th
because it can be interpreted as a single number (13
with suffix th
). But 13th edition
is not numeric, because edition
is not capable of being understood as a number.
Basic normalization
Basic normalization applies to most numeric strings. The numbers are extracted and: range-markers are replaced by the localised page-range-delimiter
(e.g. –
), commas receive one space after the comma, and ampersands are spaced. So 12,13-14&17
becomes “12, 13–14 & 17”. This is applied to most numeric material, but not to number
or version
variables, which are simply printed as they have been entered.
Page-range adjustment
For locator
and page
fields, additional reformatting is carried out to ranges, depending on the setting of page-range-format
on cs:style
:
- Minimal. Ranges are collapsed as far as possible, so 101–103 becomes 101–3. In addition, identical prefixes are removed: A101–A103 becomes A101-3 (suffixes are always preserved)
- Minimal-two. Ranges are collapsed but preserving two digits, so 101–103 becomes 101–03. Identical prefixes are removed: A101-A121 becomes A101-21 (suffixes are always preserved).
- Expanded. Ranges are expanded, so 101-3 becomes 101-103. Identical prefixes are preserved. A101-A103 remains as it is. [Question: should one really expand here by adding prefixes: I think not, because their absence can be presumed to be deliberate. The user is always in control of the data if that’s wrong.]
- Chicago. [Set out the rather complicated Chicago rules. It’s not clear to me how Chicago wants prefixes dealt with, but I think identical prefixes get removed. Would someone who knows about Chicago care to comment?]
Notes
Obviously the rules about commas are troublesome if commas are being used as a decimal marker: that needs to be deal with in the locale file. Why not semicolons as well?
Allowing any numerals in prefixes and suffixes causes inevitable ambiguity. The rules given above seem reasonable to me, but there is no perfect solution.
I haven’t treated an en-dash as a range marker, because it really isn’t. It seems to me changing ---
to —
is desirable, but a different point, because one would want to do that on all fields, not just numeric ones.