CSL spec and test cases

Hello,

I’ve been doing some work on my citeproc-py
(GitHub - brechtm/citeproc-py: Yet another Python CSL Processor) and have written down some
questions/remarks about some of the tests and the CSL spec. Note that I
could simply be misunderstanding/misinterpreting things for some of these.

  • the CSL spec is contradictory about number detection

Tests whether the given variables contain numeric content.
versus
Content is considered numeric if it solely consists of numbers.
For example, “2nd” tests “true” whereas “second” and “2nd edition”
test “false”.
does not seem to agree with condition_IsNumeric

  • Chicago page range format: what do do with five or more digits?

  • Which values are allowed for the “page” input field? I see multiple
    ranges can also be specified. I think the CSL spec should, in general,
    also define the format of the input fields. Personally, I would opt for a
    structured format (like the date fields) as opposed to a string-format
    (the page field). Individual CSL processors can still convert a
    string-formatted field to the structured data. This would require changes
    to the tests.

  • Shouldn’t “page-first” be a number variable? It is used with number in
    page_NumberPageFirst

  • The spec doesn’t say anything about the nested groups special case.
    variables_TitleShortOnShortTitleNoTitleCondition seems to disagree with
    the CSL spec:

cs:group and its child elements are suppressed if a) at least one
renderingelement in cs:group calls a variable (either directly or via
a macro), and b)all variables that are called are empty.
In the group in the else section only the title variable is called. For
ITEM-3, this variable is empty, so the group should be suppressed, but it
isn’t.
Should a nested group always act as if it’s (successfully) calling a
variable? If so, the spec should mention this.

  • I seem to remember citeproc-js postprocesses its output to remove
    duplicate affixes. The CSL spec doesn’t say anything about this, AFAIK.
    What’s the official stance on this? I would personally avoid doing this,
    unless the spec includes an unambiguous definition on how this should work.

  • locale_TitleCaseGarbageLangEnglishLocale: is “en” a valid locale? If so,
    and default-locale=“en”, which locale should we use?

  • textcase_SkipNameParticlesInTitleCase (1): I believe this behavior is
    not part of the CSL spec, is it?

  • textcase_SkipNameParticlesInTitleCase (2): the result doesn’t seem to
    follow the CSL spec. The ‘a’ after the colon should be capitalized:

In both cases, stop words are lowercased, unless they are the first or
lastword in the string, or follow a colon.

  • date_VariousInvalidDates: why is ‘Spring’ in the output?

  • page_Chicago: is the example S input data correct? It strikes me as a
    confusing way of representing a page range (in addition to saving only a
    single digit).

  • A large number of tests test functionality that is not in the CSL spec,
    but is provided by citeproc-js (raw dates, static ordering, literal names,
    …). I think these should be indicated as such, or perhaps moved to a
    separate directory. This would make it easier to check the other CSL
    processor’s compatibility.

I hope you can find the time to answer these.

Thanks,
Brecht

  • the CSL spec is contradictory about number detection

Tests whether the given variables contain numeric content.
versus
Content is considered numeric if it solely consists of numbers.
For example, “2nd” tests “true” whereas “second” and “2nd edition”
test “false”.
does not seem to agree with condition_IsNumeric

The behavior of “is-numeric” changed in CSL 1.0.1. See
http://citationstyles.org/downloads/release-notes-csl101.html#numbers

I can see how the current description in the specification might be
somewhat confusing, but it is meant to agree with
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/condition_IsNumeric.txt.
In “Tests whether the given variables contain numeric content.”
(Redirecting…), I
mean to say that the test is against the entire string contents of
each variable. In a string like “2nd edition”, the “edition” substring
means that the entire string is non-numeric.

  • Chicago page range format: what do do with five or more digits?

The specification currently links to
http://www.aahn.org/guidelines.html, but it seems like the content we
relied on moved to http://www.aahn.org/stylesheet.html . The latter
page shows an excerpt from CMoS that we almost copied verbatim.
Sebastian, could you check if CMoS 16th edition gives any guidance on
number ranges of 5 or more digits?

  • Which values are allowed for the “page” input field? I see multiple
    ranges can also be specified. I think the CSL spec should, in general,
    also define the format of the input fields. Personally, I would opt for a
    structured format (like the date fields) as opposed to a string-format
    (the page field). Individual CSL processors can still convert a
    string-formatted field to the structured data. This would require changes
    to the tests.

This would presumably involve describing the JSON format used by
citeproc-js in more detail. See
http://blog.martinfenner.org/2013/08/08/csl-is-more-than-citation-styles/
for a relevant discussion on this topic.

  • Shouldn’t “page-first” be a number variable? It is used with number in
    page_NumberPageFirst

See proposal: allow more variables to be rendered with cs:number · Issue #9 · citation-style-language/schema · GitHub. I
think Frank prefers to render “page” and “page-first” with cs:number,
but that’s currently not kosher CSL.

  • The spec doesn’t say anything about the nested groups special case.
    variables_TitleShortOnShortTitleNoTitleCondition seems to disagree with
    the CSL spec:

cs:group and its child elements are suppressed if a) at least one
renderingelement in cs:group calls a variable (either directly or via
a macro), and b)all variables that are called are empty.
In the group in the else section only the title variable is called. For
ITEM-3, this variable is empty, so the group should be suppressed, but it
isn’t.
Should a nested group always act as if it’s (successfully) calling a
variable? If so, the spec should mention this.

I think Frank already has an opinion on this, but I can’t find the
discussion. I think the test
(https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/variables_TitleShortOnShortTitleNoTitleCondition.txt)
describes the desired behavior, in which case the specification should
indeed be amended. This is somewhat related to the open issue
Clarify the output rules for the `<substitute>` element · Issue #100 · citation-style-language/documentation · GitHub

  • I seem to remember citeproc-js postprocesses its output to remove
    duplicate affixes. The CSL spec doesn’t say anything about this, AFAIK.
    What’s the official stance on this? I would personally avoid doing this,
    unless the spec includes an unambiguous definition on how this should work.

I’m convinced that CSL processors need to do some suppression of
duplicated punctuation. Frank just prepared some tests that describe
the current behavior in citeproc-js, and I hope to write up some
requirements for the specification in the next few weeks based on
those. See

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyPlain.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyQuotesIn.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyQuotesOut.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyField.txt

  • locale_TitleCaseGarbageLangEnglishLocale: is “en” a valid locale? If so,
    and default-locale=“en”, which locale should we use?

http://citationstyles.org/downloads/specification.html#locale-fallback
discusses this: “If the chosen output locale is a language (e.g.
“de”), the (primary) dialect is used in step 1 (e.g. “de-DE”).”

The table above that line mentions that “en-US” is the primary dialect for “en”.

  • textcase_SkipNameParticlesInTitleCase (1): I believe this behavior is
    not part of the CSL spec, is it?

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/textcase_SkipNameParticlesInTitleCase.txt

Correct.

  • textcase_SkipNameParticlesInTitleCase (2): the result doesn’t seem to
    follow the CSL spec. The ‘a’ after the colon should be capitalized:

In both cases, stop words are lowercased, unless they are the first or
lastword in the string, or follow a colon.

It seems like it should
(Redirecting…).
Frank?

  • date_VariousInvalidDates: why is ‘Spring’ in the output?

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/date_VariousInvalidDates.txt

Don’t know. I think you can ignore this unit test. Frank?

  • page_Chicago: is the example S input data correct? It strikes me as a
    confusing way of representing a page range (in addition to saving only a
    single digit).

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/page_Chicago.txt

Looks unambiguous to me.

  • A large number of tests test functionality that is not in the CSL spec,
    but is provided by citeproc-js (raw dates, static ordering, literal names,
    …). I think these should be indicated as such, or perhaps moved to a
    separate directory. This would make it easier to check the other CSL
    processor’s compatibility.

Sylvester Keil proposed using a Cucumber format for unit tests, which
would allow tests to be tagged:

If somebody else helps with the technical infrastructure, I’d be happy
to help reclassifying the existing unit tests.

Rintze

Re: Page range

The fore-matter in books and some journals is usually in Roman numerals. Is
this observation relevant?

DavidOn Thu, Aug 8, 2013 at 12:27 PM, Rintze Zelle <@Rintze_Zelle>wrote:

What exactly led you to this remark? The discussion about the “is-numeric” test?

My guess is that citeproc-js doesn’t currently parse roman numerals in
its input data, and just treats it as text, which should work
reasonably well.

re roman numerals: treating them as text works for CMoS which always wants
full page ranges for roman numbers.

CMoS page range specs don’t change for ranges with more than 4 digits, i.e.
"Use two digits unless more are needed to include all changed parts"
12345-46
12345-678
12345-6789

and the different rules for multiples of hundred and the first nine digits
thereafter remain,
i.e. cite all digits when dealing with multiples of hundred
12300-12345
and only the changed digit(s) for the first ten thereafter
12301-8On Thu, Aug 8, 2013 at 2:05 PM, David Lawrence <@David_Lawrence>wrote:

But according to “If numbers are four digits long and three digits
change, use all digits”, you would have:

1234-46
1234-1678

So I’d expect

12345-46
12345-12678
12345-16789

In Redirecting…,
shouldn’t we write something like
“If numbers are four or more digits long and three or more digits
change, use all digits” ?

Rintze

Hi,

In
Redirecting…,
shouldn’t we write something like
“If numbers are four or more digits long and three or more digits
change, use all digits” ?

Yes, this is the the reason why I asked in the first place. I should
probably have mentioned that.

For now, in citeproc-py, if the number of common digits between the start
en end page numbers is less than two, it uses the expanded form.

12345-468
12345-13576
123456-5614

I’m not sure which I prefer. As long as it’s clearly defined, I’m happy :slight_smile:

Cheers,
Brecht

Hello,

Thank you, Rintze, for the clarifications and pointers to relevant
information. These should prove helpful.

Brecht

sorry, was on vacations.

In

Redirecting…
,

shouldn’t we write something like
“If numbers are four or more digits long and three or more digits
change, use all digits” ?

Yes, this is the the reason why I asked in the first place. I should
probably have mentioned that.

For now, in citeproc-py, if the number of common digits between the start
en end page numbers is less than two, it uses the expanded form.

12345-468
12345-13576
123456-5614

I’m not sure which I prefer. As long as it’s clearly defined, I’m happy :slight_smile:

The current CSL spec (and thus Brecht’s implementation in -py) is incorrect
according to the current CMoS.
Here are three examples from the relevant chapter (9.60)
1496–500
11564–615
12991–3001

i.e. even when only one digit stays the same, only the changing digits are
displayed after the en-dash.
Since we call this rule “Chicago” we should change this in the specs (and
implementers should change this accordingly).
According to the manual, these rules have never changed, so we must have
gotten that wrong at some point. Sorry for never catching that.