CSL spec and test cases

Brecht_Machiels1 · August 8, 2013, 5:12pm

Hello,

I’ve been doing some work on my citeproc-py
(GitHub - brechtm/citeproc-py: Yet another Python CSL Processor) and have written down some
questions/remarks about some of the tests and the CSL spec. Note that I
could simply be misunderstanding/misinterpreting things for some of these.

the CSL spec is contradictory about number detection

Tests whether the given variables contain numeric content.
versus
Content is considered numeric if it solely consists of numbers.
For example, “2nd” tests “true” whereas “second” and “2nd edition”
test “false”.
does not seem to agree with condition_IsNumeric

Chicago page range format: what do do with five or more digits?
Which values are allowed for the “page” input field? I see multiple
ranges can also be specified. I think the CSL spec should, in general,
also define the format of the input fields. Personally, I would opt for a
structured format (like the date fields) as opposed to a string-format
(the page field). Individual CSL processors can still convert a
string-formatted field to the structured data. This would require changes
to the tests.
Shouldn’t “page-first” be a number variable? It is used with number in
page_NumberPageFirst
The spec doesn’t say anything about the nested groups special case.
variables_TitleShortOnShortTitleNoTitleCondition seems to disagree with
the CSL spec:

cs:group and its child elements are suppressed if a) at least one
renderingelement in cs:group calls a variable (either directly or via
a macro), and b)all variables that are called are empty.
In the group in the else section only the title variable is called. For
ITEM-3, this variable is empty, so the group should be suppressed, but it
isn’t.
Should a nested group always act as if it’s (successfully) calling a
variable? If so, the spec should mention this.

I seem to remember citeproc-js postprocesses its output to remove
duplicate affixes. The CSL spec doesn’t say anything about this, AFAIK.
What’s the official stance on this? I would personally avoid doing this,
unless the spec includes an unambiguous definition on how this should work.
locale_TitleCaseGarbageLangEnglishLocale: is “en” a valid locale? If so,
and default-locale=“en”, which locale should we use?
textcase_SkipNameParticlesInTitleCase (1): I believe this behavior is
not part of the CSL spec, is it?
textcase_SkipNameParticlesInTitleCase (2): the result doesn’t seem to
follow the CSL spec. The ‘a’ after the colon should be capitalized:

In both cases, stop words are lowercased, unless they are the first or
lastword in the string, or follow a colon.

date_VariousInvalidDates: why is ‘Spring’ in the output?
page_Chicago: is the example S input data correct? It strikes me as a
confusing way of representing a page range (in addition to saving only a
single digit).
A large number of tests test functionality that is not in the CSL spec,
but is provided by citeproc-js (raw dates, static ordering, literal names,
…). I think these should be indicated as such, or perhaps moved to a
separate directory. This would make it easier to check the other CSL
processor’s compatibility.

I hope you can find the time to answer these.

Thanks,
Brecht

Rintze_Zelle · August 8, 2013, 7:27pm

the CSL spec is contradictory about number detection

Tests whether the given variables contain numeric content.
versus
Content is considered numeric if it solely consists of numbers.
For example, “2nd” tests “true” whereas “second” and “2nd edition”
test “false”.
does not seem to agree with condition_IsNumeric

The behavior of “is-numeric” changed in CSL 1.0.1. See
http://citationstyles.org/downloads/release-notes-csl101.html#numbers

I can see how the current description in the specification might be
somewhat confusing, but it is meant to agree with
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/condition_IsNumeric.txt.
In “Tests whether the given variables contain numeric content.”
(Redirecting…), I
mean to say that the test is against the entire string contents of
each variable. In a string like “2nd edition”, the “edition” substring
means that the entire string is non-numeric.

Chicago page range format: what do do with five or more digits?

The specification currently links to
http://www.aahn.org/guidelines.html, but it seems like the content we
relied on moved to http://www.aahn.org/stylesheet.html . The latter
page shows an excerpt from CMoS that we almost copied verbatim.
Sebastian, could you check if CMoS 16th edition gives any guidance on
number ranges of 5 or more digits?

Which values are allowed for the “page” input field? I see multiple
ranges can also be specified. I think the CSL spec should, in general,
also define the format of the input fields. Personally, I would opt for a
structured format (like the date fields) as opposed to a string-format
(the page field). Individual CSL processors can still convert a
string-formatted field to the structured data. This would require changes
to the tests.

This would presumably involve describing the JSON format used by
citeproc-js in more detail. See
http://blog.martinfenner.org/2013/08/08/csl-is-more-than-citation-styles/
for a relevant discussion on this topic.

Shouldn’t “page-first” be a number variable? It is used with number in
page_NumberPageFirst

See proposal: allow more variables to be rendered with cs:number · Issue #9 · citation-style-language/schema · GitHub. I
think Frank prefers to render “page” and “page-first” with cs:number,
but that’s currently not kosher CSL.

The spec doesn’t say anything about the nested groups special case.
variables_TitleShortOnShortTitleNoTitleCondition seems to disagree with
the CSL spec:

cs:group and its child elements are suppressed if a) at least one
renderingelement in cs:group calls a variable (either directly or via
a macro), and b)all variables that are called are empty.
In the group in the else section only the title variable is called. For
ITEM-3, this variable is empty, so the group should be suppressed, but it
isn’t.
Should a nested group always act as if it’s (successfully) calling a
variable? If so, the spec should mention this.

I think Frank already has an opinion on this, but I can’t find the
discussion. I think the test
(https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/variables_TitleShortOnShortTitleNoTitleCondition.txt)
describes the desired behavior, in which case the specification should
indeed be amended. This is somewhat related to the open issue
Clarify the output rules for the `<substitute>` element · Issue #100 · citation-style-language/documentation · GitHub

I seem to remember citeproc-js postprocesses its output to remove
duplicate affixes. The CSL spec doesn’t say anything about this, AFAIK.
What’s the official stance on this? I would personally avoid doing this,
unless the spec includes an unambiguous definition on how this should work.

I’m convinced that CSL processors need to do some suppression of
duplicated punctuation. Frank just prepared some tests that describe
the current behavior in citeproc-js, and I hope to write up some
requirements for the specification in the next few weeks based on
those. See

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyPlain.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyQuotesIn.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyQuotesOut.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyField.txt

locale_TitleCaseGarbageLangEnglishLocale: is “en” a valid locale? If so,
and default-locale=“en”, which locale should we use?

http://citationstyles.org/downloads/specification.html#locale-fallback
discusses this: “If the chosen output locale is a language (e.g.
“de”), the (primary) dialect is used in step 1 (e.g. “de-DE”).”

The table above that line mentions that “en-US” is the primary dialect for “en”.

textcase_SkipNameParticlesInTitleCase (1): I believe this behavior is
not part of the CSL spec, is it?

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/textcase_SkipNameParticlesInTitleCase.txt

Correct.

textcase_SkipNameParticlesInTitleCase (2): the result doesn’t seem to
follow the CSL spec. The ‘a’ after the colon should be capitalized:

In both cases, stop words are lowercased, unless they are the first or
lastword in the string, or follow a colon.

It seems like it should
(Redirecting…).
Frank?

date_VariousInvalidDates: why is ‘Spring’ in the output?

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/date_VariousInvalidDates.txt

Don’t know. I think you can ignore this unit test. Frank?

page_Chicago: is the example S input data correct? It strikes me as a
confusing way of representing a page range (in addition to saving only a
single digit).

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/page_Chicago.txt

Looks unambiguous to me.

A large number of tests test functionality that is not in the CSL spec,
but is provided by citeproc-js (raw dates, static ordering, literal names,
…). I think these should be indicated as such, or perhaps moved to a
separate directory. This would make it easier to check the other CSL
processor’s compatibility.

Sylvester Keil proposed using a Cucumber format for unit tests, which
would allow tests to be tagged:

github.com

inukshuk/citeproc-ruby/blob/1c420de0f7a86b7c35782dee86ce62cbebb47ab9/features/condition/is_numeric.feature

Feature: Condition is numeric

  @citeproc-test @citation @v1.0
  Scenario: is numeric
    Given a CSL processor
    And the following items
    """
    [
        {
            "id": "ITEM-1", 
            "title": "Work 1", 
            "volume": "Volume 2",
            "type": "book"
        }, 
        {
            "id": "ITEM-2", 
            "title": "Work 2", 
            "volume": "2nd volume",
            "type": "book"
        },

This file has been truncated. show original

If somebody else helps with the technical infrastructure, I’d be happy
to help reclassifying the existing unit tests.

Rintze

David_Lawrence · August 8, 2013, 8:05pm

Re: Page range

The fore-matter in books and some journals is usually in Roman numerals. Is
this observation relevant?

DavidOn Thu, Aug 8, 2013 at 12:27 PM, Rintze Zelle <@Rintze_Zelle>wrote:

Rintze_Zelle · August 8, 2013, 8:45pm

What exactly led you to this remark? The discussion about the “is-numeric” test?

My guess is that citeproc-js doesn’t currently parse roman numerals in
its input data, and just treats it as text, which should work
reasonably well.

Sebastian_Karcher · August 8, 2013, 8:47pm

re roman numerals: treating them as text works for CMoS which always wants
full page ranges for roman numbers.

Sebastian_Karcher · August 8, 2013, 8:46pm

CMoS page range specs don’t change for ranges with more than 4 digits, i.e.
"Use two digits unless more are needed to include all changed parts"
12345-46
12345-678
12345-6789

and the different rules for multiples of hundred and the first nine digits
thereafter remain,
i.e. cite all digits when dealing with multiples of hundred
12300-12345
and only the changed digit(s) for the first ten thereafter
12301-8On Thu, Aug 8, 2013 at 2:05 PM, David Lawrence <@David_Lawrence>wrote:

Rintze_Zelle · August 8, 2013, 9:39pm

But according to “If numbers are four digits long and three digits
change, use all digits”, you would have:

1234-46
1234-1678

So I’d expect

12345-46
12345-12678
12345-16789

In Redirecting…,
shouldn’t we write something like
“If numbers are four or more digits long and three or more digits
change, use all digits” ?

Rintze

Brecht_Machiels1 · August 11, 2013, 10:27am

Hi,

In
Redirecting…,
shouldn’t we write something like
“If numbers are four or more digits long and three or more digits
change, use all digits” ?

Yes, this is the the reason why I asked in the first place. I should
probably have mentioned that.

For now, in citeproc-py, if the number of common digits between the start
en end page numbers is less than two, it uses the expanded form.

12345-468
12345-13576
123456-5614

I’m not sure which I prefer. As long as it’s clearly defined, I’m happy

Cheers,
Brecht

Brecht_Machiels1 · August 11, 2013, 11:58am

Hello,

Thank you, Rintze, for the clarifications and pointers to relevant
information. These should prove helpful.

Brecht

Sebastian_Karcher · August 17, 2013, 7:17pm

sorry, was on vacations.

In

Redirecting…
,

shouldn’t we write something like
“If numbers are four or more digits long and three or more digits
change, use all digits” ?

Yes, this is the the reason why I asked in the first place. I should
probably have mentioned that.

For now, in citeproc-py, if the number of common digits between the start
en end page numbers is less than two, it uses the expanded form.

12345-468
12345-13576
123456-5614

I’m not sure which I prefer. As long as it’s clearly defined, I’m happy

The current CSL spec (and thus Brecht’s implementation in -py) is incorrect
according to the current CMoS.
Here are three examples from the relevant chapter (9.60)
1496–500
11564–615
12991–3001

i.e. even when only one digit stays the same, only the changing digits are
displayed after the en-dash.
Since we call this rule “Chicago” we should change this in the specs (and
implementers should change this accordingly).
According to the manual, these rules have never changed, so we must have
gotten that wrong at some point. Sorry for never catching that.

Topic		Replies	Views
En dash subsittution in `text` CSL Development	21	2187	July 8, 2020
Chapters, volumes and editions CSL Development	50	535	December 20, 2007
Schema questions CSL Development	36	457	August 31, 2009
Citeproc text-case tests CSL Development	16	320	February 22, 2011
Multiple numbers in cs:number CSL Development	4	311	November 13, 2010

CSL spec and test cases

Related topics