Test suite input format

I’m about to turn to numbers and pinpoints in citeproc-js. As Simon
pointed out earlier, pinpoints (and other client-supplied details)
should be carried on a separate input object. To set that up, I think
I need to add another section to the test data for citation-type
tests, with something like this:

=== CITES ===>>
[

{ “source”: “ITEM-1”,
“locator”: “1, 2, 3”,
“locatorType”: “page” },

{ “source”: “ITEM-2”,
“locator”: “27”,
“locatorType”: “section” }

]
<<=== CITES ===<<

This input would drive all citation-type tests, relying on items in
the INPUT section for bibliographic data. Refactoring would be
required in the test suite, and the test frameworks of clients would
need to be adjusted. It’s a tedious change to implement, but better
to do it now than later, I think.

I won’t check anything in on this until the weekend, to allow time for
comments. I’ll also hold off on moving the test suite to ./csl until
the format has been settled, and the tests have been adjusted.

Frank

Doesn’t it make more sense to do “page”:“23”? Seems simpler, and more
flexible.

Doesn’t it make more sense to do “page”:“23”? Seems simpler, and more
flexible.

How would text declarations find the correct input field in that case?

<text variable=“locator”/

Just write the mapping? Am just proposing a shortcut that uses the
locator tyoe as the key. Your code can be responsible for knowing
they’re all locators.

The problem with your approach is it doesn’t leave room for multiple locators.

Bruce

Doesn’t it make more sense to do “page”:“23”? Seems simpler, and more
flexible.

How would text declarations find the correct input field in that case?

Just write the mapping? Am just proposing a shortcut that uses the
locator tyoe as the key. Your code can be responsible for knowing
they’re all locators.

The problem with your approach is it doesn’t leave room for multiple locators.

… which might be handy. The cite object will also need to carry
cite and note sequence numbers. To avoid stepping through all of the
possibilities to sort out what is a locator and what is other
metadata, could we do this:

{ “id”: “ITEM-1”,
“locator”: [
{“page”:“23”}
],
“note-number”: “7”
}

?

Sure (but I’d use the plural on the locators key, since it’s an array).

Bruce

Doesn’t it make more sense to do “page”:“23”? Seems simpler, and more
flexible.

How would text declarations find the correct input field in that case?

Just write the mapping? Am just proposing a shortcut that uses the
locator tyoe as the key. Your code can be responsible for knowing
they’re all locators.

The problem with your approach is it doesn’t leave room for multiple locators.

… which might be handy. The cite object will also need to carry
cite and note sequence numbers. To avoid stepping through all of the
possibilities to sort out what is a locator and what is other
metadata, could we do this:

Sure (but I’d use the plural on the locators key, since it’s an array).

I gave this some more thought during the commute, and I’m unsure
whether it’s a good idea to break the data down in this way. Ideally,
it’s right, but it might create as many problems as it solves. There
are two factors in the mix behind that thought.

First, a simple array of objects won’t be enough to capture the full
structure of complex locators. There will be things like this:

pp. 23 n. 4, 15-20.
p. 101, sec. 10 p. 9.

To get the joins right in rendering, you would need (at least) a
double-nested array.

That leads to the thought that, possibly, structured delivery of
locator details may be too much to demand of the calling application.
Some, perhaps all applications will have the user entering this data
as a text string. If that’s the case, the parsing will be
complicated, and if the parsing is reimplemented across different
applications, they will start coming up with different results, out of
sight of the processor. If we’re moving toward a world in which users
of Zotero and Mendeley, using different CSL processors, collaborate in
the editing of shared documents, it seems like that situation could
very quickly lead to user complaints, finger pointing and FUD.

I would certainly prefer to receive the locator in a structured form,
with all of the hints needed to render it correctly. But I think we
need to hear from the UI and delivery side (Simon, Dan, Andrea,
Steve?) before settling on a cross-processor input format for this
part of the data. (I’ve been afflicted by needless doubts before, and
I would be delighted to be told that this is one of them; but this
mail just in case.)

Frank

I understand your concern, but the locators have to be structured or
there’s no way to test them, and hence ensure consistency across
implementation.

I understand your concern, but the locators have to be structured or
there’s no way to test them, and hence ensure consistency across
implementation.

If users are inputting this data as dumb strings, they’ll need to be
parsed somewhere in the chain. All I’m suggesting is that the parsing
could be tested in the processors, to reduce the amount of code
replication. If processors require structured input of locators, the
problem is just pushed back to the calling application (including web
server deployments developed ad hoc). Many applications will have
less rigorous testing frameworks than the processors do.

If it’s to be structured in JSON, we’ll need to specify how to
represent ranges and sub-locators. I’ve attached a sample, but I
still think we should have feedback from dependent projects that would
be affected by such a scheme. You could accomplish the same thing by
adopting Chicago Manual abbreviations as a standard input format, and
parsing out the strings, failing over to literal presentation of the
dumb string if the user gets it wrong. I think that would be
adequate, since this is not core bibliographic data, and I think it
might make it easier to keep everyone on the same page, as it were. A
thought, anyway.

(In the sample I’ve kept “locator” as the field name, for consistency
with the CSL variable name.)

Frank

I understand your concern, but the locators have to be structured or
there’s no way to test them, and hence ensure consistency across
implementation.

If users are inputting this data as dumb strings, they’ll need to be
parsed somewhere in the chain. All I’m suggesting is that the parsing
could be tested in the processors, to reduce the amount of code
replication. If processors require structured input of locators, the
problem is just pushed back to the calling application (including web
server deployments developed ad hoc). Many applications will have
less rigorous testing frameworks than the processors do.

If it’s to be structured in JSON, we’ll need to specify how to
represent ranges and sub-locators. I’ve attached a sample, but I
still think we should have feedback from dependent projects that would
be affected by such a scheme. You could accomplish the same thing by
adopting Chicago Manual abbreviations as a standard input format, and
parsing out the strings, failing over to literal presentation of the
dumb string if the user gets it wrong. I think that would be
adequate, since this is not core bibliographic data, and I think it
might make it easier to keep everyone on the same page, as it were. A
thought, anyway.

(In the sample I’ve kept “locator” as the field name, for consistency
with the CSL variable name.)

Oops. Attachment.

locators.txt (713 Bytes)

But I’m saying a) it’d be a (really) bad decision to have users
inputting these data as dumb strings, and b) if that’s a developer’'s
choice, so be it; I just don’t want to encourage or validate that.

Bruce

I agree with Bruce that semantically meaningful locators are
desirable, but I would also note that it’s absolutely essential to be
able to pass a dumb string. When you’re dealing with early modern
sources, for example, “pagination” is often extremely idiosyncratic.
Not that this should be a problem at all.

Sean

I understand your concern, but the locators have to be structured or
there’s no way to test them, and hence ensure consistency across
implementation.

If users are inputting this data as dumb strings, they’ll need to be
parsed somewhere in the chain. All I’m suggesting is that the parsing
could be tested in the processors, to reduce the amount of code
replication. If processors require structured input of locators, the
problem is just pushed back to the calling application (including web
server deployments developed ad hoc). Many applications will have
less rigorous testing frameworks than the processors do.

But I’m saying a) it’d be a (really) bad decision to have users
inputting these data as dumb strings, and b) if that’s a developer’'s
choice, so be it; I just don’t want to encourage or validate that.

We can accept both, and the input scheme should certainly be amenable
structured input. Any thoughts on the JSON sample? (attached again
for convenience)

locators.txt (713 Bytes)

I like it, except that this may be overkill:

//
{
“source”: “ITEM-3”,
“note-number”: “300”,
“locator” : [
[
{ “page” : [ “10”, “15” ] }
]
]
},
//
//
// The above becomes: “pp. 10-15”

I’d say just treat it as a string, and print as is.

{
“source”: “ITEM-3”,
“note-number”: “300”,
“locator” : [
[
{ “page” : “10-15” }
]
]
},

The only processing that might be nice-to-have (though hardly
necessary) for some users is to replace the hyphen with an en-dash.

I have another question about whether the “locator” (“locators”?) key
should be an array. This presumes that a processor prints it as it
gets it. Do we want that, or is it better that we define the order in
the spec?

Am agnostic I suppose.

Bruce

We can accept both, and the input scheme should certainly be amenable
structured input. Any thoughts on the JSON sample? (attached again
for convenience)

I like it, except that this may be overkill:

//
{
“source”: “ITEM-3”,
“note-number”: “300”,
“locator” : [
[
{ “page” : [ “10”, “15” ] }
]
]
},
//
//
// The above becomes: “pp. 10-15”

I’d say just treat it as a string, and print as is.

{
“source”: “ITEM-3”,
“note-number”: “300”,
“locator” : [
[
{ “page” : “10-15” }
]
]
},

The only processing that might be nice-to-have (though hardly
necessary) for some users is to replace the hyphen with an en-dash.

Music to my ears. After sending that, I realized I hadn’t
distinguished ranges from sequence elements, and the thought of
setting ranges as a hash with “start” and “end” elements seemed
really too much.

I have another question about whether the “locator” (“locators”?) key
should be an array. This presumes that a processor prints it as it
gets it. Do we want that, or is it better that we define the order in
the spec?

Am agnostic I suppose.

The hierarchy of things like “chapter” and “section” might vary from
source to source, it might be safer to just render things in sequence.

I’m going to have a go at implementing ranged collapses for
citation-number cites and Rintze’s “Jones 2008a-c” example. Once
that’s working, I’ll pick up any further comments, set up locator
input in the test files, and move them to csl.