Sample layout for sample data

As a followup to Bruce’s suggestion of standard test suites, I’m
attaching a proposal for a file hierarchy and layouts for item data
and attribute/element tests. In the sample test files, I’ve meant to
model the item keys on what’s used by citeproc-hs, but I haven’t used
the tool itself, so please correct that if I’ve gotten that wrong.

A processor undergoing testing would need to parse these files and
compose them into its own test-suite syntax. People setting tests for
the specification shouldn’t need to worry about any of that, so I’ve
tried to keep the syntax to a minimum.

It should be obvious how the files would be parsed. If it’s not, the
layout can be changed in any way people feel comfortable with. I’m
hoping this will be easily accessible; contributions from Bruce and
Elena and others active in the design of the language would be of
tremendous value.

Frank

This is a good idea (and I believe we’ve discussed it before, but
never implemented it), but perhaps we’d be better off with JSON or
XML? Any CSL parser must have some XML processing abilities anyway,
and a JSON implementation exists for nearly all modern programming
languages (see http://www.json.org/). Personally, I’d go for JSON,
since it is a very simple spec, it is extremely easy to parse into a
data structure with existing libraries, and it is designed to
structure precisely the kind of data you have here.

Simon

This is a good idea (and I believe we’ve discussed it before, but
never implemented it), but perhaps we’d be better off with JSON or
XML? Any CSL parser must have some XML processing abilities anyway,
and a JSON implementation exists for nearly all modern programming
languages (see http://www.json.org/). Personally, I’d go for JSON,
since it is a very simple spec, it is extremely easy to parse into a
data structure with existing libraries, and it is designed to
structure precisely the kind of data you have here.

JSON requires less typing. I’m easy, but I guess on balance I’d
favour that. I’ll wait for a couple of days for guidance before doing
anything. Would the Haskell/Pandoc syntax for the keys be alright, or
should that also be in JSON? With multiple keys, the locator label
and the locator text it could get kind of verbose.

I’ve attached a proposed JSON-ized version of your test data
(hopefully I haven’t made any typos). As you can see, the formatting
is very similar to what you have in mind, and the only vaguely
annoying thing is the escaping of the quotes. If we go with JSON, I’d
prefer using JSON instead of Haskell/Pandoc syntax for locators. Not
quite sure what the difference between a citation and a cluster is
here; can you elucidate?

Simon

README.txt (1.06 KB)

I’ve attached a proposed JSON-ized version of your test data (hopefully I
haven’t made any typos). As you can see, the formatting is very similar to
what you have in mind, and the only vaguely annoying thing is the escaping
of the quotes. If we go with JSON, I’d prefer using JSON instead of
Haskell/Pandoc syntax for locators. Not quite sure what the difference
between a citation and a cluster is here; can you elucidate?

That was a missed edit; I started with cluster, then went with
citation. Citation is fine. It threw me on my first trip through the
Zotero sources, because I kept thinking it must refer to a single
item. But in these JSON objects, the meaning is obvious.

I could definitely deal with this! That lurking fear that the light
at the end of the tunnel might resolve into the headlight of an
oncoming train has at last proven unfounded.

Frank

I’ve not really read this thread (yet), but I suggest json as well.
Liam and I alreqady did some work on this; fyi:

@~/xbiblio# ls citeproc-rb/test/fixtures/
bibo_test_data.n3 bibo_test_zotero.rdf locales
bibo_test_data.yaml csl_test_data.json styles
bibo_test_how_to_cook.n3 csl_test_data.yaml

As you’ll note, it’s a simple model closer to CSL’s, and similar to
what one of you (Simon?) posted (though contributors need to be
ordered, and so hence are an array):

“c”: {
“type”: “book”,
“dateIssued”: “1994”,
“title”: “The social organization of sexuality: Sexual practices
in the United States”,
“publisher”: “University of Chicago Press”,
“publisherPlace”: “Chicago”,
“authors”: [
{ “name”: “Laumann, Edward O.” },
{ “name”: “Gagnon, John H.” },
{ “name”: “Michael, Robert T.” },
{ “name”: “Michaels, Stuart” }
]
}

Bruce

I’ve attached a proposed JSON-ized version of your test data (hopefully I
haven’t made any typos). As you can see, the formatting is very similar to
what you have in mind, and the only vaguely annoying thing is the escaping
of the quotes. If we go with JSON, I’d prefer using JSON instead of
Haskell/Pandoc syntax for locators. Not quite sure what the difference
between a citation and a cluster is here; can you elucidate?

Mmm, I would suggest a couple of small changes. The result field
should maybe be a single string rather than a list, in order to test
that cite joins are working as they ought (text attached; I’ve deleted
the “cluster” entry to avoid confusion).

Also, a snippet might be directed to the bibliography machinery, to
test that its options have the desired effect. I’ve added a “testof”
declaration to give an explicit indication of which CSL wrapper code
to use for the test. Most of the attributes and elements can be
tested against the citation engine, so that could be the default if
the declaration is left out.

Frank

README-3.txt (1.02 KB)

Looks fine to me, although I think the first name and last name should
be separated in order to test name parsing (also because otherwise
citations can’t be formatted properly). Handling names and et al.
correctly is pretty difficult, so it makes sense to have real test
cases for this. Additionally, I think it would make more sense to use
the same variable names as in the schema (i.e., “issued” instead of
"dateIssued", “publisher-place” instead of “publisherPlace”).

Simon

Looks fine to me, although I think the first name and last name should
be separated in order to test name parsing (also because otherwise
citations can’t be formatted properly). Handling names and et al.
correctly is pretty difficult, so it makes sense to have real test
cases for this.

That’d be fine, but then how would you deal with encoding non-Western
names (with different sort orders), organizational names, and such?

Additionally, I think it would make more sense to use
the same variable names as in the schema (i.e., “issued” instead of
“dateIssued”, “publisher-place” instead of “publisherPlace”).

Yes, I was thinkning that as well.

Bruce

Looks fine to me, although I think the first name and last name should
be separated in order to test name parsing (also because otherwise
citations can’t be formatted properly). Handling names and et al.
correctly is pretty difficult, so it makes sense to have real test
cases for this.

That’d be fine, but then how would you deal with encoding non-Western
names (with different sort orders), organizational names, and such?

Does citeproc-hs expect different packaging for names from that
currently used in Zotero? If so, a middle-ground arrangement that can
be remangled into the input form used by either will do. Otherwise,
the testing samples should just track the field structure that the
tools will normally see. It’s like an exam; the more real-world tasks
the tools have to perform to pass the tests, the stronger they will be
when they’re deployed.

Samples and tests for non-Western names would be great to have.
Zotero doesn’t have a means of providing phonetic hints currently, so
tests of Chinese, Japanese and simiilar languages will just fail.
That defines the task: we’ll be looking for the least painful way of
getting them to pass eventually.

Additionally, I think it would make more sense to use
the same variable names as in the schema (i.e., “issued” instead of
“dateIssued”, “publisher-place” instead of “publisherPlace”).

Yes, I was thinkning that as well.

The variable name and name formatting bits aside, I must say that the
sample item entries look great. The content of these five will be a
start, I can build test infrastructure for citeproc-js around them,
and then extend the code as the data is extended and test sets are
added. When you’re ready, let me know where they will be housed, and
I’ll pull them in from there.

Frank

You’re just referring to regular sorting in the bibliographic output? If
so, no, as far as I know, straight replacements won’t be correct, and
you really need a collation library. Mozilla provides one, and that’s
what we use in Zotero. (It’s currently based on the application locale
rather than bibliography locale, but there may only be a collation
available in Firefox for the locale of the current build. But even if
the collation isn’t for the same locale as the bibliography, it’s
probably still correct most of the time.)

Short of bundling a collation library with citeproc-js, though, I guess
you could go with string replacement, and implementers could swap in
locale-aware sorting as their environments allowed.

  • Dan

Looks fine to me, although I think the first name and last name should
be separated in order to test name parsing (also because otherwise
citations can’t be formatted properly). Handling names and et al.
correctly is pretty difficult, so it makes sense to have real test
cases for this.

That’d be fine, but then how would you deal with encoding non-Western
names (with different sort orders), organizational names, and such?

Does citeproc-hs expect different packaging for names from that
currently used in Zotero?

Let’s put it this way: while CSL is deliberately vague on these
details, I do not think the way things currently work in Zotero is
ideal.

If so, a middle-ground arrangement that can
be remangled into the input form used by either will do. Otherwise,
the testing samples should just track the field structure that the
tools will normally see. It’s like an exam; the more real-world tasks
the tools have to perform to pass the tests, the stronger they will be
when they’re deployed.

True. So we should have examples like “Mao Zedong” and “Jane Doe, III”
and “ACME, Inc.” and “Prince.”

Samples and tests for non-Western names would be great to have.
Zotero doesn’t have a means of providing phonetic hints currently, so
tests of Chinese, Japanese and simiilar languages will just fail.
That defines the task: we’ll be looking for the least painful way of
getting them to pass eventually.

Right.

Additionally, I think it would make more sense to use
the same variable names as in the schema (i.e., “issued” instead of
“dateIssued”, “publisher-place” instead of “publisherPlace”).

Yes, I was thinkning that as well.

The variable name and name formatting bits aside, I must say that the
sample item entries look great. The content of these five will be a
start, I can build test infrastructure for citeproc-js around them,
and then extend the code as the data is extended and test sets are
added. When you’re ready, let me know where they will be housed, and
I’ll pull them in from there.

Well, I wonder if the fixtures (the data examples) might live in a
separate top-level directory so that it’s clear any implementation can
and should use that for testing? Am not sure.

Bruce

http://michael.susens-schurter.com/blog/2008/07/14/javascript-collation-fail/

Bruce

http://michael.susens-schurter.com/blog/2008/07/14/javascript-collation-fail/

Thanks for this. I just get a database connection error from that
link, but I guess from the URL that this is a report that JS collation
doesn’t work correctly? If so, I’ll just leave this and let it
someone with more knowledge and experience pick it up further down the
line.

Frank

No, you were still right–we should use localeCompare(). I’d never heard
of it, and it seems not to function properly in all browsers (e.g.,
Opera, which, as that page indicates, defaults to bitwise comparison),
but it’s in the ECMAScript 3 spec. The host environment provides the
sorting, so results would vary, but that’s better than nothing.

Unfortunately, Unicode collating is broken in Firefox on OS X
(https://bugzilla.mozilla.org/show_bug.cgi?id=255192), which means that
pasting this into the Firefox address bar:

javascript: var list = [“Arg”, “�rgerlich”, “Arm”, “Assistant”, “A�lar”,
“Assoziation”]; function localesort(a, b) { return a.localeCompare(b); }
alert(list.sort(localesort))

returns “Arg,Arm,Assistant,Assoziation,A�lar,�rgerlich”, which is wrong.
This is a case, incidentally, where straight replacement would actually
be correct. It’s also something that the Unicode Collation Algorithm,
which provides rules for sorting Unicode characters in the absence of a
locale, would handle properly, since, according to Wikipedia, the UCA
"first looks only at letters stripped of any modifications or
diacritical marks".

So at least in Zotero we should probably do some replacements ourselves
for now on OS X, but localeCompare() should be a suitable default in
general.