Continuing development of citeproc-rb

All,
As I’ve already discussed with Bruce and Frank, I’m going to continue the
development of citeproc-rb. Having a good citeproc implementation in Ruby is
important to me as I’m working on a Ruby-based citation management tool, and
having perfectly accurate citations is a must. It’s the next version of
SourceAid’s citation builder (current version here:
http://sourceaid.com/citationbuilder (not a very nice version, I know), and
the next version will be hosted on http://cite.me (for a preview go to
http://draft.cite.me … a much more polished update)).

But I’m not here to plug my own work, I’m here to talk about making a CSL
processor. There are a couple of challenges/questions/thoughts I have:

*1. Documentation on CSL itself
*
Most of the documentation I’ve seen for CSL is on the Zotero website:
http://www.zotero.org/support/dev/csl_syntax_summary

Where is the most up-to-date spec for CSL? Is is the schema really the
definitive spec?

Most importantly, understanding what citation types (book, chapter, article,
etc) CSL expects, as well as the field names, is crucial to getting this
thing working in any capacity. I got citeproc-rb’s mapping from Zotero a
couple months ago, but I’m not sure how accurate it was, so hopefully this
is covered by some spec.

*2. *Testing for CSL

The tests included in citeproc-js are consumable by and CSL processor, which
is great. Does anyone have any idea how complete they are?

3. CSL file correctness

Are there any known issues with the CSL files that are provided in xbiblio,
and the citeproc implementations? Taking mla.csl and comparing it against
the most recent MLA edition, there are a ton of formatting issues and errors
(I tested this out with citeproc-js). I’m not sure if I’m feeding citeproc
an incorrect set of types/fields, so again info on that would help.

4. Testing for the CSL files themselves
*
*
Along the same lines of CSL correctness … while there is a substantial
amount tests for a CSL processor, there doesn’t seem to be any tests for the
individual CSL files that are part of xbiblio. How do we know the mla.csl
file conforms with the MLA handbook? I have tests for MLA, APA, CSE, and
CMS, from cite.me’s current style formatter; MLA is the most complete and
I’m finishing up the other three. They are style-formatter-implementation
agnostic, as they are just JSON for the data + HTML for what the style
output is matched against. Is there any interest in having tests for the CSL
files themselves?
*
*
5. Is it OK to have so many differing implementations of citeproc?

More of a philosophical question, but my initial reaction to needing a Ruby
CSL processor was to find the most compatible one and try to use it from
Ruby. This could have very well worked with citeproc-js, if I knew how to
use it (basically the whole fields/types mapping problem). Should the Python
and Ruby versions just be direct ports of the JavaScript version, so we can
ensure compatibility? Sure, having the same test suite is a step in the
right direction, but compatibility won’t be acceptable for the three
citeproc implementations for a while.

I’ve actually played around with using citeproc-js as just a in-the-browser
formatter for the website, which makes it really simple to use, but I also
need bibliography generation to be available from a web-service, so having
some way to call it directly from Ruby is a must as well.

So do we spread ourselves thin working on multiple implementations in
different languages, or all focus on one implementation, but make it really
easy to use from the command-line so people can shell out to it? Ports to
different languages can happen later if needed. Obviously I think the latter
is a good idea, but without a good Ruby->JavaScript bridge this isn’t ideal
… though it would work.

Anyway, that’s what is on my mind at the moment, but I’m sure I’ll be asking
more questions very soon.

~Jimmy

Jimmy,

This is all really very interesting. Many of your questions are best
responded to by Bruce or Rintze Zelle. I’ll just respond below to a
few items as far as they relate to citeproc-js.

All,
As I’ve already discussed with Bruce and Frank, I’m going to continue the
development of citeproc-rb. Having a good citeproc implementation in Ruby is
important to me as I’m working on a Ruby-based citation management tool, and
having perfectly accurate citations is a must. It’s the next version of
SourceAid’s citation builder (current version here:
http://sourceaid.com/citationbuilder (not a very nice version, I know), and
the next version will be hosted on http://cite.me (for a preview go to
http://draft.cite.me … a much more polished update)).
But I’m not here to plug my own work, I’m here to talk about making a CSL
processor. There are a couple of challenges/questions/thoughts I have:

  1. Documentation on CSL itself
    Most of the documentation I’ve seen for CSL is on the Zotero website:
    http://www.zotero.org/support/dev/csl_syntax_summary
    Where is the most up-to-date spec for CSL? Is is the schema really the
    definitive spec?
    Most importantly, understanding what citation types (book, chapter, article,
    etc) CSL expects, as well as the field names, is crucial to getting this
    thing working in any capacity. I got citeproc-rb’s mapping from Zotero a
    couple months ago, but I’m not sure how accurate it was, so hopefully this
    is covered by some spec.
  2. Testing for CSL

The tests included in citeproc-js are consumable by and CSL processor, which
is great. Does anyone have any idea how complete they are?

It’s hard to say exactly. After the first push to settle the test
format and get the test harness working, I tried to cover at least the
obvious potential gotchas of new functionality as it was built. But
there hasn’t been a systematic effort to cover all edge cases of every
attribute and element, and many of the tests are more complex and less
well-focused than they might be. We could probably do with double or
triple the number of fixtures.

That said, the existing fixtures provide more than a head start of
coverage for basic functionality. Names and disambiguation are
covered pretty well, I think, and the tests at least touch on most
elements and options. The specific markup for some of the special
effects options on bibliographies hasn’t been settled yet, but that
will be sorted out in short order as soon as citeproc-js starts moving
toward deployment.

  1. CSL file correctness
    Are there any known issues with the CSL files that are provided in xbiblio,
    and the citeproc implementations? Taking mla.csl and comparing it against
    the most recent MLA edition, there are a ton of formatting issues and errors
    (I tested this out with citeproc-js). I’m not sure if I’m feeding citeproc
    an incorrect set of types/fields, so again info on that would help.

That’s not super surprising, since citeproc-js hasn’t been used in
production yet. It should be getting close to servicable though. I’m
keen to iron out any kinks, and I would love to get access to your
test data.

  1. Testing for the CSL files themselves
    Along the same lines of CSL correctness … while there is a substantial
    amount tests for a CSL processor, there doesn’t seem to be any tests for the
    individual CSL files that are part of xbiblio. How do we know the mla.csl
    file conforms with the MLA handbook? I have tests for MLA, APA, CSE, and
    CMS, from cite.me’s current style formatter; MLA is the most complete and
    I’m finishing up the other three. They are style-formatter-implementation
    agnostic, as they are just JSON for the data + HTML for what the style
    output is matched against. Is there any interest in having tests for the CSL
    files themselves?

This would be extremely interesting and helpful, if you are able and
willing to share the test data. I’ll be happy to help out with
mapping issues and the like, to the extent that I can make a useful
contribution.

  1. Is it OK to have so many differing implementations of citeproc?
    More of a philosophical question, but my initial reaction to needing a Ruby
    CSL processor was to find the most compatible one and try to use it from
    Ruby. This could have very well worked with citeproc-js, if I knew how to
    use it (basically the whole fields/types mapping problem). Should the Python
    and Ruby versions just be direct ports of the JavaScript version, so we can
    ensure compatibility?

Bruce and I talked about this once. Basically citeproc-js works in
the stages that you would expect: (1) slurp data; (2) compile runnable
version of the style; (3) compose pool of bib references; (4) accept
data and spit out structured data with formatting hints; (5) grind
structured data into a pretty string. Stage (2) in citeproc-js is, I
now think, and ugly piece of work that will one day be rewritten. It
works just fine, but it unnecessarily transforms the XML nodes into a
flat list of executable tokens, and that complexity is a burden that
you probably wouldn’t want to port through to another language.

Of the other bits, the “registry” and calculation of disambiguation
rules is not the prettiest code ever written, but it is very solid
(touch wood) and reasonably compact. Certainly something there to
work from, and a wheel that you don’t want to be reinventing. The
output queue stuff is also pretty-good for structure, I think. My
coding habits may have you reaching for the airsick bag, but it
shouldn’t be impossible to follow (he said).

Sure, having the same test suite is a step in the
right direction, but compatibility won’t be acceptable for the three
citeproc implementations for a while.
I’ve actually played around with using citeproc-js as just a in-the-browser
formatter for the website, which makes it really simple to use, but I also
need bibliography generation to be available from a web-service, so having
some way to call it directly from Ruby is a must as well.
So do we spread ourselves thin working on multiple implementations in
different languages, or all focus on one implementation, but make it really
easy to use from the command-line so people can shell out to it? Ports to
different languages can happen later if needed. Obviously I think the latter
is a good idea, but without a good Ruby->JavaScript bridge this isn’t ideal
… though it would work.

I’ve come to think of citeproc-js as a Swiss watch, in which the
burden of adding each new piece to the mechanism rises sharply as one
begins to approach (what one thought was) completion. Large sections
of the code have been rewritten several times. Andrea Rossato, who’s
written the Haskell implementation, has had similar experiences.
There’s very little room for design error, and the cost in duplicated
effort is very real.

There is virtue in having at least some variety in the stable, though.
JS is very suitable for running in the browser, but it’s not the
fastest language on the block at present. Also, at this early phase
in the development of CSL, there are still big gains to be had from
wholesale reimplementation. Reasonable minds might differ on the
question of whether multiple implementations are sustainable, but if
they are, they can get a little extra lift from cross-pollenation.

Frank

*1. *Documentation on CSL itself
*
*
Most of the documentation I’ve seen for CSL is on the Zotero website:
http://www.zotero.org/support/dev/csl_syntax_summary

Where is the most up-to-date spec for CSL? Is is the schema really the
definitive spec?

The documentation from the Zotero page has been used as the basis for what
should become the official CSL specification, which can be found at
http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/doc/specification.mdml?view=markup.
I already tried to fill in some gaps, and the spec should be reasonably
complete for CSL 0.8 (
http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/tags/0.8/,
version 0.8 is close to what’s currently supported by Zotero). When the last
few tickets are closed (Bruce hopes to push for CSL 1.0 this month), the
spec can be updated (and receive some polish).

Most importantly, understanding what citation types (book, chapter, article,

etc) CSL expects, as well as the field names, is crucial to getting this
thing working in any capacity. I got citeproc-rb’s mapping from Zotero a
couple months ago, but I’m not sure how accurate it was, so hopefully this
is covered by some spec.

The most current version of the schema can be found at
http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/split/
csl-core.rnc contains most of the schema logic
csl-types.rnc and csl-variables.rnc are the other two parts of the schema,
and might contain what you’re looking for

3. CSL file correctness

Are there any known issues with the CSL files that are provided in xbiblio,
and the citeproc implementations? Taking mla.csl and comparing it against
the most recent MLA edition, there are a ton of formatting issues and errors
(I tested this out with citeproc-js). I’m not sure if I’m feeding citeproc
an incorrect set of types/fields, so again info on that would help.

The Zotero Style Repository contains the most up-to-date styles (
www.zotero.org/styles). I wouldn’t trust the xbiblio-styles (they’re
probably quite out of date).

Rintze

I’ve actually played around with using citeproc-js as just a in-the-browser
formatter for the website, which makes it really simple to use, but I also
need bibliography generation to be available from a web-service, so having
some way to call it directly from Ruby is a must as well.

I hope they don’t mind me saying this, but zotero is working on
wrapping citeproc-js into a web service. I’ve thought about doing the
same thing using Rhino on GAE (for use possibly in Wave), but doubt
I’ll have the time or skill to do this anytime soon.

But …

So do we spread ourselves thin working on multiple implementations in
different languages, or all focus on one implementation, but make it really
easy to use from the command-line so people can shell out to it? Ports to
different languages can happen later if needed. Obviously I think the latter
is a good idea, but without a good Ruby->JavaScript bridge this isn’t ideal
… though it would work.

As Frank says, CSL is pretty young still, and citation processing
deceptively complex.

So I think it makes sense to experiment with different approaches and
languages. In Python efforts, for example, we’ve gone down a few
different design paths, with my current approach quite different than
Johan’s (and Frank’s).

Ideally we end up with at least one that is really fast,
feature-complete, easy-to-extend and debug, and accessible from
different languages.

But that will take awhile, and I do think it makes sense to wrap
Frank’s work in a web service ASAP to give people options now.

Bruce

Ideally we end up with at least one that is really fast,
feature-complete, easy-to-extend and debug, and accessible from
different languages.

BTW, don’t forget Andrea’s haskell-based version. It might present
some issues with deployment, but it’s fairly complete, and really fast
(haskell has performance that comes close to C).

Bruce