citeproc-rb status

Hi,

I’m interested in using the citeproc-rb module mentioned here:

http://wiki.services.openoffice.org/wiki/Bibliographic_Project’s_Developer_Page#Formating_Engine

and here:

http://sourceforge.net/mailarchive/message.php?msg_id=a5a941250d1b046b30ca9ef3d8632d9e%40gmail.com

My use case is exactly that mentioned in Bruce’s email, i.e. generating
formatted citations via CSL in a Ruby on Rails application. This module
looks a out-of-date, though it doesn’t look too difficult to bring in line
with the (more recent?) CSL schema from citeproc-0.7.1. I’m willing to look
at updating this, but I would appreciate some pointers if at all possible.
For example the citeproc-0.7.1/styles/apa-en.csl differs from
csl/styles/apa.csl quite markedly, and I’m not sure which represents the
newer or more definitive CSL schema. And I’m not sure if more recent work
has been underway on a Ruby port of citeproc… In any case, any suggestions
would be greatly appreciated.

Regards,

Liam.

Hi Liam,

My use case is exactly that mentioned in Bruce’s email, i.e. generating
formatted citations via CSL in a Ruby on Rails application. This module
looks a out-of-date, though it doesn’t look too difficult to bring in line
with the (more recent?) CSL schema from citeproc-0.7.1. I’m willing to look
at updating this, but I would appreciate some pointers if at all possible.

Sure thing.

For example the citeproc-0.7.1/styles/apa-en.csl differs from
csl/styles/apa.csl quite markedly, and I’m not sure which represents the
newer or more definitive CSL schema. And I’m not sure if more recent work
has been underway on a Ruby port of citeproc…

No, I’ve not touched it in awhile. But I imagine not much fundamental
would change given its design. You’re just changing the CSL parsing
stuff to create the objects, and then finishing all the work I hadn’t
done :wink:

The current schema and apa example is here:

http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/

We really need to merge them back with the trunk and update the rest
of the styles.

So feel free to have a go at it and ask any questions as you go alone.
I’d really love to have a nice Ruby library!

BTW, I’m not much of a hacker, but I did try to be pretty good about
writing unit tests.

Bruce

Oh, one other suggestion: I understand Ruby will finally be getting
proper unicode support sometime soon-ish. It’d be good to be able to
easily upgrade this library for that.

Bruce

Hmm … looking again at csl.rb, perhaps it’d need some rework.

Right now, the classes are:

Style
Definition
Template
Field

At a meta level, this isn’t quite wrong, but to make the link to CSL
more clear, it’d probably be better to think in terms of something
like:

Style
Context
Citation
Bibliography
Template
Macro
Field
TextField
NameField
DateField

This is just brain-storming; not rigorously well-thought-out

Anyone else have any thoughts?

Bruce

Thanks, I’ll have a more detailed look at this soon. I’m new to citeproc,
but what I was thinking was:

  1. Restructuring csl.rb along the lines you suggest (i.e. build a class
    model which reflects the latest csl.rnc)
  2. Refactor citeproc.rb to parse CSL files and build the new model
  3. Update the reference metadata classes (not sure if this is necessary, but
    in running some tests these don’t look like they carry all the attributes of
    the updated CSL model)
  4. Optionally: provide some bridge between various likely input formats
    (BibTeX and DocBook?) and the internal metadata classes (something similar
    to the role the in-driver.xsl plays in citeproc proper)
  5. Optionally: generate a citeproc-rb gem for simplified deployment
  6. Update the test cases

This would be useful for the use cases I’m developing for internally. I’m
trying to load up the csl schema to get a better understanding of this,
after which I will probably have more questions… Any further suggestions
would be greatly appreciated.

Regards,

Liam.

Thanks, I’ll have a more detailed look at this soon. I’m new to citeproc,
but what I was thinking was:

  1. Restructuring csl.rb along the lines you suggest (i.e. build a class
    model which reflects the latest csl.rnc)
  2. Refactor citeproc.rb to parse CSL files and build the new model
  3. Update the reference metadata classes (not sure if this is necessary, but
    in running some tests these don’t look like they carry all the attributes of
    the updated CSL model)
  4. Optionally: provide some bridge between various likely input formats
    (BibTeX and DocBook?) and the internal metadata classes (something similar
    to the role the in-driver.xsl plays in citeproc proper)
  5. Optionally: generate a citeproc-rb gem for simplified deployment
  6. Update the test cases

Sounds like a great plan, though you’d probably want to update the
test cases a little earlier, right?

This would be useful for the use cases I’m developing for internally. I’m
trying to load up the csl schema to get a better understanding of this,
after which I will probably have more questions… Any further suggestions
would be greatly appreciated.

Simon has the most experience implementing a full non-XSLT processor.
You can always lok at his JS code over at the Zotero SVN if curious.

BTW, if you’re interested in keeping the code here and need SVN
access, contact me off-list with your SF user name.

Bruce

I’ve started work on a basic Ruby parser of the newer CSL format, and now
have some further questions:

  1. The original Ruby library has a group of classes in citeproc.rb which
    look like they correspond to what in the citeproc XSL version is called an
    ’intermediate representation’: classes like Reference, ReferenceList and so
    on. Would this be correct? If so, is there a standard reference for this
    representation of references, in RelaxNG or some other format?
  2. The CSL schema seems to have evolved considerably since the original Ruby
    library was developed for it. It seems like a parser would need to build a
    series of rules which are then applied by a formatting process, similar to
    the XSL version. begs the question as to whether the XSL version should
    perhaps have a Ruby wrapper instead, although I don’t know how mature the
    Ruby XSLT libraries are. Has anyone on this list experimented with this
    approach?
  3. In line with how the XSL version work, I thought a Ruby version, if it
    went ahead, could include:

InputFilter (for MODS, DocBook and BibTeX) - converts to an internal
reference model
Parser - builds an internal representation of the CSL graph
Formatter - applies CSL rules to the reference model
OutputFilter (for XHTML and perhaps other formats) - outputs the format
results

Ruby idioms make the filters and parsers reasonably straightforward to
implement, but the Formatter probably would take some work - and this is
where XSL makes the most sense. I’m happy to look at building a ‘pure’
(non-XSL) Ruby version, but it is more work (as always) that I thought
initially - and probably beyond the scope of my selfish requirements. Is
there substantial justification for building a Ruby library, or would it be
preferable to contribute some help in other ways?

Regards,

Liam.

Liam Magee wrote:

I’ve started work on a basic Ruby parser of the newer CSL format, and
now have some further questions:

  1. The original Ruby library has a group of classes in citeproc.rb which
    look like they correspond to what in the citeproc XSL version is called
    an ‘intermediate representation’: classes like Reference, ReferenceList
    and so on. Would this be correct? If so, is there a standard reference
    for this representation of references, in RelaxNG or some other format?

No.

  1. The CSL schema seems to have evolved considerably since the original
    Ruby library was developed for it. It seems like a parser would need to
    build a series of rules which are then applied by a formatting process,
    similar to the XSL version. begs the question as to whether the XSL
    version should perhaps have a Ruby wrapper instead, although I don’t
    know how mature the Ruby XSLT libraries are. Has anyone on this list
    experimented with this approach?

I’ve thought it should indeed be possible to do some of the heavy
lifting in Ruby (sorting/grouping, etc.), and then to perhaps pass some
parameters into an XSLT process.

The problems with that are:

  1. the current XSLT code is outdated, and is written in XSLT 2.0, so
    there would need to be work on that code

  2. one is then stuck with using XML as the data representation.

  3. another dependency (not pure Ruby)

  1. In line with how the XSL version work, I thought a Ruby version, if
    it went ahead, could include:

InputFilter (for MODS, DocBook and BibTeX) - converts to an internal
reference model
Parser - builds an internal representation of the CSL graph
Formatter - applies CSL rules to the reference model
OutputFilter (for XHTML and perhaps other formats) - outputs the format
results

Right.

Ruby idioms make the filters and parsers reasonably straightforward to
implement, but the Formatter probably would take some work - and this is
where XSL makes the most sense.

Have you taken a look at how Simon implemented the equivalent in JS?

https://www.zotero.org/trac/browser/extension/branches/1.0/chrome/content/zotero/xpcom/cite.js

I’m happy to look at building a ‘pure’
(non-XSL) Ruby version, but it is more work (as always) that I thought
initially - and probably beyond the scope of my selfish requirements. Is
there substantial justification for building a Ruby library, or would it
be preferable to contribute some help in other ways?

Well, up to you of course. See what you think about the above (probably
particularly Simon’s code).

Bruce

Thanks for the response.

I’ve looked at Simon’s code. This is an amazing bit of work in Javascript.
There are some difficulties with adapting this for a Ruby library: the CSL
declarations look like they have been normalised in a relational database;
and the outputs are hard-coded to specific HTML and RTF tags. But this would
provide a very useful starting and reference point for a different
implementation.

I’ll continue down the path of a XSL-less Ruby variant a little further, to
the point of having a simplified CSL parser and formatter at least. What do
you recommend as a basis for a citations model - I take it both Docbook and
MODS supply a richer model than BibTeX?

Regards,

Liam.

I’ve looked at Simon’s code. This is an amazing bit of work in Javascript.
There are some difficulties with adapting this for a Ruby library: the CSL
declarations look like they have been normalised in a relational database;
and the outputs are hard-coded to specific HTML and RTF tags.

Right, that’s not how I was doing it when I was working on
citeproc-rb. I prefer the idea of having methods like “to_xhtml” and
“to_rtf”.

But this would provide a very useful starting and reference point for a different
implementation.

Right, because he’s had to figure out all the tricky details that I
know here a PITA in XSLT!

I’ll continue down the path of a XSL-less Ruby variant a little further, to
the point of having a simplified CSL parser and formatter at least. What do
you recommend as a basis for a citations model - I take it both Docbook and
MODS supply a richer model than BibTeX?

You mean, you’re looking for good source data to work with? Or are you
asking how to model the class in your code?

Bruce

I’ll continue down the path of a XSL-less Ruby variant a little further,
to
the point of having a simplified CSL parser and formatter at least. What
do
you recommend as a basis for a citations model - I take it both Docbook
and
MODS supply a richer model than BibTeX?

You mean, you’re looking for good source data to work with? Or are you
asking how to model the class in your code?

The latter. What I imagine is some reasonably complete generic citation Ruby
object model which supports at least most of the data structures of DocBook,
MODS and BibTeX citations. These could then be parsed into such a object
model, which would then be traversed to generate a formatted output.
However I’m not particularly familiar with the details of these (although I
have done some work with DocBook and BibTeX).

Regards,

Liam.

You mean, you’re looking for good source data to work with? Or are you
asking how to model the class in your code?

The latter. What I imagine is some reasonably complete generic citation Ruby
object model which supports at least most of the data structures of DocBook,
MODS and BibTeX citations. These could then be parsed into such a object
model, which would then be traversed to generate a formatted output.

Right, that’s the way to do it.

However I’m not particularly familiar with the details of these (although I
have done some work with DocBook and BibTeX).

I see two approaches on which to base it:

  1. the CSL model (which is more output oriented)

  2. the RDF ontology that some of us have been working on, and which
    has inspired a bit of the current class design.

http:bibliontology.com

We ought to be releasing a draft of that soon-ish, BTW.

Each approach has its advantages and disadvantages. What’s your preference?

I’m I suppose at this point rather famously NOT a fan of BibTeX; it’s
too limited for anything but the sciences.

MODS is much more flexible, but arguably too much so. I never thought
miuch of teh DocBook model, and even when authoring my manuscripts in
DocBook, I always bypassed its bib support.

Per above, the current version of the XSLT code uses a particular
serialization of an RDF representation as its internal model.

Bruce

However I’m not particularly familiar with the details of these
(although I
have done some work with DocBook and BibTeX).

I see two approaches on which to base it:

  1. the CSL model (which is more output oriented)

  2. the RDF ontology that some of us have been working on, and which
    has inspired a bit of the current class design.

http:bibliontology.com

We ought to be releasing a draft of that soon-ish, BTW.

Each approach has its advantages and disadvantages. What’s your
preference?

I like the idea of using the bibliontology - presumably by ‘disadvantages’
you mean the scope is somewhat larger than using the CSL model? This looks
like potentially two, related projects: one to model bibliontology in Ruby,
the other to apply a fomatting language like CSL to it. What I was
describing as an InputFilter would then be something like a loss-less
(ideally) conversion utility from BibTeX, DocBook and other schemas to
bibliontology. This is probably non-trivial to establish a proper set of
test cases for, but worth considering (you get a fully fledged, largely
interoperable and semi-standardised(?) domain model for the trouble).
Another consideration would be using a Ruby RDF library to load and store
the model as RDF - though this is probably getting off-topic.

In reviewing CSL more thoroughly, another approach is to use ERB to develop
a CSL-like language (something like RCSL?). This by-passes the need for
parsing and emulating XSL-like transformations in code - but at the cost of
expressivity and compatibility with the existing CSL files (which it looks
like a lot of work has gone into). I’ve done a quick test of this, with a
simplified citation and ERB file - this is certainly a much easier path to
go down.

In any case, I’ll keep looking into this - I am looking at refining a
lightweight citation management system, and both interoperability and
variable formatting are key requirements.

Regards,

Liam.

Liam Magee wrote:

 > However I'm not particularly familiar with the details of these
(although I
 > have done some work with DocBook and BibTeX).

I see two approaches on which to base it:

1) the CSL model (which is more output oriented)

2) the RDF ontology that some of us have been working on, and which
has inspired a bit of the current class design.

<http:bibliontology.com <http://bibliontology.com>>

We ought to be releasing a draft of that soon-ish, BTW.

Each approach has its advantages and disadvantages. What's your
preference? 

I like the idea of using the bibliontology - presumably by
‘disadvantages’ you mean the scope is somewhat larger than using the CSL
model?

Correct.

This looks like potentially two, related projects: one to model
bibliontology in Ruby, the other to apply a fomatting language like CSL
to it.

Right.

What I was describing as an InputFilter would then be something
like a loss-less (ideally) conversion utility from BibTeX, DocBook and
other schemas to bibliontology. This is probably non-trivial to
establish a proper set of test cases for, but worth considering (you get
a fully fledged, largely interoperable and semi-standardised(?) domain
model for the trouble). Another consideration would be using a Ruby RDF
library to load and store the model as RDF - though this is probably
getting off-topic.

Maybe, though certainly worth keeping in mind.

When I was working on the model, I was trying to think of it being used
in potentially a variety of different kinds of contexts, where the
backend might be the typical Rails-ORM-RDBMS, or text files, or an RDF
store (since I personally am increasingly interested in citations and
bibliographic management as a perfect use case for semantic web ideas;
e.g. there’s still tons of room left for really useful innovation).

In reviewing CSL more thoroughly, another approach is to use ERB to
develop a CSL-like language (something like RCSL?). This by-passes the
need for parsing and emulating XSL-like transformations in code - but at
the cost of expressivity and compatibility with the existing CSL files
(which it looks like a lot of work has gone into). I’ve done a quick
test of this, with a simplified citation and ERB file - this is
certainly a much easier path to go down.

It may be easier is the short-run, but probably not the long-term.

The intent behind CSL has always been that language-independence is
crucial. Right now, there aren’t that many styles, but as we start to
build them up, and as we start to see GUI editors (or wizards) for the
styles, this is going to change. It would suck to have to recreate the
styles in two different formats.

In any case, I’ll keep looking into this - I am looking at refining a
lightweight citation management system, and both interoperability and
variable formatting are key requirements.

What would be the backend?

Bruce

What I was describing as an InputFilter would then be something
like a loss-less (ideally) conversion utility from BibTeX, DocBook and
other schemas to bibliontology. This is probably non-trivial to
establish a proper set of test cases for, but worth considering (you get
a fully fledged, largely interoperable and semi-standardised(?) domain
model for the trouble). Another consideration would be using a Ruby RDF
library to load and store the model as RDF - though this is probably
getting off-topic.

Maybe, though certainly worth keeping in mind.

When I was working on the model, I was trying to think of it being used
in potentially a variety of different kinds of contexts, where the
backend might be the typical Rails-ORM-RDBMS, or text files, or an RDF
store (since I personally am increasingly interested in citations and
bibliographic management as a perfect use case for semantic web ideas;
e.g. there’s still tons of room left for really useful innovation).

This is a good reason for tackling bibliontology as part of this, I think -
it is more likely people will invest in conversion tools and ontology
matching techniques for something like this than, say, the implicit model
referenced as part of CSL. One potential use case would be issuing SPARQL
queries to a federated RDF store, and post-processing the results with
differential formatting based on CSL. More practically - has there been any
attempts to convert bibliontology to an object model to date?

In reviewing CSL more thoroughly, another approach is to use ERB to

develop a CSL-like language (something like RCSL?). This by-passes the
need for parsing and emulating XSL-like transformations in code - but at
the cost of expressivity and compatibility with the existing CSL files
(which it looks like a lot of work has gone into). I’ve done a quick
test of this, with a simplified citation and ERB file - this is
certainly a much easier path to go down.

It may be easier is the short-run, but probably not the long-term.

The intent behind CSL has always been that language-independence is
crucial. Right now, there aren’t that many styles, but as we start to
build them up, and as we start to see GUI editors (or wizards) for the
styles, this is going to change. It would suck to have to recreate the
styles in two different formats.

I see your point. Some inlined mark-up might be useful somewhere in a system
like this - it is certainly easier to prototype. It may be possible to
derive it from the CSL file in some way (i.e be one of the wizards you
mention), or to use it for the ‘out-driver.xsl’ role. But no, two formats
would not be good.

In any case, I’ll keep looking into this - I am looking at refining a

lightweight citation management system, and both interoperability and
variable formatting are key requirements.

What would be the backend?

It’s built around Rails and Postgres (using a model that is basically BibTeX

  • a few other properties at the moment).

Regards,

Liam.

Liam Magee wrote:

This is a good reason for tackling bibliontology as part of this, I
think - it is more likely people will invest in conversion tools and
ontology matching techniques for something like this than, say, the
implicit model referenced as part of CSL. One potential use case would
be issuing SPARQL queries to a federated RDF store, and post-processing
the results with differential formatting based on CSL. More practically

  • has there been any attempts to convert bibliontology to an object
    model to date?

Not exactly, though what you see in the current citeproc-rb file
reflects thinking more-or-less consistent with what you see in the ontology.

For example …

It’s built around Rails and Postgres (using a model that is basically
BibTeX + a few other properties at the moment).

… the ontology has a Contribution class, as did a Rails prototype I
was playing with (really just the model) awhile back.

Bruce