RPC & citeproc-js

With a little fiddling around, I’ve managed to set up a little demo
RPC server that runs citeproc-js. The server is based on Python code
gleaned from the Net, communicating with the processor across the
python-spidermonkey bridge that I wrote about earlier. The demo is an
extremely primitive thing that uses wget to send a few hard-coded
requests, spewing the JSON responses to the terminal. Not very pretty
or useful, but it does demonstrate the four commands in the API, and
shows that the thing actually works.

http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/branches/fbennett/

The RPC stuff is (of course) in the ./rpc-stuff directory.

Spidermonkey is streets faster than Rhino, and with this setup the
formatter only needs to be recompiled when the style changes, so
rendering transactions should be reasonably fast. No idea what the
exact performance profile will look like, or how much slower this will
be than the Haskell, but at least now there’s something to take out
for a test drive.

Frank

With a little fiddling around, I’ve managed to set up a little demo
RPC server that runs citeproc-js. The server is based on Python code
gleaned from the Net, communicating with the processor across the
python-spidermonkey bridge that I wrote about earlier. The demo is an
extremely primitive thing that uses wget to send a few hard-coded
requests, spewing the JSON responses to the terminal. Not very pretty
or useful, but it does demonstrate the four commands in the API, and
shows that the thing actually works.

XBib download | SourceForge.net

The RPC stuff is (of course) in the ./rpc-stuff directory.

Spidermonkey is streets faster than Rhino, and with this setup the
formatter only needs to be recompiled when the style changes, so
rendering transactions should be reasonably fast.

I ran across this issue when working on the XSLT code, which was
dependent on the Java-based Saxon.

While I’m far from an expert in Java, I think you may be
misunderstanding where the bottleneck is with Rhino. Java (well, the
JVM) is actually really, really fast. The problem is that when you run
it from the commandline, every time you run it, you incur a large
startup cost for the JVM. It’s probably that which slows things down
even more than having to recompile the style.

If you were to stick Rhino in a servlet container like Tomcat (or
using some Java or other JVM-compatible language like Scala or
Groovy), you’d get the same benefits you’re seeing with this approach.

No idea what the
exact performance profile will look like, or how much slower this will
be than the Haskell, but at least now there’s something to take out
for a test drive.

Yeah, nice.

So should we all talk about what an API should look like, since both
you and Christian are working on this?

XML-RPC is rather more heavy-weight that what I’m used to dealing
with, but stripping out implementation details, what’s the request,
and what’s the response?

Bruce

With a little fiddling around, I’ve managed to set up a little demo
RPC server that runs citeproc-js. The server is based on Python code
gleaned from the Net, communicating with the processor across the
python-spidermonkey bridge that I wrote about earlier. The demo is an
extremely primitive thing that uses wget to send a few hard-coded
requests, spewing the JSON responses to the terminal. Not very pretty
or useful, but it does demonstrate the four commands in the API, and
shows that the thing actually works.

XBib download | SourceForge.net

The RPC stuff is (of course) in the ./rpc-stuff directory.

Spidermonkey is streets faster than Rhino, and with this setup the
formatter only needs to be recompiled when the style changes, so
rendering transactions should be reasonably fast.

I ran across this issue when working on the XSLT code, which was
dependent on the Java-based Saxon.

While I’m far from an expert in Java, I think you may be
misunderstanding where the bottleneck is with Rhino. Java (well, the
JVM) is actually really, really fast. The problem is that when you run
it from the commandline, every time you run it, you incur a large
startup cost for the JVM. It’s probably that which slows things down
even more than having to recompile the style.

That must be right.

If you were to stick Rhino in a servlet container like Tomcat (or
using some Java or other JVM-compatible language like Scala or
Groovy), you’d get the same benefits you’re seeing with this approach.

Together with a team of technicians to keep them running. (By way of
fair disclosure, I guess I should say that I’m not exactly a fan of
Java. Still can’t make head or tail of those error traces, and
they’ve been streaming in front of my nose for a couple of months.)

No idea what the
exact performance profile will look like, or how much slower this will
be than the Haskell, but at least now there’s something to take out
for a test drive.

Yeah, nice.

So should we all talk about what an API should look like, since both
you and Christian are working on this?

XML-RPC is rather more heavy-weight that what I’m used to dealing
with, but stripping out implementation details, what’s the request,
and what’s the response?

Request
Id (for sessioned systems)
Method (string)
Parameter (string or object list)

Response
Id (for sessioned systems)
Response (string)

At least that’s what the little demo uses.

With a little fiddling around, I’ve managed to set up a little demo
RPC server that runs citeproc-js. The server is based on Python code
gleaned from the Net, communicating with the processor across the
python-spidermonkey bridge that I wrote about earlier. The demo is an
extremely primitive thing that uses wget to send a few hard-coded
requests, spewing the JSON responses to the terminal. Not very pretty
or useful, but it does demonstrate the four commands in the API, and
shows that the thing actually works.

XBib download | SourceForge.net

The RPC stuff is (of course) in the ./rpc-stuff directory.

Spidermonkey is streets faster than Rhino, and with this setup the
formatter only needs to be recompiled when the style changes, so
rendering transactions should be reasonably fast.

I ran across this issue when working on the XSLT code, which was
dependent on the Java-based Saxon.

While I’m far from an expert in Java, I think you may be
misunderstanding where the bottleneck is with Rhino. Java (well, the
JVM) is actually really, really fast. The problem is that when you run
it from the commandline, every time you run it, you incur a large
startup cost for the JVM. It’s probably that which slows things down
even more than having to recompile the style.

That must be right.

If you were to stick Rhino in a servlet container like Tomcat (or
using some Java or other JVM-compatible language like Scala or
Groovy), you’d get the same benefits you’re seeing with this approach.

Together with a team of technicians to keep them running. (By way of
fair disclosure, I guess I should say that I’m not exactly a fan of
Java. Still can’t make head or tail of those error traces, and
they’ve been streaming in front of my nose for a couple of months.)

No idea what the
exact performance profile will look like, or how much slower this will
be than the Haskell, but at least now there’s something to take out
for a test drive.

Yeah, nice.

So should we all talk about what an API should look like, since both
you and Christian are working on this?

I should probably add that I’m not very keen on getting involved with
the web app end, beyond this little bit of code to get the ball
rolling. Someone actively working on integration in a server
environment (like Christian) will be able provide better input.

Together with a team of technicians to keep them running. (By way of
fair disclosure, I guess I should say that I’m not exactly a fan of
Java. …

Me neither. Even newer languages like Scala, while nice, are a PITA to
deal with in terms of building, classpaths, etc., etc.

XML-RPC is rather more heavy-weight that what I’m used to dealing
with, but stripping out implementation details, what’s the request,
and what’s the response?

Request
Id (for sessioned systems)
Method (string)
Parameter (string or object list)

Response
Id (for sessioned systems)
Response (string)

At least that’s what the little demo uses.

I guess to the degree that I’ve done any (limited) server stuff, it’s
of the more RESTful (Django, web.py, etc.) kind, where you might have
a URI like this:

http://someservice.org/citeproc/bibliography

… which gets bound to a method like “run_bibliography”, which takes
a series of parameters, and returns a response.

So what I meant by my question was, what should be the …

a) methods
b) parameters
c) response

…?

I hear your followup that you’re not interested in working on the
service part much, but these are of course generic questions that
relate to the API question.

I think the answer partly depends on the context. For may use cases,
it might make sens to send the entire document and get it back
reformatted. For other cases (like integration with a WP), it might
make sense instead to send back (at least with citations) a list of
more structured data (well, certainly formatted, but maybe not so
low-level as the bibliography, which can just be a rendered document
fragment). Not sure if the answer changes somewhat if you have the
formatter integrated more closed to the WP application though.

Bruce

Together with a team of technicians to keep them running. (By way of
fair disclosure, I guess I should say that I’m not exactly a fan of
Java. …

Me neither. Even newer languages like Scala, while nice, are a PITA to
deal with in terms of building, classpaths, etc., etc.

XML-RPC is rather more heavy-weight that what I’m used to dealing
with, but stripping out implementation details, what’s the request,
and what’s the response?

Request
Id (for sessioned systems)
Method (string)
Parameter (string or object list)

Response
Id (for sessioned systems)
Response (string)

At least that’s what the little demo uses.

I guess to the degree that I’ve done any (limited) server stuff, it’s
of the more RESTful (Django, web.py, etc.) kind, where you might have
a URI like this:

http://someservice.org/citeproc/bibliography

… which gets bound to a method like “run_bibliography”, which takes
a series of parameters, and returns a response.

So what I meant by my question was, what should be the …

a) methods
b) parameters
c) response

…?

Ah. There are four commands at the moment:

method: setStyle
parameter: serialized XML of CSL style
response: “Set style OK”

method: insertItems
parameter: list of data items
response: “Insert items OK”

method: makeCitationCluster
parameter: list of data items
response: formatted string data

method: makeBibliography
parameter: none
response: formatted string data

There will be at least two more methods, “removeItems” (for removing
items from the processor’s persistent store and adjusting sort
sequence and disambiguation configs) and “itemInfo”, for a list of all
item variables used by the style – this will help the application be
more efficient about DB retrievals if it wants to be.

The API for data items basically follows csl.rnc, plus a breakdown of
names to smaller elements.

Ah. There are four commands at the moment:

method: setStyle
parameter: serialized XML of CSL style
response: “Set style OK”

As a general rule, I don’t think I like the idea of a service being
required to send a full CSL style. Would rather have the parameter at
least optionally (though also probably preferred) be a style URI.

method: insertItems
parameter: list of data items
response: “Insert items OK”

OK. Might make sense to allow different types of data as well.

method: makeCitationCluster
parameter: list of data items
response: formatted string data

How does this “data items” differ from the above “data items”? Seems
to me these are merely references to the data above + some optional
parameters

method: makeBibliography
parameter: none
response: formatted string data

A string, or a typed output (XHTML, XHTML+RDFa, RTF, etc.)?

So while I was assuming makeBibliography might take the style URI, you
are breaking that out into a separate method (well, variable).

There will be at least two more methods, “removeItems” (for removing
items from the processor’s persistent store and adjusting sort
sequence and disambiguation configs) and “itemInfo”, for a list of all
item variables used by the style – this will help the application be
more efficient about DB retrievals if it wants to be.

The API for data items basically follows csl.rnc, plus a breakdown of
names to smaller elements.

OK.

Christian, how are you doing this?

Bruce

Very cool! Have it running here on OS X. -Sean

This sounds very similar to our Zotero API, so it should fit in
nicely. One problem (and I haven’t looked at the code yet, so this may
actually be resolved somewhere): how does one specify page numbers (or
other citation-specific information) for different citations in
makeCitationCluster? Our current API has a separate CitationItem class
to account for this, which includes a reference to the item object as
well as prefix, suffix, locator, and suppress author parameters.

Thanks,
Simon

Christian, how are you doing this?

So far, I use a VERY simple POST request/response that supplies the arguments for my citeproc PHP class, which, in turn, feed that into a call to pandoc via the shell. But I can see that this does not get us anywhere - we need a more sophisticated API.

I am still working on a script that installs the whole pandoc/citeproc-hs/hs-bibutis/ghc stuff on a vanilla linux server. If this isn’t easy enough, people just won’t use it. Haskell is just very wierd stuff for the average user. Once I get this done, I’ll look at the Frank’s API. I could also use a JSONRPC server that comes with the qooxdoo framework (http://qooxdoo.org/documentation/0.8/rpc).

Christian

Ah. There are four commands at the moment:

method: setStyle
parameter: serialized XML of CSL style
response: “Set style OK”

As a general rule, I don’t think I like the idea of a service being
required to send a full CSL style. Would rather have the parameter at
least optionally (though also probably preferred) be a style URI.

Sounds like a good idea. For citeproc-js, it would have to be done by
the application (browser or server process), though, since the JS
implementations don’t have that i/o capability. So it’s kind of
beyond my jurisdiction. :slight_smile:

method: insertItems
parameter: list of data items
response: “Insert items OK”

OK. Might make sense to allow different types of data as well.

method: makeCitationCluster
parameter: list of data items
response: formatted string data

How does this “data items” differ from the above “data items”? Seems
to me these are merely references to the data above + some optional
parameters

In the demo, they’re just JSON representations of the base source
data, in fields named following the CSL schema. As Simon noted, I
don’t have anything in there yet (in the processor or the demo app) to
handle tweaks and special additions like locators. At the moment it’s
just missing.

method: makeBibliography
parameter: none
response: formatted string data

A string, or a typed output (XHTML, XHTML+RDFa, RTF, etc.)?

True, there needs to be a further command to set the output mode. So
that would be six, so far.

So while I was assuming makeBibliography might take the style URI, you
are breaking that out into a separate method (well, variable).

Yes, one style per session of the processor. Changing styles alters
sorting and disambiguation rules, so in this implementation you need
to start over with a freshly configured instance, and generate a fresh
session registry by reloading all of the data items in the set.

As you suggested earlier, this engine is aimed at a different
deployment environment from citeproc-hs. It has to support item-level
transactions and persistence, to keep overhead down during
incremental edits in a connected word processor or web UI, and that
requirement is reflected in the command set.

In the demo, they’re just JSON representations of the base source
data, in fields named following the CSL schema. As Simon noted, I
don’t have anything in there yet (in the processor or the demo app) to
handle tweaks and special additions like locators. At the moment it’s
just missing.

For comparison, here’s something from citeproc-hs:

processCitations :: Style → [Reference] → [[(String, String)]] →
[[FormattedOutput]]

The natural language documentation is then:

“With a Style, a list of References and the list of citation groups
(the list of citations with their locator), produce the
FormattedOutput for each citation group.”

I think that’s probably right, except that the tuple needs to include
prefix, suffix, and also some sort of local styling class (what now
might include “suppress author” but which might, apropos of the citet
citep stuff, (better?) be other things).

Yes, one style per session of the processor. Changing styles alters
sorting and disambiguation rules, so in this implementation you need
to start over with a freshly configured instance, and generate a fresh
session registry by reloading all of the data items in the set.

As you suggested earlier, this engine is aimed at a different
deployment environment from citeproc-hs. It has to support item-level
transactions and persistence, to keep overhead down during
incremental edits in a connected word processor or web UI, and that
requirement is reflected in the command set.

But I think all implementations probably need to consider both cases.

Bruce

Together with a team of technicians to keep them running. (By way of

fair disclosure, I guess I should say that I’m not exactly a fan of

Java. …

Me neither. Even newer languages like Scala, while nice, are a PITA to

deal with in terms of building, classpaths, etc., etc.

XML-RPC is rather more heavy-weight that what I’m used to dealing

with, but stripping out implementation details, what’s the request,

and what’s the response?

Request

Id (for sessioned systems)

Method (string)

Parameter (string or object list)

Response

Id (for sessioned systems)

Response (string)

At least that’s what the little demo uses.

I guess to the degree that I’ve done any (limited) server stuff, it’s

of the more RESTful (Django, web.py, etc.) kind, where you might have

a URI like this:

http://someservice.org/citeproc/bibliography

… which gets bound to a method like “run_bibliography”, which takes

a series of parameters, and returns a response.

So what I meant by my question was, what should be the …

a) methods

b) parameters

c) response

…?

Ah. There are four commands at the moment:

method: setStyle
parameter: serialized XML of CSL style
response: “Set style OK”

method: insertItems
parameter: list of data items
response: “Insert items OK”

method: makeCitationCluster
parameter: list of data items
response: formatted string data

method: makeBibliography
parameter: none
response: formatted string data

There will be at least two more methods, “removeItems” (for removing
items from the processor’s persistent store and adjusting sort
sequence and disambiguation configs) and “itemInfo”, for a list of all
item variables used by the style – this will help the application be
more efficient about DB retrievals if it wants to be.

The API for data items basically follows csl.rnc, plus a breakdown of
names to smaller elements.

This sounds very similar to our Zotero API, so it should fit in nicely. One
problem (and I haven’t looked at the code yet, so this may actually be
resolved somewhere): how does one specify page numbers (or other
citation-specific information) for different citations in
makeCitationCluster? Our current API has a separate CitationItem class to
account for this, which includes a reference to the item object as well as
prefix, suffix, locator, and suppress author parameters.

This still needs to be stirred in. At the start, I just blithely
thought I would bang those details onto the individual data items,
since within the processor they’re treated as disposable data. If the
data items have a life elsewhere, though, that would mean added cost
for cloning everything, which now seems like not such a good idea. It
might be better to extend the internal interfaces, and treat them as a
paired package throughout the code.

I haven’t gotten to this yet in part because locators requires page
numbers, and numbers require range awareness, and implementing range
awareness affects the output queue machinery, which was based on the
happy and reckless assumption that everything could be string-ified as
soon as it was queued for rendering. I’m now in the process of fixing
that (it won’t be pretty, but the ugliness will all be in one place).
Once that’s out of the way, I’ll come up against locators and the
interface problem. At decision time, I’ll give a shout.

Frank

In the demo, they’re just JSON representations of the base source
data, in fields named following the CSL schema. As Simon noted, I
don’t have anything in there yet (in the processor or the demo app) to
handle tweaks and special additions like locators. At the moment it’s
just missing.

For comparison, here’s something from citeproc-hs:

processCitations :: Style → [Reference] → [[(String, String)]] →
[[FormattedOutput]]

The natural language documentation is then:

“With a Style, a list of References and the list of citation groups
(the list of citations with their locator), produce the
FormattedOutput for each citation group.”

Any preferences I had would probably not be sensible, for lack of
experience and lack of knowledge. The processor can certainly be
driven behind an interface like that, though.

This still needs to be stirred in. At the start, I just blithely
thought I would bang those details onto the individual data items,
since within the processor they’re treated as disposable data. If the
data items have a life elsewhere, though, that would mean added cost
for cloning everything, which now seems like not such a good idea. It
might be better to extend the internal interfaces, and treat them as a
paired package throughout the code.

I think your impulse is absolutely correct. I think Simon made a
mistake (though perhaps a necessary one at the time) with the current
implementation in tying the citation to the internal database of a
Zotero instance. This is largely why the documents are so NOT
portable.

To wit, the processor should be totally ignorant of anything internal
to Zotero. It should just take in some standardized data, and spit out
the formatted results.

But this does tie into the larger question of identifiers and in-document data.

Bruce

This still needs to be stirred in. At the start, I just blithely
thought I would bang those details onto the individual data items,
since within the processor they’re treated as disposable data. If the
data items have a life elsewhere, though, that would mean added cost
for cloning everything, which now seems like not such a good idea. It
might be better to extend the internal interfaces, and treat them as a
paired package throughout the code.

I think your impulse is absolutely correct. I think Simon made a
mistake (though perhaps a necessary one at the time) with the current
implementation in tying the citation to the internal database of a
Zotero instance. This is largely why the documents are so NOT
portable.

To wit, the processor should be totally ignorant of anything internal
to Zotero. It should just take in some standardized data, and spit out
the formatted results.

But this does tie into the larger question of identifiers and in-document data.

Zotero seems to have duplicates awareness now, which would be a step
toward safely slurping metadata out of a document. If you can do
that, fresh items added by a co-researcher could be imported to Zotero
(and vice-versa), and Z can carry on as it does now, using its
internal ID for referencing and database retrieval. Seems like it
would work that way, anyway.

That’s what we’re planning for handling document sharing in a future
release. If changes to the database are going to be reflected in the
document, there has to be some link between the two. Our current plan
is to handle sharing by grabbing metadata from the document, linking
items to items in the Zotero DB or creating new items if no equivalent
items exist, and then saving all collaborators’ IDs in the document
along with the metadata.

Simon

That’s what we’re planning for handling document sharing in a future
release. If changes to the database are going to be reflected in the
document, there has to be some link between the two. Our current plan is to
handle sharing by grabbing metadata from the document, linking items to
items in the Zotero DB or creating new items if no equivalent items exist,
and then saving all collaborators’ IDs in the document along with the
metadata.

All I care about is that whatever approach you take leaves room for
collaboration beyond Zotero. I should, for example, be able to start a
draft using pandoc + citeproc-hs, convert it to ODF, and send it to a
collaborator who uses OOo and Zotero. who in turn sends it to another
who uses Mendeley and Word, and for the citations to continue to
remain “live” through that whole cycle.

To me is this the critical use case that must drive any particular design.

It’s hard to tell from your description whether this would be feasible
(or even partially feasible), since the devil is in the details on all
of this metadata (how do you define a “collaborator ID”? how does it
get stored? how does it get associated with what kind of in-document
metadata, encoded how? etc., etc.).

Bruce

Our current plan is to encode the collaborator ID as a URI, as RDF,
with the bibliographic ontology. This should satisfy your use case,
although it might be hard to figure out how to store bibliographic
metadata in doc/docx…

Simon/

Our current plan is to encode the collaborator ID as a URI, as RDF, with the
bibliographic ontology.

Ah good :slight_smile:

This should satisfy your use case, although it might
be hard to figure out how to store bibliographic metadata in doc/docx…

Yes, but I have a feeling there are ways to do this; certainly in docx.

An interesting development in the past six months or so is that MS has
joined the ODF TC at OASIS, so there’s at least some effort at working
on interoperability between them.

Bruce