Embedded CSL fields in Mendeley Word & OO documents

Hey guys,

I’m a software engineer at Mendeley, developing our Word and OpenOffice
plugins along with Carles Pina.

I’m altering the format we use to store the references documents. Currently
we just store our own document UUIDs, but we want to store all the document
metadata necessary for CSL formatting, which would make sharing documents
between our users easier, and potentially between users of other reference
managers which can read this format.

The plan is to embed JSON in a format readable by Frank Bennett’s
citeproc-js, along with some optional Mendeley specific fields. They would
be stored in the Word field codes, or in OpenOffice bookmarks. An example
field code would look like this:

{Mendeley
Citation{5756f170-e97d-4c32-8279-b2039884c21b};{a8315e66-67a2-4693-b51a-e0741f556d6a}
CslCitation:<JSON-DATA>}

where will look like this (but without the whitespace):

{
“ITEM-1”: {
<CITEPROC-CSL-FIELDS>,
“mendeley”: {
“account”: “@Steve_Ridout”,
“server”: “www.mendeley.com”,
“uuid”: “5756f170-e97d-4c32-8279-b2039884c21b”
}
},
“ITEM-2”: {
<CITEPROC-CSL-FIELDS>,
“mendeley”: {
“account”: “@Steve_Ridout”,
“server”: “www.mendeley.com”,
“uuid”: “a8315e66-67a2-4693-b51a-e0741f556d6a”
}
},
“ITEM-3”: {
<CITEPROC-CSL-FIELDS>,
“mendeley”: {
“group”: “14217”,
“server”: “www.mendeley.com”,
“uuid”: “ae405489-9d99-4c05-bc56-788ba48fd16b”
}
},
“mendeley”: {
“previousFormattedCitation”: “(Ahn & Schmidt, 1995; Al-shehbaz &
O’kane, 2002; Alcaraz & Donaire, 2004)”
}
“version”: “1”
}

Notes:

  • We need the original “Mendeley Citation{}” at the start for compatibility
    with old plugin versions, but it’s optional.
  • The JSON “mendeley” elements are optional, and if anyone else (e.g.
    Zotero) wants to they can add their own.
  • The “version” element represents the version of this JSON schema, in case
    we add to it or change it in future.

Does this sound sensible to you?

Steve–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedded-CSL-fields-in-Mendeley-Word-OO-documents-tp6096952p6096952.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

I think providing a URL/URI for the JSON schema would be more useful.

RintzeOn Mon, Mar 7, 2011 at 6:36 AM, Steve Ridout <@Steve_Ridout>wrote:

Hi Steve,

This might be an opportunity, given that Zotero has also talked about
embedding JSON.

So let me respond in this way: two questions:

  1. How would your proposal address the following, critical, problem?

http://community.muohio.edu/blogs/darcusb/archives/2009/03/01/the-babel-of-citations

  1. where are you intending to store the source data?

Bruce

So, Zotero currently formats like this:

{ ADDIN ZOTERO_ITEM {“citationID”:“12rsus7rlj”,“citationItems”:[{“uri”:[“http://zotero.org/users/331/items/CT7UITEM”]}]} }

I can tweak things to handle Mendeley codes as well fairly easily. We can serialize citation metadata in the “citationItems” element in the near future. This leads me to the following conclusions:

  1. Where is the citationID coming from in your current implementation? Are you just re-generating them each time you load the document? Is there a reason not to be saving them?
  2. Having an array of URIs may be preferable to having a “mendeley” object. Adding an array of URIs and letting the implementation pick up the first that matches an account it knows about allows the same citation to be linked to multiple accounts simultaneously without having to re-match data each time. It’s also more general.
  3. There are better formats for carrying metadata than citeproc-js JSON, e.g., Bibliontology RDF, which could be serialized to JSON. Do we want to standardize on one of these instead of citeproc-js JSON?

Simon

I think providing a URL/URI for the JSON schema would be more useful.

That’s a good idea, it could also point to a web page with more information.

  1. How would your proposal address the following, critical, problem?

If we accept the general problem is
"How can we provide compatibility of citations between different users of
different word processors using different reference managers?"

This proposal would embed all the metadata necessary for formatting a
citation. So it is very feasible for authors of other plugins to use this
data, particularly if they are using CSL processors which accept JSON in the
format citeproc-js expects.

The method of embedding this data is not ideal, the Word field codes aren’t
recognised by OpenOffice and so we provide an “Export” macro to save the
data in bookmarks instead if he want’s to move between Word and OpenOffice.

  1. where are you intending to store the source data?

All the necessary data (title, authors, publication, etc…) will be
included where I’ve written <CITEPROC-CSL-FIELDS>

In addition, the data will be stored in the user’s local Mendeley database
and on our servers if the user is syncing his database. The data on our
servers will only be available to the user who uploaded it unless it’s in a
"shared group".

Steve–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedded-CSL-fields-in-Mendeley-Word-OO-documents-tp6096952p6097674.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

If we accept the general problem is
"How can we provide compatibility of citations between different users of
different word processors using different reference managers?"

Yup; that’s it.

This proposal would embed all the metadata necessary for formatting a
citation. So it is very feasible for authors of other plugins to use this
data, particularly if they are using CSL processors which accept JSON in the
format citeproc-js expects.

OK. So in other words, if Mendeley and Zotero store these data in
compatible way, we can slowly solve this problem?

The method of embedding this data is not ideal, the Word field codes aren’t
recognised by OpenOffice and so we provide an “Export” macro to save the
data in bookmarks instead if he want’s to move between Word and OpenOffice.

  1. where are you intending to store the source data?

All the necessary data (title, authors, publication, etc…) will be
included where I’ve written <CITEPROC-CSL-FIELDS>

So if one has fifty references to the same source (not unreasonable in
some fields, in a book), then the data is repeated fifty times?

In addition, the data will be stored in the user’s local Mendeley database
and on our servers if the user is syncing his database. The data on our
servers will only be available to the user who uploaded it unless it’s in a
"shared group".

So the application-specific component is an additional help, but not required.

I still wonder, and so am just throwing the idea out there, if it’s
not better to decouple the following:

  • item metadata
  • user
  • service

For sake of argument, what if you identified a source as
"issn:doi:23298392892" but also stored the user info such that you
can, if needed, search first the user library, but fallback to other
options?

E.g. recognize a URI is just an identifier, and that getting metadata
for that thing is a separate action.

Bruce

Simon Kornblith wrote:

I can tweak things to handle Mendeley codes as well fairly easily. We can
serialize citation metadata in the “citationItems” element in the near
future. This leads me to the following conclusions:

Sounds good. It would be great for Zotero and Mendeley users to be able to
share documents.

  1. Where is the citationID coming from in your current implementation? Are
    you just re-generating them each time you load the document? Is there a
    reason not to be saving them?

Yes, they are regenerated every time we run citeproc to generate the
formatted citations. I’m not sure of the benefit of embedding the ID inside
the document since it’s possible that the user could copy and paste a
citation and then edit it resulting in two citation clusters with the same
citationID but different references. Is there a good reason we should store
them?

  1. Having an array of URIs may be preferable to having a "mendeley"
    object. Adding an array of URIs and letting the implementation pick up the
    first that matches an account it knows about allows the same citation to
    be linked to multiple accounts simultaneously without having to re-match
    data each time. It’s also more general.

This could also allow linking to multiple Mendeley or Zotero user’s accounts
which would be nice. It may be wise to restrict the number of accounts added
though in case widely circulated documents end up full of account URI
clutter.

  1. There are better formats for carrying metadata than citeproc-js JSON,
    e.g., Bibliontology RDF, which could be serialized to JSON. Do we want to
    standardize on one of these instead of citeproc-js JSON?

JSON is very easy for us to deal with as it can be passed straight to
citeproc, and adding support for citeproc features which Mendeley doesn’t
currently support is easy. Is there a compelling reason to switch to RDF?

Steve–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedded-CSL-fields-in-Mendeley-Word-OO-documents-tp6096952p6097708.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

Simon Kornblith wrote:

I can tweak things to handle Mendeley codes as well fairly easily. We can
serialize citation metadata in the “citationItems” element in the near
future. This leads me to the following conclusions:

Sounds good. It would be great for Zotero and Mendeley users to be able to
share documents.

Agreed.

  1. Where is the citationID coming from in your current implementation? Are
    you just re-generating them each time you load the document? Is there a
    reason not to be saving them?

Yes, they are regenerated every time we run citeproc to generate the
formatted citations. I’m not sure of the benefit of embedding the ID inside
the document since it’s possible that the user could copy and paste a
citation and then edit it resulting in two citation clusters with the same
citationID but different references. Is there a good reason we should store
them?

I think it depends on how you have implemented citeproc-js; we use the citationID for tracking within the same session. It shouldn’t be a problem for Zotero to handle a citation with a missing citationID, since we will automatically regenerate it anyway.

Whether or not you include the citationID, it would be nice to standardize on a single JSON format. Is there a reason not to extend the structure of the existing Zotero format with additional keys as necessary?

  1. Having an array of URIs may be preferable to having a "mendeley"
    object. Adding an array of URIs and letting the implementation pick up the
    first that matches an account it knows about allows the same citation to
    be linked to multiple accounts simultaneously without having to re-match
    data each time. It’s also more general.

This could also allow linking to multiple Mendeley or Zotero user’s accounts
which would be nice. It may be wise to restrict the number of accounts added
though in case widely circulated documents end up full of account URI
clutter.

If we need a limit here, I suggest a large one, since a URI doesn’t take up too much space. Maybe 25-50?

  1. There are better formats for carrying metadata than citeproc-js JSON,
    e.g., Bibliontology RDF, which could be serialized to JSON. Do we want to
    standardize on one of these instead of citeproc-js JSON?

JSON is very easy for us to deal with as it can be passed straight to
citeproc, and adding support for citeproc features which Mendeley doesn’t
currently support is easy. Is there a compelling reason to switch to RDF?

CSL JSON is definitely simpler to code, but Bibliontology RDF is more versatile in terms of field support and extensibility. My opinion on this is not very strong.

Simon

If we accept the general problem is
"How can we provide compatibility of citations between different users of
different word processors using different reference managers?"

Yup; that’s it.

This proposal would embed all the metadata necessary for formatting a
citation. So it is very feasible for authors of other plugins to use this
data, particularly if they are using CSL processors which accept JSON in the
format citeproc-js expects.

OK. So in other words, if Mendeley and Zotero store these data in
compatible way, we can slowly solve this problem?

The method of embedding this data is not ideal, the Word field codes aren’t
recognised by OpenOffice and so we provide an “Export” macro to save the
data in bookmarks instead if he want’s to move between Word and OpenOffice.

  1. where are you intending to store the source data?

All the necessary data (title, authors, publication, etc…) will be
included where I’ve written <CITEPROC-CSL-FIELDS>

So if one has fifty references to the same source (not unreasonable in
some fields, in a book), then the data is repeated fifty times?

We could potentially embed the data in the first citation only, but we could also embed it in each citation and rely on odt/docx compression to take care of it.

In addition, the data will be stored in the user’s local Mendeley database
and on our servers if the user is syncing his database. The data on our
servers will only be available to the user who uploaded it unless it’s in a
"shared group".

So the application-specific component is an additional help, but not required.

I still wonder, and so am just throwing the idea out there, if it’s
not better to decouple the following:

  • item metadata
  • user
  • service

For sake of argument, what if you identified a source as
"issn:doi:23298392892" but also stored the user info such that you
can, if needed, search first the user library, but fallback to other
options?

E.g. recognize a URI is just an identifier, and that getting metadata
for that thing is a separate action.

There are a bunch of problems with this from a usability perspective:

  1. Inability to store metadata for items with no ISBN, DOI, or PMID.
  2. Incorrect/incomplete metadata in public repository.
  3. User has modified item to add additional data. (There are legitimate reasons to do this, e.g., to add a short title.)
  4. Speed of metadata retrieval, if retrieving citations for hundreds of items.

If we use URI arrays, we can easily include this metadata, but I’m not sure it’s reliable enough to use as anything but a last resort.

Simon

JSON is very easy for us to deal with as it can be passed straight to
citeproc, and adding support for citeproc features which Mendeley doesn’t
currently support is easy. Is there a compelling reason to switch to RDF?

CSL JSON is definitely simpler to code, but Bibliontology RDF is more versatile in terms of field support and extensibility. My opinion on this is not very strong.

And what did you have in mind in terms of a JSON representation of
BIBO? Using a generic RDF-as-JSON, which will be pretty verbose, or
something more idiomatic to JSON?

Pulling back, I’ve been gravitating towards thinking of two kinds of
representation that ought to be able to be more-or-less round-tripped:

  1. a CSL JSON which is very close to the CSL model, and so easy to
    process from that standpoint
  2. a richer, more extensible, more rigorous and remixable, BIBO RDF

The second has an additional benefit, which is that it can be
serialized in different ways, including as RDFa embedded in HTML
output, which is a medium term goal I’d like to push on: the idea that
the output CSL implementations produce is not just dumb text, but can
also be extracted as structured data.

It can also be embedded as RDF/XML in ODF documents in standard ways
consistent with that spec, and so is accessible to the OOo/LO metadata
API (though MS Office has no such thing, so that leaves the question
of how to deal with that).

But there’s no doubt that all of this has some additional costs.

There’s also no denying that dumping json in fields is a bit of an
abuse of the formats.

In any case, I don’t have a strong opinion either; my main goal is
something that “just works.”

Bruce

Simon Kornblith wrote:

I can tweak things to handle Mendeley codes as well fairly easily. We can
serialize citation metadata in the “citationItems” element in the near
future. This leads me to the following conclusions:

Sounds good. It would be great for Zotero and Mendeley users to be able to
share documents.

  1. Where is the citationID coming from in your current implementation? Are
    you just re-generating them each time you load the document? Is there a
    reason not to be saving them?

Yes, they are regenerated every time we run citeproc to generate the
formatted citations. I’m not sure of the benefit of embedding the ID inside
the document since it’s possible that the user could copy and paste a
citation and then edit it resulting in two citation clusters with the same
citationID but different references. Is there a good reason we should store
them?

They allow transactions between the processor and the calling
application to be optimized.

The processCitationCluster() method is called with the data of the
target citation, identified by its own citationID (if known), plus
lists of predecessor and successor citationIDs. By comparing the
citationID sequence against the contents of its internal registry, the
processor can determine which specific citation clusters in the
document require an update, and return the necessary data to the
calling application.

The processor doesn’t currently account for the case of duplicate
citationIDs in a single call to processCitationCluster(), but it
should do, and it would be easy to fix up. Citation clusters returned
by the processor are identified to the document interface by sequence
number, not by ID, so it’s safe to change them on the fly.
processCitationCluster() can just scan the ID list arguments before
doing its thing, and force any duplicates to null. The processor would
then generate a fresh ID, return it to the calling application
identified by sequence number, and you’re ready for the next editing
cycle.

Simon Kornblith wrote:

I can tweak things to handle Mendeley codes as well fairly easily. We can
serialize citation metadata in the “citationItems” element in the near
future. This leads me to the following conclusions:

Sounds good. It would be great for Zotero and Mendeley users to be able to
share documents.

  1. Where is the citationID coming from in your current implementation? Are
    you just re-generating them each time you load the document? Is there a
    reason not to be saving them?

Yes, they are regenerated every time we run citeproc to generate the
formatted citations. I’m not sure of the benefit of embedding the ID inside
the document since it’s possible that the user could copy and paste a
citation and then edit it resulting in two citation clusters with the same
citationID but different references. Is there a good reason we should store
them?

They allow transactions between the processor and the calling
application to be optimized.

The processCitationCluster() method is called with the data of the
target citation, identified by its own citationID (if known), plus
lists of predecessor and successor citationIDs. By comparing the
citationID sequence against the contents of its internal registry, the
processor can determine which specific citation clusters in the
document require an update, and return the necessary data to the
calling application.

The processor doesn’t currently account for the case of duplicate
citationIDs in a single call to processCitationCluster(), but it
should do, and it would be easy to fix up. Citation clusters returned
by the processor are identified to the document interface by sequence
number, not by ID, so it’s safe to change them on the fly.
processCitationCluster() can just scan the ID list arguments before
doing its thing, and force any duplicates to null. The processor would
then generate a fresh ID, return it to the calling application
identified by sequence number, and you’re ready for the next editing
cycle.

(Slight amendment: for duplicates forced to false, we would need to
make a separate call to the document for the data at that position, so
that the registry can be updated; but it should be doable. My idea
with embedding this logic in the processor is to lower the barrier to
the creation of new word processor plugins as far as possible. I have
Abiword in the back of my mind there; their shared document model
combined with robust citation support would be very attractive for
collaborative projects.)

Whether or not you include the citationID, it would be nice to standardize
on
a single JSON format. Is there a reason not to extend the structure of the
existing Zotero format with additional keys as necessary?

Not sure I understand, do you mean adding the CSL metadata within ZOTERO
ADDIN{} ?

I’d prefer not to name it Mendeley or Zotero, so my suggestion was to put it
in a separate CslCitation block, with other blocks being optional. e.g. we
would support field codes like:

{CslCitation:{}}

or

{ZOTERO ADDIN{} AnythingYouLike{}[][] CslCitation:{} OtherStuff…}

or

{Mendeley Citation{} CslCitation:{}}

as long as it has CslCitation:{} somewhere

(Currently we will be adding “Mendeley Citation{}” to the start for
compatibility with old versions but we can drop this at some point in
future.)–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedded-CSL-fields-in-Mendeley-Word-OO-documents-tp6096952p6153496.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

I was suggesting that we use the same basic structure for the JSON object itself (potentially with Zotero- and Mendeley-specific extensions). Our current JSON doesn’t say anything about Zotero anywhere, and very closely resembles a citeproc-js citation object.

I agree that the field code preceding the JSON should be implementation-agnostic. I’m fine with CslCitation:{}, but I suggest that we specify that it must come at the end of the field. Otherwise, it’s hard to parse where the JSON ends with a regexp.

Simon

I was suggesting that we use the same basic structure for the JSON object
itself
(potentially with Zotero- and Mendeley-specific extensions). Our current
JSON
doesn’t say anything about Zotero anywhere, and very closely resembles a
citeproc-js citation object.

How does the Zotero JSON differ from that required by citeproc-js?

I agree that the field code preceding the JSON should be
implementation-agnostic.
I’m fine with CslCitation:{}, but I suggest that we specify that it must
come at the
end of the field. Otherwise, it’s hard to parse where the JSON ends with a
regexp.

Agreed.–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedded-CSL-fields-in-Mendeley-Word-OO-documents-tp6096952p6154567.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

We use the aforementioned uri array instead of an id on each citationItem, but otherwise it’s identical. We would need to extend it to put the content of the CSL fields into the citationItem.

Simon

I was suggesting that we use the same basic structure for the JSON
object

itself
(potentially with Zotero- and Mendeley-specific extensions). Our current

JSON
doesn’t say anything about Zotero anywhere, and very closely resembles a

citeproc-js citation object.

How does the Zotero JSON differ from that required by citeproc-js?

We use the aforementioned uri array instead of an id on each citationItem,
but otherwise it’s identical. We would need to extend it to put the content
of the CSL fields into the citationItem.

How about a format like the following example:

CslCitation:
{
“citationID”:“12rsus7rlj”,
“citationItems”:
[
{
“id”:“ITEM-1”,
“itemData”:
{
“author” : [],
“editor” : [],
“id” : “ITEM-1”,
“issued” : { “date-parts” : [ [ “2007” ] ] },
“title” : “My paper”
},
“locator”:“21”,
“label”:“page”,
“uris” :
[
www.mendeley.com/uniqueDocumentIdForUserA”,
www.mendeley.com/uniqueDocumentIdForUserB”,
www.zotero.org/uniqueDocumentIdForUserC
]
}
],
“properties”:
{
“noteIndex”: 1
}
}

It’s the same structure as a citeproc “minimal citation data object”, except
there’s an added “itemData” element containing the full item data as
returned by sys.retrieveItem(), and an extra “uris” array which can contain
any number of Mendeley / Zotero / other unique identifiers.

Do you think this would be OK?

Sounds good to me.

Simon

That’s great, we’re almost done implementing this now and it will be in our
next development preview (not stable yet) release.

One thing I forgot to put in my last example was a schema URI, @Bruce
and @Rintze:
maybe you could suggest a good URI to use for the schema version, perhaps
starting with http://citationstyles.org/