Columbia Libraries, Mendeley Collaboration on prototype CSL editor

Hi everyone,

I wanted to write the list today to introduce myself and also to let you
all know about a project that has recently been funded by the Sloan
Foundation to develop a prototype CSL editor.

My name is Jeffrey Lancaster and I’m the Emerging Technologies
Coordinator at the Science & Engineering Library at Columbia University. As
we figure out what the heck that job title even means, one of the things
I’ve been able to do is to pursue opportunities to collaborate with
developers (both internal and external) to develop technologies that we
think will be beneficial to our university community, the larger academic
community, and the public in general.

I’m writing today to let you know that we recently received a grant from
the Sloan Foundation to collaborate with Mendeley in order to develop a
prototype visual CSL editor. While many of the specifics of the prototype
are still up in the air as we begin development, I wanted to solicit the
CSL community for your input throughout the process so that the product of
our collaboration is useful to you. This will be especially useful since
Mendeley has already attempted such an editor before and is looking forward
to improving upon that previous effort. The code that we develop will be
deposited into an open-source repository throughout the project, so please
feel free to follow along and submit suggestions if you’re so inclined. While
we may not necessarily be able to adopt all suggestions, it’s important to
us that the process is inclusive so the product will be useful to you all.

For this project, I’m coordinating the outreach and assessment
components while Ian Mulvany at Mendeley has taken the lead on the
development. Please feel free to send me email directly (
@Jeffrey_Lancaster) with suggestions that we can include in our
development effort. As a heads up, toward the end of the prototype effort,
I’ll again be in touch to ask for your help in evaluating and assessing the
CSL editor in order to gather information that can be used to further
development either in a subsequent grant or by independent developers. We’ll
also soon send out a link to a webpage where you can follow the progress of
the project, submit feedback, etc.

This is something we’re very excited about pursuing, and I look forward
to hearing from the list about what you might like to see in a visual,
wysiwyg-ish CSL editor.

Thanks!

–Jeffrey

p.s. Press releases such as this one (
http://www.prnewswire.co.uk/cgi/news/release?id=345494) describing the
project have also recently been published and will be linked to from a
forthcoming webpage.–
Jeffrey Lancaster
Emerging Technologies Coordinator
404 Northwest Corner
Buildinghttp://www.columbia.edu/about_columbia/map/northwest.html,
Science and Engineering
Libraryhttp://library.columbia.edu/indiv/sciencelib.html,
Columbia University http://www.columbia.edu/
mailcode: 4899
phone: 212.851.7138

Hi Jeffrey,

Hi everyone,

I wanted to write the list today to introduce myself and also to let you
all know about a project that has recently been funded by the Sloan
Foundation to develop a prototype CSL editor.

That’s great news!

My name is Jeffrey Lancaster and I’m the Emerging Technologies
Coordinator at the Science & Engineering Library at Columbia University. As
we figure out what the heck that job title even means, one of the things
I’ve been able to do is to pursue opportunities to collaborate with
developers (both internal and external) to develop technologies that we
think will be beneficial to our university community, the larger academic
community, and the public in general.

I’m writing today to let you know that we recently received a grant from
the Sloan Foundation to collaborate with Mendeley in order to develop a
prototype visual CSL editor. While many of the specifics of the prototype
are still up in the air as we begin development, I wanted to solicit the CSL
community for your input throughout the process so that the product of our
collaboration is useful to you. This will be especially useful since
Mendeley has already attempted such an editor before and is looking forward
to improving upon that previous effort. The code that we develop will be
deposited into an open-source repository throughout the project, so please
feel free to follow along and submit suggestions if you’re so inclined.
While we may not necessarily be able to adopt all suggestions, it’s
important to us that the process is inclusive so the product will be useful
to you all.

For this project, I’m coordinating the outreach and assessment components
while Ian Mulvany at Mendeley has taken the lead on the development. Please
feel free to send me email directly (@Jeffrey_Lancaster) with
suggestions that we can include in our development effort. As a heads up,
toward the end of the prototype effort, I’ll again be in touch to ask for
your help in evaluating and assessing the CSL editor in order to gather
information that can be used to further development either in a subsequent
grant or by independent developers. We’ll also soon send out a link to a
webpage where you can follow the progress of the project, submit feedback,
etc.

I imagine some of the Mendeley people have already mentioned there’s
been a fair bit of discussion, and indeed debate, on this list about
different approaches to this issue. I’d encourage you to look through
the archive for that, as I think collectively it represents some
really deep thought on all of this issues around this.

A number of us (an in particular I) strongly believe that a
traditional WYSIWYG approach to the problem, that attempts to model
every detail of CSL in a UI, is probably doomed to failure. Or at
least, it’ll be quite difficult to do well, and will be more difficult
for users to work with then another approach.

The primary “alternative” approach evolved out of list discussions,
and centered on a higher-level approach that was focused on two
things:

  1. piecing together formatting by assembling CSL macros (rather than
    start with lower-level details); my original idea on this was sort of
    MakeBST for CSL.

  2. using some simple AI-like code to match output to relevant CSL
    macro code. E.g. the user scenario would be: user pastes formatted
    bibliography into some input area, and the tool assembles a finished,
    or more-or-less finished, style. Simon Kornblith actually coded up a
    proof-of-concept, and Sylvester Keil posted a later followup about a
    cool new Ruby library that might facilitate this.

To put it differently, I’d encourage you to publicly document some
specific user scenarios (with example users from different fields;
please don’t make the mistake of only considering the needs of the
sciences), and also the sorts of benchmarks you might use to designate
success, before getting to any kind of prototyping. I’m happy to
comment on any of that.

Bruce

Two links:

  1. using some simple AI-like code to match output to relevant CSL
    macro code. E.g. the user scenario would be: user pastes formatted
    bibliography into some input area, and the tool assembles a finished,
    or more-or-less finished, style. Simon Kornblith actually coded up a
    proof-of-concept, and Sylvester Keil posted a later followup about a
    cool new Ruby library that might facilitate this.

Code for these at, respectively:

https://github.com/simonster/csl-inference
https://github.com/inukshuk/anystyle-parser

Bruce

First of all - yay!

Bruce is much better on the higher level conceptual questions, but as
one of the couple of people who have so far coded most of the existing
styles allow me some input.
A lot of the work in writing a good CSL style actually is not spent on
the coding, but on thinking about the structure of the citation style:
How does it work systematically? What happens when an item type misses
one or multiple fields? How about item types that I may not have
addressed specifically, how should those work?
Figuring that type of stuff out and writing high quality (i.e.
reliable and robust) citation styles isn’t trivial, most certainly not
for someone who has never thought about such technicalities (I know -
my first CSL styles were pretty bad in that respect).

What that means is that essentially GUIfying the CSL logic is very
likely going to lead to a bunch of poorly coded styles, even if the
editor algorithm behind the GUI is good.

This all basically a rambling form of strong support for Bruce’s
suggestion of rethinking how a style editor should work.

Doing something along the lines suggested by Bruce may also mean that
incorporating Frank Bennett’s Feedback Gadget -
http://citationstylist.org/tools/ - would come natural and we could
easily implement test-suited for styles that keep them working well
during/after revisions.

Best,
Sebastian

Hi All,

I’m really happy about this grant too. It will allow us to dedicate
some people full time to working on this issue.

I’m very aware of the pitfalls of a WYSIWYG only approach, and I feel
that we need to have a really good handle on the use cases. I strongly
feel that participation from this community will be vital to the
project.

From the perspective of Mendeley the main use case we are getting is
to support styles that users don’t have access to. We feel that there
is scope for improving tools for authoring CSL, and if we lower the
barrier to the creation of valid independent styles this will help
everyone, that’s where I would like to see this project make a
contribution.

  • Ian

From the perspective of Mendeley the main use case we are getting is
to support styles that users don’t have access to.

To be more specific on the use case, I take the “users don’t have
access to” phrase to mean:

User needs a style for Journal X, but there is no such style
available. They need such a style.

Is that what we’re talking about?

The details are really important, because there are a couple of
nuances here, each suggesting potentially different paths to an
outcome. In user-oriented language:

  1. “is there an existing style that looks like what I need, but is
    called something else?” A variation is “what style is closest to what
    I need, and how can I edit it to do what I need?”

  2. “The journal style guide gives me these examples. How can I get a
    style that produces these results?” (which is really the crux of the
    matter for most users, it seems to me)

I think it would help to get into this kind of detail on use cases, as
it will help designers, and also make it easier to assess which
approach is likely to work best.

You might say Simon’s proof-of-concept, for example, is designed for #2.

Bruce

Let me echo Sebastian’s and Bruce’s sentiments. I also think that coding
high-quality styles with a GUI CSL editor that supports the full scope of
CSL is unlikely to be that much easier than hand-editing XML. Some
strengths of CSL (conditionals, macros, groups) are difficult to implement
without visualizing the hierarchical structure of styles, and properly
using these features takes a bit of know-how. So I strongly support Bruce’s
advice of taking a bit of distance and identifying the best way(s) to solve
the problem.

Furthermore, I think that a tool that takes in formatted bibliographies and
finds matching styles is relatively low-hanging fruit. Such a tool would
make it easier for users and style coders to identify styles that already
give output close to what’s needed, and can already be used without a
full-blown CSL editor.

Rintze

I also agree with the overall sentiment that sticking too close to the XML specs of CSL will not lead to a very useful tool. It would clearly be nice to be able to magically create a CSL style from analyzing an existing bibliography, and it looks like it could work quite well, which I am impressed by. But there will also always be the need to refine things more, so you would still need to provide more user interface elements to edit the details.

I also like the idea of automatically scanning existing styles for similar output. This is a great way to either get done before you even started, or to at least provide a good starting point.

Charles

I also agree with the overall sentiment that sticking too close to the XML specs of CSL will not lead to a very useful tool. It would clearly be nice to be able to magically create a CSL style from analyzing an existing bibliography, and it looks like it could work quite well, which I am impressed by. But there will also always be the need to refine things more, so you would still need to provide more user interface elements to edit the details.

Yes, good point here. It’s just a question whether you start from the
general and move to the particular (the approach we’re advocating
here), or vice versa (often the first impulse).

Bruce

Yes, good point here. It’s just a question whether you start from the
general and move to the particular (the approach we’re advocating
here), or vice versa (often the first impulse).

Exactly right: great summary, and totally agreed!

Agree on the following:

  • We need really well defined use cases. We have a long list of
    support requests, and feature requests at Mendeley so I can start to
    break those down by use case type. Charles, I guess you guys have
    similar for Papers?

  • +1 on looking to integrate a test suite.

  • I like the point about deciding on the direction of approach. As
    many of you know, we tried before to go bottom up and it didn’t work
    because we failed to capture the hierarchical structure of the CSL
    specification.

Our first step in this project is going to be to work on these use
cases before we think about cutting code.

I want to spend time examining the excellent set of tools that are out
there already. I’m aware of the following tools:

Dev resources listed here:
http://citationstyles.org/citation-style-language/development/

Processors listed here:
http://citationstyles.org/citation-style-language/processors/

Prototype lookup service:
http://steveridout.com/cslEditor/cslFinder/

Prototype CSL code viewier
http://steveridout.com/cslEditor/

Naive WYSIWYG implementation (the one that didn’t work out):
http://csleditor.quist.de/csleditor/show/2/another-example-citation-style

What else should I be looking at?

  • Ian

  • Ian

Tools to parse formatted bibliographies:
http://www.zotero.org/support/kb/importing_formatted_bibliographies
and
http://www.crossref.org/guestquery/

Another CSL editor (CSL 0.8.1):
http://www.somwhere.org/csl

RintzeOn Fri, Jan 20, 2012 at 5:18 AM, Ian Mulvany <@Ian_Mulvany>wrote:

I wrote anystyle-parser as a freecite replacement; my idea, going forward, was to turn it into a web service, like freecite, too. The ML model and the feature dictionary was optimized for my use cases, but could be easily improved. David Shorthouse has written a webservice that combines the parser with discovery:

http://refparser.shorthouse.net/

This is a very smart approach to improve the quality of the parsed references. You can find the parser itself at:

Or simply get it by ‘gem install anystyle-parser’.

Also, in rewriting citeproc-ruby I have started to extract all the CSL functionality into a separate multi-purpose CSL API. This could be extremely useful for a style editor, obviously, but it’s far from finished.

Best,

Sylvester

signature.asc (163 Bytes)

I wrote anystyle-parser as a freecite replacement; my idea, going forward, was to turn it into a web service, like freecite, too. The ML model and the feature dictionary was optimized for my use cases, but could be easily improved.

So just to clarify, the relevance here is in this approach, we’d need
a really smart parser, that would allow us to deconstruct a formatting
bibliographic entry into their component parts, and then to match that
against CSL macros fragments, to piece together a new style.

This library can provide that.

Also, in rewriting citeproc-ruby I have started to extract all the CSL functionality into a separate multi-purpose CSL API. This could be extremely useful for a style editor, obviously, but it’s far from finished.

GitHub - inukshuk/csl-ruby: Citation Style Language (CSL) API for Ruby

I was wondering about that. So what’s the relationship between the
rewritten citeproc-ruby an csl-ruby?

Bruce

With the caveat that formatted bibliographic entries are often lossy, so it
might be desirable to parse the entry, use the component parts to identify
the item (e.g. via CrossRef’s lookup tools), retrieve more complete
bibliographic data for the item (e.g., once you know the DOI, you could
resolve it and scrape that page; this could be done with server-side Zotero
translators:
http://forums.zotero.org/discussion/19458/translators-server-side/ ), and
run that more complete bibliographic data through the CSL processor for all
styles.

Rintze

I wrote anystyle-parser as a freecite replacement; my idea, going forward, was to turn it into a web service, like freecite, too. The ML model and the feature dictionary was optimized for my use cases, but could be easily improved.

So just to clarify, the relevance here is in this approach, we’d need
a really smart parser, that would allow us to deconstruct a formatting
bibliographic entry into their component parts, and then to match that
against CSL macros fragments, to piece together a new style.

This library can provide that.

Basically. The parser is not really smart, but based on a machine learning model. It is currently trained mostly on a bibliography that I had to parse and yielded very good results. Because it is extremely hard to achieve perfection, I wanted it to be really easy for everyone to train the model. (The model itself could be further improved, too, as well as the feature extraction algorithms).

Anyway, here’s a quick example:

Anystyle.parse “Harrison, Lowell H. (1975). The Civil War in Kentucky. The University Press of Kentucky. pp. 20, 22. ISBN 0-8131-1419-5.”

Returns:

=> [{:author=>“Harrison, Lowell H.”, :title=>“The Civil War in Kentucky”, :publisher=>“The University Press of Kentucky”, :pages=>[“pp.”, “22.”], :volume=>20, :isbn=>“0-8131-1419-5”, :year=>1975, “unmatched-pages”=>“22.”, :type=>:book}]

So this is pretty close, but volume 20 is wrong.

Anystyle.parse ‘Craig, Berry F. (August 1979). “Henry C. Burnett: Champion of Southern Rights”. The Register of the Kentucky Historical Society 77: pp. 266–274.’

This one is spot on:

=> [{:author=>“Craig, Berry F.”, :title=>“Henry C. Burnett: Champion of Southern Rights”, :journal=>“The Register of the Kentucky Historical Society”, :volume=>77, :pages=>“266–274”, :month=>8, :year=>1979, :type=>:article}]

But:

Anystyle.parse ‘Craig, Berry F. (Autumn 2001). “The Jackson Purchase Considers Secession: The 1861 Mayfield Convention”. The Register of the Kentucky Historical Society 99 (4): pp. 339–361.’

Returns:

=> [{:author=>“Craig, Berry F.”, :date=>“(Autumn”, :title=>"2001). “The Jackson Purchase Considers Secession: The 1861 Mayfield Convention”, :journal=>“The Register of the Kentucky Historical Society”, :volume=>99, :pages=>“339–361”, :number=>4, :type=>:article}]

So here the year wasn’t picked up. What you need to is train the model to become smarter at recognizing the season-year combination like this:

Anystyle.parser.train ‘ Craig, Berry F. (August 1979). “Henry C. Burnett: Champion of Southern Rights”. The Register of the Kentucky Historical Society 77: pp. 266–274. ’

Now, the results are improved for this entry (obviously), but more importantly, also for similarly formatted entries.

The parser is well suited for reference parsing (especially when combined with discovery). If you want truly perfect results, the machine learning approach is probably not the best.

Also, in rewriting citeproc-ruby I have started to extract all the CSL functionality into a separate multi-purpose CSL API. This could be extremely useful for a style editor, obviously, but it’s far from finished.

GitHub - inukshuk/csl-ruby: Citation Style Language (CSL) API for Ruby

I was wondering about that. So what’s the relationship between the
rewritten citeproc-ruby an csl-ruby?

citeproc-ruby became really difficult to maintain and refactor, because I originally added a lot of functionality for managing the JSON format on the one hand and CSL elements on the other. For example, you could use the CSL locale classes to ordinalize numbers (with gender support) etc.

So now my approach is to have a separate processor API (which contains all the JSON functionality, like date parsing etc.), a processor (the ruby processor which needs to be rewritten, or the citeproc-js embedded into ruby), and the CSL API. Basically it allows you to parse, create and interact with the individual CSL elements.

signature.asc (163 Bytes)

Absolutely. I don’t think it’s feasible to parse bibliographies ‘perfectly’. To have a suitably good parser and combine it with discovery tools is a really good idea.

However, it doesn’t help if the data you need to parse isn’t really available online.

signature.asc (163 Bytes)

A similar approach, that doesn’t rely on free text parsing, is what
Dan Stillman originally suggested: have predefined formatted data, and
allow users to modify it. Of course, there are a range of ways to
approach this as well (pure free text, pop-up tokens that allow one to
select from options, etc.).

Bruce

It looks like Steve Ridout implemented this idea, as linked by Ian above:

http://steveridout.com/cslEditor/cslFinder/

I’ve been meaning to incorporate this sort of functionality into the
Zotero styles page (and was actually planning to sit down and finally
try it today), so it’s great to see a working PoC. As you say, there are
a bunch of ways this could work—free-form, tokens, fixed, customizable,
client-side, server-side—but I still think this is one of the easiest
ways to meet a lot of people’s needs and allow more people to benefit
from the authoring work done by Sebastian and others.

A similar approach, that doesn’t rely on free text parsing, is what
Dan Stillman originally suggested: have predefined formatted data, and
allow users to modify it. Of course, there are a range of ways to
approach this as well (pure free text, pop-up tokens that allow one to
select from options, etc.).

It looks like Steve Ridout implemented this idea, as linked by Ian above:

http://steveridout.com/cslEditor/cslFinder/

I’d chatted with Steve about the idea, but didn’t realize he actually
implemented it. Awesome!

I’ve been meaning to incorporate this sort of functionality into the
Zotero styles page (and was actually planning to sit down and finally
try it today), so it’s great to see a working PoC. As you say, there are
a bunch of ways this could work—free-form, tokens, fixed, customizable,
client-side, server-side—but I still think this is one of the easiest
ways to meet a lot of people’s needs and allow more people to benefit
from the authoring work done by Sebastian and others.

Yup; it’s probably “easiest” both for developers and for users, as
well as potentially style repository managers (e.g. those who have to
maintain the styles; gets to Sebastian and Rintze’s point about style
quality).

Bruce