Parsing bibliographies / style predictor

Dear all,

I’ve recently completed a long-term project of mine by writing a web
application that exposes the AnyStyle parser library for ML-powered
parsing of bibliographies. The web application (and API) is available at
http://anystyle.io (SSL available too) and is very exciting (if I may
say so myself) for mainly two reasons:

  1. The parsing process is split into two steps, showing you the output
    of the ML-driven step in an editor that allows you to make changes to
    the parse result.

  2. These changes can be recorded and used directly to train the ML
    model.

This is exciting, because so far it required a lot of effort and
know-how to prepare training data. Now there is a single public model
that everyone can help improve. Obviously, this part is still very
experimental — it will be interesting to see if the model starts to
deteriorate at some point if fed too much training data. Meanwhile, we
now have a publicly available parser that should be fairly easy to train
to recognize, for example new styles or languages. Please do take a look
if you’re interested! I imagine most of you will be interested in the
’CiteProc’ output format (the ‘JSON’ format is less interesting, because
it does not apply as much post-processing to individual fields).

The parser is also accessible via a JSON API; I wrote a very quick
prototype for a style-predictor (Rintze’s idea!) similar to the one in
the CSL editor. You can give the predictor a reference, the reference
will be parsed and the parsed result rendered in all independent CSL
styles; these formatted references are then compared with the original
one using the Levenshtein distance and the best matches reported. It’s
just a quick prototype; you can take a look at it here:

The parsing is very fast, but the rendering using citeproc-ruby takes
quite some time :slight_smile: But since the parsing API is so simple, it should be
very easy to recast this example in JavaScript, Haskell or Python.

I thought this might be of interest to some of you on this list. Just
let me know if you have any questions!

Sylvester

Cool!

So does this get us one step closer to the magic style finder and generator?

Bruce

A more short term goal would be to incorporate this into Mendeley’s
CSL editor. It’s probably more user-friendly if users no longer have
to reformat a fixed predefined set of item metadata (as currently is
required for http://editor.citationstyles.org/searchByExample/ ), but
instead can just copy and paste some references that already exist in
the desired format, and have the tool show CSL styles that give
similar output.

Rintze

Hi,On 17 May 2014 21:06, Sylvester Keil <@Sylvester_Keil> wrote:

Dear all,

I’ve recently completed a long-term project of mine by writing a web
application that exposes the AnyStyle parser library for ML-powered
parsing of bibliographies. The web application (and API) is available at
http://anystyle.io (SSL available too) and is very exciting (if I may
say so myself) for mainly two reasons:

As Rintze said: this is very cool! Thank you very much for doing it!

I’ll send to you some feedback off-list.

The CSL Editor would be an obvious point of improvement, I’ll also
forward this to some colleagues that might be interested.

Regards,

So does this get us one step closer to the magic style finder and generator?

I think Rintze is more on the mark about making it easy to find an
existing style. Reducing the overall hassle for users I think is more
about
reducing the unnecessary proliferation of styles. Within Elsevier one
of the outcomes of trying to ensure that CSL styles are available for
all their journals is the consolidation of existing styles into only a
handful - which becomes much easier when you have better visibility (
a journal name → style URI mapping) on how many different styles
there really are.

Regards,
Rob.

but instead can just copy and paste some references that already exist in the desired format, and have the tool show CSL styles that give similar output.

I talked to Steve about this back when he wrote the tool - I suggested
at a more basic level to allow using items supplied via citation data
supplied as CSL-JSON or RIS, but the main bottleneck isn’t actually
getting the data in (the task that Sylvester’s tool would make very
simple). The problem is that it takes a significant time (5-10mins?)
to generate citations in all styles for new data and those new
citations are needed to generate the set of closest matches.
It may still be possible to use a similar approach to do this,
especially if Sylvester’s tool could be extended to more quickly
identify similar styles, but that’s going to be a lot more advanced
than simply plugging it into the existing CSL editor, unfortunately.

Reducing the overall hassle for users I think is more about reducing the unnecessary proliferation of styles.
on an ideological level I couldn’t agree more.

I’m wondering how citeproc-ruby, citeproc-js, and citeproc-hs compare,
performance wise. And even if a user needs to wait a few minutes, that
might still be a better experience than the current setup.

Rintze

We can and should definitely lobby for
fewer styles - but realistically I wouldn’t expect much to
change on this front in the medium term.

I don’t think it is so much a matter of “lobbying” as maximizing the
convenience to librarians,
departments or whoever else makes the decisions of picking an existing style.

But yes, I take your point.

It’s off-topic, but getting Wiley on board would help quite a bit with
trimming down the number of independent styles. I never got any
further in their bureaucracy than a single dysfunctional reply:

“Unfortunately, I regret to inform you that we already have our own
way of standardizing and handling journal citation/reference styles.
Nonetheless, we would like to thank you for your offer.”

when I asked them whether they use a limited set of citation formats
for their journals. If any of you have good contacts at Wiley, let me
know.

Rintze