Style analysis, UI?

So we’ve periodically talked about two things:

  1. a higher-level style editing/creation wizard, with one variant relying on
    modifying formatted output to get the appropriate CSL

  2. using existing style data to help us do this

Do we have any actual data that would give us a concrete sense of how
flexible a UI would need to be?

I just had in my mind’s eye a UI that is a formatted bib, but where one
could interact with the variable tokens. Just wondering if we have any
evidence whether that’s feasible.

Bruce

So we’ve periodically talked about two things:

  1. a higher-level style editing/creation wizard, with one variant relying on
    modifying formatted output to get the appropriate CSL

  2. using existing style data to help us do this

Do we have any actual data that would give us a concrete sense of how
flexible a UI would need to be?

There’s this: https://bitbucket.org/fbennett/csl-lib/overview

I haven’t touched it in the past year, but the scripts should still
work. The data dumps are idiosyncratic, but if someone wanted to clean
it up, the logic is there.

I just had in my mind’s eye a UI that is a formatted bib, but where one
could interact with the variable tokens. Just wondering if we have any
evidence whether that’s feasible.

Bruce

I’ve spoken with Steve Rideout (IIRC) about the possibility of tagging
output with links or hints that lead back to the portion(s) of CSL
from which it is derived, to allow this kind of reverse-direction
editing. It’s a very tough problem due to the complexity of
inheritance and cross-node interaction in much of the formatting
logic. It’s a tempting concept, but there would be a risk of finding
in the end that it wouldn’t work well enough to be useful. I don’t
expect to do anything on it in the foreseeable future myself.

So we’ve periodically talked about two things:

  1. a higher-level style editing/creation wizard, with one variant relying on
    modifying formatted output to get the appropriate CSL

  2. using existing style data to help us do this

Do we have any actual data that would give us a concrete sense of how
flexible a UI would need to be?

There’s this: https://bitbucket.org/fbennett/csl-lib/overview

I haven’t touched it in the past year, but the scripts should still
work. The data dumps are idiosyncratic, but if someone wanted to clean
it up, the logic is there.

Cool; thanks.

I just had in my mind’s eye a UI that is a formatted bib, but where one
could interact with the variable tokens. Just wondering if we have any
evidence whether that’s feasible.

Bruce

I’ve spoken with Steve Rideout (IIRC) about the possibility of tagging
output with links or hints that lead back to the portion(s) of CSL
from which it is derived, to allow this kind of reverse-direction
editing. It’s a very tough problem due to the complexity of
inheritance and cross-node interaction in much of the formatting
logic. It’s a tempting concept, but there would be a risk of finding
in the end that it wouldn’t work well enough to be useful. I don’t
expect to do anything on it in the foreseeable future myself.

What I’m contemplating here is actually simpler, and is a riff on my
earlier ideas about a makebst-like wizard interface. So rather than
there be some direct mapping of CSL to HTML, there’d just be just be
tokens that were tied to a pre-selected list of macro options, where
one would just click the token to get the list, click on what looked
right, etc.

But come to think of it, this may not be any better than my original
idea; not sure.

Bruce

So we’ve periodically talked about two things:

  1. a higher-level style editing/creation wizard, with one variant relying on
    modifying formatted output to get the appropriate CSL

  2. using existing style data to help us do this

Do we have any actual data that would give us a concrete sense of how
flexible a UI would need to be?

There’s this: https://bitbucket.org/fbennett/csl-lib/overview

I haven’t touched it in the past year, but the scripts should still
work. The data dumps are idiosyncratic, but if someone wanted to clean
it up, the logic is there.

Cool; thanks.

I just had in my mind’s eye a UI that is a formatted bib, but where one
could interact with the variable tokens. Just wondering if we have any
evidence whether that’s feasible.

Bruce

I’ve spoken with Steve Rideout (IIRC) about the possibility of tagging
output with links or hints that lead back to the portion(s) of CSL
from which it is derived, to allow this kind of reverse-direction
editing. It’s a very tough problem due to the complexity of
inheritance and cross-node interaction in much of the formatting
logic. It’s a tempting concept, but there would be a risk of finding
in the end that it wouldn’t work well enough to be useful. I don’t
expect to do anything on it in the foreseeable future myself.

What I’m contemplating here is actually simpler, and is a riff on my
earlier ideas about a makebst-like wizard interface. So rather than
there be some direct mapping of CSL to HTML, there’d just be just be
tokens that were tied to a pre-selected list of macro options, where
one would just click the token to get the list, click on what looked
right, etc.

But come to think of it, this may not be any better than my original
idea; not sure.

Bruce

That was what I had in mind with csl-lib as well. The hard part (or at
least the part that no one has tried to do yet) would be to work out a
typology of style structures (since not every style breaks things down
in the same way). With that in hand, you should be able to build a
style by chosing a structure, and then selecting from a subset of
macros for each slot in the structure. Working out the typology would
require back-and-forth between analyzing styles to identify top-level
patterns, and then looking at the styles to see if the number of
top-level patterns can be reduced. Some styles would have to be left
out, as “bespoke CSL”, but many of the author-date and numeric styles,
in particular, might be amenable.

So, I have a crazy idea of how to shift as much of the complexity of generating CSL away from the user as possible. Essentially, I want to be able to copy and paste bibliography entries from a journal’s reference list into a box and end up with a formatted style.

As far as the implementation goes, we would need to:

  1. Convert the bibliography entries to a series of labeled fields using a parser such as FreeCite.
  2. Where possible, string together macros from existing styles to generate the output.
  3. If the output contains a substring that cannot be generated using existing macros, generate a new macro to generate only that substring and use existing macros for the rest. In order to avoid generating macros that work for only a limited set of references (e.g., “(” as a prefix on one element and “)” as a suffix on a different element), this would need to be done either using a statistical model based on the distribution of prefixes, suffixes, and group delimiters in the CSL repository and choosing the most likely macro, or by using a set of heuristics.

As far as (3) goes, I made a naive implementation of the former in Scheme/MIT Church (https://github.com/simonster/csl-inference) that mostly works. MIT Church is really nice in some ways, but the inference is imperfect (samples are not actually independent). Heuristics would undoubtedly be faster, and might work better.

Implementing this might end up being a lot of work, but I think it’s possible in principle. The UI is very simple if it can be made to work well enough; the difficulty is in programming it. I won’t have any time to do this for quite a while, but it could be a fun project.

Simon

So, I have a crazy idea of how to shift as much of the complexity of
generating CSL away from the user as possible. Essentially, I want to be
able to copy and paste bibliography entries from a journal’s reference list
into a box and end up with a formatted style.

Indeed, this would probably be the ideal (except that, note: most of
the time, the examples aren’t extensive enough to account for what
authors often need; code should account for that if it can).

As far as the implementation goes, we would need to:

  1. Convert the bibliography entries to a series of labeled fields using a
    parser such as FreeCite.
  2. Where possible, string together macros from existing styles to generate
    the output.
  3. If the output contains a substring that cannot be generated using
    existing macros, generate a new macro to generate only that substring and
    use existing macros for the rest. In order to avoid generating macros that
    work for only a limited set of references (e.g., “(” as a prefix on one
    element and “)” as a suffix on a different element), this would need to be
    done either using a statistical model based on the distribution of prefixes,
    suffixes, and group delimiters in the CSL repository and choosing the most
    likely macro, or by using a set of heuristics.
    As far as (3) goes, I made a naive implementation of the former in
    Scheme/MIT Church (https://github.com/simonster/csl-inference) that mostly
    works. MIT Church is really nice in some ways, but the inference is
    imperfect (samples are not actually independent). Heuristics would
    undoubtedly be faster, and might work better.

Why MIT Church, and not, say, Python? Just something you’d been
playing with, or is there some other reason?

Implementing this might end up being a lot of work, but I think it’s
possible in principle. The UI is very simple if it can be made to work well
enough; the difficulty is in programming it. I won’t have any time to do
this for quite a while, but it could be a fun project.

Cool; thanks for putting it up on github!

If you get a chance, do you think you could convert the README to
markdown, so that it will render correctly (complete with
syntax-highlighting) in the browser?

If the source is LaTeX, pandoc will convert it for you, except maybe
for the syntax highlighting. For that, see this source:

https://raw.github.com/seancribbs/ripple/6b62eee9301b654d937b0f85706e6cc72ad88352/README.markdown

… which will render like:

https://github.com/seancribbs/ripple/blob/master/README.markdown

Hence, XML highlighting with this:

<foo>bar</foo>

Bruce

So, I have a crazy idea of how to shift as much of the complexity of
generating CSL away from the user as possible. Essentially, I want to be
able to copy and paste bibliography entries from a journal’s reference list
into a box and end up with a formatted style.

Indeed, this would probably be the ideal (except that, note: most of
the time, the examples aren’t extensive enough to account for what
authors often need; code should account for that if it can).

That’s the rationale behind using existing macros when they fit, instead of trying to infer everything, but there may still be some issues with this.

As far as the implementation goes, we would need to:

  1. Convert the bibliography entries to a series of labeled fields using a
    parser such as FreeCite.
  2. Where possible, string together macros from existing styles to generate
    the output.
  3. If the output contains a substring that cannot be generated using
    existing macros, generate a new macro to generate only that substring and
    use existing macros for the rest. In order to avoid generating macros that
    work for only a limited set of references (e.g., “(” as a prefix on one
    element and “)” as a suffix on a different element), this would need to be
    done either using a statistical model based on the distribution of prefixes,
    suffixes, and group delimiters in the CSL repository and choosing the most
    likely macro, or by using a set of heuristics.
    As far as (3) goes, I made a naive implementation of the former in
    Scheme/MIT Church (https://github.com/simonster/csl-inference) that mostly
    works. MIT Church is really nice in some ways, but the inference is
    imperfect (samples are not actually independent). Heuristics would
    undoubtedly be faster, and might work better.

Why MIT Church, and not, say, Python? Just something you’d been
playing with, or is there some other reason?

MIT Church has a lot of rough edges, but it makes performing this kind of inference very simple. Essentially, you can write code to generate a random sample from some distribution (a generative model), and it will find samples that match a given set of parameters, even when drawing a sample with those parameters by chance is highly improbable. That code contains a routine to generate a random CSL substring from a distribution defined by the prefixes, suffixes, and group delimiters in the CSL repository, which is very large. Church’s mh-query function takes that function and samples that very large distribution of substrings for a CSL substring that matches the given output. Since the CSL generating routine is more likely to give samples that more closely resemble the repository, CSL substrings are more likely to resemble those in the repository than not. Church is intended to make writing code to perform this kind of inference very easy.

Unfortunately, Church is very computationally intensive, and the algorithm it uses for inference (Metropolis-Hastings) might be suboptimal for this kind of problem judging by the results, so I’m not sure this code has much of a future besides as a proof of concept.

Implementing this might end up being a lot of work, but I think it’s
possible in principle. The UI is very simple if it can be made to work well
enough; the difficulty is in programming it. I won’t have any time to do
this for quite a while, but it could be a fun project.

Cool; thanks for putting it up on github!

If you get a chance, do you think you could convert the README to
markdown, so that it will render correctly (complete with
syntax-highlighting) in the browser?

If the source is LaTeX, pandoc will convert it for you, except maybe
for the syntax highlighting. For that, see this source:

https://raw.github.com/seancribbs/ripple/6b62eee9301b654d937b0f85706e6cc72ad88352/README.markdown

… which will render like:

https://github.com/seancribbs/ripple/blob/master/README.markdown

Hence, XML highlighting with this:

<foo>bar</foo>

Thanks. The original file is in fact LaTeX, so I’ll give this a try.

Simon

Do you have some thoughts on a possibly more appropriate algorithm,
should someone want to explore alternatives?

[…snip…]

Bruce

Oh, sorry …On Tue, Jul 26, 2011 at 4:18 PM, Bruce D’Arcus <@Bruce_D_Arcus1> wrote:

Do you have some thoughts on a possibly more appropriate algorithm,
should someone want to explore alternatives?

You wrote in the first message:

“Heuristics would undoubtedly be faster, and might work better.”

Bruce

I just spent some time getting FreeCite running locally. The project
has been largely dormant for two years or so, but there’s someone
who’s been committing to a fork on Github lately, and I was able to
get it to work on my machine pretty quickly, once I remembered my
Rails mambo. It works somewhat better than the current hosted version
at Brown-- it at least recognizes post-1999 dates. If we could build
some capability for the user to override the tags, an interactive
review, then I think it’d make a reasonable platform.

I think one of the issues that FreeCite struggles with is limited
training data-- we should be able to provide strong data on things
like author names, place names, publishers and the like (from the data
stores of Zotero and perhaps Mendeley), that might make the tagging
more accurate. We can also produce tagged training data using
citeproc-js and known inputs to give good, comprehensive descriptions
of major patterns in citation formatting.

Avram

Dear Avram,

I’m returning to this thread to shamelessly plug the citation parser I wrote in the last couple of weeks:

I had to parse about 8000 references and was not satisfied by the results I got using ParsCit and FreeCite. The Parser follows the same general approach, but I’ve extended and improved (I hope) much of the feature elicitation; also, I’m using wapiti instead of libcrf++ which, IMO, has a much cleaner codebase and because I personally preferred a C over C++ implementation. In any case, wapiti is extremely fast and my models produced very encouraging results for my data once I trained about 30 references (in addition to the CORA dataset).

Picking up on your idea, it would be extremely easy to adapt CSL styles to generate tagged output. Thus, we could automate the process of producing valid training data, as you suggest.

Anyway, I thought I’d let you (and anyone interested in parsing citation references) know about the project. If you want to try out the parser but encounter any problems, don’t hesitate to contact me for help. A word of caution: if your results are not accurate right away, try to tag one or two references and train the parser – I tried to make training the parser with new references very easy.

/end shameless plug

Best,
Sylvester

Dear Avram,

I’m returning to this thread to shamelessly plug the citation parser I wrote in the last couple of weeks:

https://github.com/inukshuk/anystyle-parser

Cool!

I had to parse about 8000 references and was not satisfied by the results I got using ParsCit and FreeCite. The Parser follows the same general approach, but I’ve extended and improved (I hope) much of the feature elicitation; also, I’m using wapiti instead of libcrf++ which, IMO, has a much cleaner codebase and because I personally preferred a C over C++ implementation. In any case, wapiti is extremely fast and my models produced very encouraging results for my data once I trained about 30 references (in addition to the CORA dataset).

Picking up on your idea, it would be extremely easy to adapt CSL styles to generate tagged output. Thus, we could automate the process of producing valid training data, as you suggest.

So just to understand, are you volunteering to work up a
proof-of-concept of Simon’s idea with your new tool? :slight_smile:

Bruce

Hi,On 8 September 2011 13:08, Sylvester Keil <@Sylvester_Keil> wrote:

Picking up on your idea, it would be extremely easy to adapt CSL styles to generate tagged output. Thus, we could automate the process of producing valid training data, as you suggest.

I have a very dirty patch on the top of citeproc-js to tag the output.
Was not always working. I’ll send you by mail (since it needs some
magic to work it properly, sadly). Not suitable straight away for
production, and was for a very old citeproc-js… so don’t expect
anything too good.

It is on my (ever growing) list of ideas to try out, yes. :slight_smile: However, for the time being, I get satisfying results by just tagging a few representative references. Carles just made the suggestion to have the Cite processor produce tagged output (instead of altering the CSL style), which, now that I think of it, would be a better approach, because it would not involve changing individual styles.

I remember that there is at least one feature in CSL which involves tracking the currently processed item and monitoring which of its attributes are being requested. Perhaps it would be possible to use a similar approach during processing and then inject tags at the end.

Having said that, however, I am not convinced that this is really necessary. In my (brief) experience, it is more efficient to improve the feature elicitation and/or the statistical model than to have a well-nigh unlimited supply of well-formed training data. In fact, it may be that badly formatted data makes for more valuable input.

Sylvester