So, I have a crazy idea of how to shift as much of the complexity of
generating CSL away from the user as possible. Essentially, I want to be
able to copy and paste bibliography entries from a journal’s reference list
into a box and end up with a formatted style.
Indeed, this would probably be the ideal (except that, note: most of
the time, the examples aren’t extensive enough to account for what
authors often need; code should account for that if it can).
That’s the rationale behind using existing macros when they fit, instead of trying to infer everything, but there may still be some issues with this.
As far as the implementation goes, we would need to:
- Convert the bibliography entries to a series of labeled fields using a
parser such as FreeCite.
- Where possible, string together macros from existing styles to generate
the output.
- If the output contains a substring that cannot be generated using
existing macros, generate a new macro to generate only that substring and
use existing macros for the rest. In order to avoid generating macros that
work for only a limited set of references (e.g., “(” as a prefix on one
element and “)” as a suffix on a different element), this would need to be
done either using a statistical model based on the distribution of prefixes,
suffixes, and group delimiters in the CSL repository and choosing the most
likely macro, or by using a set of heuristics.
As far as (3) goes, I made a naive implementation of the former in
Scheme/MIT Church (GitHub - simonster/csl-inference) that mostly
works. MIT Church is really nice in some ways, but the inference is
imperfect (samples are not actually independent). Heuristics would
undoubtedly be faster, and might work better.
Why MIT Church, and not, say, Python? Just something you’d been
playing with, or is there some other reason?
MIT Church has a lot of rough edges, but it makes performing this kind of inference very simple. Essentially, you can write code to generate a random sample from some distribution (a generative model), and it will find samples that match a given set of parameters, even when drawing a sample with those parameters by chance is highly improbable. That code contains a routine to generate a random CSL substring from a distribution defined by the prefixes, suffixes, and group delimiters in the CSL repository, which is very large. Church’s mh-query function takes that function and samples that very large distribution of substrings for a CSL substring that matches the given output. Since the CSL generating routine is more likely to give samples that more closely resemble the repository, CSL substrings are more likely to resemble those in the repository than not. Church is intended to make writing code to perform this kind of inference very easy.
Unfortunately, Church is very computationally intensive, and the algorithm it uses for inference (Metropolis-Hastings) might be suboptimal for this kind of problem judging by the results, so I’m not sure this code has much of a future besides as a proof of concept.
Implementing this might end up being a lot of work, but I think it’s
possible in principle. The UI is very simple if it can be made to work well
enough; the difficulty is in programming it. I won’t have any time to do
this for quite a while, but it could be a fun project.
Cool; thanks for putting it up on github!
If you get a chance, do you think you could convert the README to
markdown, so that it will render correctly (complete with
syntax-highlighting) in the browser?
If the source is LaTeX, pandoc will convert it for you, except maybe
for the syntax highlighting. For that, see this source:
https://raw.github.com/seancribbs/ripple/6b62eee9301b654d937b0f85706e6cc72ad88352/README.markdown
… which will render like:
https://github.com/seancribbs/ripple/blob/master/README.markdown
Hence, XML highlighting with this:
<foo>bar</foo>
Thanks. The original file is in fact LaTeX, so I’ll give this a try.
Simon