CSL Funds & Projects

As an update, @retorquere has done an amazing job updating Sheldon so that we now get previews of changes in PRs. He’s put a lot of work into this and still offered to do it at the low end of our suggested rate above, i.e for US$1,000.

This is already making our work reviewing style PRs easier, so Rintze and I would be more than happy to pay this out. We’ll wait a week for any concerns raised here and then, absent objections, pay Emiliano.

We’re still looking for someone who wants to take on the csl-editor update. Please post here and-or to a separate thread.

1 Like

These previews are pretty amazing!

I’m happy they help.

In what sense are you guys looking to modernize the CSL editor? Just an update to ES6? Of the demo site or of the cslEditorLib? I could do that but that seems like a marginal win. Do you mean streamlining the deployment of the demo site (stuff like webpacking for example)?
Or do you mean something like React-ifying the editor?

@retorquere

  1. Seeing no objections to releasing the pay – are you able to sent me an invoice of US$ 1,000 ? Can be informal and by email.
  2. For the web editor, we’re honestly not 100% sure. Here are some general thoughts:
    a) cslEditorLib relies on a bunch of dependencies, some of them outdated. It’d be nice to make sure this is all in shape, and more generally, that the JS used is up to date (including ES6)
    b) the whole deploy process is tedious: in cslEditor, Update submodules, update citeproc separately, then regenerate the example citations, then go over to the demo cite, update the processor, then deploy. There has got to be a way to make that simpler or even automate it
    c) The generation of sample citations is ugly – it writes one single massive (and increasingly massive) JSON file using absurd amounts of memory (IIRC that’s a common problem in writing JSON in javascript bu there are solutions). Speeding that up and making it more robust (i…e less memory intensive) would be great.
    d) While a-c would just be updates, the real vision would be to fundamentally change the basic functionality: instead of requiring specific citations, we would use something like the citation parser behind anystyle.io and alllow people to just paste any citation (e.g. the samples for the author guidelines). Sylvester seemed to think that was totally doable and if it were that’d be a total game changer.
  1. I’ve not sent an invoice in my life. If you can give me the text that would work for you I’ll gladly bounce that back to you.

2.a AFAICT, all dependencies are git submodules, not npm packages, so the best that can be achieved there is a pull from their head. Codemirror is available as an npm package, as is jstree, but jstree for example hasn’t been updated in 7 years, I’m not sure using package versioning would help there. I’m not familiar with the module system used in the editor (I see a “define” for example that looks like it brings in libraries but not sure where that comes from – maybe we should bring in Steve Ridout as the lead on this stuff and I could just assist him where it’s helpful)

2.b should be doable using Travis

2.c why does it write one massive (and growing) JSON file? generateExampleCitations.js doesn’t currently run for me so it’s not easy for me to establish what it does exactly.

2.d I don’t yet understand what the specific citations (from 2.c?) do, and how they are used in the CSL editor. Also, anystyle.io is written in Ruby, so I don’t yet grok how it would be incorporated in the JS-based CSL editor – this seems best left to Sylvester if you ask me.

From what I’ve seen, anystyle.io only determines which parts of the citation correspond to which properties, not the style. If that’s going to be used, it’d still have to generate citations in multiple styles to match the inputted citation to the styles. So, the big JSON file would probably not be needed (unless we want to do both) and we would have to decide whether to host an API for citation generation or have it in the browser. The generation is probably pretty costly, so I’m not sure about an approach there.

2a) It’d be great to get Steve on this, but from our last communications (a couple of years back) I’m not optimistic. All dependencies (except for citeproc-js) are git submodules, yes.
2c) The JSON contains the sample citations in all ~2k styles – it’s growing because we add more styles and it’s enormous for obvious reasons
2d) So the editor’s search by example function does a match of the entered example against the examples (in the JSON from 2c). The idea would be to allow any example and parse that. As Lars says, that does require on-the-fly generation of the styles. The preference would be to run this in the browser so we don’t have to deal with server, user data, and security. I’m not sure if this is possible, but as I said, that’d be a total gamechanger. If this is possible but requires more work than our small budget allows, this one I’d also try to find additional money for.

FWIW – I had thought of this as a natural project for Lars, but happy to have you collaborate, have one take the lead and have the other advise, whatever works. Just please agree on any monetary components beforehand so that there’s no unnecessary conflict.

For pricing, I’d say we’d be happy to offer $2,000 for modernization along the lines of a-c (within the realms of whats possible) and another $2,000 if d) is possible.

Let me know what you think.

Back-of-the-envelope numbers based on the trials summarized in Performance testing? suggest about three to five minutes to generate a full set of samples from arbitrary input. There is some overhead to instantiate the processor for each style. In the test rig:

Cell 83ms
Chicago Fullnote 133ms
APA 257ms

Taking 90ms as an optimistic average build time, and 15ms as rendering time for three additional cites (citeproc-ruby clocks at about 20ms, this is again on the optimistic side for citeproc-js) yields 4.4 minutes as a low-end estimate:

bash> echo 4k 1964 d 0.090* r 0.015 3**+ 60/p|dc
4.4190

1 Like

I don’t need to be lead on this. I figured since there was no uptake yet I could at least get some small things off the ground just to kick off, but it’s slow going for me.

It seems to me in the current implementation, the samples only need to be updated when there’s a change on the styles repo. Even if generation takes long, that shouldn’t matter; Travis could take care of this. But even for this constrained case, I’m still not grokking the basics; citeproc-js is available as a node module, so I’m still trying to figure out why pregeneration uses requirejs, or why there’s a JSDOM and jQuery dependency (it runs in Node, yeah? Not the browser?). And this is just a minor and relatively isolated part of the stack.

All that goes to show that I’m currently unqualified to take the lead here. I’m not even sure I want to be paid for the minor work I could do on this as things stand currently. If Lars wants to take point on this, I’m a-OK with this. If Lars gets the full funds and I can help here and there, I’m also a-OK with this. I declare here and now and definitively that if Lars gets involved, any budding monetary dispute is pre-solved because even if we both do half the work (which seems unlikely), I’m OK with Lars getting the full sum. If Lars joins, I’ll be happy to reiterate this, or, if that makes people uncomfortable, I will take a small, pre-determined amount.

WRT anystyle, it may be possible to bodge something together analogously to:

  1. run a citation through anystyle (e.g. Putnam, Hilary [1985] A comparison of something with something else, New Literary History, vol. 17, pp. 61–79.)
  2. We get back [{"author":[{"family":"Putnam","given":"Hilary"}],"title":"A comparison of something with something else","container-title":"New Literary History","volume":"17","page":"61–79","language":"en","issued":{"date-parts":[[1985]]},"id":"putnam1985a","type":"article-journal"}]
  3. We take a pre-generated sample citation of type article-journal and in it’s formattedBibliography replace the author, title, year etc by a naive text match (this is the bodge part)
  4. Run the search as usual.

But that does depend on the existing search being fuzzy, not crisp, because the odds of getting crisp input this way is minimal.

In principle it should be possible to extend anystyle to do style detection, but I’m not sure where we’d get a sufficient corpus to train it on.

Specifically about citeproc-js, it looks like the code is attempting to use xmldom.js to instantiate styles and locales for consumption by the processor. That’s no longer necessary. In a modern JS environment, citeproc-js runs faster against serialized XML straight from the disk.

There is also code in there for running the processor in Rhino. That should be removed. As Steve notes in the log, it’s 10x slower than node, there’s no reason to keep it around.

There should be opportunities for simplification in pregeneration, JS and node have come a long way since the editor was written. A rewrite could be tough for the person that takes it on though. One problem would be to figure out what is going on with build.js, a generated file that seems central to pregeneration. Code comments point at a file example.build.js for hints on its structure and configuration, but no file of that name exists in the source. I guess you would need to reverse engineer the spec from the code and its output.

Thanks Frank, that’s super helpful. I was wondering why the style generation was so miserably slow, blaming it mainly on how the JSON is written, but this explains a lot.

While the pregenerated JSON file is large (7MB), it loads to memory in about 80ms. The larger cause of lag-time would be network latency. A couple of things could be done to improve things there. Simplest would be to normalize the entries (as in a database), assigning IDs to strings, then using the ID everywhere to avoid replication. Here’s what the gains from normalization of citation and bibliography output would look like:

total examples 21461
unique citations 3274 (potential reduction of 18187 strings, or -84.0%)
unique bibentries 16463 (potential reduction of 4998 strings, or -23.0%)

Together with shortened versions of some of the keys in the JSON (such as the style IDs), normalization could significantly reduce the bulk of data shipped to the browser.

Another optimization that would be more intrusive would be to classify the styles, and download data only for styles of a selected family (author-date, numeric, note, etc).

Edit: On a quick trial, it looks like normalizing IDs and shipping compact JSON would give you about a 34% reduction in size for the examples, if coupled with removal of the statusMessage element. That’s currently showing an error for only one style, (“Environmental Chemistry”, probably because it uses an unknown locale, “en-AU”).

Edit2: Normalizing the style IDs (i.e. by registering a numeric ID in cslStyles.json and using it in the examples file) brings the size of the file down to 3.4MB, less than half the current size.

I started a rewrite two years ago, but got stuck on names (especially the substitute suppression part): GitHub - larsgw/csl-js: Lightweight CSL Engine. It’s missing a lot of other stuff too, but the performance seems good so far:

csl-js

  • 12ms: importing
  • 38ms: parsing & loading (which can all be pre-generated and saved quite easily, currently ~93KB)
  • 0.5ms: engine instantiation
  • 3.5ms: bibliography formatting (one item)
  • 0.5ms: citation formatting (one item)

(1957). Correlation of the Base Strengths of Amines 1 () . Journal of the American Chemical Society, 79(20), 5441-5444. https://doi.org/10.1021/ja01577a030


(1957)

citeproc-js

  • 4ms: importing (I think)
  • 32ms: parsing
  • 341ms + 45ms: engine instantiation & loading (main culprit) — not sure what parts of that can be saved
  • 19ms: bibliography formatting (one item)

Hall, H. K. (1957). Correlation of the Base Strengths of Amines 1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030


(Hall, 1957)

Nice. It will be good one day to have cleaner, more modern, more well documented alternatives to citeproc-js. In my case, building the processor was a grudge match (put more nicely, a “labor of love”), but implementing CSL is a long slog, and processor developers of the next iteration really need to be well rewarded for the sake of the ecosystem as a whole.

Apart from other issues, @Denis_Maier and I just came across an issue with the use of cs:choose in the context of cs:substitute. I’ve opened an issue in the styles repo that goes over the details, but the short version for the Visual Editor is that “Conditional” should not be offered as an option immediately under “Substitute.”

Unfortunately, a quick scan of the csl-editor code suggests that this will be difficult to do. There is no mention of substitute outside of the processor code. I think this means that the option panel is just displaying all of the rendering elements there, including cs:choose, so there isn’t a ready-made configuration setting where a restriction could be imposed. Looks hard from here, anyway.

If things are left as they are, styles emerging from the Visual Editor might just be post-processed to wrap cs:choose in a group in that context. That would sometimes produce unexpected results in some cases, but at least the cause of breakage would be relatively easy to track down, and all processors would produce identical output.

Hi all, it’s been a while but I’m back. The big news is that I have an arrangement with CDS/Zotero now, to work on citeproc-rs, aiming at including it in a Zotero release. I will be putting some serious time into it this year.

I don’t have structured HTML output on the roadmap so suitability for the Visual Editor is perhaps a while away. But performance-wise, it’s on good footing. I don’t know what style(s) you’ve chosen to benchmark, but here are some numbers on an old power-saving i7-2677M CPU @ 1.80GHz, compiled to WebAssembly in release mode:

  • Parse apa.csl and instantiate = 6ms in Firefox, 14ms Chrome (where about 7ms is consistently waiting for V8, so this might improve with time).
  • Build a single cite = somewhere under 1ms to build the cite below (not APA, it doesn’t do bibliographies). Firefox is capped at 1ms resolution and often doesn’t show up, Chrome says 0.6-0.9ms. If we really cared about this I could use a better benchmark than eyeballing flame graphs.

Where The Vile Things Are, Kurt Camembert, 09/08/1999, 56.

  • Add ~1ms for fetching & parsing a locale, but I can’t measure that directly. Importing a reference library is not included in my figures, are the other citeprocs doing that (and doing pre-computation for it)? In my case this is another ms or two. For me ‘instantiation’ means turning a generic parsed XML tree into a purposeful Style struct.
2 Likes

Congratulations! Very good to hear you will be receiving support for your work.

Beggars can’t be choosers, but I hope CSL-M support is still in your development plans. There have been some recent changes as we’ve adapted Jurism to the requirements of a European case law project, and I’ll be updating the documentation soon. Do let me know if you have any questions or preferences at that end.

Not sure what you mean by importing a reference library. If it has to do with things not cited that might be, citeproc-js only carries records of items referenced in citations or the bibliography.