Bundled citation styles and processor?

johanneswilm · August 22, 2019, 5:41pm

Hey,

when dealing with the question of whether to store JSON or XML, it suddenly hit me that storing these files in our database maybe isn’t the best way to do it at all.

We have had the citation style XML files in our database for a few select styles for a few years now. Having them in the database is not really the best solution, and also we are limiting our users to just those styles that are locally installed.

So I have an idea: I could create a module for npmjs that bundles all 2000 independent styles (turned into JSON) from the style repository with citeproc-js. In order to make the download size not too big, it would use dynamic imports to load a given style and locale before running citeproc-js on it. The package would be a wrapper around citeproc-js and would provide an API that is just about the same, with the addition that one can specify styles by their short title, like such:

new CSL.Engine(citeprocSys, 'AMR')

This will then first use a dynamic import to load the AMR style file and then initiates citeproc-js onto it. The syntax can be slightly different so developers don’t get confused on whether they are initiating citeproc-js or the wrapper.

The advantages of this approach would be:

No need to deal with headaches concerning JSON/XML in the database nor with caching, etc. as it’s all just part of the JavaScript packaging. Nor does one have to consider situations in which users don’t have any style at all installed, etc. .
The JSON version of the styles will not be user facing at all. Currently I cannot guarantee that someone running their own server is looking at the JSON in the database.
Possible collaboration with other projects to improve the bundling over time.

But before I start on this, I wanted to ask you if there is anything fundamentally wrong with this approach. Or maybe this has been done before? Of course all the various licenses have to be respected so there will need to be some kind of large disclaimer about where it all comes from, who wrote the various parts, etc., but beyond that it would be good to know if you people would prefer for such a package to be named something close to citeproc-js so people will associate it with that, or whether you’d prefer for it to be named something very different so that there is no association.

It would also be good to hear if anyone has been working on compressing the styles more in JavaScript. By making some basic modifications to the JSON, I’ve gotten it down to about 20 MB for all styles combined. I saw that there are some who claim to compress JSON to around 20% of it’s original size. 4MB altogether for all the styles should be quite acceptable also for browsers these days.

cormacrelf · August 23, 2019, 3:20am

There are a few things you haven’t thought of.

First, you can’t mirror the citeproc-js API. There is no such thing as an asynchronous constructor, so new Engine(..., ‘STY’) can’t trigger a download and return a promise of an engine later. The dynamic imports would have to be explicit. import(‘whatever/apa.csl’).then(apa => new Engine(..., apa))

Second, if it’s an NPM package, it has to be updated by anyone who depends on it. The pace at which the styles repo evolves would be a very significant amount of churn. I’m talking daily updates. Your users would be waiting for you to push out new changes once they hit the styles repo. Even if the package is created on each push to styles, you would still have to include the latest version of it in your product.

Thirdly, it’s just too big. Even a mythical 4MB after compression is too big. True, some web images are that big, and they are also too big. But more importantly, if you’re dynamically importing them anyway, it should be one at a time. This is the crux: the way dynamic imports are done at the moment is statically analysed code splitting via Webpack or similar. So this would only work if you knew which styles people needed at compile time. Kinda defeats the point. You could of course have 2000 different JavaScript files that are essentially export default { JSON stylehere }, but I’m guessing all this will achieve is dramatically slowing down build times for applications since these have to be compiled using the normal code splitter. I’m not sure if you can even get the standard Webpack code splitter to include them in the output directory if they are not referenced statically. Sure dynamic imports are actually dynamic web requests, but you still need to get all those split up styles into individual JS documents so that you can request them individually. In any case, the path forward is very complicated and I don’t see how it can work at the moment.

Fourth, your idea actually requires delivering JavaScript code, not JSON. This will negate all the (negligible) speed advantages of JSON, and then some. Because now it’s full-on JavaScript, which is slowly becoming the world’s most syntax- and feature-rich language, and the parsing of which is notoriously the slowest part of a page load by a long margin. 20MB of JSON is still an eye blink — but 20MB of JavaScript will be in the order of a few seconds of parsing, maybe more. That differential holds even if you manage to split it up, i.e. it’s still much slower in the small.

Name-wise, although I don’t think the idea is viable, it would ideally reflect the fact that it is only compatible with citeproc-js due to the choice of delivery format.

I think we’d be better off creating an NPM or a crates.io for styles, where there is a registry, versions and a CDN. Kinda like the thing inside Zotero now. This would help in a number of ways that a big lump of published JS wouldn’t:

actually splittable so you can download 5kB rather than many megabytes
CDN hosting for styles and locales so fast and cacheable
can include non-styles-repo styles
gathers usage info
would help if we start modularising styles like Juris-M, most importantly with matching versions (!) eg apa-base@^2.0.4
could offer a search API
could lock your style version to prevent buggy updated styles ruining a document until you’ve checked them

———

Edit: the way to get 2000 individual JavaScript files with object literals in them as described above is with

import(`templateLiteral-${styleSlug}.json`)

johanneswilm · August 23, 2019, 6:59am

Sure a minor adjustment to make it async, but I wouldn’t deal with CSL files. Something like compressed JSON would be the actual format.

I wasn’t aware of the amount of updates or that there are any updates to well-established styles. Right now we are updating about once every three years and there have been no complaints about that. So adding this would also have the benefit of making sure that every installation creates the same output.

Users are currently installing something like 12MB of JavaScript code on their local machine for our editor altogether.

I am not sure how to best package them, and that’s why I was asking for experiences here, but one way to possibly get the filesize and number of files down to an acceptable number would be to have one file for each first letter. So one file for all the styles that start with “a”, one for all the files that start with “b”, etc. . Given that they only need to be split once when building the repo and not when the main app is being rebuild, I wonder if there is an option in webpack to just copy those files directly.

Another option would be to include a resources-directory that those using the package need to serve as well, and that directory simply holds all the json-files while the main JS file contains just an array of all the files that are available.

It’s something to experiment with, but certainly doesn’t sound like a show stopper.

Yeah so no user actually needs all the styles. So this is a good argument for not putting everything into one and the same file.

If this service provides a version number covering all the styles, then that could be useful to package this thing. We would still need to cache the styles locally in case the service goes down, etc. so then there would still be the database, etc. . So probably this wouldn’t work as a replacement.

johanneswilm · August 23, 2019, 7:38am

Considering it all, I think this approach is probably the simplest. There is one complication over using dynamic imports and automatic code splitting which is having to serve that resources directory. But overall it seems like this avoids most issues related to size number of files, build times, etc. .

So basically this would be an npm repository that would contain:

A little bit of code to initiate citeproc-js and fetch style files from the resources directory if it hasn’t been downloaded already.
A script that pulls all the styles from github, converts them to JSON-files in the resources directory (and moving the licensing information to one common license file) and creates a list containing all the style names which is included in the JavaScript in 1.
A REAME.md explaining how to initiate the package and to serve the resources directory additionally.

Does that sound better?

larsgw · August 23, 2019, 8:31am

As an aside, an earlier discussion about publishing styles and locales on npm (although it may not be entirely relevant anymore):

cormacrelf · August 23, 2019, 8:37am

That’s fine for a desktop application. For web – no. A lot of citeproc consumers are more along the lines of ZoteroBib, so what works for you may not be generally useful.

That all sounds like a ton of work for both you and everyone else who would use it.

Can we just throw the repo up on a CDN that’s free for open source projects instead? (KeyCDN pops up on Google, and I seem to recall CloudFlare doing this on a few occasions.) You don’t need to reinvent the wheel for something this simple. Just set the cache headers properly (use ETags to detect changes, etc) and then you don’t need to create a whole NPM library just to make a really simple HTTP request. Web apps (and Electron apps!) can generally run a worker that makes sure all the required files are cached if they want to be sure they will continue to work offline. Having this in place would save a lot of people a lot of effort and wouldn’t be limited to your exact use case. You could spend your time making an asynchronous CSL Engine API. (You may wish to follow citeproc-rs’ own async locale fetching scheme.)

let fetchCdn = async file => {
    let headers = { 'Content-Type': 'application/xml', };
    let res = await fetch(`https://cdn.citationstyles.org/${file}`, { headers });
    return await res.body();
};
let fetchStyle = sty => fetchCdn(`style/${sty}.csl`);
let fetchLocale = sty => fetchCdn(`locale/locale-${loc}.xml`);
let sys = { fetchLocale, /* ... */ };

let apa = await fetchStyle('apa');
let engine = new AsyncCslEngine(sys, apa, 'en-US');
engine.setReferences(...);
await engine.fetchAllLocales();

If you write the script that updates the CDN, you can even do your own thing with JSON that you are so keen on as well! Just convert them and make the API ‘replace .csl with .json to get the arbitrary conversion to citeproc-js’ JSON format done already’.

That version number would not help at all, it is made useless by a git commit hash. I meant version numbers for every style. This is a pipe dream, remember, not an actual thing we’re all going to go off and build.

Also, with a good CDN, the likelihood that it has lower availability than whatever you scrap together out of conversion scripts and host on your website is just incredibly small. A web worker for caching is the right solution.

johanneswilm · August 23, 2019, 8:47am

There are several other packages that require this. Vivliostyle just stopped requiring such a resource directory a few days ago, but we still have scripts that do this for fontawesome, mathlive and prosemirror. It’s mostly to make CSS and other resource files available. It’s no more than 1-2 lines of code in the package.json file. But then there is also another way of doing it:

Thinking further, this can probably also be done directly with webpack by just making it treat the .json or .json.zip files like it would be treating images or other resource files it cannot convert to JavaScript directly. I am guessing it just uses a fetch request to get those and it itself keeps track of where they are located.

Nothing against CDNs, but we stopped using those around 2013, because basically it allows third parties to track our users and also we cannot be 100% certain that their service is up at all times. It has been my understanding that this is what most projects have done: moved to install things from npmjs rather than rely on a CDN to deliver jQuery & co the way it was common to do around 2009. A CDN could work for many other things, but probably not exactly for this usecase.

cormacrelf · August 23, 2019, 10:12am

Oh, I am well aware, I maintain a project where people forgetting to include the resources directory is the most frequent issue people file on GitHub. I had to make people tick boxes in the issue template certifying they had read the documentation to solve that. It’s a pattern to avoid.

If it were easy, you would have finished it already. If you did, you would still have problems at every turn. I have a lot of experience with Webpack, and I can’t imagine how you’re going to make your npm package’s webpack configuration not get bundled back together into a 20MB JS blob by downstream users’ own configurations without making them add webpack plugins entries manually and redoing all your work. And then those people whose webpack configs aren’t in a config file but are managed by something like @angular/cli or create-react-app will complain that they can’t do it. Does this sound like a nightmare to you? It does to me.

“Most projects” stopped including jQuery with a CDN script tag because they started using NPM instead. But “most projects” with this problem also started rolling their own CDNs. That is, they started buying CDN service by the transferred-gigabyte from CloudFront/CloudFlare/etc and using it for web assets. This is how virtually all images and JS/CSS assets on major platforms are served. That’s the kind I mean. CDNs are well adapted to serving big bunches of files on the web. Pretty much every package manager has one. NPM has one. Yarn has (a better) one. This is the day crates.io got one to make downloads more reliably fast and work in China. Don’t reject it out of hand because you (correctly) followed a trend in 2013.

In this case, “third parties” would be the CSL community, who would control the CDN. You can trust us! These CDNs are often in use for private content. They do not track end users, except to offer features like abuse prevention, that their customers purchase and enable.

And you can be 100% certain in your own website’s availability? To match CloudFront or CloudFlare in availability, you would have to put your own website behind a CDN. Think about that for a minute.

You should probably have your assets behind a CDN anyway if there’s 12MB of JS to download. This is just a best practice in 2019.

johanneswilm · August 23, 2019, 10:26am

I am looking at this. GitHub - sebastian-software/rollup-plugin-rebase: The Rollup Rebase Plugin copies static assets as required from your JavaScript code to the destination folder and adjusts the references in there to point to the new location. It seems to solve this at least for some cases. I don’t like having this extra dir either, which is why I lobbied Vivliostyle.js into getting rid of it, But this is one of the few cases where I can see an advantage of keeping those files as resource files that don’t need to be touched any further.

As for the arguments about configuration being harder, and there are “problems at every turn” - that all seems resolvable. Having style sheets in a database has also not been without problems and still we have managed over the years.

I would absolutely welcome if you were to turn the style repo into a CDN so that there is a better way of accessing it. It may also be a better way of getting it into our own repo.

I’m just saying that I have to consider GDPR and our users privacy in general, and making users download things directly from third party sites in a way that we cannot control is not an option for us. We have users with their own installation of Fidus Writer that I only know exists because I get occasional bug reports about them, but they keep things so secret on there that they won’t tell me the URL much less get access to that server. And I get it. People write their book manuscripts and don’t want me to know anything about it. For the server we are hosting ourselves we are using cloudfront for the static assets. AWS has signed the GDPR docs, so there is at least a little more protection there, but I totally understand that not everyone with their private installation wants to give that kind of access to third parties. For all I know, they may be running it entirely on an intranet that is not connected to the rest of the internet.

No, but when our website (or the user’s own installation of Fidus Writer) is not available, they cannot get to their documents anyway. So then it makes no difference.

I don’t know why you seem so keen on shutting this idea down, but even if it appears that I don’t agree with you on everything here, you certainly made me reevaluate how to do some things. You are quite right that dynamic imports probably would not have been a good idea. Thanks for that!

cormacrelf · August 23, 2019, 3:06pm

I seem “keen on shutting this idea down” because you have touched on a good problem to solve and I think it needs fixing. I want to actually get something useful out there, because I recognise some of the difficulties you’re alluding to. But I think we disagree on what the problem is.

As far as I can tell, although you mention storing styles in a database as a problem, I think you’ve focused on fetching styles. You don’t see updating styles as one of the sub-problems and are happy to put them in a static folder that “don’t need to be touched any further”. (Probably because you’ve only ever had a subset in Fidus.) True, it all might be more difficult if you choose to do the unnecessary JSON optimisation, but that’s really a problem unique to you, and unlikely to be worth publishing as an NPM package. But: managing these static folders is actually much more annoying than fetching files from them, and it’s a problem that everyone shares.

Here is my problem statement. As a dev I don’t want to have to:

create the static folder of styles or locales; or
pull them from upstream; or
remember to pull them on dev machines; or
regularly keep them in sync with upstream; or
add 20MB to deployment bundles which would otherwise be pretty small; or
serve them with the right MIME types (.csl won’t be a known filetype on any web host, ever)
serve them with the right cache headers; or
deploy anything again when styles get outdated or (alternatively) write and debug code to pull styles & locales repos on user machines.

Most of these are everyday drudgery for every CSL-enabled tool out there. Writing code to handle this is the price of admission when you get started. I have done it three times now. It gets old. If you’re wrangling styles but not hitting any of these points, whatever you’re building is not worth it.

Does an NPM package bundling all the styles make it better? You asked if it were feasible, I said not really and generally advised against it because it is harder to use, and slower and more fragile in a number of ways. Even if it did work, updated styles would require deploying a new application with the bumped package. I have enough package bumping to do as a JS dev. I’ve had enough drowning under the firehose of dependabot pull requests. If it existed (big if), such a hypothetical package would hit points 1, 3 and 6, but make the rest of them worse.

Does a cool script to download all the styles and convert to JSON make it better? Arguably not any better or worse, but has more moving parts, than adding cd styles && git pull to a CI script and calling it a day. I sense it will be problematic to set up given you’re likely putting a hardcoded resource path in the JS glue code. Such a script would hit none of the points. Maybe 6/MIME for converted JSON but only for you, and that was never a huge problem.

Neither approach meaningfully attacks this list. You can build it if you like but it won’t be of any great use to the community.

On the other hand, does a CDN solve it? Yes. It hits every item on that list. (You are correct that any CDN may engage the GDPR because of the IP addresses it collects and discards. But this isn’t the huge blocker you made it out to be. I’m sure if we built it we could get some compliance drafted, it would be a fine entrant to a GDPR-golf brevity competition.)

johanneswilm · August 23, 2019, 3:17pm

Ok, maybe we just have different ideas of what is useful to the community. I really don’t know who all is in this community so you may be right that yours is more useful for a lot of projects. In our case adding a CDN to the mix for end users is an absolute no-go because the program can then no longer be used by security aware users. That’s not something I will be able to convince anyone to compromise on.

I think what I am proposing is fairly simple, so I’ll just go ahead with a proof of concept. If that is then not used by anyone except us - so be it. It may also show that it’s not actually possible to do the way I hoped it was. But separating everything CSL related out into a different package would in itself a big advance for us and will likely make it much more manageable in the long run. If in a few years time the world starts to relax a bit more about CDN and it has been proven that it’s not actually used to spy on users, this may all change and it may even be possible for us to switch to your CDN some day.

Again, thanks for pointing to some of the flaws in my initial plan. I’ll let you guys know when I have something to try out and then you’re welcome to again tell me how this is a bad idea that will never work.

cormacrelf · August 23, 2019, 3:24pm

Try harder man! I have never heard of these ridiculous people. In my experience, they are either security-conscious enough to know that CDNs are not harmful, or they know nothing about them at all. I don’t see CDN as something even worth explicitly telling users about. It’s just downloading additional styles. Anyway.

johanneswilm · August 23, 2019, 3:28pm

There is a reason that scientists from around the world are hesitant to write their research in Google Docs and thereby give the US government and possibly also other governments and organizations access to their data. Some of those users are among those that feel the need to run their own Fidus Writer instance. It’s not just that I will find it hard to convince them not to use some random CDN - I think they are completely right in not accessing anything from the outside.

Again, I can see you propose a solution. It’s just a solution for a different problem and a different usecase than the one I am dealing with.

Bruce_D_Arcus1 · August 23, 2019, 3:47pm

Beyond getting some CDN provider to provide this project a free CDN, how much work would be involved in setting it up?

While would be nice if you two could agree, any reason two proofs of concepts aren’t a reasonable alternative?

johanneswilm · August 23, 2019, 3:49pm

yeah I think two proofs of concepts sounds like a good plan. I may be underestimating it, but right now I don’t think there is much work involved in my proof of concept.

Sebastian_Karcher · August 23, 2019, 4:14pm

Thanks both for good discussion. I have little to add on the technical aspects, but did want to briefly flag this:

I wasn’t aware of the amount of updates or that there are any updates to well-established styles. Right now we are updating about once every three years and there have been no complaints about that.

To the extent that there is a “CSL brand”, that’s not ideal for us. We put a lot of effort into fixing, improving, and updating styles and take great pride in this. I’m completely on board with not grabbing every single update to the repository for a packaged version of the styles, but if there is going to be a packaged version of CSL styles, NPM or otherwise, I’d really want to see at least monthly updates. That’s partly for image reasons, but also for maintainability. If someone complains about a CSL style being incorrect, it’s a hassle to check which version they’re running and if the same problem exists in the current version.

johanneswilm · August 23, 2019, 4:20pm

Well, I mean you guys kind of have to decide. It’s open source so you cannot force people to update it. At the same time there are things in your licences that require one to show text saying that the system is using CSL or citeproc (I cannot exactly remember which one). So just for that reason you’ll have that branding shown to end users.

We all have a lot of our hands. There is also some value to me not receiving error messages relating to a style that I cannot reproduce because there is a different version of a CSL-file than on my system.

I don’t even know if we had an update 3 years ago - it may also have been back in 2012 or so. If you decide that you don’t want CSL branding unless there is an update every few weeks, then change the license terms so that the branding isn’t required.

Bruce_D_Arcus1 · August 23, 2019, 4:31pm

He didn’t say anything about requiring updating, but I guess the obvious question is who benefits from NOT having up-to-date styles? Certainly not your users?

Sebastian_Karcher · August 23, 2019, 4:35pm

Right, this is a request, not a requirement. I just think it’ll make everyone involved happier.

johanneswilm · August 23, 2019, 4:39pm

We currently have two releases per year. Once we declare that it’s more or less feature complete, we’ll probably have releases annually or biannually. So more frequent updates are probably not an option for us. I would think some other projects have even less frequent updates.

But this effort here should help us to get CSL updates at least those times that we do releases.

Topic		Replies	Views
Styles/locales as XML vs. JSON CSL Development	22	2668	April 13, 2023
CSL test suite -- citeproc-js migration CSL Development	0	445	March 7, 2016
new pages CSL Development	3	372	December 23, 2004
Infrastructure for style-level testing CSL Development	12	368	April 9, 2011
Citeproc-js in worker threads CSL Development	2	383	March 20, 2015

Bundled citation styles and processor?

Related topics