Styles/locales as XML vs. JSON

Hey,
the documentation mentions converting styles/locales to JSON as one option in environments that don’t have access to a dom, such as a web worker.

I am wondering: in browsers that do have dom support, is there any advantage of keeping the XML version around? Reading/processing a dom tend to be slower than processing a JSON object, so I’m wondering if it’s not just better to convert to JSON for all purposes.

I can’t speak for other processors, but in citeproc-js, the processor will convert serialized XML to JSON internally as of a few years ago. If the style is provided as pre-converted JSON, the processor will work with that directly, so if you’re instantiating styles and tearing them down frequently, there should be a small performance advantage to caching the JSON version externally.

(As a historical note, when I started work on citeproc-js, JavaScript was an interpreted language - no JIT compilation. I had zero experience with JS or with programming in the browser, and initially set out to string-parse the XML input because I didn’t know any better. Others disabused me of that approach because DOM or Mozilla’s (short-lived) E4X were much faster; but times change, and systems change.)

Ok, understood. So I’ll just add the converter to JSON to run on any new incoming XML but only store the JSON version in the database. Using JSON also has the advantage that I can easily manipulate the object before handing it to the CSL processor.

So for example to adjust the delimiter/prefix/suffix of citations, I can adjust the object like this:

const citationLayout = citationStyle.children.find(
    section => section.name === 'citation'
).children.find(
    section => section.name === 'layout'
).attrs

citationLayout.delimiter = '; '

Bear in mind that there isn’t (afaik) a standard way to represent XML as JSON, the .children is just a convention. So even if citeproc-rs did have a DOM, the type-safe internal representation in citeproc-rs wouldn’t line up with that convention. The best possibility is that an API similar to https://github.com/RazrFalcon/svgdom is offered, which would read and write XML. (DOM means an API for exploring and manipulating a document, it is not XML specific.)

In general, I would advise storing XML because it has a known, portable schema and you can see what went wrong by looking at it (whereas the JSON representation is not human friendly). That way it even survives things like citeproc-js choosing a new JSON converter.

[…] you can see what went wrong by looking at it (whereas the JSON representation is not human friendly).

I think that depends a lot on what you are used to as a developer. I am seeing a lot more JSON than XML these days and so a properly indented JSON looks much more readable to me than most XML-files. I feel more comfortable manipulating it and more secure that the browser will work correctly. Apparently it’s not an issue for CSL, but in general I found especially in Safari there are several XML-related bugs that haven’t been fixed for about a decade.

I could have the browser convert the serialized XML to a real XML-dom node, then serialize it back to a string, then hand it to citeproc-js which then converts it into a JavaScript object. All those conversions would take place potentially several times a second on the end user’s mobile phone or laptop. But I don’t see the advantage.

That way it even survives things like citeproc-js choosing a new JSON converter.

We don’t create our own styles - we just copy and convert the XML definitions from the styles repo. If there is a change in citeproc-js, then we would need to copy them again and adjust the code that relies on the json structure being the way it is. In this case it seems like it’s worth it if it takes some of the burden of what needs to be converted on the computers of end users.

I would emphasise three things:

  • You do not need to rely on the browser’s (very heavy and oriented at correcting errors as it goes) HTML DOM to parse or write xml. Try xml2js or fast-xml-parser.
  • A full parse cycle should not take more than a few milliseconds for the kinds of file sizes written by hand in CSL. If XML is actually noticeably slower, then something’s wrong.
  • You don’t have to do full XML parse cycles every time your user wants a change, I only recommended not storing an undefined format in a database.

@cormacrelf A few years ago I was sent to present at several XML conferences. I knew what XML was but like most other developers I had not used it more than a bare minimum after around 2005 or so. I thought this would be a good place to talk also about things that weren’t XML. But what I quickly learned was that the subject of XML vs. JSON or HTML is almost like having a discussion on what religion is better - it’s impossible to reach a conclusion because it’s something that people feel very strongly about.

I noticed you are the citeproc-rs developer, and if the idea is that we all switch to citeproc-rs in a few years and you prefer XML, then clearly that’s a good argument not to change the pipeline to exclude XML too much as we’ll probably have to go back to deal with XML when that happens.

We currently store almost everything in JSON - documents, our bibliography database, etc. . The only XML-thing we stored until now were the CSL definitions and it was mainly because I didn’t realize that citeproc-js also accepts json until I looked through the manual today to try to find a way of manipulating the engine. Also, we didn’t really need to touch the engine much until now.

So thanks for the warning, and I’ll make sure not to walk too far away from XML so that we can switch back to it if we need to for citeproc-rs, but the step of saving and serving a JSON conversion is still within the realm of what I think makes sense in our case, at least for now.

Hey, I don’t care about the XML-JSON debate in the slightest. Writing either by hand is distinctly unpleasant. JSON is (imo) significantly worse for this particular use case, although for most other purposes it’s significantly better. This is beyond unimportant, however. For CSL the only things that matter are portability between implementations and backwards compatibility with the huge library of styles. I guess if you’re happy to maintain both manually to save a millisecond from time to time, then that’s up to you.

1 Like

I have tried to debug citeproc-js every now and then, but I must admit I generally deal with it like a black magic box that just does it’s thing. So you may absolutely be right - I really don’t know which one would be better to do the job, but it sounds like it’s the JSON version that is currently being used.

My initial question came because I just noticed i other situations that dealing with native DOMs in browsers is significantly slower than dealing with JSON objects, and even various virtual DOMs require more resources. So when I read today about the possibility to use a JSON object instead that I suspected that may be at least as fast (and more convenient to deal with for me), and the answer @Frank_Bennett provided meant that at least there is no way the JSON version can be slower, so there seems to be no reason to keep the XML version around for performance the way citeproc-js works right now.

The way Ideal with it is this:

The interface to add another style is basically that you have to paste the style definition into a text box in an administration interface. It is assumed people get it there by copying it out of the style repository.

The check I added today trims the content of that text field before saving. If the first character is an “<” and not an “(”, it assumes it just received XML and runs it through the XML-to-JSON converter before storing it.

It’s a really useless text box given the amount of XML/JSON so it’s really not usable for editing the style. Pasting the entire thing is really the only thing that makes sense. In that sense the style sheets in the database in a Fidus Writer installation really just act like a local cache and therefore it shouldn’t matter to you guys what format it is stored in.

I don’t need to run this JSON in any other language version of citeproc, so my situation is different than that of CSL as an organization.

Thank you very much for the detailed explanation!

Mate, I just don’t think the supposed cleanliness and joy of storing JSON for everything is worth all the trouble. I don’t imagine I’m going to convince you at this point, but I’ll address a few things:

This, and Frank’s “will convert serialized XML to JSON internally”, is inaccurate. JSON != JavaScript objects. citeproc-js does not convert XML to JSON — it just parses XML into its own in-memory representation. It just so happens that many (not all) objects in JavaScript can be serialized to and deserialized from JSON strings. So you can, by pure coincidence, replace “XML string --parser–> in-memory” with “JSON string --parser–> in-memory”. But you would be the first.

If by this you mean people can paste JSON in, I would mark my opposition to allowing people to paste or ever see it. Any supposed JSON interpretation is not in use by people who write CSL styles. There is nowhere on the web where one may obtain styles in this posited format. The actual layout of fields and node children is arbitrary. There is no standard, and there will likely never be a standard. If there were one, it wouldn’t be the one used by citeproc-js, which is verbose, and only any good for an in-memory representation. You can depend on citeproc-js’ interpretation all you like, but please don’t make it user-facing.

To answer your original question, one really good reason to store XML is so that when people inevitably ask you ‘can I copy that style I’m using to use somewhere else?’, you can give them the original XML and not this useless-elsewhere JSON representation. The way you’re describing it, you would have to convert back to XML, losing any comments and whitespace in the original. Also, you’d have to write that yourself. The list of troubles this isn’t worth just keeps growing. I can think of a few more, but I’m sure you’ll encounter them along the way. Good luck.

I am not sure I would be the first. There seems to be at least two implementations (Python and JavaScript) of the converter to JSON. The Python converter has been there for 6.5 years and the format hasn’t changed [1]. That’s more stable than a lot of standards and two independent implementations is usually enough to define a new standard :wink: . Also, it’s mentioned in the documentation a few places as the way to do it in web workers, etc. . So I assume people who run it in nodejs generally do it this way. My question here was just to figure out if it also makes sense even if there is access to a DOM.

I promise that if anyone asks, I’ll tell them they have to go back to the style repository and download it from there. Even though this isn’t what it’s meant for, if a user still decides to change the syntax, in 2019 there is a likely a higher chance that they’ll feel comfortable editing the JSON than XML. Should it become an actual problem that users have edited the JSON and now want to use it on other sites that do not use citeproc-js, we’ll just need to write a converter to create the XML from the JSON.

[1] History for tools/makejson.py - Juris-M/citeproc-js · GitHub

Frank’s “will convert serialized XML to JSON internally”, is inaccurate.
Indeed. I should have written “will convert to and internal JavaScript object that the processor will also accept as input if it is serialized to JSON.”

Back when JSON input was implemented, @fcheslack ran tests to compare the performance of DOM versus pre-processed JSON input for this particular processor, and found that the latter was significantly faster. I would guess that some such caching mechanism is used on Zotero Style Repository. Whether it involves stringifying to JSON I have no idea.

Be that as it may, it should be sufficient to indicate that citeproc-rs, when complete, will expect the validated XML form of CSL styles and locales as input. Developers that will rely on it can figure out the rest.

1 Like

Citeproc-js is the one implementation that has access to the browser’s DOM engine, which one would imagine would be perfect conditions for dealing with an XML DOM. If even under those conditions it is faster to use the JSON, I wonder what it will be like in other implementations. Has this been tried out? I’m guessing citeproc-rs will lose access to the browser’s DOM, so then that could also become a relevant question.

Rust is a very different environment. I’m sure @cormacrelf has performance issues there well under control.

Today:

JS benches are using jsbench.me, which I can’t save for sharing as the setup code is too long. Rust is with criterion. Same Xeon E5620 with its fairly modest single-core perf for 2019.

Parser Time for 29kB CSL (ms)
XML via CSL.parseXml 5.33
XML via DOMParser: 2.52
JSON.parse(the JSON equivalent without whitespace, 35kB) 0.859
Rust XML via roxmltree 0.528
Rust XML parse + completely validate + form proper AST 0.949

So, JSON is 3x faster to parse than DOMParser, but inflates the download size. Theoretically, if your app includes downloading, XML is still faster for network connections with speeds of less than 30Mbps.

None of these times warrant any effort on caching, especially for an operation that generally happens once per session. For that visual-editor-batch-preview-generation thread or huge repo-wide test runners, mileage may vary. Otherwise, for your average document editor, it’s a completely unnecessary optimisation. Don’t bother.

I know your original question is framed around the viability of JSON as an alternative behind-the-scenes interchange format, but it also seems like you either actually want or are indifferent to the creation of a competing standard for end users to consider. From my perspective it seems like you’ve based this on your experience of editing JSON config files and/or general discomfort around XML/old-news tech. I am not attributing a serious ecosystem-wide proposal to your remarks. Nevertheless, I hope to squash any fanciful notions of creating a new CSL syntax, so people know what they’re getting into if they propose one by accident.

Here’s a CSL fragment:

<!-- XML has comments. JSON does not. I could stop talking here. -->
<macro name="AMacro">
  <text variable="locator" />
</macro>

Syntax tree from CSL.parseXml:

{
  "name": "macro",
  "attrs": {
    "name": "AMacro"
  },
  "children": [
    {
      "name": "text",
      "attrs": {
        "variable": "locator"
      },
      "children": []
    }
  ]
}

The CSL is quite readable. citeproc-js’ output is awful to edit. I know this because I just typed them both out. This (among many reasons) is why YAML, TOML, HTML, and hundreds of other markup languages still get airtime even now the great saviour JSON exists. A good human-friendly language should be harder to parse for machines; the more work they have to do, the less we have to think.

But much more important reason the CSL snippet is so readable is that the CSL designers (Rintze et al) put significant effort into making it human-friendly. This involves lots of choices to exclude the numerous ways XML could represent the same concept. Clearly, another gargantuan effort would be required to make some JSON schema or any other language come even close. Of course citeproc-js’ implementation hasn’t changed in six years: nobody is willing to put in the time and effort to create a human-friendly syntax based on a machine-optimised language that was never fit for that purpose. As it stands now, the so-called JSON ‘standard’ is not a contender; it is simply an XML abstract syntax tree that is about 5x the laborious typing than the real thing.

If you do want to seriously propose a new host language and final syntax to write CSL in, you probably want to pick a better horse to race than citeproc-js’ in-memory representation. Not even a good proposal is remotely likely to surmount the obstacles to changing the entire ecosystem. I would again encourage you not to expose this JSON representation to your users, and to generally ignore it except to solve the one WebWorkers problem it was designed for.

Hey @cormacrelf,

thanks for the numbers.

In our case speed is a lot more important than download size (the total size is a few MB which are downloaded on the first visit and are then cached in the browser), but also remember that the JSON parsing only has to happen once on the computer while the XML parsing has to happen every time I create a new CSL instance because citeproc only supports either a JavaScript object (that I deserialized from JSON myself earlier) or a text string that needs to be deserialized to XML by citeproc-js.

Additionally, the time issues with DOMs in browsers are not just the parsing of it - it’s also reading and writing things to dom elements (see below).

I have been maintaining a few XML/DOM related open source tools over the last few years and besides XML conferences where I met people who felt extremely strongly about XML and a Twitter fight about whether CSS or XSL-FO is better that I witnessed, my experience comes from tools.

I think the one expierence that influenced me the most is that of diffDOM [1] - a library to diff two DOM elements and apply the resulting patch which I have maintained for a few years. Initially the way it worked was that it would create a copy of the first dom element it was given and then find the first difference between the copy of 1 and dom element 2. It would then apply the diff, and start over, finding the next diff. In the end it then returned a list of patches that needed to be applied to make dom element 1 be like dom element 2.

To try to speed it up, one of the things I experimented with was what would happen if instead I first made “virtual” dom copies (JavaScript objects that contain all relevant values and attributes of the original dom element) and then make the diff findign process work on the Javascript objects instead. In comparison to the old method, this did involve an extra step of walking through both elements and their trees and make these copies. So I was quite surprised when it turned out that this new way was a lot faster than using the dom elements, even if the dom elements were not descendants of window.document and could cause for things to be redrawn.

The explanation I was given by browser developers when I discussed this with them was that by accessing dom elements in JavaScript, it cannot optimize the code as much because it needs to switch between C++ and JavaScript.

This worked for a few years - the “virtual dom” was only used internally and was recreated every time another diff/patch operation had to take place. A few months ago we started seeing reports that Fidus Writer (an online collaborative word processor) turned slow with very large documents. The main culprit seemed to be that with every letter the user typed a number of boxes on the right hand side (showing tracked changes, comments on text parts, etc.) needed to be recalculated. The diffDOM had already minimized the amount of dom updates, but just reading the dom was still slowing it down. In the end we resolved it by giving access to the internal JavaScirpt object, storing it between updates, and also by adding code that turns the html-string directly into our JavaScript object format without going by the route of using the browser’s dom system first.

I did notice that you mentioned virtual dom systems above. And yes, they are probably a solution to some things. I have been thinking for a while to use jsdom on Safari to create DOCX/ODT files rather than the native dom mechanism to get around the restrictions of that browser. The main reason I haven’t yet done it is lack of time.

But also jsdom & co create an overhead because they add dom related methods, etc. that we don’t really need in diffDOM, so the best solution so far seems to be this custom JavaScript object.

Actually, the reason I chose to store this JSON and not some other custom more readable variant is precisely that it has been supported for 6.5 years and it does correspond quite directly with the XML so I don’t really see it as a competing standard. It’s more a different way to express the same thing and a converter could be written to turn it back into XML. This way of representing it just turns out to be faster in browsers and at least in Fidus Writer we have so much going on that every microsecond counts.

I really did not start this conversation to start a new format. The question was just how could I do what I needed to do which was to change some parameters of the citation style before running citeproc on it. Creating an XML dom, changing some attributes (which probably would break on Safari) and then serializing it back to a text string and running citeproc seemed a bit excessive. So I went into the documentation and found something that could not only solve that problem but which is also likely to speed up the process in general.

But now with your responses here, I start to wonder - maybe there are actually enough technical reasons to move away from XML. The one reason that I can see to keep XML as the main file format that I can see is that it has been established already. But of course if using XML always is three times slower than using JSON, then you’ll find yourself in the situation of having to battle of JSON alternatives all the time. At any rate, when we moved from our custom solution to citeproc back in around 2012/13, it was precisely to have one less thing to worry about and for others to take care of and become more compatible with/similar to other tools. So I can assure you that I will not be launching any kinds of citation style standards any time soon. I’m quite happy with citeproc(-js) as it is.

Right, and that’s a significant shortcoming I can see for example with the package.json file that is being used so much these days. I hope that eventually one of the iniciatives to add comments is successful [2]. However, it seems like the world is more JSON-positive in 2019 than XML-positive.

Look, maybe we came off the wrong foot on this. My understanding of this conversation was that I had found a way of running citeproc-js in the manual that I as a random nobody on the internet hadn’t been aware of hitherto and it was’t clear to me whether this would be faster or slower than running it with XML. Frank then explained that it would likely be faster and under no circumstances slower. I then quickly analyzed where in our workflow it would be the easiest to do the conversion and decided that on saving in the database would be the simplest to implement. I committed the code and added a single comment for others who might be in a similar situation on how that helped me. But then suddenly you came with a long list of reasons why XML is better than JSON. And you’re not wrong and if I were the maintainer of some heavily used citation manager, then it would probably have made sense to have this conversation with me. But I’m just not that person.

Citations is one of many things Fidus Writer does and it doesn’t have nay interface to edit citation styles, so I would be very surprised if anyone were to go in and manually edit those. An administrator could theoretically also go in and edit documents that are also a large amount of JSON, but I would be highly surprised if anyone would do that.

[1] GitHub - fiduswriter/diffDOM: A diff for DOM elements, as client-side JavaScript code. Gets all modifications, insertions and removals between two DOM fragments.

[2] https://json5.org/

1 Like

FWIW, when I originally created CSL somewhere around 2005, I based it on XML for the following reasons:

  1. a huge infrastructure of well-designed associated tools (NXML, xmllint, etc., etc.) and standards (RNG/C) for editing, validating, and manipulating files.
  2. per @cormacrelf, was not hard to create relatively user-friendly and compact styles.

I still think it was a good decision in 2019, but do recognize the larger programming landscape has changed.

1 Like

I do just want to add to this lest I leave someone in the future with the impression that I’m open to the idea of a json schema version of CSL styles: I’m not, at all.

XML turned out to be a good solution for this use case. And providing the same thing in json would be impossible.

2 Likes

… providing the same thing in json would be impossible.

Among other reasons, because JSON Schema is pretty terrible and limited in comparison to RELAX NG XML, which is probably the most elegant technology I’ve ever worked with.

I’ve tried to do pretty basic things with JSON Schema in the past months that are either impossible, difficult, or simply don’t work from a validation standpoint.

One of the nice things about RNG definitions is they very precisely specify what we intend, and RNG tools consistently and accurately validate that logic.

2 Likes

Hey,

I don’t think it is necessary to change now, at least as long as XML is still used somewhat. But how the files are cached is a different story. Citeproc-plus [1] internally keeps a cache of the entire citation style repository and it caches those in a json format that citeproc-js can read directly. End users never have to deal with those json files at all and that solution has worked out pretty well for us for a while now.

[1] GitHub - fiduswriter/citeproc-plus: Citeproc-js + citation styles bundled