Hey @cormacrelf,
thanks for the numbers.
In our case speed is a lot more important than download size (the total size is a few MB which are downloaded on the first visit and are then cached in the browser), but also remember that the JSON parsing only has to happen once on the computer while the XML parsing has to happen every time I create a new CSL instance because citeproc only supports either a JavaScript object (that I deserialized from JSON myself earlier) or a text string that needs to be deserialized to XML by citeproc-js.
Additionally, the time issues with DOMs in browsers are not just the parsing of it - it’s also reading and writing things to dom elements (see below).
I have been maintaining a few XML/DOM related open source tools over the last few years and besides XML conferences where I met people who felt extremely strongly about XML and a Twitter fight about whether CSS or XSL-FO is better that I witnessed, my experience comes from tools.
I think the one expierence that influenced me the most is that of diffDOM [1] - a library to diff two DOM elements and apply the resulting patch which I have maintained for a few years. Initially the way it worked was that it would create a copy of the first dom element it was given and then find the first difference between the copy of 1 and dom element 2. It would then apply the diff, and start over, finding the next diff. In the end it then returned a list of patches that needed to be applied to make dom element 1 be like dom element 2.
To try to speed it up, one of the things I experimented with was what would happen if instead I first made “virtual” dom copies (JavaScript objects that contain all relevant values and attributes of the original dom element) and then make the diff findign process work on the Javascript objects instead. In comparison to the old method, this did involve an extra step of walking through both elements and their trees and make these copies. So I was quite surprised when it turned out that this new way was a lot faster than using the dom elements, even if the dom elements were not descendants of window.document and could cause for things to be redrawn.
The explanation I was given by browser developers when I discussed this with them was that by accessing dom elements in JavaScript, it cannot optimize the code as much because it needs to switch between C++ and JavaScript.
This worked for a few years - the “virtual dom” was only used internally and was recreated every time another diff/patch operation had to take place. A few months ago we started seeing reports that Fidus Writer (an online collaborative word processor) turned slow with very large documents. The main culprit seemed to be that with every letter the user typed a number of boxes on the right hand side (showing tracked changes, comments on text parts, etc.) needed to be recalculated. The diffDOM had already minimized the amount of dom updates, but just reading the dom was still slowing it down. In the end we resolved it by giving access to the internal JavaScirpt object, storing it between updates, and also by adding code that turns the html-string directly into our JavaScript object format without going by the route of using the browser’s dom system first.
I did notice that you mentioned virtual dom systems above. And yes, they are probably a solution to some things. I have been thinking for a while to use jsdom on Safari to create DOCX/ODT files rather than the native dom mechanism to get around the restrictions of that browser. The main reason I haven’t yet done it is lack of time.
But also jsdom & co create an overhead because they add dom related methods, etc. that we don’t really need in diffDOM, so the best solution so far seems to be this custom JavaScript object.
Actually, the reason I chose to store this JSON and not some other custom more readable variant is precisely that it has been supported for 6.5 years and it does correspond quite directly with the XML so I don’t really see it as a competing standard. It’s more a different way to express the same thing and a converter could be written to turn it back into XML. This way of representing it just turns out to be faster in browsers and at least in Fidus Writer we have so much going on that every microsecond counts.
I really did not start this conversation to start a new format. The question was just how could I do what I needed to do which was to change some parameters of the citation style before running citeproc on it. Creating an XML dom, changing some attributes (which probably would break on Safari) and then serializing it back to a text string and running citeproc seemed a bit excessive. So I went into the documentation and found something that could not only solve that problem but which is also likely to speed up the process in general.
But now with your responses here, I start to wonder - maybe there are actually enough technical reasons to move away from XML. The one reason that I can see to keep XML as the main file format that I can see is that it has been established already. But of course if using XML always is three times slower than using JSON, then you’ll find yourself in the situation of having to battle of JSON alternatives all the time. At any rate, when we moved from our custom solution to citeproc back in around 2012/13, it was precisely to have one less thing to worry about and for others to take care of and become more compatible with/similar to other tools. So I can assure you that I will not be launching any kinds of citation style standards any time soon. I’m quite happy with citeproc(-js) as it is.
Right, and that’s a significant shortcoming I can see for example with the package.json file that is being used so much these days. I hope that eventually one of the iniciatives to add comments is successful [2]. However, it seems like the world is more JSON-positive in 2019 than XML-positive.
Look, maybe we came off the wrong foot on this. My understanding of this conversation was that I had found a way of running citeproc-js in the manual that I as a random nobody on the internet hadn’t been aware of hitherto and it was’t clear to me whether this would be faster or slower than running it with XML. Frank then explained that it would likely be faster and under no circumstances slower. I then quickly analyzed where in our workflow it would be the easiest to do the conversion and decided that on saving in the database would be the simplest to implement. I committed the code and added a single comment for others who might be in a similar situation on how that helped me. But then suddenly you came with a long list of reasons why XML is better than JSON. And you’re not wrong and if I were the maintainer of some heavily used citation manager, then it would probably have made sense to have this conversation with me. But I’m just not that person.
Citations is one of many things Fidus Writer does and it doesn’t have nay interface to edit citation styles, so I would be very surprised if anyone were to go in and manually edit those. An administrator could theoretically also go in and edit documents that are also a large amount of JSON, but I would be highly surprised if anyone would do that.
[1] GitHub - fiduswriter/diffDOM: A diff for DOM elements, as client-side JavaScript code. Gets all modifications, insertions and removals between two DOM fragments.
[2] https://json5.org/