RFC: Rich Text for CSL JSON input format

Bruce_D_Arcus1 · July 14, 2020, 11:10pm

I have merged experimental rich text input support to provide a superset of substring formatting functionality included in citeproc-js (though not defined in the CSL spec), but here implemented using native JSON data structures.

Here’s the schema file that defines the model (aside: it’s YAML because I hate hand-writing JSON, but it’s trivial to convert to JSON; it’s just a YAML representation of a JSON Schema, and some tools support YAML directly for this).

In short, this would offer an alternative to pseudo HTML, and also add support for math.

I have designated it experimental because we need developers to vet and implement it, and provide suggestions for the spec.

So a request to the developers here: please do just this, and let us know what you think.

Should we include it in the main JSON schema for 1.1?

Will you commit to supporting it if we do?

By “commit,” I mean if you now support the citeproc-js substring markup, can you also support this, so we can define this as what compliant implementations should or must support going forward? If you don’t already have rich text support, can you add this, or some subset of it?

And on that note, for the spec, what should we say “support” means? How should processors deal with the math content, for example?

Is mathml or tex even the right solution for math here, or should be have a single option to use a unicode math representation? Looking around quickly, it seems there are libraries to convert TeX to unicode, so am now second-guessing the initial decision here.

cc @PaulStanley @asimonyi @Frank_Bennett

Frank_Bennett · July 15, 2020, 6:18pm

Is there a JS library out there that can convert math-things to RTF? That would be a minimum threshold for the things citeproc-js is used for.

Bruce_D_Arcus1 · July 15, 2020, 6:25pm

Does it have to be RTF output, or could it just be unicode?

Frank_Bennett · July 15, 2020, 6:26pm

Also, is the idea that the processor should be able to handle string field input expressed as a nested object that conforms to this schema, or is this an abstract specification of parsing capabilities? If the former, I wouldn’t work with it directly: it would need to be pre-processed into string input that the processor is already capable of handling. That, and in either case support for the new features (verbatim, strikeout) would have to wait for time available.

Frank_Bennett · July 15, 2020, 6:28pm

I guess Unicode would do, if that can be wrapped in RTF for insertion by a word processor. I don’t know how anything about the math stuff. There would have to be a library, and it would have to be easy to slot in.

Bruce_D_Arcus1 · July 15, 2020, 6:33pm

The former. So completely bypass any requirement to parse some custom syntax, which no matter what option, gets complicated.

I thought you said you already had an internal representation that I presume is similar; wouldn’t it be easier to just transform this to that?

We do need more exploration on the math front, which I also don’t understand well at all.

The mathml and tex support came as suggestions from people with more knowledge than I.

@bwiernik - you’re more knowledgeable about this; when you can find some time, perhaps you could figure this out? I am guessing, for example, that mathml can be inserted into open/libreoffice and word without modification by a processor, but am unsure.

Hence “experimental.”

Frank_Bennett · July 15, 2020, 6:47pm

It could be done, but to be honest, in the case of citeproc-js it would be simpler to flatten the nested structure into the string representation that the processor already digests. Support for the existing string syntax can’t be dropped in any case, because a universe of projects of unknown size is lurking out there that depends on it.

Bruce_D_Arcus1 · July 15, 2020, 7:01pm

That would be no problem. This is forward looking, and designed in particular for implementations that don’t yet have such support.

Frank_Bennett · July 15, 2020, 7:09pm

Okay. Any move would probably be driven by the lead consumer, since my work on Jurism is dependent on theirs. If Zotero opt to begin parsing fields into nested structures, I would ask them to flatten it for delivery to the processor. So there would be no impact at my end, beyond possible implementation of the extensions, time permitting. That would be the story at citeproc-js, anyway.

Bruce_D_Arcus1 · July 15, 2020, 7:14pm

Gotcha.

And in fact, it occurs to me, it seems like Zotero could indeed just continue with status quo; feed citeproc-js as a plain string with the markup you use, and only change to the new input model if and when they (and you) add support for the other features, in particular math.

bwiernik · July 15, 2020, 8:32pm

Zotero intends to adopt an actual rich text editor at some point. My impression is that they would be amenable to pretty much any reasonable unambiguous structured markup format.

@Frank_Bennett correct me if I am wrong, but supporting JSON object-structured could be fairly simply by converting it to the existing flat string syntax.

The discussion about RTF is mostly due to that is what Word Fields and similar support for display. For math, the likely approach would be to display an RTF-compatible string (e.g., asciimath or UnicodeMath text), then to convert to embedded MathML when Zotero field codes are removed. That’s all on application side, though. Zotero could just as easily pass math through as literal text and leave to users to deal with that in post-processing. Basically what needs to exist is the application telling the processor what to display/insert.

Frank_Bennett · July 15, 2020, 11:49pm

Yes. Structured markup and the flat-string syntax are different expressions of the same thing. The only addition here is new elements (strikeout, code, etc), which can be accommodated by extending the existing flat syntax (for this specific processor). From your description, the same would go for math: the processor can just treat it as a string blob set off with a specific markup that is passed through literally for possible post-processing: "My amazing \[ math-jazz \] formula."

What the text input looks like to the user will be a matter for the calling application to work out. Converting it to a form digestible by a processor should be trivial, so long as caller and callee share the same assumptions about the elements of which the field can be composed.

Bruce_D_Arcus1 · July 20, 2020, 6:41pm

Any feedback on this question of mine?

Bruce_D_Arcus1 · February 9, 2021, 4:13pm

Just to update on this, the schema file was merged as a separate file in the repo.

I don’t imagine we’ll integrate it in the main schema file until we see some implementations.

Bruce_D_Arcus1 · November 18, 2021, 7:25pm

Do you know the status of this at this point @bwiernik?

Topic		Replies	Views
json representation CSL Development	0	252	July 10, 2009
Design Principles for CSL JSON CSL Development	76	2341	July 20, 2020
Citeproc json data input specs CSL Development	34	2152	February 8, 2012
csl-data.rnc CSL Development	8	296	October 19, 2009
citeproc-js docs CSL Development	6	277	September 26, 2009

RFC: Rich Text for CSL JSON input format

Related topics