Sub-field parsing

We almost certainly wouldn’t support Markdown in Zotero and wouldn’t want Markdown parsed by the processor. We’ve always planned to just use a simple rich-text editor, as @Sebastian_Karcher says. It’s not hard, and Markdown isn’t something most of our users would know or expect.

The point here, I would think, is to have a limited, defined set of supported formatting options that all clients know they can use. HTML-emitting rich-text editors make that easy. If a citation processor supported Markdown as input, that would make it much less clear what output you would get for given input. If there’s a > at the beginning of a title, is that going to turn into a blockquote? Is it going to be stripped? Passed through as a literal? If I use arbitrary HTML, is that going to remain as raw HTML because it’s passed through by the citation processor’s Markdown processor? Will consumers of the generated HTML need to sanitize it because they won’t know what tags it might contain?

I think if a client wants to support Markdown, it should do so itself and pass on the HTML subset to the processor. Note that this actually allows clients to decide what output they want to support, because they can parse the processor’s HTML output as Markdown. If the processor took Markdown as input, that decision would be left to the processor, forcing the client to escape any Markdown it wanted to preserve that might otherwise be stripped by the processor’s implementation. And since clients likely want to display rendered values themselves separate from citation processing — e.g., in the items list in Zotero — they would need to bundle a Markdown processor anyway, and they might handle it differently from the processor.

Math support is an occasional request by Zotero users, and that, too, should be handled by the client, using whatever interface it thinks would work best for its users, and handling any rendering itself. That’s not for the processor to dictate.

I will note that citeproc-js’s micro-HTML approach is a bit unorthodox, in that it treats other angled-bracket tags as raw text, not HTML markup. But the benefit of that approach is that it avoids all the problems of sanitization I mention above. Nothing is stripped unexpectedly — including unpredictable input from all sorts of data sources — and you can trust that the output HTML is safe to display, because the only unencoded output is generated by the processor itself. (It does mean that when Zotero switches to a rich-text editor for titles, we’ll need to look for those defined tags and convert everything else to encoded text before rendering as HTML internally, but fortunately that’s easy.)

1 Like

Is this really a syntax issue, or broader?

For sake of argument, if we said all csl processors should support the following subfield formatting, in both html and markdown:

  • italic
  • bold
  • superscript
  • subscript

… and use the pandoc syntax for the last two in markdown.

Why is markdown a problem there and the current sorta-html not?

Just trying to understand.

I think Markdown is much more prone to false positives than HTML-like syntax, especially for *. I wouldn’t want text to be parsed as Markdown unless explicitly intended as such.

I understand this concern, though I don’t really share it. In my experience, markdown syntax creates very few problems in this respect. To begin with, a title would have to contain at least two asterisks (or underscores) before problems could even theoretically start.
If anyone is aware of a way how to get a random sample of titles containing at least two asterisks out of some corpus, say crossref, we might get a somewhat clearer picture.

Another reason to favor HTML (or HTML-like-thing) is that the syntax is extensible. In the original discussion over in-field markup, the possibility of semantic markup of titles was one of the concepts on the table. At some point, for whatever reason, someone might want to stir semantic hints into CSL records. HTML syntax would allow that (with class names, say), whereas Markdown would need some sort of non-standard workaround to shoehorn the information into the field and parse it back out again. (I realize that the no-dependencies homebrew parser in citeproc-js is stiff and limited, but there are lots of HTML parsing libraries around that someone could use instead.)

2 Likes

@njbart Probably a better example is links. Square brackets are fairly common in title data—those should generally not be parsed as links.

Fair enough. pandoc does indeed parse strings inside square brackets as links – if the string inside square brackets happens to match, e.g., a section title, and if the biblio data are provided inside a YAML header block. Though this might be relatively rare, I feel this is not ideal, but John MacFarlane asserts this is not a bug. See https://github.com/jgm/pandoc-citeproc/issues/457.

Well, that’s not actually Markdown, the same way that the sorta-HTML isn’t actually HTML. Sorta-Markdown would be conceptually identical to sorta-HTML, in that it’s just specific tags the processor knows how to turn into specific formatting, with all other Markdown ignored. The main difference would be that sorta-Markdown would be a bit more likely to unexpectedly turn into formatted text, whereas sorta-HTML is unambiguous.

My read of the above thread is that we were discussing actual Markdown, which implies both a broader syntax and also arbitrary HTML, and that creates all the problems I discuss above.

But I also just don’t really understand the impetus here. Maybe there’s some part of a pandoc workflow that I’m missing, but as I see it this markup exists solely for the calling application to communicate rich-text formatting to the processor. I don’t see what’s gained by supporting a more ambiguous input format.

We shouldn’t get overly distracted by this syntax being exposed in apps like Zotero, which is only the case because this was implemented as a hack in citeproc-js. Sorta-HTML does have the important advantage that it can be exposed to users without their being exposed to all the problems that full HTML (or full Markdown) would imply in terms of encoding, stripping, and sanitization, but an application could just as easily present 1) a WYSIWYG editor or 2) a Markdown editor, and then pass the supported, unambiguous, sorta-HTML tags to the processor. It would be the calling application’s responsibility, not the processor’s, to decide what happened to any unsupported formatting that it allowed to be entered. (A proper WYSIWYG editor just wouldn’t allow anything but the supported formatting.)

So what’s the point of this?

The point of the discussion I think is that pandoc supports full Markdown in its CSL YAML entry format. If a user is curating a library with a program like Zotero, then this creates some potential data incompatibility if they want to work with their library in both Word with Zotero’s integration and in pandoc via CSL YAML. This is a fairly common case, for example, for R users writing papers with RMarkdown.

Based on this discussion, I think probably the best approach would be for CSL-JSON to formalize the HTML-like syntax of citeproc-js because of its unambiguity. CSL YAML should include an element specifying the markup syntax used. When, for example, Better BibTeX generates CSL YAML with Markdown markup specified, it should convert the HTML-like markup to corresponding Markdown syntax.

With this in mind, Zotero users would markup their fields using the HTML-like syntax (currently) or using a rich text editor (in the future), whereas pandoc users who curate YAML file by hand could use the more human-readable Markdown syntax supported by pandoc.

Can I add my 2 cents.

Every processor is likely to start its work by taking input in whatever form it is given and either refuse to deal with it at all or convert it to some sort of “internal” representation, which is quite unlikely to be JSON, YAML, HTML, or any variant thereof and almost certain to be a sort of lightly-marked up list-like or s-expression like structure.

One then simply layers on parsers which take any external representation and convert it to the internal one.

The critical question from a processor design point of view is only this: what types of input do I need to recognise and distinguish. As I understand it that is (as things stand) just these: text, numbers, dates, names. Within the text class, it must recognise and distinguish between “plain” text, italic text, bold text, superscript text, subscript text, smallcaps and text protected against changes of case.

Exactly how those are marked in any type of input a processor may choose to accept is a detail that is less important, so long as I know how it is going to be done for any type of input that I am willing to process.

It may make sense to specify a “canonical” representation for any given type of input, e.g. (for JSON, use <i></i> <b></b> ..., for YAML use ..., for BibTeX use \em{} \textbf{} ...), and at this point it’s actually fine to let a thousand flowers bloom: not every processor will or needs to process multiple different input forms. But anyone who chooses to process YAML will know what to expect etc. (Strictly speaking, the CSL spec doesn’t even need to specify what form the input takes at all, so long as it can be coerced respectably into an internal representation that respects the CSL “types”.)

HOWEVER, it would be a really annoying thing to specify multiple versions of markup for a given form of input. If I’m getting JSON, I want to know how I have to parse it. I don’t want to have to “sniff” it to see if I think someone’s used markdown or whatever. A frontend should be responsible for coercing its input into one definite canonical form. A processor should always expect to be told what form its input is going to be in, and to know with certainty what sort of markup that input is allowed to contain. And a processor must never be expected to do anything other than handle that markup correctly. Spending time trying multiple different markups in an attempt to brute force one that makes sense is a big waste of energy.

So please don’t give me JSON and then leave me to “figure out” whether it’s html-ish JSON or markdown-ish JSON.

The history of standards suggests that being “generous in what you accept” is a nice idea, which leads to exponential trouble down the line.

For my own part, I would specify that for JSON input (and every processor should at least deal with that) the “html-ish” tags are correct, and the only allowable tags. Those and only those tags will be recognised as valid/special. As an API design, admittedly, they are horrible (especially nocase and small-caps), but sometimes we have to live with the mess we have.

Everything else is then left to the frontend. If someone wants to write a database frontend that will manage its data and allow rich-text entry in any form it likes, that’s OK, but it’s then its responsibility if it is going to output JSON that it will expect a processor to deal with to make any conversions.

Of course!

This is pretty much where we are now as well. I’m hoping to post a PR on the documentation repo soon, but in the language I’m working on I was saying “well-formed HTML tags,” with a small list of them.

We also discussed the idea to have an optional markup or similar property where one could note other formats; for example, in pandoc using YAML, markdown is likely, but so is org or rst. So one could do this, to tell a processor what to expect:

title: Some title with *markup*
markup: rst

By way of update, we settled on this approach, at least experimentally. It has the virtue of using native JSON data structures.

This proposal did not attempt to deal with semantics (well, aside from quote, and maybe code), per my comments at the top of this thread, which seemed in retrospect overkill.