# Sub-field parsing

Well, more practically, my point is that we can’t know what solution
is workable (generally) without assessing both. Presentation-only
might work for you, but am not sure (e.g. I really don’t know) if it
works across fields.

But do you know of any cases where semantic markup is really required? I
can’t think of any.

So let’s itemize what we know.

Here’s the stuff I deal with personally in my corner of the social
sciences/humanities:

1. titles within titles; in some cases they are placed in single
quotes, in some cases double-quotes, and in other cases italicized (or
flipped).

2. quotes within titles

3. foreign phrases; not sure off the top of my head how these are
handled; I think it varies

I gather that you also deal with:

1. species names

2. chemicals?

What else is there?

Parties to lawsuits that are deemed “common litigants” in the view of
the editor of the journal concerned? When citing the case in
subsequent references, the case should be referred to by one party
only, and the party that is not a common litigant should be used.

There is no generally accepted list of common litigants, and no
generally agreed unambiguous standard for identifying one.

But do you know of any cases where semantic markup is really required? I
can’t think of any.

So let’s itemize what we know.

Here’s the stuff I deal with personally in my corner of the social
sciences/humanities:

1. titles within titles; in some cases they are placed in single
quotes, in some cases double-quotes, and in other cases italicized (or
flipped).

2. quotes within titles

3. foreign phrases; not sure off the top of my head how these are
handled; I think it varies

I gather that you also deal with:

1. species names

2. chemicals?

What else is there?

Ship names

If one had a species name that was not italicized, you could not set

it to output the correct styling (say, italicized).

Maybe I was unable to make myself entirely clear in my example. In my
mind
the most important principle here is that when citing, one should always
copy the cited titles verbatim, including errors or non-standard markup.
As
such, the markup that should be present in the cited titles (e.g. gene
and
species markup), is not a function of the output style of the manuscript
but
purely related to the cited sources.

Yes, I understand you, but just disagree. If an article lists my name
as “B D’Arcus” but I have an output style that demands the full given
name, there’s a problem.

That might indeed be an exception to the rule (though full names are never
used in my field).

Rintze

But do you know of any cases where semantic markup is really required? I
can’t think of any.

So let’s itemize what we know.

Here’s the stuff I deal with personally in my corner of the social
sciences/humanities:

1. titles within titles; in some cases they are placed in single
quotes, in some cases double-quotes, and in other cases italicized (or
flipped).

2. quotes within titles

3. foreign phrases; not sure off the top of my head how these are
handled; I think it varies

I gather that you also deal with:

1. species names

2. chemicals?

What else is there?

Still fishing for a compromise between ease of entry and semantic
markup, here’s another shot at the topic.

Having sub-field semantic markup would be useful, but it may not be
necessary for the semantic details to be known to CSL. As Rintze’s
data is set up, he knows for a given set of entries what the visual
markup represents. At the application level (Zotero, say), that data
could, for a given set of entries, be exported with the visual markup
mapped to appropriate inline semantic tags. For CSL, though, if there
are very few cases that cannot be covered by a wiki flip-flop scheme
(that’s my impression from this thread), it’s not very painful to just
leave the edge cases for manual touch-up.

If fully fledged semantic markup is left as a problem for another
application to deal with, this would encourage people to adopt
consistent, or at least congruent conventions for inline markup, and
that would I think speed rather than hinder the emergence of data
stores containing semantic hints. I think everybody wins.

A thought, anyway.

Frank2009/3/24 Rintze Zelle <@Rintze_Zelle>:

Resurrecting this ancient discussion to see if usage might have already become established. Since this occurred, citeproc-js and Zotero have adopted support for a limited set of HTML-like markup for inline formatting: https://www.zotero.org/support/kb/rich_text_bibliography

I am wondering whether it would still be a good idea to formalize this feature in CSL. Either using the syntax adopted by Zotero and/or perhaps Markdown or CommonMark syntax (changing existing data to a simpler syntax would probably be doable for the clients that currently support such markup).

I’d be inclined to go with html (which obviously could be hidden by GUI tools like Zotero or Mendeley) because markdown syntax may not be unique enough (as in, you could see * or _ in regular titles).

The only reservation to putting this in the specs is that we currently don’t say much about the data model at all there. Not sure if that should matter.

My major thought is that we probably want consistent behavior across citeprocs, so listing the tags that must be supported seems like a good idea.

That’s not really a problem. CommonMark goes to great lengths to preserve many punctuation marks, especially mid_word ones and those without matching marks like regular askerisk use.* You would have to have a really weird title to get accidental formatting for any of the markdown features. I would much prefer Markdown as an implementor, and in fact I have an implementation going already. Since Markdown implementations often include smart quotes, this also saves a bunch of quote parsing. And it makes preserving backwards compat with citeproc-js’ micro-HTML (as I have come to call it) very easy: simply recognise those raw HTML tokens as they appear as open tag, then contents, then close tag in the standardised token stream. I will share my implementation shortly.

Of course, these considerations are minuscule compared to the difficulty of inputting raw HTML tokens into your reference library; improving that is even more of a win. I would add finally that it’s a lot less work to support user-friendly formatting in a reference manager with Markdown than by “hiding the tags” with some rich text editor that probably won’t give you the right strict HTML subset to work out of the box.

*Like this.

The Markdown family doesn’t seem to have syntax for superscript, subscript, or small-caps. The first two are heavily used in titles to articles in chemistry and biology.

Obviously as Markdown is HTML-oriented, you can still write superscript/subscript with <sup> and <sub>. So this isn’t a loss. There are some proposed extensions to support some of these (e.g. https://talk.commonmark.org/t/why-there-is-no-syntax-for-subscript-and-supscript/586), but nothing we have to actually wait on. For small caps, you simply get a pair of tokens HtmlTag("<span style=\"font-variant: small-caps;\">") and HtmlTag("</span>"), with Markdown in between, that you can recognise instead of escaping it as &lt;span&gt; etc.

Would these titles benefit from Pandoc-like math syntax in \$ signs? Biology I’m guessing no, but chemists might like a “pass this straight through to LaTeX please” where they get to \usepackage{mhchem} and not get bogged down in subscript soup. That’s not what math is, but same kind of thing at least.

We almost certainly wouldn’t support Markdown in Zotero and wouldn’t want Markdown parsed by the processor. We’ve always planned to just use a simple rich-text editor, as @Sebastian_Karcher says. It’s not hard, and Markdown isn’t something most of our users would know or expect.

The point here, I would think, is to have a limited, defined set of supported formatting options that all clients know they can use. HTML-emitting rich-text editors make that easy. If a citation processor supported Markdown as input, that would make it much less clear what output you would get for given input. If there’s a > at the beginning of a title, is that going to turn into a blockquote? Is it going to be stripped? Passed through as a literal? If I use arbitrary HTML, is that going to remain as raw HTML because it’s passed through by the citation processor’s Markdown processor? Will consumers of the generated HTML need to sanitize it because they won’t know what tags it might contain?

I think if a client wants to support Markdown, it should do so itself and pass on the HTML subset to the processor. Note that this actually allows clients to decide what output they want to support, because they can parse the processor’s HTML output as Markdown. If the processor took Markdown as input, that decision would be left to the processor, forcing the client to escape any Markdown it wanted to preserve that might otherwise be stripped by the processor’s implementation. And since clients likely want to display rendered values themselves separate from citation processing — e.g., in the items list in Zotero — they would need to bundle a Markdown processor anyway, and they might handle it differently from the processor.

Math support is an occasional request by Zotero users, and that, too, should be handled by the client, using whatever interface it thinks would work best for its users, and handling any rendering itself. That’s not for the processor to dictate.

I will note that citeproc-js’s micro-HTML approach is a bit unorthodox, in that it treats other angled-bracket tags as raw text, not HTML markup. But the benefit of that approach is that it avoids all the problems of sanitization I mention above. Nothing is stripped unexpectedly — including unpredictable input from all sorts of data sources — and you can trust that the output HTML is safe to display, because the only unencoded output is generated by the processor itself. (It does mean that when Zotero switches to a rich-text editor for titles, we’ll need to look for those defined tags and convert everything else to encoded text before rendering as HTML internally, but fortunately that’s easy.)

1 Like

Is this really a syntax issue, or broader?

For sake of argument, if we said all csl processors should support the following subfield formatting, in both html and markdown:

• italic
• bold
• superscript
• subscript

… and use the pandoc syntax for the last two in markdown.

Why is markdown a problem there and the current sorta-html not?

Just trying to understand.

I think Markdown is much more prone to false positives than HTML-like syntax, especially for *. I wouldn’t want text to be parsed as Markdown unless explicitly intended as such.

I understand this concern, though I don’t really share it. In my experience, markdown syntax creates very few problems in this respect. To begin with, a title would have to contain at least two asterisks (or underscores) before problems could even theoretically start.
If anyone is aware of a way how to get a random sample of titles containing at least two asterisks out of some corpus, say crossref, we might get a somewhat clearer picture.

Another reason to favor HTML (or HTML-like-thing) is that the syntax is extensible. In the original discussion over in-field markup, the possibility of semantic markup of titles was one of the concepts on the table. At some point, for whatever reason, someone might want to stir semantic hints into CSL records. HTML syntax would allow that (with class names, say), whereas Markdown would need some sort of non-standard workaround to shoehorn the information into the field and parse it back out again. (I realize that the no-dependencies homebrew parser in citeproc-js is stiff and limited, but there are lots of HTML parsing libraries around that someone could use instead.)

2 Likes

@njbart Probably a better example is links. Square brackets are fairly common in title data—those should generally not be parsed as links.

Fair enough. pandoc does indeed parse strings inside square brackets as links – if the string inside square brackets happens to match, e.g., a section title, and if the biblio data are provided inside a YAML header block. Though this might be relatively rare, I feel this is not ideal, but John MacFarlane asserts this is not a bug. See https://github.com/jgm/pandoc-citeproc/issues/457.

Well, that’s not actually Markdown, the same way that the sorta-HTML isn’t actually HTML. Sorta-Markdown would be conceptually identical to sorta-HTML, in that it’s just specific tags the processor knows how to turn into specific formatting, with all other Markdown ignored. The main difference would be that sorta-Markdown would be a bit more likely to unexpectedly turn into formatted text, whereas sorta-HTML is unambiguous.

My read of the above thread is that we were discussing actual Markdown, which implies both a broader syntax and also arbitrary HTML, and that creates all the problems I discuss above.

But I also just don’t really understand the impetus here. Maybe there’s some part of a pandoc workflow that I’m missing, but as I see it this markup exists solely for the calling application to communicate rich-text formatting to the processor. I don’t see what’s gained by supporting a more ambiguous input format.

We shouldn’t get overly distracted by this syntax being exposed in apps like Zotero, which is only the case because this was implemented as a hack in citeproc-js. Sorta-HTML does have the important advantage that it can be exposed to users without their being exposed to all the problems that full HTML (or full Markdown) would imply in terms of encoding, stripping, and sanitization, but an application could just as easily present 1) a WYSIWYG editor or 2) a Markdown editor, and then pass the supported, unambiguous, sorta-HTML tags to the processor. It would be the calling application’s responsibility, not the processor’s, to decide what happened to any unsupported formatting that it allowed to be entered. (A proper WYSIWYG editor just wouldn’t allow anything but the supported formatting.)

So what’s the point of this?

The point of the discussion I think is that pandoc supports full Markdown in its CSL YAML entry format. If a user is curating a library with a program like Zotero, then this creates some potential data incompatibility if they want to work with their library in both Word with Zotero’s integration and in pandoc via CSL YAML. This is a fairly common case, for example, for R users writing papers with RMarkdown.

Based on this discussion, I think probably the best approach would be for CSL-JSON to formalize the HTML-like syntax of citeproc-js because of its unambiguity. CSL YAML should include an element specifying the markup syntax used. When, for example, Better BibTeX generates CSL YAML with Markdown markup specified, it should convert the HTML-like markup to corresponding Markdown syntax.

With this in mind, Zotero users would markup their fields using the HTML-like syntax (currently) or using a rich text editor (in the future), whereas pandoc users who curate YAML file by hand could use the more human-readable Markdown syntax supported by pandoc.

Can I add my 2 cents.

Every processor is likely to start its work by taking input in whatever form it is given and either refuse to deal with it at all or convert it to some sort of “internal” representation, which is quite unlikely to be JSON, YAML, HTML, or any variant thereof and almost certain to be a sort of lightly-marked up list-like or s-expression like structure.

One then simply layers on parsers which take any external representation and convert it to the internal one.

The critical question from a processor design point of view is only this: what types of input do I need to recognise and distinguish. As I understand it that is (as things stand) just these: text, numbers, dates, names. Within the text class, it must recognise and distinguish between “plain” text, italic text, bold text, superscript text, subscript text, smallcaps and text protected against changes of case.

Exactly how those are marked in any type of input a processor may choose to accept is a detail that is less important, so long as I know how it is going to be done for any type of input that I am willing to process.

It may make sense to specify a “canonical” representation for any given type of input, e.g. (for JSON, use <i></i> <b></b> ..., for YAML use ..., for BibTeX use \em{} \textbf{} ...), and at this point it’s actually fine to let a thousand flowers bloom: not every processor will or needs to process multiple different input forms. But anyone who chooses to process YAML will know what to expect etc. (Strictly speaking, the CSL spec doesn’t even need to specify what form the input takes at all, so long as it can be coerced respectably into an internal representation that respects the CSL “types”.)

HOWEVER, it would be a really annoying thing to specify multiple versions of markup for a given form of input. If I’m getting JSON, I want to know how I have to parse it. I don’t want to have to “sniff” it to see if I think someone’s used markdown or whatever. A frontend should be responsible for coercing its input into one definite canonical form. A processor should always expect to be told what form its input is going to be in, and to know with certainty what sort of markup that input is allowed to contain. And a processor must never be expected to do anything other than handle that markup correctly. Spending time trying multiple different markups in an attempt to brute force one that makes sense is a big waste of energy.

So please don’t give me JSON and then leave me to “figure out” whether it’s html-ish JSON or markdown-ish JSON.

The history of standards suggests that being “generous in what you accept” is a nice idea, which leads to exponential trouble down the line.

For my own part, I would specify that for JSON input (and every processor should at least deal with that) the “html-ish” tags are correct, and the only allowable tags. Those and only those tags will be recognised as valid/special. As an API design, admittedly, they are horrible (especially nocase and small-caps), but sometimes we have to live with the mess we have.

Everything else is then left to the frontend. If someone wants to write a database frontend that will manage its data and allow rich-text entry in any form it likes, that’s OK, but it’s then its responsibility if it is going to output JSON that it will expect a processor to deal with to make any conversions.

Of course!

This is pretty much where we are now as well. I’m hoping to post a PR on the documentation repo soon, but in the language I’m working on I was saying “well-formed HTML tags,” with a small list of them.

We also discussed the idea to have an optional markup or similar property where one could note other formats; for example, in pandoc using YAML, markdown is likely, but so is org or rst. So one could do this, to tell a processor what to expect:

title: Some title with *markup*
markup: rst