Design Principles for CSL JSON

As @Denis_Maier, @bwiernik, and I have been working through the large backlog of issues for 1.0.2 and 1.1, it occurs to me we might want to have a larger discussion about principles we should use to guide these changes.

Let’s just focus on the CSL JSON.

There are a few features that are worth improving, some of which are supported in citeproc-js, but not in CSL proper:

  1. dates
  2. titles (subtitle vs main title formatting, for example)
  3. sub-string, mostly title, formatting (bold, italics, math, etc.)

@PaulStanley has noted that citeproc-js it probably too liberal on the dates it will accept, and we don’t want to require that of all CSL implementers, for a variety of reasons.

Settling on EDTF and aligning the structured date representation with it is one step towards addressing that.

But let’s focus on the last two, since they’re the most important, and are important mostly for titles.

The solution that citeproc-js adopted on 3 was to invent some pseudo-HTML markup, which it would parse and modify.

Last night, John MacFarlane (from pandoc; currently working on a new processor) suggested a completely different approach to this, which is to instead do something like the pandoc json representation of its internal model: have rich text variables optionally represented as nested JSON arrays.

So the principle here would be: CSL processors shouldn’t need to parse fields to divine structure; it should be provide pre-parsed in the JSON.

Edit: this has since been merged to the 1.1 branch as experimental. More here.

Now, another case we struggled with: subtitles and such.

Rather than add tons of new variables to support this on all the different titles we have, we settled on an idea that seems practical enough: ask processors to split titles based on some rules (including configuration in the style), OR with user-provided double bars, so Title|| Subtitle.

The principle here is thus the opposite of at least John’s suggestion above: CSL processors should/must parse certain fields to divine structure.

Is this inconsistency a problem for anyone?

It occurred to me, for example, in part based on the discussion John sparked, we could also allow titles to be objects, and so to also expect pre-parsed content.

Any suggestion on a broader set of principles we should follow when unsure?

Should we, for example, expect structure and not worry about verbosity or complexity on the input end?

I think as a general principle, expecting applications to provide structured data is a good thing. Especially when the things needing parsing would be entered by users, rather than existing in bibliographic metadata in the wild (e.g., within-field markup). That said, the guiding principle of any decision should be that bibliographic metadata in the wild should produce correct citations by processors without requiring user input or a high degree of application preprocessing. Applications like Cite This for Me, Open Science Framework, etc. depend heavily on processors being able to generate correct citations from typical-quality, usually flat, metadata.

With that in mind, I think markup and parsing of titles are different enough beasts to warrant somewhat different solutions. I see parsing of titles as similar in complexity and demand as testing for is-numeric. Some degree of parsing based on punctuation seems definitely needed for the title/subtitle structure to be practically useful—very little metadata exists pre-parsed on the wild, and many styles require some subtitle formatting.

Beyond that, it is a good idea to have a specified delimiter for manually overriding the automatic parsing to permit consistent handling of data across applications or processors. I think it is reasonable to not expect applications to provide separate title and subtitle fields, and for plain text processors, establishing a consistent splitting syntax would be useful. I support adopting || as a delimiter we use generally for such purposes. In addition to title/subtitle, it could also be used with multiple locators in locator (e.g., Chapter 4, page 12 or this parsing of string locators problem) or similarly with giving multiple types of locators in page. || could be less prone to false positives than comma-delimited or semicolon-delimited lists.

There is often tension between an input form that is “easy for the user”, and an input form that is open to only one interpretation.

Generally, Citeproc-JS has taken the laudable but troublesome approach of trying to “do what you mean”, a/k/a guess what the user wants. Mostly that works, but there are corner cases even in quite simple areas, e.g. initials. If I give the name as Philippe, can the processor “guess” that the initials should be “Ph.” not “P.”? How? Should it accept “ME” as initials, or “M.E.” or “M.E” or “M. E.” or “M E” or some or all of those. An “ideal” approach would be to require precision: specify initials separately:

{ given: “Paul”,
family: “Stanley”,
initials: [“P”, “M”] …
}

One can see why for user input that is avoided. OTOH, I would suggest, (a) something like that should be permitted (so a user can make it clear if necessary (b) the range of input that is accepted should be limited to clear cases: e.g. accept “P. M. Stanley” but not “P M Stanley”.

We see it in spades in relation to quotes, where we end up having to parse fragments with quotations to try to convert them to some sort of structure we can work with. And literally that is sometimes impossible to do reliably:

The ‘90s were James’ best years

The poor old computer cannot be blamed for turning that into The “90s were James” best years. And what is it to make of

The ‘roaring twenties’ were the speakeasies’ heyday

Those, with straight single quotes are strictly ambiguous, unless you graft on actual semantic analysis. So where does one strike the balance?

I think one has to accept that it is reasonable to be somewhat picky about input, or at least to offer no guarantees where input is ambiguous. But one must offer at least some reliable and unambiguous way of getting the right output.

In my view, doing things like trying to parse titles, or introduce recherche markup is heading in the same confused direction. If there is a subtitle and a title, let the user decide, and require explicit markup. I’d favour the following principles:

(A) In general, an explicit markup should always be available (so, e.g., I’d allow explicit identification of initials or subtitles if required), using title and sub-title). Other forms of markup, if available, should be “sugar” for the canonical and completely clear form. It should always be possible for semantic elements to be explicitly specified in input, explicitly, and preferably in JSON. Fix on JSON as the normative form.

(B) For common cases and to facilitate direct user entry, unambiguous sugared form should be available (so "given": "P. M." and "given": "Paul Matthew" work to specify initials). But only for common cases, and without struggling to accommodate every possible variation. So, for instance, I wouldn’t (as Citeproc-JS does) attempt to parse a family name of “di Angelo” into a non-dropping particle. If the user wants to specify a non-dropping particle, they can/should do that. Frontends can attempt such parsing if they want to, but a processor that is told “this is a family name” is entitled to assume that it is just that!

(C) As a corollary of (B) it’s OK to reject or mangle “reasonable” but non-compliant input, so long as there is a readily available compliant form. A user can always correct it. If you enter The 'roaring twenties' and it comes out as The ’roaring twenties’ you can easily correct the input.

(D) It’s fine (good!) to make sugared forms available in human-readable forms, but keep the interchange format unpolluted. It would save a heap of time if a processor knew it could expect

 { "title": [ "The ", { "quoted": "roaring twenties" } ],
   "subtitle": "an investigation",
   "author": [ { "name-parts": { "family": "Vinci",
                                  "given": "Leonardo",
                                  "non-dropping-particle": "da",
                                  "initials": [ "L" ] } } ] }

and wouldn’t have to deal with The 'roaring twenties'||An investigation etc.

That isn’t at all incompatible with encouraging the development of software which allows a user to enter those details in other forms and have them parsed out, or to parse data “from the wild” in the hope of extracting the right stuff. But it’s much tidier to separate the quite separate tasks of interpreting (often ambiguous) input and processing (hopefully unambiguous) data. As far as possible the parsing phase should be kept quite separate. Encourage, in other words a separation of the overall ecosystem into specialised layers, recognising the specialism of “turning sloppy human-readable text into hard-edged structured data” and “turning hard-edged data into properly formatted citations” as equally valuable but fundamentally different tasks.

(E) Even when processors should adopt heuristic parsing/sniffing methods (e.g. to detect that a name is not “Latin” from the characters), there should nearly always be some explicit way for the user to make things explicit and override the machine’s guess.

In the particular case of markup, I’m agnostic, because I think markup should be left mostly to the style, with the possible exception of allowing for emphasis in titles. But it should certainly be very rare, and I’m therefore happy to allow even a rather cumbersome convention so long as it is completely unambiguous. FWIW I’d be quite suprised if, internally, processors didn’t hold text in a tree/s-expression like form along the lines that John Macfarlane’s JSON represents, and I prefer it (for machine consumption anyway) to the ugly pseudo html, which is especially objectionable because it makes the most common legitimate case (preserving capitalization) rather cumbersome <span class="nocase"></span> is nearly 30 characters to do what Bibtex does in {T}wo.

1 Like

I certainly agree that parsing should have an accessible override. \ is a common syntax to indicate that \’roaring twenties’ should not have the quotes converted, for example. The || isn’t a suggestion to require data to have || in it, bit to provide a way to accommodate cases analogous to your roaring twenties example. For example, in “Review of ‘The Whale: And Other Stories’: A Great Read", to prevent interpreting the first colon as a subtitle split, a user might add ||.

Both of these cases are to address edge cases of reasonable parsing rules that work in >90% of cases and which yield the most accurate results possible.

Your example of name particle parsing is exactly the sort of case where parsing is really needed. Name data just doesn’t exist in 5-field formats. If a processor doesn’t do particle parsing, then that for practical purposes means that CSL doesn’t do dropping versus non-dropping particles or suffixes. That is going to yield inaccurate citations most of the for any style requiring handling of parties. That’s not acceptable.

One of the amazing successes of CSL so far has been the flexibility of handling real-world bibliographic data that processors, especially citeproc-js, have strived for. It certainly increases complexity of the processors, but that is the price of being able to generate accurate results without major human intervention. Services like Cite This For Me would be vastly inferior without these features. This is one of the key reasons why I prefer CSL over BibTeX.

Thanks @PaulStanley.

So you’re in favor.of John’s suggestion. FYI, I just committed this, for discussion. Feel free to weigh in on the PR if you have further thoughts.

What about the subtitle issue, which it occurs to me now overlaps with the rich text issue?

In styles, in 1.1 we need to be able to distinguish full titles from “sub” and “main” “forms.”

Adding a ton of new variables to accommodate that seemed the wrong approach, as it was getting unwieldy.

History:

The original design of CSL, which we still have, has titles and short titles.

The intention of that was always that title is the full title, and short title could represent the main title.

In practice, short titles are used for other things too.

So I see three options:

  1. ask processors to split, as we’ve been planning
  2. allow titles to be objects, as I mentioned above
  3. redefine title to be main title, and add subtitle variables (as in your example)

3 would be a reasonable option too, that we haven’t discussed.

If the initial title/short idea doesn’t in fact work for our purposes, then maybe we should throw out the idea that title = full title?

So 1 would be like this (as in status quo; nothing changes) for this simple case:

title: 'Main Title: Subtitle'

… 2 would be:

title:
   main: Main Title
   sub: Subtitle
   short: Main

… and 3 would be:

title: Main Title
subtitle: Subtitle

I’d approach the design thus:

  1. What parts need to be accessible (separately) to CSL styles. This is a question for CSL style writers mostly. Let’s suppose it’s (up to) 3: main-title, sub-title and short-title, where a full-title would consist of the main-title and sub-title (perhaps separated by a “:”, and short-title is taken to be an alternative form suitable for standalone use in subsequent cites. (I’m not saying that is right, just that it’s conceivable.

  2. Now make variables for those parts separately available in the input CSL, perhaps allowing “title” to be a synonym for “main-title” on input.

  3. Then specify rational default processor behaviour, e.g.

  • If asked for “title”, a processor should construct it using maintitle plus a term (e.g. subtitle-separator) and the subtitle if available. I might consider adding a localised title-macro construction to deal with casing the subtitle, if that’s needed, along the lines of localised dates. In theory that could all be left to a macro in the style, but it might be common enough to be worth adding.
  • If asked for subtitle, a processor will provide only the sub-title.
  • If asked for a maintitle, a processor will provide only the title.
  • (Together those two behaviours allow any style writer to “work around” the automated construction of titles: you just never call on “title” alone.)
  • If asked for “title-short”, the processor will provide it if available, and will otherwise provide only the maintitle.
  1. Then consider whether any pre-processing markup is required. Is this a sufficiently common case to justify special markup? To my mind, I’d think not, at the processor level. Up to frontends whether they want to do any magic (“Maintitle: A subtitle” -> “Maintitle”, “A subtitle”) or explicit (“Maintitle|A subtitle”) processing. I would reserve pre-processing for very common cases like initials in given names. The most common “error” will be users who just enter the whole damn thing as “title”, but it probably won’t do enough harm to be worth catching. If I did allow for processor-level parsing, I’d do it ONLY if the relevant text was in the “title” input, and there is no “subtitle” input, i.e. if it “looks” like the user had not bothered to do the processing herself.

FYI @PaulStanley we’ve had some long discussions about titles and subtitles before https://github.com/citation-style-language/schema/pull/203

One additional requirement us that some styles render titles with whichever punctuation they had in the original.

I can see that makes it more difficult, and might justify requiring some processor-side parsing. The proposal you link to “meets my requirements” in the sense that it is absolutely clear what a processor is required to do. Although it’s complicated, that isn’t an objection, if it’s required.

The problem with all these proposals is that they have to deal with and satisfy a lot of different and conflicting requirements. Ease of use for users, clarity for implementers, legacy data, data in the wild, unsupervised processing, max compliance to a given style… doesn’t make things easier.

I think you summarize this nicely, but that this piece has maybe held us back; that how users enter data in Zotero et al should not be something we worry about, and can be counterproductive.

Aside: CSL precedes Zotero, but the current iteration was designed in conjunction with it. And citeproc-js was designed to be used beyond Zotero, but primarily focused on it and Juris-M.

So not surprising this implicit requirement has been there, and guided certain solutions.

But I think we should design CSL so developers can easily understand how they would adjust their apps to generate the data that a CSL processor needs.

For sake of argument, we could publish our parsing rules for client apps to generate structured titles.

Finally, in this post I’m not endorsing one solution; I’m just pushing us to settle on the best solution, to clearly articulate sound principles that brought us to it, and to ensure decisions are consistent (for example rich text and sub/main) so they don’t needlessly annoy developers.

I agree that we generally don’t need to be too concerned about data models in something like Zotero, but I think we really need to prioritize handling of data as delivered from publisher websites and bibliographic databases.

1 Like

Why is that the responsibility of a CSL processor though, as opposed to the publishing source, or even a CSL pre-processor?

Like, imagine @PaulStanley is working on his processor, and he wants to be able to support that funky incoming data.

Should he be required to include that in his processor, by us, or should he instead have the freedom to decide whether or not he wants to pre-process the incoming data?

I recognize the answer may not be straightforward, but it seems worth asking.

One pragmatic point in this discussion is that we aren’t at Day 0 on this problem. We are 10 years into folks like Frank identifying and solving these cases.

Frank wrote citeproc-js based on one key constraint that we do not have: he assumed (indeed, had to) an effectively static input model, so Juris-M and Zotero could be compatible.

So some of his solutions may not be the best solutions for CSL going forward, even if the use case and the logic he used to solve them can inform how they might show up in CSL itself.

One argument pro title parsing by citeprocs: different styles have different assumptions about what constitutes a title and where a subtitle begins. So apps would have to deliver different data based on the selected style. Not impossible, but still…

I agree, but otoh there’s always the risk of developing a standard that’s nowhere really supported…

1 Like

So, yes, there are two issues.

One is a design issue: where do you draw the line between the tasks of (a) massaging the sort of messy data people use “in the wild” into structured data a processor can use and (b) taking structured data and turning it into accurate citations. The boundary is not always obvious. For my part, I’d sooner not move too much of task (a) into processors for reasons of hygiene, and because I see these as different tasks which should be kept separate. But ultimately it’s a matter of judgment.

The second is a specification issue. If you do expect processors to transform input, you MUST specify how they should do it, if it matters. There may be very sound reasons for needing it (the subtitles issue looks like it is one), but if it’s required it needs to be specified.

As things stand there are certainly things where the spec is silent, but the test suite assumes behaviour (such as extracting particles from given or family names, or dates from “raw” dates) which is not mentioned in the spec. That shouldn’t happen. There are plenty of other examples (e.g. extracting initials, normalizing spacing and punctuation, dealing with quotations) where in practice there is expected behaviour which has never been properly specified and which we have to guess from the test suite. This is not the best.

By minimizing the amount of fiddling that you expect from processors, you make it more likely that the spec will be right, because you have clearly structured input data, and clearly specified output. But always subject to pragmatic considerations, of course.

1 Like

As our discussions have proceeded over the past couple of weeks, attempting to solve a few tricky issues, I’m thinking the road to a 1.1 release will be longer than I’d initially anticipated.

I think, in short, we should make some aggressive decisions, move quickly towards a v1.1-rfc.1 or v1.1-pre.1 tagged pre-release, and then give implementers plenty of time (30 days, or more) to provide feedback, so they can not just review the changes, but also experiment with the code changes needed to support them.

As part of those “aggressive decisions,” I think we should drop ideas for now that are seeming too complicated or potentially controversial. My decision to mark the new “rich text” functionality I just merged as “experimental” is such an example; a kind of middle ground.

Sounds good. But we should also identify those issues that are not too complicated and close as many of this as possible.

1 Like

I won’t trouble the list with a sample, but I’m happy to confirm that citeproc-js builds an unambiguous nested structure from parsed content, in an internal representation that is then flattened into some output format or other.

If I understand correctly, the discussion here seems to turn on unease with the fact that potentially ambiguous string input is magically jumped to cleanly formatted output by a black box that performs various ill-specified restructuring operations on the former to spit out the latter. It’s understandably frustrating when viewed from a specification-to-code perspective.

In an ideal world, mass-distributed metadata would be cast according to strict rules that are amenable to unambiguous parsing. Unfortunately there are ambiguities, and given a massive volume of data and a large population of users, removing them through small, documented markup extensions such as @bwiernik suggests is probably the way to go. I can’t see a user dealing with a UI with separate fields each for dropping and non-dropping particles, name suffixes, given names, and family names. Likewise, a structured WYSIWYG editor that requires quotation marks be applied as nested markup in similar fashion to italics or boldface is probably going to lose out to keyboard entry with most users.

If you will indulge a slight digression, I can offer an example of the risks of structural ambition from personal experience. Jurism was originally inspired by a multilingual RDF schema that is published by the Japan National Information Institute, and used to publish metadata from their CiNII aggregator service. I built the initial version of Jurism (then “Multilingual Zotero”) to digest that RDF. It worked a treat on sample input: names parsed out nicely, and Japanese and English fields imported correctly and were set with their respective language tags. Unfortunately, my hopes of solving the productivity problem for international scholarship on Japan were dashed when I discovered that metadata on the service is riddled with irregularities. Multiple languages banged into the Japanese title field with ad hoc syntax like “吾輩は猫である=I Am a Cat.” Elements of non-Japanese names in arbitrary ordering, and in arbitrary scripts. Records with only partial provision of English. Completely separate records for Japanese and English citation forms. The CiNII RDF specification is a brilliant piece of engineering, but the content is mostly garbage. I had an opportunity to speak with one of the NII developers at a conference, and he said that they just pipeline whatever they receive from publishers direct to the database, with no curation, and no possibility of revision. As he put it, they follow the “Yoshinoya Principle” (from the CEO of a Japanese fast-food chain) under which, among “Fast,” “Cheap,” and “Delicious,” you can only succeed at two. CiNII does fast-and-cheap; and they are not alone.

In light of experiences like mine with CiNII, I don’t think it’s reasonable to expect that unambiguous well-structured markup will be delivered at scale for general consumption. The costs of curation are just too great. Tools that digest data from large repositories need to deal with what’s out there. Some of it will need to be tweaked by users. Successful tools will make the tweaking as painless as possible.

That said, parsing from plain text and juggling metadata fields are two very different tasks, and they could be separately specified. Just as EDTF provides a separate standalone specification of the syntax of date elements, it should be possible to cast a text parsing specification that sets out the structures that should be generate-able from parsing operations. That would then guide the development of syntax, whether kinda-sorta-HTML or something else. It is the markup capabilities that matter, not the HTML-ness or Markdown-ness of the markup, nor the JSON-ness or YAML-ness of structured output.

Such a syntax spec would not affect what an existing tool kike citeproc-js does directly. It has an expected syntax for input that is capable of expressing the structures needed for formatted string field output. If data is to be fed to it that is in another markup language, so long as that markup shares the same capabilities, it can be converted into the markup that citeproc-js expects.

On the implementation-specific issue of the admittedly ugly and cumbersome <span class=“no-case”></span> thing, that’s just a matter of tweaking the implementation to recognize different markup elements: it doesn’t affect the capabilities of the “field markup language,” only its specific expression in citeproc-js.