Design Principles for CSL JSON

In the experimental solution we settled on, the result is a native JSON data structure; the unambiguous target for any parsing.

Aside: these days, YAML and JSON are effectively interchangeable. Our JSON schemas can validate a YAML variant.

So despite the syntax differences, we’re saying for csl processors, rich text is a nested array of strings and formatted objects.

In YAML:

title:
  - A title with tex math
  - math-tex: x=y^2

This is also valid though:

title: A title

But, there’s a reason I marked it “experimental”: we need feedback.

So concerning these three options regarding titles and subtitles:

@Frank_Bennett what would be your perspective on these options? Given your comment above and on the rich text issue, I’d assume that your preference could be option 1, is that correct?

I guess I’m curious whether this is an abstract specification (titles may [or must] be parsable into structures that validate against this schema) or something more concrete (titles may be flat strings for verbatim rendering, or structures that validate directly against this schema).

The latter, I think.

We’re adding ability to separately specify formatting for main titles and subtitles in styles, because some styles require this.

How should a processor access those parts?

Do we need to change the input schema so a processor accesses them directly:

print(ref['subtitle'])

… or:

print(ref['title']['sub'])

… or do we require processors to parse title strings to access these:

print(parse_title(ref['title'], 'sub')

On the input end, the last option yields no change in the input model, but would result in things like this to accommodate non-standard data for which the algorithms break:

title: "A title || a subtitle, non-standard delimeter"

Keep in mind parsing would also apply in the rich text model; am not sure how that would work, actually (say for a new processor targeting 1.1), because the “title string” would no longer only be a string.

It may be worth noting that the current proposal to split titles is based on current citeproc-js functionality.

The logistics do get complicated. But setting aside the rich-text issue, it would obviously be easier in the processor to just receive title and subtitle as separate fields. That’s not the shape of data in the wild, though, so parsing would have to happen somewhere. The burden would just fall on the calling application, which will adopt various solutions or not.

The “redefine title to be main title, and add subtitle variables” option would align us with biblatex.

The title as object I’ve not seen elsewhere (except in MODS).

The parsing rules are based on the existing citeproc-js parsing rules that are used for uppercase subtitles. The additions are (1) a style-setting to specify a set of delimiters from a list of discrete options and (2) a specified character string || to override automatic parsing (similar to the existing full/short comparison, but not requiring multiple fields).

With respect to how this intersects with rich text—I think a simple rule that parsing doesn’t cross markup boundaries would work.

That would be a good solution for my use cases. But, as has been said before, that won’t be without problems either.

(Also, even this might require some parsing in the processor: how would a second subtitle be handled in such a solution?)

It’s not clear to me we should support that.

Why not?
At least with the current proposal it’s possible without adding too much complexity.

Because the current proposal adds complexity upfront.

How about let’s leave this question open until end of day (extending this a bit) Sunday, July 19.

If someone wants to argue for a change from the current plan, please state your case, and which option you prefer, here.

Otherwise, we’ll go the parsing titles route.

1 Like

I mean, if I were a developer, I would want to literally know how to do it.

All languages have string splitting functions, so with a string, that’s straightforward.

Are we saying splitting only happens, for example, on the formatted string, after primary processing (if dealing with rich text, one would need to format the sub-strings for output to RTF, HTML, LaTeX, or whatever, after all)?

So the parsing route is your favourite?

No. :slight_smile:

I’d prefer not, but I want to keep this moving, so …

I do think we need an answer to my latest question though before we actually do this.

And I’m curious what @PaulStanley thinks as a relative newcomer.

Like not at all or simply not parsing?

You asked about parsing, so that’s all I meant.

Ok.
But then which one of those three options would be your favorite? (But you don’t want to argue for a change if plans?)

No doubt, feedback from other implementers would be useful. This is based on current citeproc-js behaviour, but what do @PaulStanley, @cormacrelf, @John_MacFarlane think about this? Also @asimonyi

I don’t have a strong preference. I can see arguments either way.

But you don’t want to argue for a change if plans?

I think we should ideally base the decision on what works for styles and style authors, and for CSL developers. I have concerns about the parsing for the latter, but developers can speak for themselves.

I may change my mind based on subsequent conversation though :wink:

Just to clarify the main vs short title distinction, this is something that is a pretty annoying limitation for BibTeX users in fields like law that regularly use short titles that aren’t just the main title.

1 Like