Design Principles for CSL JSON

bwiernik · July 18, 2020, 1:57pm

I think this thread has veered off-topic, so I suggest we move this elsewhere.

Denis and I went through several design iterations on title splitting before coming to consensus with Bruce at the current proposal with a flat field and || as an explicit delimiter to override automatic parsing.

There are several reasons for this approach:

It isn’t sufficient to have just an object with main and sub—you would also need to either have ‘full’, in order to indicate the original punctuation for styles that don’t normalize, like APA.
Processors will need to do parsing of titles anyway. There is no way around that. Almost all bibliographic data comes with a flat title field, and features like uppercase subtitle casing, the ABNT style, even the updated title case spec based on subtitles, won’t work for a lot of places where CSL is used. With that the case, it’s very little extra burden to parse an explicit sub delimiter from a field.

Re: short titles and abbreviations:

title-short and title form-="short" are explicitly intended to mean the same thing. title-short/shortTitle and container-title-short/journalAbbreviation are simply the way that’s available to provide short titles in the data model.

Short titles are used in two places in CSL styles:

Most commonly as part in-text citations or footnotes, such as a substitute for author (APA) or in subsequent footnotes (Chicago full note). This is always reliant on a short form being specified in the data. For articles or books, short title will often be the same main title, but sometimes not (e.g., “On the Origin of Species”). For some other types, such as legal cases, the short title is generally not the same as main title (cases don’t have main and sub titles). You may not have thought of these cases when you originally designed CSL, but they are entirely consistent with the original definition of the short title form.
When container-title form=“short”, citeproc-js and pandoc-citeproc accept an abbreviations list to automatically generate abbreviated titles for articles. Otherwise, it uses the short form supplied in the data. This is extra-spec behavior, but one we’ve discussed formalizing in the last few days.

In all of these cases, short title is doing exactly one thing—giving the short form of the title.

The behavior that gduffner links to is a modification in CSLm. I don’t think it has bearing on what we are discussing. I really don’t think we should also add form=“abbreviation”. That is what form=“short” is, with citation-label filling a related role in rare cases (like citation of classics). @gduffner if you are saying that you want to have automatic abbreviation of legal case names but also be able to specify a specific short form for some cases, that level of legal citation complexity seems beyond the scope of CSL’s aims, and I suggest you pursue that as part of CSLm, which aims to more comprehensively handle legal citation.

Bruce_D_Arcus1 · July 18, 2020, 2:24pm

Regardless, I think we need an answer this before we make a final decision.

asimonyi · July 18, 2020, 2:48pm

Dear All,

as this is my first post on this site, a bit of background: I’m the developer of citeproc-el, a CSL citation processor for the GNU Emacs text editor, written in Emacs Lisp. I also wrote a package to make the processor usable for formatting citations in exported Org mode documents. Thanks to Bruce for inviting me to participate in the discussion. Since the thread has touched upon a lot of issues, this is just a very short (and I’m afraid rather superficial) reaction to a couple of them.

On the level of abstract principles, as far as I can see, I’m in full agreement with @PaulStanley’s view that a fully explicit, unambiguous markup should be available, even the main priority, maybe with added (but unambiguous) sugar. I don’t have much to add by way of arguments: I think the reasons were very well articulated by @PaulStanley and others: it’s important to have a well defined boundary between the unambiguous structured input for the processor and preprocessors/applications providing this by parsing the semistructured, noisy “real world” data, and the standard (by its very nature) should concentrate on the former.

As for the problem of representing titles, I like the idea of handling them analogously to names with titles as objects (this would be option 2 I think) but could live with (3) as well. Just would rather avoid parsing/splitting based on some ad-hoc rules/notations to get to the structure because of the considerations mentioned in the previous paragraph.

bwiernik · July 18, 2020, 3:07pm

Are we saying splitting only happens, for example, on the formatted string, after primary processing (if dealing with rich text, one would need to format the sub-strings for output to RTF, HTML, LaTeX, or whatever, after all)?

My proposition was to do splitting before doing formatting of rich text. So, for example:

title: "This is a ", {italic: "title with a colon in italics: is "}, "that a subtitle? "

Would have no split because the colon is inside a rich text element. This simplifies the parsing rules: any rich-text entity can be treated atomically when doing title splitting/punctuation normalization, etc. If that were supposed to be a subtitle, it would be entered by the user as:

title: "This is a ", {italic: "title with a colon in italics"}, ": ", "is that a subtitle? "

or:

title: "This is a ", {italic: "title with a colon in italics"}, ": ", {italic: "is "}, "that a subtitle? "

Bruce_D_Arcus1 · July 18, 2020, 3:18pm

Thanks, but what’s behind my question is this (in Python):

>>> title = ['A Title with a || non-standard ', {'quote': 'subtitle'}]
>>> title.split('||')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'split'

So where would a developer go from there?

Bruce_D_Arcus1 · July 18, 2020, 3:34pm

Welcome!

What might be an example of your last point here? What kind of “sugar”?

bwiernik · July 18, 2020, 3:40pm

For example, in R, something like this:

title = list('A Title with a ||non-standard ', list(italic = 'subtitle'))
richText = title[is.list(title)]
names(richText) = paste0("$s", 1:length(richText))
titlePlaceholders = title
titlePlaceholders[is.list(title)] = names(richText)
titleFlat = paste(titlePlaceholders, collapse=" ")
# 'A Title with a ||non-standard $s1'
titleSplit = str_split(titleFlat)
for (i in titleSplit) {
  for (x in names(richText)) {
    titleSplit[i] = str_replace(titleSplit[i], x, str_replace[x])
  }
}
# convert richText to the processor or output preferred syntax
# c('A Title with a ', 'nonstandard <i>subtitle</i>')

The logic here being:

Replace any rich text entities with placeholders
Join the title list into a string.
Split the title string into main and subtitle(s).
Replace the placeholders with the rich text strings

Bruce_D_Arcus1 · July 18, 2020, 3:47pm

So on this …

This is what I meant by the ontological question yesterday. I don’t know the implications yet, but it just seems strange. I also don’t know if this is a real issue in the wild; whether any copy editor would force an author to “fix” such a title, even if it conflicted with authorial intentions. But I agree, this is a live issue.
Is this not addressable, regardless of the solution?
I think this is an orthogonal issue. Option 2 could support it by adding a “sub-sub” input key and style form, if we wanted to do that. I’m not convinced we do initially. But if and when we did, the advantage is it then becomes very explicit what’s going on, both on input, and in styling. Aside, maybe: how do this relate to biblatex addon?

As a general rule, CSL development has been guided by trying to get the details right for the vast majority of cases, but not to bend over backwards (sorry for the US English idiom) to accommodate every last possibility. This could be one of the situations.

bwiernik · July 18, 2020, 3:52pm

Could we please move the title splitting discussion back to the GitHub PR so we can discuss specifics? We’ve hashed through most of these questions multiple times there. If there are specific critiques of the proposal, let’s discuss those, not try to engage with them in an abstract first-principles way.

Bruce_D_Arcus1 · July 18, 2020, 4:04pm

But we haven’t settled which solution; we’re comparing, it seems, two, and the decision goes to first principles.

How about let’s just pause the discussion for us, and see if any other developers weigh in over the weekend, and then we can decide, and move subsequent discussion back to GH?

Denis_Maier · July 18, 2020, 6:05pm

Yes, it is, but you’ll need to provide information about the original delimiters somehow. Perhaps as per @bwiernik’s suggestion:

title:
  full: A title. With a subtitle
  main: A title
  sub: with a subtitle

Or perhaps like that:

title:
  main: A title.
  sub: with a subtitle

Denis_Maier · July 18, 2020, 6:49pm

Biblatex’s titleaddon is not a second subtitle. The documentation says: “An annex to the title, to be printed in a different font.” On the field eventtitle it says: “Things like “Proceedings of the Fifth XYZ Conference” go into the titleaddon or booktitleaddon field, respectively.”

MLA gives this as an example for two subtitles: " Finis Coronat Opus: A Curious Reciprocity: Shelley’s “When the Lamp Is Shattered”"

SBL, however, treats informations about a Festschrift simply as a subtitle: “Campbell, Joan Cecelia, and P. J. Hartin, eds. Exploring Biblical Kinship: Festschrift in Honor of John J. Pilch. CBQMS 55. Washington, DC: The Catholic Biblical Association of America, 2016.”

I think CMoS would just omit the information about the Festschrift, and I think this would fit perfectly into titleaddon.

Seems to be slippery terrain.

gduffner · July 18, 2020, 7:50pm

Ok, I thought it was the other way around, that it takes the short form from the data and if it doesn’t exist, it looks for an abbreviation in the list. My bad.

I’m sorry if I wasn’t clear enough and caused misunderstanding here. I don’t want to have this in CSL. Those very special legal requirements are well served in CSLm. On the contrary, I want the current behavior not to get lost. I just added that link as an indication that the behaviour wasn’t defined in the spec but the very implementation in citeproc-js. I obviously missed the discussion about formalising it.

Bruce_D_Arcus1 · July 19, 2020, 8:16pm

Summary

This is a long post, so let me start by summarizing the path I think we should take upfront, which I think (mostly) gives us the best of both options, and flexibility:

Adopt the object title alternative (2)
Add relevant pieces of the formatting rules to the spec
Publish the string parsing rules that Denis and Brenton are working on as a non-spec addendum to facilitate conversion of title strings to objects.
Consider the possibility to use style-based rules for reassembling titles, rather than splitting them. In my view, this tends to be how style guides are written when they discuss this.

Explanation

I posted this thread because I thought we needed feedback from the developers that would actually implement these features, particularly in light of recent decisions on rich text support. I was concerned they’d balk at the parsing approach we were headed towards.

More generally, I’ve heard complaint from developers much like Paul has articulated here:

That CSL is in general too complex.
That the test suite, which is crucial, includes too much undocumented behavior, and odd parsing and string manipulation rules.

So it seems to me the least we could do, if we’re going to add new features like independent formatting of title parts, is to not also add complex parsing requirements on top if not absolutely necessary.

Titles as Objects

We heard from three of these developers, each using different languages, and targeting different use cases and environments.

Each of them clearly said they want the input format model that most closely matches the styling language, and requested not to force them to massage input data more than necessary to get it in a form for them to cleanly process.

It would be one thing if opinion was split (no pun intended) among the developers. But it’s not.

So I think we need to go with the object-based option, and figure out how best to make it work.

Include Relevant Details in the Spec

That would necessarily include some rules on trailing punctuation, to accommodate both main titles that end in things like question or exclamation marks, and also the examples that Denis and Brenton have pointed out.

These rules can be derived from the work they already did on the title parsing approach.

Publish the Splitting Rules Outside the Spec

We can also, per Frank, publish these splitting rules as a non-standard addendum to the spec, so that developers can easily generate or convert these data from string titles.

We could even publish a few of these parsing rules (including for names), maybe even with some reference code in Python or JS, on a dedicated repository.

If those developers want to write a pre-parsing function (including using the rules we publish) to handle splitting a title string into their constituent parts, that’s their responsibility, and we make it easier for them.

Related aside: we should also, per Paul’s suggestion, add initials to names, based on the same logic.

Style-based Title Reassembly?

One wrinkle Denis pointed out: in some cases, different styles have different rules on this.

Idea: maybe we could turn around the idea of “style-based splitting rules” to instead “style-based title-assembly rules”, and that this could then be accommodated in the object-based input model?

PR is here.

asimonyi · July 19, 2020, 9:11pm

Well I didn’t have anything concrete in mind but perhaps a trivial example would be to have an alternative way of specifying names as one “full-name” string in the form “[family], [given]”
which would be stipulated (by the standard) to be parsed into the ‘family’ and ‘given’ fields by splitting at the (first) ", " occurrence. (Not that I would consider this a good idea… ) This would be to a certain extent similar to the way the ‘given’ field can contain suffix etc. in the current standard, but with a simple and unambiguous mapping to the fundamental representation by the standard itself.

bwiernik · July 20, 2020, 7:41am

citeproc-js currently accepts the syntax Family || Given for names.

Bruce_D_Arcus1 · July 20, 2020, 9:47pm

A related question:

Should we add more structure to the csl-citation.json schema?

I’m wondering, in particular, about locators.

PR is here. Comments would be appreciated.

Topic		Replies	Views
Processor support for CSL-JSON 1.1? CSL Development	2	662	October 8, 2020
json representation CSL Development	0	252	July 10, 2009
csl issues CSL Development	0	232	September 25, 2005
RFC: Rich Text for CSL JSON input format CSL Development	14	1187	November 18, 2021
Thoughts on enhanced date support CSL Development	2	269	November 14, 2010