There is often tension between an input form that is “easy for the user”, and an input form that is open to only one interpretation.
Generally, Citeproc-JS has taken the laudable but troublesome approach of trying to “do what you mean”, a/k/a guess what the user wants. Mostly that works, but there are corner cases even in quite simple areas, e.g. initials. If I give the name as Philippe, can the processor “guess” that the initials should be “Ph.” not “P.”? How? Should it accept “ME” as initials, or “M.E.” or “M.E” or “M. E.” or “M E” or some or all of those. An “ideal” approach would be to require precision: specify initials separately:
{ given: “Paul”,
family: “Stanley”,
initials: [“P”, “M”] …
}
One can see why for user input that is avoided. OTOH, I would suggest, (a) something like that should be permitted (so a user can make it clear if necessary (b) the range of input that is accepted should be limited to clear cases: e.g. accept “P. M. Stanley” but not “P M Stanley”.
We see it in spades in relation to quotes, where we end up having to parse fragments with quotations to try to convert them to some sort of structure we can work with. And literally that is sometimes impossible to do reliably:
The ‘90s were James’ best years
The poor old computer cannot be blamed for turning that into The “90s were James” best years
. And what is it to make of
The ‘roaring twenties’ were the speakeasies’ heyday
Those, with straight single quotes are strictly ambiguous, unless you graft on actual semantic analysis. So where does one strike the balance?
I think one has to accept that it is reasonable to be somewhat picky about input, or at least to offer no guarantees where input is ambiguous. But one must offer at least some reliable and unambiguous way of getting the right output.
In my view, doing things like trying to parse titles, or introduce recherche markup is heading in the same confused direction. If there is a subtitle and a title, let the user decide, and require explicit markup. I’d favour the following principles:
(A) In general, an explicit markup should always be available (so, e.g., I’d allow explicit identification of initials or subtitles if required), using title
and sub-title
). Other forms of markup, if available, should be “sugar” for the canonical and completely clear form. It should always be possible for semantic elements to be explicitly specified in input, explicitly, and preferably in JSON. Fix on JSON as the normative form.
(B) For common cases and to facilitate direct user entry, unambiguous sugared form should be available (so "given": "P. M."
and "given": "Paul Matthew"
work to specify initials). But only for common cases, and without struggling to accommodate every possible variation. So, for instance, I wouldn’t (as Citeproc-JS does) attempt to parse a family
name of “di Angelo” into a non-dropping particle. If the user wants to specify a non-dropping particle, they can/should do that. Frontends can attempt such parsing if they want to, but a processor that is told “this is a family name” is entitled to assume that it is just that!
(C) As a corollary of (B) it’s OK to reject or mangle “reasonable” but non-compliant input, so long as there is a readily available compliant form. A user can always correct it. If you enter The 'roaring twenties'
and it comes out as The ’roaring twenties’
you can easily correct the input.
(D) It’s fine (good!) to make sugared forms available in human-readable forms, but keep the interchange format unpolluted. It would save a heap of time if a processor knew it could expect
{ "title": [ "The ", { "quoted": "roaring twenties" } ],
"subtitle": "an investigation",
"author": [ { "name-parts": { "family": "Vinci",
"given": "Leonardo",
"non-dropping-particle": "da",
"initials": [ "L" ] } } ] }
and wouldn’t have to deal with The 'roaring twenties'||An investigation
etc.
That isn’t at all incompatible with encouraging the development of software which allows a user to enter those details in other forms and have them parsed out, or to parse data “from the wild” in the hope of extracting the right stuff. But it’s much tidier to separate the quite separate tasks of interpreting (often ambiguous) input and processing (hopefully unambiguous) data. As far as possible the parsing phase should be kept quite separate. Encourage, in other words a separation of the overall ecosystem into specialised layers, recognising the specialism of “turning sloppy human-readable text into hard-edged structured data” and “turning hard-edged data into properly formatted citations” as equally valuable but fundamentally different tasks.
(E) Even when processors should adopt heuristic parsing/sniffing methods (e.g. to detect that a name is not “Latin” from the characters), there should nearly always be some explicit way for the user to make things explicit and override the machine’s guess.
In the particular case of markup, I’m agnostic, because I think markup should be left mostly to the style, with the possible exception of allowing for emphasis in titles. But it should certainly be very rare, and I’m therefore happy to allow even a rather cumbersome convention so long as it is completely unambiguous. FWIW I’d be quite suprised if, internally, processors didn’t hold text in a tree/s-expression like form along the lines that John Macfarlane’s JSON represents, and I prefer it (for machine consumption anyway) to the ugly pseudo html, which is especially objectionable because it makes the most common legitimate case (preserving capitalization) rather cumbersome <span class="nocase"></span>
is nearly 30 characters to do what Bibtex does in {T}wo.