I agree, but otoh there’s always the risk of developing a standard that’s nowhere really supported…
So, yes, there are two issues.
One is a design issue: where do you draw the line between the tasks of (a) massaging the sort of messy data people use “in the wild” into structured data a processor can use and (b) taking structured data and turning it into accurate citations. The boundary is not always obvious. For my part, I’d sooner not move too much of task (a) into processors for reasons of hygiene, and because I see these as different tasks which should be kept separate. But ultimately it’s a matter of judgment.
The second is a specification issue. If you do expect processors to transform input, you MUST specify how they should do it, if it matters. There may be very sound reasons for needing it (the subtitles issue looks like it is one), but if it’s required it needs to be specified.
As things stand there are certainly things where the spec is silent, but the test suite assumes behaviour (such as extracting particles from given or family names, or dates from “raw” dates) which is not mentioned in the spec. That shouldn’t happen. There are plenty of other examples (e.g. extracting initials, normalizing spacing and punctuation, dealing with quotations) where in practice there is expected behaviour which has never been properly specified and which we have to guess from the test suite. This is not the best.
By minimizing the amount of fiddling that you expect from processors, you make it more likely that the spec will be right, because you have clearly structured input data, and clearly specified output. But always subject to pragmatic considerations, of course.
As our discussions have proceeded over the past couple of weeks, attempting to solve a few tricky issues, I’m thinking the road to a 1.1 release will be longer than I’d initially anticipated.
I think, in short, we should make some aggressive decisions, move quickly towards a
v1.1-pre.1 tagged pre-release, and then give implementers plenty of time (30 days, or more) to provide feedback, so they can not just review the changes, but also experiment with the code changes needed to support them.
As part of those “aggressive decisions,” I think we should drop ideas for now that are seeming too complicated or potentially controversial. My decision to mark the new “rich text” functionality I just merged as “experimental” is such an example; a kind of middle ground.
Sounds good. But we should also identify those issues that are not too complicated and close as many of this as possible.
I won’t trouble the list with a sample, but I’m happy to confirm that citeproc-js builds an unambiguous nested structure from parsed content, in an internal representation that is then flattened into some output format or other.
If I understand correctly, the discussion here seems to turn on unease with the fact that potentially ambiguous string input is magically jumped to cleanly formatted output by a black box that performs various ill-specified restructuring operations on the former to spit out the latter. It’s understandably frustrating when viewed from a specification-to-code perspective.
In an ideal world, mass-distributed metadata would be cast according to strict rules that are amenable to unambiguous parsing. Unfortunately there are ambiguities, and given a massive volume of data and a large population of users, removing them through small, documented markup extensions such as @bwiernik suggests is probably the way to go. I can’t see a user dealing with a UI with separate fields each for dropping and non-dropping particles, name suffixes, given names, and family names. Likewise, a structured WYSIWYG editor that requires quotation marks be applied as nested markup in similar fashion to italics or boldface is probably going to lose out to keyboard entry with most users.
If you will indulge a slight digression, I can offer an example of the risks of structural ambition from personal experience. Jurism was originally inspired by a multilingual RDF schema that is published by the Japan National Information Institute, and used to publish metadata from their CiNII aggregator service. I built the initial version of Jurism (then “Multilingual Zotero”) to digest that RDF. It worked a treat on sample input: names parsed out nicely, and Japanese and English fields imported correctly and were set with their respective language tags. Unfortunately, my hopes of solving the productivity problem for international scholarship on Japan were dashed when I discovered that metadata on the service is riddled with irregularities. Multiple languages banged into the Japanese title field with ad hoc syntax like “吾輩は猫である＝I Am a Cat.” Elements of non-Japanese names in arbitrary ordering, and in arbitrary scripts. Records with only partial provision of English. Completely separate records for Japanese and English citation forms. The CiNII RDF specification is a brilliant piece of engineering, but the content is mostly garbage. I had an opportunity to speak with one of the NII developers at a conference, and he said that they just pipeline whatever they receive from publishers direct to the database, with no curation, and no possibility of revision. As he put it, they follow the “Yoshinoya Principle” (from the CEO of a Japanese fast-food chain) under which, among “Fast,” “Cheap,” and “Delicious,” you can only succeed at two. CiNII does fast-and-cheap; and they are not alone.
In light of experiences like mine with CiNII, I don’t think it’s reasonable to expect that unambiguous well-structured markup will be delivered at scale for general consumption. The costs of curation are just too great. Tools that digest data from large repositories need to deal with what’s out there. Some of it will need to be tweaked by users. Successful tools will make the tweaking as painless as possible.
That said, parsing from plain text and juggling metadata fields are two very different tasks, and they could be separately specified. Just as EDTF provides a separate standalone specification of the syntax of date elements, it should be possible to cast a text parsing specification that sets out the structures that should be generate-able from parsing operations. That would then guide the development of syntax, whether kinda-sorta-HTML or something else. It is the markup capabilities that matter, not the HTML-ness or Markdown-ness of the markup, nor the JSON-ness or YAML-ness of structured output.
Such a syntax spec would not affect what an existing tool kike
citeproc-js does directly. It has an expected syntax for input that is capable of expressing the structures needed for formatted string field output. If data is to be fed to it that is in another markup language, so long as that markup shares the same capabilities, it can be converted into the markup that
On the implementation-specific issue of the admittedly ugly and cumbersome <span class=“no-case”></span> thing, that’s just a matter of tweaking the implementation to recognize different markup elements: it doesn’t affect the capabilities of the “field markup language,” only its specific expression in
In the experimental solution we settled on, the result is a native JSON data structure; the unambiguous target for any parsing.
Aside: these days, YAML and JSON are effectively interchangeable. Our JSON schemas can validate a YAML variant.
So despite the syntax differences, we’re saying for csl processors, rich text is a nested array of strings and formatted objects.
title: - A title with tex math - math-tex: x=y^2
This is also valid though:
title: A title
But, there’s a reason I marked it “experimental”: we need feedback.
So concerning these three options regarding titles and subtitles:
@Frank_Bennett what would be your perspective on these options? Given your comment above and on the rich text issue, I’d assume that your preference could be option 1, is that correct?
I guess I’m curious whether this is an abstract specification (titles may [or must] be parsable into structures that validate against this schema) or something more concrete (titles may be flat strings for verbatim rendering, or structures that validate directly against this schema).
The latter, I think.
We’re adding ability to separately specify formatting for main titles and subtitles in styles, because some styles require this.
How should a processor access those parts?
Do we need to change the input schema so a processor accesses them directly:
… or do we require processors to parse title strings to access these:
On the input end, the last option yields no change in the input model, but would result in things like this to accommodate non-standard data for which the algorithms break:
title: "A title || a subtitle, non-standard delimeter"
Keep in mind parsing would also apply in the rich text model; am not sure how that would work, actually (say for a new processor targeting 1.1), because the “title string” would no longer only be a string.
It may be worth noting that the current proposal to split titles is based on current citeproc-js functionality.
The logistics do get complicated. But setting aside the rich-text issue, it would obviously be easier in the processor to just receive title and subtitle as separate fields. That’s not the shape of data in the wild, though, so parsing would have to happen somewhere. The burden would just fall on the calling application, which will adopt various solutions or not.
The “redefine title to be main title, and add subtitle variables” option would align us with biblatex.
The title as object I’ve not seen elsewhere (except in MODS).
The parsing rules are based on the existing citeproc-js parsing rules that are used for uppercase subtitles. The additions are (1) a style-setting to specify a set of delimiters from a list of discrete options and (2) a specified character string
|| to override automatic parsing (similar to the existing full/short comparison, but not requiring multiple fields).
With respect to how this intersects with rich text—I think a simple rule that parsing doesn’t cross markup boundaries would work.
That would be a good solution for my use cases. But, as has been said before, that won’t be without problems either.
(Also, even this might require some parsing in the processor: how would a second subtitle be handled in such a solution?)
It’s not clear to me we should support that.
At least with the current proposal it’s possible without adding too much complexity.
Because the current proposal adds complexity upfront.
How about let’s leave this question open until end of day (extending this a bit) Sunday, July 19.
If someone wants to argue for a change from the current plan, please state your case, and which option you prefer, here.
Otherwise, we’ll go the parsing titles route.
I mean, if I were a developer, I would want to literally know how to do it.
All languages have string splitting functions, so with a string, that’s straightforward.
Are we saying splitting only happens, for example, on the formatted string, after primary processing (if dealing with rich text, one would need to format the sub-strings for output to RTF, HTML, LaTeX, or whatever, after all)?
So the parsing route is your favourite?
I’d prefer not, but I want to keep this moving, so …
I do think we need an answer to my latest question though before we actually do this.
And I’m curious what @PaulStanley thinks as a relative newcomer.