Raw dates vs date parts

Could someone explain the workings of date_RawSeasonRange1.txt?

testscripts/date_RawSeasonRange1.txt FAILED
-------- EXPECTED --------
Spring 1999–Summer 2001
----------- GOT -----------
June 1, 1965

We have both a raw date and date-parts, which are incompatible. The CSL calls for

<date variable="issued" date-parts="year-month-day" form="text"/>

The input data is contains the data to construct just what it is asked for, but the test results calls for the raw form to be preferred, ignoring the date-parts. I suppose this is not strictly speaking a CSL issue, but I had assumed that the sane thing to do was to use date-parts if they are available, so I’d coded this to use them and to parse the raw date only as a fallback. Is that wrong?

Preferring the parsed date there must be right. This raises another question, though. The raw date key currently figures in 12 fixtures in test-suite, and it looks to be another candidate for housecleaning.

On a quick look, two of the tests are specifically meant to exercise raw date parsing, which is a citeproc-js local thing. One of the others (date_String) should probably be doctored to test only rendering of a dumb string via raw. The remaining 9 tests just use raw for date input because I was being lazy, and can be recast to use pre-parsed dates without losing anything.

Shall I make those changes in test-suite?

1 Like

Fine by me, of course. I really think this is a citeproc-js issue, not a CSL issue at all: a processor could totally refuse to parse raw dates and still be compliant, as I understand it. I found them helpful, as it happens, because I happen to need to parse raw dates anyway (and I rather think that any processor is going to want to do something along those lines). So, thank you for having them there: but if I can give myself a pass on one more test that gets me to the nearly magic number of 489 passing tests.

I’m starting to toy with the idea of somehow bundling Sylvester Kiel’s EDTF parser with citeproc-js. There will be details to sort out, and I probably won’t touch it until the summer, but it would be good to at least open a pathway to standardized date handling. Would EDTF parsing be an option in your project?

Yes. I’m targeting LaTeX, which means I should really accept dates in any format biblatex accepts, which now includes EDTF. Unfortunately, I’m necessarily also targeting it via Lua, which means … I think I’d have to roll my own anyway (also because I need to keep things self-contained as far as possible, though I’m using libraries for JSON – LuaTeX has its own XML and unicode libraries).

Given the target, I have to parse dates which are not already date-parted. In principle, it seems to me, parsing should be left to some front end (i.e. occur as part of taking raw data and producing JSON), and the processor should handle date-parts and literals alone. FWIW, when I put it together, that’s what I’ll probably do: parse the dates as part of parsing the bibtex, before it ever hits the processor as such. Still, I have to parse it somewhere. The same goes for things like names: obviously I must parse names entered in the standard bibtex convention into their various parts, but I don’t intend to accept JSON input like “family”: “von German” and split that. I’ll likely take the same approach to dates.

ETA: that said, I’m not inclined to be too generous. So long as one does parse dates in a format which is sufficiently expressive to handle anything that a user might require, I am rather unfazed at the idea that someone, somewhere will come up with a format which won’t parse, because the user can always recast it into a form that does. This is a perpetual problem with bibtex, where there is a lot of “bad stuff” (i.e. sloppy bibtex) out in the wild.

That makes sense. If I move toward EDTF support in citeproc-js it will be introduced as an option for pre-processing.

I’ve pushed changes to eliminate the expectation of raw string parsing from the test-suite fixtures. Implementations that parse raw can supplement with local tests for their specific behavior.

I was going to write an EDTF parser in Rust. I also think there are zero objections, which is great. A couple of obvious things that may have been mentioned before:

  1. I don’t think the citeproc-js “raw” input is a perfect subset of EDTF. Prove me wrong but I think that’s the reason it has to go in an “edtf” key in date-parts (and raw should be preserved). Seasons, etc, have more than one notation in raw, so you’d have to modify an edtf parser; so don’t.
  2. Only a subset of edtf can actually be represented in CSL and output. For example, we can’t process time (yet). I would probably build any new features to support more edtf inputs into the same feature flag. But it’s important to standardise around full parsing but only some rendering, rather than parsing a subset of EDTF.

I’m not sure about adding an additional category as well as “raw”. As I understand it, CSL is agnostic as to how dates are presented, simply requiring a processor to render day, month (with season as a substitute if not rendering day) and year.

As things stand the input format is not really specified. It happens (if I understand it right) that citeproc-js envisages an array form (date-parts), a raw form raw, a literal form literal and also a season and circa flags. But none of that is required by CSL which is agnostic as to data input format. (In fact I’m not sure that the treatment of literal dates and fallback behaviour with raw dates, sensible as it is, is even sanctioned …)

Can one not say that raw effectively means “This thing here is a date which I understand: see if you can make sense of it. If you can, treat it as a date”. At that point it seems to me that it’s open to any processor to decide whether it will handle raw, and if so how generously or not. It could have a fantastically indulgent parser which would understand, say AD XIII Kal Mai MMDCCLXXI AUC if it wanted, and produce [ [2019, 4, 19] ] … or not.

In which case, there’s no particular reason to have a separate field for EDTF, any more than one needs a separate field for, say, “11 August 2018” or “August 11, 2018”. All one needs is an offer to recognize certain formats, and an order of priority for parsing ambiguous formats like “4/4/2019”. I’m not sure one even needs to modify an EDTF parser, does one? Can’t one just say “If this parses as valid EDTF, then it will be treated as EDTF”, if it doesn’t, it may (or may not) parse in some other way. After all parsing a date is not usually an expensive thing, so one can fail and fail repeatedly. The only thing one needs is a clear set of rules about what preference will be given to ambiguous forms.

Nothing to do with CSL. This is CSL-JSON, which is a data format, and actually aims to enable unambiguous input of dates and other things. It is a bad idea now that we finally have a definite format for dates to ruin it with “but also you can input basically anything and we’ll give it a go” because then none of the upstream tools like Zotero (or test suites, or any of the other zillion CSL targeters) will know if they’re messing up the rigid date format. This is valuable because nobody is going to notice a date being slightly off in the output, and when your EDTF misrepresents a time zone that might cross date lines etc, saying “we’ll give it a go as the old raw format” means messing up the output and more manual scrutiny.

The correct place to do “it’s not quite edtf” is in Zotero’s CSL-JSON exporter, where it should fall back to raw or literal.

Basically, please just give us this clean slate. Please!

1 Like

Just linking in this more recent discussion.

https://discourse.citationstyles.org/t/edtf-and-date-representation/1663/13

My preference would be to revise the current date model to be a perfect representation of EDTF (which we just added as an option), and remove raw and literal, perhaps with open optional properties for apps to put that stuff if they want. But I may be missing some nuance on that last part.

The linked thread shows how we could do that; seems like minor changes.

We could then deprecate and remove it in time without pain; or keep it if it made sense.

I can see an argument for retaining literal, because a user might have something that they want output in some peculiar form, and literal unambiguously says: “treat this as a string, and just pass it through unprocessed: I will take responsibility for the result”. Raw, I agree, should go, because it offers a promise that a processor will parse the input, which is ultimately a dangerous promise to make because since the input format is ill-defined, so is the result.

2 Likes

I agree that raw should go (it is essentially filling the same role as the new string/edtf field. In terms of compatibility with existing data, I’d say that processors should be instructed to treat raw if encountered as though it were entered in string/edtf format.

literal should stay. There are various user use cases for it. For example, it’s common for users to enter things like “in preparation” or “in press” into the Date field in applications (even if better practice would be to use status).

The downside of keeping literal is it maintains an incompatibility between it and EDTF, which has no such support.

Would be nice if we could have a good example to include in an annotation. I don’t consider “in press” to be such an example.

But maybe we can work on addressing that so it’s unneeded longer term. Can always drop it later?

I think that a reasonable instruction is: parse the string field as EDTF. If that fails, render the string literally. literal just instructs to bypass that parsing step.

Why exactly do you find “in press” to not be a good example? It is ridiculously common, as is “in prep” or “in preparation”.

Another example might be something like “Han dynasty” or a date formatted in a calendar system that isn’t formally supported by CSL, such as Japanese Imperial system (common in Frank’s legal citation work) or the Arabic calendar.

Because it’s a status; not a date.

The annotations and documentation needs to have intended best practice for clarity.

I’ll include the Han Dynasty example; thanks.

But would that fit for citation data?

A historian might cite that for an artifact for which that’s the only known date information.

1 Like

I think that a reasonable instruction is: parse the string field as EDTF. If that fails, render the string literally. literal just instructs to bypass that parsing step.

Please, no. If a user enters what s/he thinks is valid EDTF, and it isn’t, the right response is an error or warning not a silent rendition of the invalid EDTF as if it were a string. Much better to keep a separate literal field: that way one can distinguish clearly between “at attempt to enter an EDTF value which was invalid” and "a decision to enter a string rather than EDTF.

1 Like

That’s fair. Determining whether something is valid EDTF or not to place in in edtf or literal should happen by the client application serving the data, rather than the citation processor.

1 Like

Yes.

In general, as a matter of “clean design”, processors should (a) be given very clear indications of what they can expect and must handle and (b) be free to reject and warn about invalid input. Client applications can do whatever they like to try to “fix” bad input, but unless the specification is clear about it, processors need not.

Historically, to be fair, this hasn’t always happened. Citerproc-JS does all sorts of things that the spec does not require, which enables it to handle a variety of malformed input gracefully. That’s fine, but from a specification point of view it’s “undefined behaviour”, not “required behaviour”. (I’d make the case that it’s better for the user to reject dubious input with a clear warning rather than to “do what you think the user may have meant”, but that’s really a design preference. The important thing from a specification point of view is that what a processsor must do is clearly laid out.)

One way where the next version of the spec would benefit from a polish is working out which of those things should be mandatory and which not. To my taste, it’s reasonable and right to include in the spec bits of behaviour whose absence will lead a significant number of styles to break. But really better to leave unspecified things which at best enable Citeproc-JS to handle malformed data input without complaint. So, e.g., punctuation and whitespace cleanup should be in spec, but “raw date parsing” shouldn’t.

(There are some tricky areas like “how do we decide what counts as an initial?” which I think are connected to user input, and should be in spec. In an ideal world, this would be explicit in the input too, and not left for parsing in the processor, but there are good reasons why that isn’t so, in which case the spec should be crystal clear about how an given name will be turned into initials.)

2 Likes