Citation Style Language

Raw dates vs date parts

Could someone explain the workings of date_RawSeasonRange1.txt?

testscripts/date_RawSeasonRange1.txt FAILED
-------- EXPECTED --------
Spring 1999–Summer 2001
----------- GOT -----------
June 1, 1965

We have both a raw date and date-parts, which are incompatible. The CSL calls for

<date variable="issued" date-parts="year-month-day" form="text"/>

The input data is contains the data to construct just what it is asked for, but the test results calls for the raw form to be preferred, ignoring the date-parts. I suppose this is not strictly speaking a CSL issue, but I had assumed that the sane thing to do was to use date-parts if they are available, so I’d coded this to use them and to parse the raw date only as a fallback. Is that wrong?

Preferring the parsed date there must be right. This raises another question, though. The raw date key currently figures in 12 fixtures in test-suite, and it looks to be another candidate for housecleaning.

On a quick look, two of the tests are specifically meant to exercise raw date parsing, which is a citeproc-js local thing. One of the others (date_String) should probably be doctored to test only rendering of a dumb string via raw. The remaining 9 tests just use raw for date input because I was being lazy, and can be recast to use pre-parsed dates without losing anything.

Shall I make those changes in test-suite?

Fine by me, of course. I really think this is a citeproc-js issue, not a CSL issue at all: a processor could totally refuse to parse raw dates and still be compliant, as I understand it. I found them helpful, as it happens, because I happen to need to parse raw dates anyway (and I rather think that any processor is going to want to do something along those lines). So, thank you for having them there: but if I can give myself a pass on one more test that gets me to the nearly magic number of 489 passing tests.

I’m starting to toy with the idea of somehow bundling Sylvester Kiel’s EDTF parser with citeproc-js. There will be details to sort out, and I probably won’t touch it until the summer, but it would be good to at least open a pathway to standardized date handling. Would EDTF parsing be an option in your project?

Yes. I’m targeting LaTeX, which means I should really accept dates in any format biblatex accepts, which now includes EDTF. Unfortunately, I’m necessarily also targeting it via Lua, which means … I think I’d have to roll my own anyway (also because I need to keep things self-contained as far as possible, though I’m using libraries for JSON – LuaTeX has its own XML and unicode libraries).

Given the target, I have to parse dates which are not already date-parted. In principle, it seems to me, parsing should be left to some front end (i.e. occur as part of taking raw data and producing JSON), and the processor should handle date-parts and literals alone. FWIW, when I put it together, that’s what I’ll probably do: parse the dates as part of parsing the bibtex, before it ever hits the processor as such. Still, I have to parse it somewhere. The same goes for things like names: obviously I must parse names entered in the standard bibtex convention into their various parts, but I don’t intend to accept JSON input like “family”: “von German” and split that. I’ll likely take the same approach to dates.

ETA: that said, I’m not inclined to be too generous. So long as one does parse dates in a format which is sufficiently expressive to handle anything that a user might require, I am rather unfazed at the idea that someone, somewhere will come up with a format which won’t parse, because the user can always recast it into a form that does. This is a perpetual problem with bibtex, where there is a lot of “bad stuff” (i.e. sloppy bibtex) out in the wild.

That makes sense. If I move toward EDTF support in citeproc-js it will be introduced as an option for pre-processing.

I’ve pushed changes to eliminate the expectation of raw string parsing from the test-suite fixtures. Implementations that parse raw can supplement with local tests for their specific behavior.

I was going to write an EDTF parser in Rust. I also think there are zero objections, which is great. A couple of obvious things that may have been mentioned before:

  1. I don’t think the citeproc-js “raw” input is a perfect subset of EDTF. Prove me wrong but I think that’s the reason it has to go in an “edtf” key in date-parts (and raw should be preserved). Seasons, etc, have more than one notation in raw, so you’d have to modify an edtf parser; so don’t.
  2. Only a subset of edtf can actually be represented in CSL and output. For example, we can’t process time (yet). I would probably build any new features to support more edtf inputs into the same feature flag. But it’s important to standardise around full parsing but only some rendering, rather than parsing a subset of EDTF.

I’m not sure about adding an additional category as well as “raw”. As I understand it, CSL is agnostic as to how dates are presented, simply requiring a processor to render day, month (with season as a substitute if not rendering day) and year.

As things stand the input format is not really specified. It happens (if I understand it right) that citeproc-js envisages an array form (date-parts), a raw form raw, a literal form literal and also a season and circa flags. But none of that is required by CSL which is agnostic as to data input format. (In fact I’m not sure that the treatment of literal dates and fallback behaviour with raw dates, sensible as it is, is even sanctioned …)

Can one not say that raw effectively means “This thing here is a date which I understand: see if you can make sense of it. If you can, treat it as a date”. At that point it seems to me that it’s open to any processor to decide whether it will handle raw, and if so how generously or not. It could have a fantastically indulgent parser which would understand, say AD XIII Kal Mai MMDCCLXXI AUC if it wanted, and produce [ [2019, 4, 19] ] … or not.

In which case, there’s no particular reason to have a separate field for EDTF, any more than one needs a separate field for, say, “11 August 2018” or “August 11, 2018”. All one needs is an offer to recognize certain formats, and an order of priority for parsing ambiguous formats like “4/4/2019”. I’m not sure one even needs to modify an EDTF parser, does one? Can’t one just say “If this parses as valid EDTF, then it will be treated as EDTF”, if it doesn’t, it may (or may not) parse in some other way. After all parsing a date is not usually an expensive thing, so one can fail and fail repeatedly. The only thing one needs is a clear set of rules about what preference will be given to ambiguous forms.

Nothing to do with CSL. This is CSL-JSON, which is a data format, and actually aims to enable unambiguous input of dates and other things. It is a bad idea now that we finally have a definite format for dates to ruin it with “but also you can input basically anything and we’ll give it a go” because then none of the upstream tools like Zotero (or test suites, or any of the other zillion CSL targeters) will know if they’re messing up the rigid date format. This is valuable because nobody is going to notice a date being slightly off in the output, and when your EDTF misrepresents a time zone that might cross date lines etc, saying “we’ll give it a go as the old raw format” means messing up the output and more manual scrutiny.

The correct place to do “it’s not quite edtf” is in Zotero’s CSL-JSON exporter, where it should fall back to raw or literal.

Basically, please just give us this clean slate. Please!