how to code a CSL processor

So followup on citeproc-py question, just thought I’d ask Frank and
Andrea in particular:

Having learned what you’ve learned, what’s the best sequence in which
to code the CSL functionality?

For example, disambiguation is one of the more complex ones; when
should one tackle this? Sorting? Substitution?

Bruce

My suggestion is to put together the very very basic elements (terms
and macros, the basics of dates and names, conditionals and the text
element).

Then implement Frank’s testsuite. This is going to save you a HUGE
amount of time when digging into the details of the schema and the
specification.

Substitution and sorting are not tough. Disambiguation and collapsing
were the most challenging for me, but that depends on the overall
design of the processor. I’d leave their implementation for a later
stage of development.

I’d suggest to follow citeproc-js for representing the input data
(citations and references). Keep an eye on the citeproc-js
documentation. Also the code can be useful as the last resort… :wink:
I’m just kidding: I entirely lost my familiarity with javascript in my
years of Haskell coding, so I use citeproc-js code for the small
details and not for the overall design.

You could also have a look at the Haskell implementation. The problem
with my code is that some kind of Haskell literacy is needed, since I
used some nice tricks which keep the number of lines of code down
(citeproc-hs is less then 3200 now), but at the price of some
obfuscation.

Andrea

So followup on citeproc-py question, just thought I’d ask Frank and
Andrea in particular:

Having learned what you’ve learned, what’s the best sequence in which
to code the CSL functionality?

For example, disambiguation is one of the more complex ones; when
should one tackle this? Sorting? Substitution?

I certainly made my share of mistakes along the way, so take anything
I say with a grain of salt. But step one would probably be to get the
basic rendering nodes working, followed by cs:group, followed by the
conditional nodes. From there, I guess you’d move up the food chain,
to progressively higher levels of
everything-related-to-everything-else-ness.

Maybe cs:substitute would be next, since it’s pretty well isolated in
the cs:names node (the only additional worry is the need to identify
and suppress subsequent use of substituted vars). Most of the sorting
stuff can be isolated from other code (with the exception of dates
embedded in macros, which are … interesting).

Next maybe is setting up a mechanism for macros.

Disambiguation can be broken down into several sub-parts. Names-based
disambiguation (what Zotero currently does) is relatively
straightforward. Not sure how much use it would be as a reference,
but the citeproc-js code for this is all contained in
disambig_names.js, in about 230 lines of code. The
disambiguate=“true” condition and per-cite disambiguation with
progressive name transformations are just very hard. The citeproc-js
code ended up as a kind of pastiche of finger-painting. I’m confident
that it works, but only because it’s backed up by tests.

The most stubbornly integrated things are cite collapsing,
internationalized dates, dates embedded in macros, and everything
related to position awareness and backreferencing. Each of those cuts
across several parts of the system, and can lead to significant
rewriting or extension of other code.

So followup on citeproc-py question, just thought I’d ask Frank and
Andrea in particular:

Having learned what you’ve learned, what’s the best sequence in which
to code the CSL functionality?

For example, disambiguation is one of the more complex ones; when
should one tackle this? Sorting? Substitution?

My suggestion is to put together the very very basic elements (terms
and macros, the basics of dates and names, conditionals and the text
element).

Yeah, that’s what I’ve done, along with cs:substitute (well, and date
isn’t done). I’m basically just creating different functions to handle
the different node types.

Then implement Frank’s testsuite. This is going to save you a HUGE
amount of time when digging into the details of the schema and the
specification.

I’m just worried about this one given my previously expressed
concerns. E.g. my internal model is an object representation of
enhanced HTML, and so bears no relation (except ultimately in viewed
output rendering) to the output in the test suite. So that means I
have to convert my HTML to the stuff used in the test suite?

Substitution and sorting are not tough. Disambiguation and collapsing
were the most challenging for me, but that depends on the overall
design of the processor. I’d leave their implementation for a later
stage of development.

I’d suggest to follow citeproc-js for representing the input data
(citations and references). Keep an eye on the citeproc-js
documentation. Also the code can be useful as the last resort… :wink:

I have a bit of hard time reading it myself last I checked.

I’m just kidding: I entirely lost my familiarity with javascript in my
years of Haskell coding, so I use citeproc-js code for the small
details and not for the overall design.

You could also have a look at the Haskell implementation. The problem
with my code is that some kind of Haskell literacy is needed, since I
used some nice tricks which keep the number of lines of code down
(citeproc-hs is less then 3200 now), but at the price of some
obfuscation.

I’ve periodically looked at your code, as I have an interest in
Haskell in general. But I’m guessing not everyone is so inclined. :wink:

Was mostly just starting this thread to see if it suggests any ways
forward for Petr.

Bruce