Towards a simpler and extensible CSL 2.0; or What can we learn from citeproc (hs) and djot?

Background

At a high level CSL is an XML DSL that sets some context-dependent parameters and provides templates for inline and block formatting of lists (citations and bibliographies respectively).

But it has no method for extension, so (almost) any change in behavior requires changes in the XML model, and often, by extension, the styles, AND the input schema.

Example: we want to add support for independent formatting of main and subtitles. In the CSL v1.1 branch, implementing that required new, incompatible, XML nodes, and changes in the input format (or new CSL attributes to specify parsing rules for string titles).

Given the diversity of software implementations and thousands of styles, the lack of extensibility enforces a large degree of inertia.

Might there be another way?

CiteProc (hs) and djot: what can we learn?

Citeproc (hs) includes an optional CLI that offers a JSON server that operates via stdin, and returns an AST that looks like this:

[
  "———. 1983b. “The Concept of Truth in Formalized Languages.” In ",
  {
    "format": "italics",
    "contents": [
      "Logic, Semantics, Metamathematics"
    ]
  },
  ", edited by John Corcoran, 152–278. Indianapolis: Hackett."
]

Djot is a new better markdown, designed to be extensible from the beginning, with a javascript reference implementation that also has a JSON AST, and filters that can transform it. Implementations in other languages already, or will in the future, have their own compatible JSON ASTs.

    {
      "tag": "para",
      "children": [
        {
          "tag": "str",
          "text": "A title with "
        },
        {
          "tag": "emph",
          "children": [
            {
              "tag": "str",
              "text": "emphasis"
            }
          ]
        },
        {
          "tag": "str",
          "text": " and "
        },
        {
          "tag": "strong",
          "children": [
            {
              "tag": "str",
              "text": "strong"
            }
          ]
        }

While @Denis_Maier and I have discussed some specific possible solutions, given that @John_MacFarlane designed both of these impressive projects, I thought best to just ask him:

What can CSL learn from these projects?

Of course, if you’re someone other than John, feel free to chime in too!

The high-level sketch of an idea I was contemplating, to provide something possibly more concrete to consider. It’s inspired more than a little by djot.js.

It’s currently just a typescript style model, which auto-converts to a JSON schema. Details on the README, including that help is welcome, particularly from people more comfortable with js and typescript.

Here’s an example of the current state of an APA-like citation, with configuration for both the default parenthetical style (called “non-integral” here), and the in-text/narrative (“integral”). Aside from the I think simpler for users and developers alike model, that’s also a feature not supported in CSL 1.0.

{
  placement: "inline",
  sort: [ { key: "author", order: "ascending" }, { key: "year", order: "descending" } ],
  groupBy: [ "author", "year" ],
  groupAffix: "parentheses",
  delimiter: "semi-colon",
  integral: { groupAffixLevel: "secondary", andAs: "symbol", delimiter: "comma" },
  nonIntegral: { groupAffixLevel: "primary" },
  format: [ { template: "author-apa" }, { template: "year" }, { template: "locators-apa" } ]
}

The commented source is YAML.

The basic idea is a simpler and more flexible model, and more logic contained in (at least in some places) extensible parameters, which should be able to be added, much like XML attributes, without breaking things.

This is somewhat, BTW, inspired by how biblatex works, but really just takes some of the core design ideas in CSL 1.0, and purifies them.

So there we have parameters for some complex logic like page range formatting, and use lower-level XML template structures for other things. We also have implicit ideas that some things are lists. Here it’s explicit and consistent, with heavier weight on the parameters.

We have thousands of styles that can confirm what those parameters, and their values, should be (for example, for sorting), which should give us confidence that would work.

The simplified template language, then, leaves the door open for innovations elsewhere (style creation and editing UIs, less weight put on unpaid volunteers to manage styles, etc.).

PS - here I’m explicitly using the language of “grouping”, which is actually what’s happening (and I made explicit in my original XSLT implementation, BTW). An “author-year” citation or bibliography is just a two-level group; in SQL, for example:

group by Author, Year

I don’t think I’m able yet to contribute meaningfully to strategy. There’s years of background information I haven’t digested. Given that caveat, I have a few impressions I’ll share in the off-chance they trigger something or serve as a useful nudge.

While working on a Swift processor, it seemed natural to consolidate formatting indicated by CSL attributes and HTML-like tags within field values to Swift AttributedString values.

The debug output of an AttributedString equivalent to the example Citeproc (hs) AST would be

"———. 1983b. “The Concept of Truth in Formalized Languages.” In " {},
"Logic, Semantics, Metamathematics" { Italics },
", edited by John Corcoran, 152–278. Indianapolis: Hackett." {}

In the Apple ecosystem, these AttributedStrings are core structures that can be easily merged, modified and assigned directly to view components as rich text.

What is the lesson in this? I guess that I also found a more flexible, abstract expression to be an easier working format than trying to model CSL directly. Like the other processors, I expect, I deal with CSL semantics only long enough to turn it into something more useful.

I will need to look more closely at the djot.js inspired JSON schema you suggest. Off-hand, I definitely prefer the readability of and modern tooling for JSON over XML. How that makes it more extensible is less clear to me but that’s where I need to dig in.

1 Like

Thanks @Jason-Abbott; I had never heard of AttributedString.

The idea is to simplify the template model so we are less dependent on it for adding new functionality, and put more logic in those “option groups”, which we open to extension, if for no other reason than to make CSL iteration easier going forward.

In the XML context, it would be like simplifying and freezing the element model, and saying foreign attributes are allowed in X, Y, Z places, and processors can’t break if they see them.

But the simpler template model may yield other benefits:

  1. easier for users (if simple enough, maybe could leave room for string templates?) AND more featureful
  2. easier style and template distribution and evolution
  3. easier for developers in general, but also to create better UIs for creating and editing styles and templates

PS - this idea is clearly the clean break Denis was mentioning earlier.

Also, If you clone the repo, try make docs to get an easy-to-browse view of the model documentation.

This was originally an email to Bruce, but continuing this here for more eyes & transparency:
Here’s my initial reaction, without having fully understood all aspects of the question

Sociological concerns

If I’m understanding it correctly, I’m worried about the general idea of individual extensibility. Enforcing a single, centralized standard has significant advantages in keeping styles working (and the “CSL brand” coherent, if you want)

Higher level concerns

I guess my biggest high-level concern is that I think technical debt is called debt for a reason: switching costs to a completely new system are significant. In other words, it can both be true that XML-based CSL would very much not be how we/you would design CSL today and that the switching costs outweigh the benefits of an updated approach.

Of the top of my head, switching would require updates or rewrites of, at least

  • all citeprocs
  • our CI infrastructure
  • the visual style editor
  • the validator
  • all schemas
  • a large chunk of the specification
  • any guidance related to changing or writing CSL styles

There are also socio-technical costs: e.g. people like Patrick and me, who have looked at thousands of styles are very proficient with the existing syntax and can read and modify it at high speed. Even with a technically improved system, it would almost certainly take a good amount of time to get to a similar level of fluency.

Even with the very limited infrastructure changes required for the CSL 1.0.1–> 1.0.2 switches, it turned out to be a fair amount of work to make the switch even after the schema and specs (the main chunk of the work) were done. CSL has become vital scientific infrastructure. Who will be responsible for planning and running a much more complex transition? Can that even be done without someone working on it full time?

On the other side of the equation, at least going by the quite limited uptake of the CSL 1.0.2 update over the last ~17 months, I think the case for the importance of CSL schema updates is at least not obviously strong (even though there are certainly things, esp. multilingual, I’d love to support better).

Specific concerns/comments

  • I like the ‘scope’ approach to language
  • currently it appears that the separation between elements in the bibliography relies on affixes exclusively. We’ve found that delimiters work more reliably in most instances, how would that look here? We have some styles with quite deep nesting: personally, if that remains necessary, I find YAML quite tricky to work with (and don’t particularly enjoy deeply nested JSON either.) How would that look? Edit: I believe @bdarcus demos that here: (Confirm delimiter is available in the right places · Issue #36 · bdarcus/csl-next.js · GitHub )
  • grouping: I’m not sure grouping and disambiguation are as tightly related as your comments imply. I don’t see, e.g., why a style would need to group by author if it disambiguates by author (even though many do).
  • weird/complex stuff: in my experience, it’s not a good idea to expect too much coherence in style manuals. E.g., your comments imply a pretty simple model for in-text, author-year citations. But check e.g. Chicago (author-date): delimiters for in-text citations depend on whether n.d. is used or not: styles/chicago-author-date.csl at master · citation-style-language/styles · GitHub

Thanks @Sebastian_Karcher! I edited out a couple of details you added to your post.

I’m worried about the general idea of individual extensibility.

The extensibility is more on the development side, to make it easier to evolve core CSL.

It would be like allowing foreign nodes in certain places in CSL 1.0, and so letting developers and users experiment with alternatives without breaking compatibility.

I guess my biggest high-level concern is that I think technical debt is called debt for a reason: switching costs to a completely new system are significant.

There’s no denying this, and the issues you raise are the ones that concern me.

I’m also concerned about how much responsibility under the status quo is placed in the hands of busy people who are donating their time. I don’t think that’s likely sustainable.

There are also socio-technical costs: e.g. people like Patrick and me, who have looked at thousands of styles are very proficient with the existing syntax and can read and modify it at high speed. Even with a technically improved system, it would almost certainly take a good amount of time to get to a similar level of fluency.

Part of my idea here, which could actually be implemented in a tweak of 1.0 to allow standalone macro files, is if maintainers could spend more time on curated macros files, styles themselves are likely to become potentially much simpler, and so easier to create and maintain.

But I also think we need tools to automate a lot more of this, including that average users can use, so they are not dependent on the knowledge of a small circle of experts.

BTW, I think doing server side validation of these files using GitHub Actions would be trivial. I may even integrate it into the repo just to show/test.

Even with the very limited infrastructure changes required for the CSL 1.0.1–> 1.0.2 switches, it turned out to be a fair amount of work to make the switch even after the schema and specs (the main chunk of the work) were done. CSL has become vital scientific infrastructure. Who will be responsible for planning and running a much more complex transition? Can that even be done without someone working on it full time?

These are absolutely the right questions, and the very fact that it required much work at all signals a problem in my view.

But FWIW, I think it would likely mean new repos, and rethinking how we have done things for the past decade and a half.

In my experiment, the schemas and developer docs are generated from the commented code, and so can be automated by CI.

And if we can improve tools per my point above for users too, maybe that lessons the burdens on central repository maintenance, or even scales back its purpose?

On the other side of the equation, at least going by the quite limited uptake of the CSL 1.0.2 update over the last ~17 months, I think the case for the importance of CSL schema updates is at least not obviously strong (even though there are certainly things, esp. multilingual, I’d love to support better).

That’s not really surprising, nor does it show there’s no desire for change. The changes in 1.0.2 were fairly trivial. Even then, they broke citeproc-rs!

(Though, as an aside, I don’t believe the changes were particularly disruptive to other processors, so suspect there’s something unique about citeproc-rs.)

OTOH, multilingual, proper support for APA integral/narrative citations, independent formatting of titles and subtitles, richer date and time support, etc. are bigger deal features, which we can’t support without changing the XML model (new elements and attributes), which means parsers break, style updating will be a nightmare, etc.

So we go back to our questions from last Summer:

Do we have a plan to evolve CSL, that considers all of these socio-technical issues, or do we just freeze the schemas?

Not answering the questions, in my view, is freezing by default.

grouping: I’m not sure grouping and disambiguation are as tightly related as your comments imply. I don’t see, e.g., why a style would need to group by author if it disambiguates by author (even though many do).

I’ve gone back-and-forth on this one.

I still think it’s related to grouping, and I can explain why in the author group case.

I’m struggling a little with the year one though.

EDIT: per post below, I figured out in implementation: an author-year group is just that, where you group references based on a combined string: author:year. The disambiguation rules are then just about whether you add a suffix to the year, and whether you drop the author and year in the output.

PS - I guess in the end, the only way to know if this will work is to finish the proof-of-concept prototype! That’ll take awhile; I’m having to learn a lot to implement the idea, and am a mediocre programmer anyway.

I was just a playing a bit with quicktype, which converts JSON schemas to different languages: Rust or Swift structs, Haskell types, etc.

It even includes serialization and deserialization code!

With the Rust output of the style schema, the following compiles without error.

fn main() {
   let json = fs::read_to_string("src/style.csl.json")
       .expect("Unable to read file");
   let style: Style = serde_json::from_str(&json).unwrap();
   println!("{}", serde_json::to_string(&style.title).unwrap());
}

I haven’t looked carefully to see how well the output matches the input model in typescript, but it looks good on first glance; the compiled binary even fails if the style file is not valid.

So I’ve included a section on this in the README.

@Jason-Abbott if you have a chance anytime soon, can you check if the Swift code it generates actually works? I tried this on a swift playground, and it ran without error at least …

let jsonData = "{\"title\": \"Test style\"}"
let style = try? JSONDecoder().decode(Style.self, from: Data(jsonData.utf8))

But that’s about as far as I go with Swift.

If I can get this example to be more useful, I’ll link it from the README as demonstration of the code generation.

Update:

I merged sorting and grouping logic this weekend, and I’ve integrated it into a processing AST.

The Internal model and AST

The grouping logic is added to the procHints property on an intermediate ProcReference class, which includes formatting methods. That property I currently only use for disambiguation …

  ProcReference {
    data: {
      type: "book",
      title: "The Title",
      author: [ { name: "Doe, Jane" } ],
      issued: "2023",
      citekey: "doe2"
    },
    procHints: { groupIndex: 2, groupKey: "Doe, Jane:2023", groupLength: 2 }
  }

… and then incorporate in the rendering AST, which is just the input templates with added procValue property:

  [
    [ { contributors: "author", procValue: "Doe, Jane" } ],
    {
      date: "issued",
      format: "year",
      wrap: "parentheses",
      procValue: "2023b"
    },
    [ { title: "title", procValue: "The Title" } ],
    undefined,
    undefined
  ]

FWIW, this processing model is conceptually similar to the design I used for the very first CSL implementation, in XSLT 1.0, but of course using native and close-to-native data structures (as in, input format very closely maps to and from internal data structures).

Code generation example

A little demo repo to demonstrate the code generation:

If you run the generate.sh script, it will create the rust module files, and build that with the tiny main.rs source file, which demonstrates the generated code correctly deserializes and serializes the Style model, in this case translated to a Rust struct.

❯ time target/debug/csln-rs style.csl.yaml bibliography.yaml
The name of the style is: "APA"
The number of entries in the bibliography is: 5

________________________________________________________
Executed in    2.74 millis    fish           external
   usr time    1.16 millis  406.00 micros    0.75 millis
   sys time    1.62 millis  115.00 micros    1.51 millis

The Plan, milestones

Here’s how I imagine the milestones to build this out to a formal release. I’m already ahead of schedule on my own, so I’m optimistic about accomplishing the plan with help.

Aside: I’ve partly set the milestones up because there’s a possibility some comp sci students may work on pieces of this later this year and next.

Going further down the rabbit hole, decided to just re-implement it in pure Rust, and see how far I could get.

Good news: a lot of the modeling and serialization and deserialization code is actually easier (and MUCH faster) than typescript alternatives.

Bad news: rumors of a very strict compiler are definitely warranted!

Anyway, I’m almost at the same place as the typescript code, and the process has prompted me to rethink some design decisions.

If you happen to be curious to try it, with a rust setup installed, clone the repo, and do cargo build from the root.

Then:

❯ target/debug/csln-schemas
Wrote style schema to schemas/style.json
Wrote bibliography schema to schemas/bibliography.json
❯ target/debug/csln processor/examples/style.csl.yaml
processor/examples/ex1.bib.yaml
{
  "doe1": {
    "disamb-condition": false,
    "group-index": 1,
    "group-length": 1,
    "group-key": "Doe, Jane:2023-10"
  }
...

The current output of the CLI shows an internal HashMap that only includes the processing hints I will then use in the rendering step (next).

EDIT: basic rendering now implemented, which is what the CLI is now outputting.

Next step, hopefully: localized EDTF date-time rendering.