Citation Style Language

Proposal: Make style ids immutable

As I’ve noted in the past, it’s really inconvenient that style ids can change, and dealing with it creates extra complexity everywhere styles are handled, leading to bugs that cause unpleasant experiences for users. It’s also redundant to have a user-friendly id when there’s already a human-friendly title. The id should be for computers, and the title should be for humans.

It’d be much easier if this worked more like Zotero translator IDs, where ids are immutable, dependencies use those ids, and filenames are irrelevant. Then styles on GitHub could be renamed as desired without requiring downstream code to do anything to handle those changes.

As things are, to properly fix the issue in the above thread, the Zotero client would have to download and store updated renamed-style data from the repo on every style update check (instead of bundling it with version upgrades as it does now). That’s more development, complexity, and potential for bugs to solve a problem that doesn’t need to exist.

So here’s what I’d propose:

  1. Freeze all existing style ids as they are now. In an ideal world they’d all be changed to whatever new format we decided on (e.g., a UUID), but we don’t want to break existing clients or require massive changes across the whole CSL ecosystem. Fortunately this isn’t a problem, because the id is for computers, not people. http://zotero.org/styles/apa is just a string, no different from 90320768-db08-4d22-917c-4b7714273ff4. If a journal with an existing style changes its name, too bad. The title and filename can be updated, but the id has to stay the same.
  2. Make sure all implementers are treating ids and independent-parent as opaque strings, not dereferenceable URLs, and not interpreting the suffix as a meaningful short name. No assumption should be made that the filename on GitHub will remain consistent. Implementers can name local styles however they like (e.g., a filename-safe version of the title).
  3. Recommend UUIDs for new styles.
  4. Leave renamed-styles.json in the repo for the time being, or have existing implementers mirror a static copy of it to deal with old dependencies and then delete it. The advantage of the latter is that new implementers wouldn’t think they had to handle these mappings.

A note about independent-parent: there’s currently an ambiguity regarding whether the URI in independent-parent is meant to be understood as an id or a download location. In Zotero — and perhaps other tools — it’s both: it’s used as an id to identify a parent style that’s already installed, but it’s treated as a URI to download a missing style. That really shouldn’t be the case. (I’m not sure what happens if the id of the style that’s downloaded doesn’t match the independent-parent value, but probably nothing good.) If we switch to always considering the id an opaque string, I think we’d have to say that independent-parent had to exist in the official CSL repository (or an implementer’s mirror of it) so that it could be retrieved by id if necessary. While that’s a small bit of centralization, it also seems acceptable for dependent styles, since the whole point is that they allow for styles to be named after journals while tracking one of the more common styles. If you really need your style to track some style that isn’t in the main repo, you can just copy it and change id and title (or automate that) rather than making a dependent style.

We also use renamed-json.json to offer redirection for styles we delete from the repository (e.g. for journals that stop publishing, or when organizations simplify the number of styles they use (e.g. ACS earlier this year standardized onto a single citation format)). Immutable IDs wouldn’t help with that.

P.S. Of course, we don’t really need to provide an update path for deleted styles, and we can also suggest implementations to fall back on e.g. APA if an existing style is no longer available.

And for tracking independent parent styles, we could just add a second element to separate the ID from the retrieval link, e.g.:

<title>Nature Biotechnology</title>
<id>90320768-db08-4d22-917c-4b7714273ff4</id>
<link href="http://www.zotero.org/styles/nature-biotechnology" rel="self"/>
<parent-id>73f1d413-9249-4464-8f40-4e77de7f95b0</parent-id>
<link href="http://www.zotero.org/styles/nature" rel="independent-parent"/>

instead of

<title>Nature Biotechnology</title>
<id>http://www.zotero.org/styles/nature-biotechnology</id>
<link href="http://www.zotero.org/styles/nature-biotechnology" rel="self"/>
<link href="http://www.zotero.org/styles/nature" rel="independent-parent"/>

Yeah, I don’t think redirection for deleted styles is necessary — it’s much more straightforward for the client to just have the user reselect a valid style.

We could — and then the <link> would remain a deferenceable URL, as you sort of expect it to be. If we did that, I think we’d have to say that an independent-parent in the absence of a <parent-id> was also an id so that existing styles without <parent-id> continued to work.

There’s also the related issue of the rel="self" link. I don’t think we use that in Zotero — we update central styles by id from our own repo, and we never implemented updating of externally hosted styles. But if any clients do rely on those, they’d become more important without ids that were also deferenceable URLs. For central styles, new links would need to be in the form https://www.zotero.org/styles/90320768-db08-4d22-917c-4b7714273ff4, and we’d add a fallback to the Zotero repo for legacy ids so that https://www.zotero.org/styles/apa continued to work rather than it needing to be https://www.zotero.org/styles/http%3A%2F%2Fwww.zotero.org%2Fstyles%2Fapa (as logically it should be). Pre-UUID styles that were renamed would still retain their old URLs based on the legacy ids. (If we really didn’t like that, we could discuss alternatives, but accepting frozen ids and being content with updated titles is sort of the point here.)

Following up on this, given recent discussion of adding a <uuid> field to solve the problem of ids that get changed.

Above, I proposed simply freezing existing ids and using UUIDs for new ones, and updating the spec to clarify that the id field is an opaque string rather than a URI.

Adding a <uuid> field has some advantages:

  • People creating new styles wouldn’t be confused by seeing two different <id> formats in existing styles and not know which to use.

  • Implementations that treat the id as anything other than an opaque string now might behave unexpectedly if UUIDs started appearing in <id>. Hopefully most such code is resilient enough to handle unexpected values when, say, deriving a filename from the id (e.g., to simply use the full string + .csl if there are no slashes, which it should, because technically “a URI” could also mean a URN), but it’s not impossible that there could be unexpected behavior.

But there are also some downsides to having two fields:

  • We’d have to update every single style. That’s easily automated, but it would mean a ton of churn and a loss of meaningful last-update times.

  • Authors would have to populate two id fields in new styles, which would continue the confusion about what exactly was supposed to go in <id>.

  • Implementations would still need to deal with styles without UUIDs as long as they supported older CSL formats, which would mean continuing to treat <id> as an identifier of some sort. (That could create some weird situations if, say, you installed an old version of the APA style with just an id but had a newer version with both the same id and also a UUID. Are those two styles treated as equivalent, because the id is the same, or is an id in a UUID-less style treated as the same sort of opaque identifier as a UUID and therefore treated as a different style?) This would be more complicated than just making sure <id> was treated as an opaque string.

  • I’m not sure we would ever actually want to remove <id>, because if we did then anyone who still had a pre-UUID style and installed a newer version without an id would end up with a totally separate style. So we’d likely end up keeping this superfluous, confusing field forever.

So while I understand the appeal of transitioning to a new, consistent format, for simplicity I’d probably stick to a single field that was simply defined as opaque.

1 Like

With UUIDs (in any field), how would things like detecting a style beginning with apa- work?

I’m more worried that style authors do not know what a UUID is or how to generate one. Version 1 or version 4?

Not even Microsoft’s developer products make people actually generate them. You want a new .NET Assembly, Visual Studio will generate an identifier for you. This only really makes sense if we have a GUI tool that everyone uses and we can control.

The worst and most likely outcome is that people who duplicate styles will reuse the existing UUID, not knowing that you are meant to change it, causing a great many more collisions than currently occur. I would guess the number of times people have created a CSL style that isn’t originally a duplicate of an existing style, in the history of time, could be counted on two hands.

At least with strings that have a slug of the style name in them, people feel like they should change it.

Looking for a <title> beginning with “American Psychological Association” would work for all existing apa- styles in the repo.

  1. We can point to https://uuidgen.org/v/4 from the documentation.
  2. We can add a button to Zotero’s CSL code editor to insert a new UUID.
  3. The visual editor can do the same.

For official styles, we can easily check for duplicate ids via CI. (Maybe we already do?)

For unofficial styles, when installing a style not through an official repo, clients can just warn before replacing an existing style if the title is different. (“Do you want to replace “American Psychological Association 6th Edition” with “My Custom APA Style”?”) Even with the current human-readable URIs, it’s fairly common for people to 1) overwrite their existing official style and then 2) not understand why they lost their modifications on the next update, so it’s something we should try to prevent anyway.

For the unofficial style workflow, you still have to understand what goes into a UUID before you can tweak it. With opaque text, someone can just append -cormacrelf-edits-2019 or whatever. I can’t do that to a UUID, so this raises the barrier of entry to style editing. You can do the same “warn before replacing an existing style with a different title” with opaque text, I don’t see what UUID adds to that. I don’t see what UUID adds to the official repo either if you have to CI-check them still. All it gives you is the ability to autogenerate, which you could always do by appending whatever you like, including a UUID, to the opaque text blobs.

I don’t really have a problem with using UUID as a random string generator, not validated. I don’t care about handholding implementations that have relied on exactly what is returned by a ‘dereferenceable’ uri, as far as I’m concerned that’s way outside spec, which would have had to describe valid HTTP responses. It would be a breaking change and require CSL 2.0.0 to require UUIDs. I don’t think there is any value in adding an extra uuid field. So I guess I’m saying go for it, autogenerate them to avoid collisions and match user expectations, even though 1.0.1 says it’s “undesirable”.

I think the value of static IDs is really big. If there is concern about users not knowing how to change the UUID, perhaps Zotero or other clients could offer to generate a new ID if a duplicate is detected, rather than just asking for confirmation of install?

With respect to making the switch, I think minimizing confusion with existing IDs by changing them all to UUIDs may be valuable, especially for future new contributors who I would predict would otherwise be prone to editing the id field if a journal title changed, etc. That consistency would be worth the churn and bump in viewing modification dates.

Currently only implicitly (since id = filename and we check for that) but yes, that part wouldn’t be an issue.

For @bwiernik’s question:

With UUIDs (in any field), how would things like detecting a style beginning with apa- work?

I also assume we’ll keep filenames & naming conventions as they are, which has proven very useful for human-readability – but not having to worry too much about changing them will be nice.

So can I summarize what I see as an emerging consensus:

  1. Don’t introduce a separate UUID element
  2. Do convert all style IDs in the repository to UUIDs
  3. Do check for duplicate IDs using CI for the repository
  4. Don’t require UUIDs as part of validation but do recommend it in the spec and do require&test for it in the repository.

Open questions from me:

  1. Any concern about putting this into a minor release?
  2. What do we do with the rel=“self” link?
  3. I assume we’ll do the same with locales?

Right, I don’t think there’s any particular reason for clients to validate them. The official repo could, along with duplicate detection, but that’s just for cleanliness. This is really just to have a convention for random ids that someone won’t be tempted to change when the style is renamed. That’s the problem we’re solving for. ama would also be a fine id, until someone decides it should be american-medical-association for consistency and disambiguation, and then things break.

Well, not quite. It could just be an assumption that it’s something that looks like a URL, such that there’s a slug at the end (e.g., apa) that could be used as a filename or a lookup id from a repo, and at least for official styles that would always be true. And that’s actually an embedded assumption in renamed-styles.json, which only uses short names.* I agree that clients should at least be gracefully handling styles where it’s not the case, so I don’t see this as a blocker for changing the recommendation in a minor release, but clients will need to check their id-handling code.

* If we continued using renamed-styles.json for deleted styles, clients would need to handle both legacy short names and UUIDs on each side.

Yeah, a “Replace”/“Keep Both” might be nice (as is done for file replacement in recent macOS versions).

No, we can’t change existing ids — that would break everything. The whole goal of this is to prevent ids from changing. The “churn” I was referring to was if we added a second field and needed to update every style to add a UUID, not changing existing ids. We can use CI to warn if the <id> is changed on a file modification (as opposed to in a new file, where they’ll just be duplicate-checked).

Super-no on 2. Otherwise yes.

Fine with me.

The spec should say “URL” instead of “URI”, since that’s the actual URL that clients should be using to check for updates (at least for ids not found in an official repo or mirror), and it’s up to the server to provide redirects as appropriate. As I say above, in the Zotero repo we’d make new UUID-based styles available as https://www.zotero.org/styles/90320768-db08-4d22-917c-4b7714273ff4 so that filename changes didn’t break them and we didn’t have to track redirects.

rel="template" should also be understood as a URL, not an id (but I think that’s just for human authors anyway).

We should also add parent-id, as Rintze and I discuss above, to remove the ambiguity with independent-parent. The former would be a static id. The latter would be a download location.

I’m not sure what you mean there. The only id in locales is xml:lang, no?