Proposal: Make style ids immutable

As I’ve noted in the past, it’s really inconvenient that style ids can change, and dealing with it creates extra complexity everywhere styles are handled, leading to bugs that cause unpleasant experiences for users. It’s also redundant to have a user-friendly id when there’s already a human-friendly title. The id should be for computers, and the title should be for humans.

It’d be much easier if this worked more like Zotero translator IDs, where ids are immutable, dependencies use those ids, and filenames are irrelevant. Then styles on GitHub could be renamed as desired without requiring downstream code to do anything to handle those changes.

As things are, to properly fix the issue in the above thread, the Zotero client would have to download and store updated renamed-style data from the repo on every style update check (instead of bundling it with version upgrades as it does now). That’s more development, complexity, and potential for bugs to solve a problem that doesn’t need to exist.

So here’s what I’d propose:

  1. Freeze all existing style ids as they are now. In an ideal world they’d all be changed to whatever new format we decided on (e.g., a UUID), but we don’t want to break existing clients or require massive changes across the whole CSL ecosystem. Fortunately this isn’t a problem, because the id is for computers, not people. http://zotero.org/styles/apa is just a string, no different from 90320768-db08-4d22-917c-4b7714273ff4. If a journal with an existing style changes its name, too bad. The title and filename can be updated, but the id has to stay the same.
  2. Make sure all implementers are treating ids and independent-parent as opaque strings, not dereferenceable URLs, and not interpreting the suffix as a meaningful short name. No assumption should be made that the filename on GitHub will remain consistent. Implementers can name local styles however they like (e.g., a filename-safe version of the title).
  3. Recommend UUIDs for new styles.
  4. Leave renamed-styles.json in the repo for the time being, or have existing implementers mirror a static copy of it to deal with old dependencies and then delete it. The advantage of the latter is that new implementers wouldn’t think they had to handle these mappings.

A note about independent-parent: there’s currently an ambiguity regarding whether the URI in independent-parent is meant to be understood as an id or a download location. In Zotero — and perhaps other tools — it’s both: it’s used as an id to identify a parent style that’s already installed, but it’s treated as a URI to download a missing style. That really shouldn’t be the case. (I’m not sure what happens if the id of the style that’s downloaded doesn’t match the independent-parent value, but probably nothing good.) If we switch to always considering the id an opaque string, I think we’d have to say that independent-parent had to exist in the official CSL repository (or an implementer’s mirror of it) so that it could be retrieved by id if necessary. While that’s a small bit of centralization, it also seems acceptable for dependent styles, since the whole point is that they allow for styles to be named after journals while tracking one of the more common styles. If you really need your style to track some style that isn’t in the main repo, you can just copy it and change id and title (or automate that) rather than making a dependent style.

We also use renamed-json.json to offer redirection for styles we delete from the repository (e.g. for journals that stop publishing, or when organizations simplify the number of styles they use (e.g. ACS earlier this year standardized onto a single citation format)). Immutable IDs wouldn’t help with that.

P.S. Of course, we don’t really need to provide an update path for deleted styles, and we can also suggest implementations to fall back on e.g. APA if an existing style is no longer available.

And for tracking independent parent styles, we could just add a second element to separate the ID from the retrieval link, e.g.:

<title>Nature Biotechnology</title>
<id>90320768-db08-4d22-917c-4b7714273ff4</id>
<link href="http://www.zotero.org/styles/nature-biotechnology" rel="self"/>
<parent-id>73f1d413-9249-4464-8f40-4e77de7f95b0</parent-id>
<link href="http://www.zotero.org/styles/nature" rel="independent-parent"/>

instead of

<title>Nature Biotechnology</title>
<id>http://www.zotero.org/styles/nature-biotechnology</id>
<link href="http://www.zotero.org/styles/nature-biotechnology" rel="self"/>
<link href="http://www.zotero.org/styles/nature" rel="independent-parent"/>

Yeah, I don’t think redirection for deleted styles is necessary — it’s much more straightforward for the client to just have the user reselect a valid style.

We could — and then the <link> would remain a deferenceable URL, as you sort of expect it to be. If we did that, I think we’d have to say that an independent-parent in the absence of a <parent-id> was also an id so that existing styles without <parent-id> continued to work.

There’s also the related issue of the rel="self" link. I don’t think we use that in Zotero — we update central styles by id from our own repo, and we never implemented updating of externally hosted styles. But if any clients do rely on those, they’d become more important without ids that were also deferenceable URLs. For central styles, new links would need to be in the form https://www.zotero.org/styles/90320768-db08-4d22-917c-4b7714273ff4, and we’d add a fallback to the Zotero repo for legacy ids so that https://www.zotero.org/styles/apa continued to work rather than it needing to be https://www.zotero.org/styles/http%3A%2F%2Fwww.zotero.org%2Fstyles%2Fapa (as logically it should be). Pre-UUID styles that were renamed would still retain their old URLs based on the legacy ids. (If we really didn’t like that, we could discuss alternatives, but accepting frozen ids and being content with updated titles is sort of the point here.)

Following up on this, given recent discussion of adding a <uuid> field to solve the problem of ids that get changed.

Above, I proposed simply freezing existing ids and using UUIDs for new ones, and updating the spec to clarify that the id field is an opaque string rather than a URI.

Adding a <uuid> field has some advantages:

  • People creating new styles wouldn’t be confused by seeing two different <id> formats in existing styles and not know which to use.

  • Implementations that treat the id as anything other than an opaque string now might behave unexpectedly if UUIDs started appearing in <id>. Hopefully most such code is resilient enough to handle unexpected values when, say, deriving a filename from the id (e.g., to simply use the full string + .csl if there are no slashes, which it should, because technically “a URI” could also mean a URN), but it’s not impossible that there could be unexpected behavior.

But there are also some downsides to having two fields:

  • We’d have to update every single style. That’s easily automated, but it would mean a ton of churn and a loss of meaningful last-update times.

  • Authors would have to populate two id fields in new styles, which would continue the confusion about what exactly was supposed to go in <id>.

  • Implementations would still need to deal with styles without UUIDs as long as they supported older CSL formats, which would mean continuing to treat <id> as an identifier of some sort. (That could create some weird situations if, say, you installed an old version of the APA style with just an id but had a newer version with both the same id and also a UUID. Are those two styles treated as equivalent, because the id is the same, or is an id in a UUID-less style treated as the same sort of opaque identifier as a UUID and therefore treated as a different style?) This would be more complicated than just making sure <id> was treated as an opaque string.

  • I’m not sure we would ever actually want to remove <id>, because if we did then anyone who still had a pre-UUID style and installed a newer version without an id would end up with a totally separate style. So we’d likely end up keeping this superfluous, confusing field forever.

So while I understand the appeal of transitioning to a new, consistent format, for simplicity I’d probably stick to a single field that was simply defined as opaque.

1 Like

With UUIDs (in any field), how would things like detecting a style beginning with apa- work?

I’m more worried that style authors do not know what a UUID is or how to generate one. Version 1 or version 4?

Not even Microsoft’s developer products make people actually generate them. You want a new .NET Assembly, Visual Studio will generate an identifier for you. This only really makes sense if we have a GUI tool that everyone uses and we can control.

The worst and most likely outcome is that people who duplicate styles will reuse the existing UUID, not knowing that you are meant to change it, causing a great many more collisions than currently occur. I would guess the number of times people have created a CSL style that isn’t originally a duplicate of an existing style, in the history of time, could be counted on two hands.

At least with strings that have a slug of the style name in them, people feel like they should change it.

Looking for a <title> beginning with “American Psychological Association” would work for all existing apa- styles in the repo.

  1. We can point to UUID Generator ⚡ from the documentation.
  2. We can add a button to Zotero’s CSL code editor to insert a new UUID.
  3. The visual editor can do the same.

For official styles, we can easily check for duplicate ids via CI. (Maybe we already do?)

For unofficial styles, when installing a style not through an official repo, clients can just warn before replacing an existing style if the title is different. (“Do you want to replace “American Psychological Association 6th Edition” with “My Custom APA Style”?”) Even with the current human-readable URIs, it’s fairly common for people to 1) overwrite their existing official style and then 2) not understand why they lost their modifications on the next update, so it’s something we should try to prevent anyway.

For the unofficial style workflow, you still have to understand what goes into a UUID before you can tweak it. With opaque text, someone can just append -cormacrelf-edits-2019 or whatever. I can’t do that to a UUID, so this raises the barrier of entry to style editing. You can do the same “warn before replacing an existing style with a different title” with opaque text, I don’t see what UUID adds to that. I don’t see what UUID adds to the official repo either if you have to CI-check them still. All it gives you is the ability to autogenerate, which you could always do by appending whatever you like, including a UUID, to the opaque text blobs.

I don’t really have a problem with using UUID as a random string generator, not validated. I don’t care about handholding implementations that have relied on exactly what is returned by a ‘dereferenceable’ uri, as far as I’m concerned that’s way outside spec, which would have had to describe valid HTTP responses. It would be a breaking change and require CSL 2.0.0 to require UUIDs. I don’t think there is any value in adding an extra uuid field. So I guess I’m saying go for it, autogenerate them to avoid collisions and match user expectations, even though 1.0.1 says it’s “undesirable”.

I think the value of static IDs is really big. If there is concern about users not knowing how to change the UUID, perhaps Zotero or other clients could offer to generate a new ID if a duplicate is detected, rather than just asking for confirmation of install?

With respect to making the switch, I think minimizing confusion with existing IDs by changing them all to UUIDs may be valuable, especially for future new contributors who I would predict would otherwise be prone to editing the id field if a journal title changed, etc. That consistency would be worth the churn and bump in viewing modification dates.

Currently only implicitly (since id = filename and we check for that) but yes, that part wouldn’t be an issue.

For @bwiernik’s question:

With UUIDs (in any field), how would things like detecting a style beginning with apa- work?

I also assume we’ll keep filenames & naming conventions as they are, which has proven very useful for human-readability – but not having to worry too much about changing them will be nice.

So can I summarize what I see as an emerging consensus:

  1. Don’t introduce a separate UUID element
  2. Do convert all style IDs in the repository to UUIDs
  3. Do check for duplicate IDs using CI for the repository
  4. Don’t require UUIDs as part of validation but do recommend it in the spec and do require&test for it in the repository.

Open questions from me:

  1. Any concern about putting this into a minor release?
  2. What do we do with the rel=“self” link?
  3. I assume we’ll do the same with locales?

Right, I don’t think there’s any particular reason for clients to validate them. The official repo could, along with duplicate detection, but that’s just for cleanliness. This is really just to have a convention for random ids that someone won’t be tempted to change when the style is renamed. That’s the problem we’re solving for. ama would also be a fine id, until someone decides it should be american-medical-association for consistency and disambiguation, and then things break.

Well, not quite. It could just be an assumption that it’s something that looks like a URL, such that there’s a slug at the end (e.g., apa) that could be used as a filename or a lookup id from a repo, and at least for official styles that would always be true. And that’s actually an embedded assumption in renamed-styles.json, which only uses short names.* I agree that clients should at least be gracefully handling styles where it’s not the case, so I don’t see this as a blocker for changing the recommendation in a minor release, but clients will need to check their id-handling code.

* If we continued using renamed-styles.json for deleted styles, clients would need to handle both legacy short names and UUIDs on each side.

Yeah, a “Replace”/“Keep Both” might be nice (as is done for file replacement in recent macOS versions).

No, we can’t change existing ids — that would break everything. The whole goal of this is to prevent ids from changing. The “churn” I was referring to was if we added a second field and needed to update every style to add a UUID, not changing existing ids. We can use CI to warn if the <id> is changed on a file modification (as opposed to in a new file, where they’ll just be duplicate-checked).

Super-no on 2. Otherwise yes.

Fine with me.

The spec should say “URL” instead of “URI”, since that’s the actual URL that clients should be using to check for updates (at least for ids not found in an official repo or mirror), and it’s up to the server to provide redirects as appropriate. As I say above, in the Zotero repo we’d make new UUID-based styles available as https://www.zotero.org/styles/90320768-db08-4d22-917c-4b7714273ff4 so that filename changes didn’t break them and we didn’t have to track redirects.

rel="template" should also be understood as a URL, not an id (but I think that’s just for human authors anyway).

We should also add parent-id, as Rintze and I discuss above, to remove the ambiguity with independent-parent. The former would be a static id. The latter would be a download location.

I’m not sure what you mean there. The only id in locales is xml:lang, no?

This might be a very bad idea, but how about converting all existing style IDs to (MD5) hashes of the original ID? That way there is a very simple way to compare old and new IDs without a lookup table, and it would allow us to get rid of the https://www.zotero.org/styles/ style ID format in one big swoop, which otherwise would haunt us forever. We can then use a proper UUID generator for new IDs.

P.S. although I guess this would be a one-way street. E.g. if you have an old style ID, you know what the corresponding hash will be, but going the opposite direction would be difficult.

1 Like

I really would like to get rid of the https://www.zotero.org/styles/ format. If hashing an ID that starts with https://www.zotero.org/styles/ is not too much effort to implement, that seems like a good way to go to me.

So regarding style IDs, we have two issues, right:

One is that, per @Dan_Stillman’s original post, style IDs have not been treated as immutable in the past. The second is that the current IDs format (http://www.zotero.org/styles/<filename-sans-extension>) is not ideal, with at least these drawbacks:

  • HTTP is outdated. So far we’ve refrained from updating the “self” links to HTTPS to keep them identical to the HTTP IDs, as having HTTPS “self” links and HTTP IDs that are otherwise the same would probably lead to many submission errors. Submitters occasionally try to use HTTPS in style IDs which we currently don’t allow.
  • The Zotero domain in the IDs became outdated in 2011 when the CSL project started maintaining the central CSL style repository on GitHub, and it confuses the branding of CSL (e.g. some people think CSL is a Zotero technology).

Like @bwiernik, I’d really like a solution that tackles both, especially since we have such a large number of legacy IDs (almost 10,000!). By itself, we could address the second issue with a simple prefix substitution (from http://www.zotero.org/styles/apa to e.g. https://styles.citationstyles.org/apa, https://citationstyles.org/styles/apa, csl-repository/apa, or something like that).

@Dan_Stillman, does one of these solutions seem reasonable/feasible to you:

a) convert all legacy style IDs in the next CSL release using a documented string-substitution (e.g. from http://www.zotero.org/styles/apa to https://styles.citationstyles.org/apa), and use UUIDs for new IDs.
b) per my previous post, convert all legacy style IDs into something that is no longer human-readable via a hashing function. Per https://docs.python.org/3.8/library/uuid.html and https://stackoverflow.com/questions/10867405/generating-v5-uuid-what-is-name-and-namespace, it looks like we could generate Type 3 UUIDs using MD5 hashes of the original style ID, so we would even be able to get proper UUIDs instead of raw hashes. We would use Type 4 random-seeded UUIDs for new IDs.

The last option would be my preference, but maybe I’m overlooking a scenario where being able to reproducible generate a hash-seeded UUID from a legacy style ID isn’t enough. Are there any expected lookups where we only have the new ID and need to know the old one? The main scenarios I could think of:

Style with legacy ID: “Is there a newer version of me?”
Style with legacy ID/“self” link: “Where do I find the style belonging to this ID”
Style with legacy “independent-parent” link: “Where/who is my parent?”

And regarding other things that have been discussed in this thread:

Since we use <link href="..." rel="independent-parent"/>, it would be better to add independent-parent-id instead of parent-id for consistency. Or change independent-parent to just parent.

We use “template” links mainly for attribution (I tend to strip authors/contributors from styles if they were inherited from the “template” style) and to determine style ancestry, so I think it should reference an ID. I can’t think of any case where we actually rely on these links being dereferenceable.

It seems a bit redundant to repeat the style UUID in the “self” link. The only thing we really need to indicate is whether the style (identified through its UUID) is present in an online repository. If we only support the central repository, a boolean would be enough for that. If we supported multiple repositories, we’d need no more than the URL to the repo (e.g. <link href="https://www.zotero.org/styles/" rel="repository"/> instead of <link href="https://www.zotero.org/styles/90320768-db08-4d22-917c-4b7714273ff4" rel="self"/>.

I understand the impulse here, but to reiterate what I said above: “we can’t change existing ids — that would break everything. The whole goal of this is to prevent ids from changing.”

It’s not reasonable to ask 50-odd implementations of CSL to spend time updating and testing every bit of code that deals with style ids to change things for essentially aesthetic purposes.

Could we update Zotero to replace an existing style if uuid3(md5(oldID)) == newID instead of duplicating the style? And make it so that an existing word processor document used the new style instead of throwing an error? And update the Zotero API so that requests that used old ids used the new styles rather than returning 400s? And update the Zotero styles page so old style links redirected and weren’t 404s? And update our style repo to serve new styles as updates for old ids? And cut off style updating for all Zotero versions before this change was made so they didn’t end up with duplicates of every installed style? And update anything else I’m not thinking of in the Zotero ecosystem to handle this properly? And could every other implementer do the same?

I mean…yes, but it would be a terrible use of everyone’s time, and it would risk breakage across the CSL ecosystem.

And after all that, we wouldn’t have made the problem go away — we’d simply have pushed it out, times 50 or so, to every implementation, which would need to keep this migration code in place in perpetuity.

HTTP being outdated makes no difference, because these aren’t web addresses — they’re identifiers.

rel="self" can be whatever it needs to be to point to the current version of a style. There’s no reason it needs to be the same as the id.

The filename doesn’t need to match the id either.

Submission problems can be dealt with by authoring tools and CI.

So it’s a historical artifact. HTTP Referer is forever misspelled. Linked-data URIs and XML namespaces will forever begin with http:// despite sites like schema.org now using HTTPS. User-Agent strings begin with Mozilla, and will be frozen as such even once they’re deprecated.

A tiny number of people who view raw CSL code getting the wrong impression about CSL’s history — thinking that Zotero created CSL instead of merely contributing to the first version and writing the first implementation — is not grounds for risking disruption of a critical technology used by millions of people.

Re: other things:

I don’t think the inconsistency particularly matters — they’re different elements, and I think they can be different levels of descriptive — but independent-parent-id is annoyingly verbose, so I wouldn’t do that regardless. Changing independent-parent to parent would be cleaner, but it would break things, so it’d have to wait for a major version.

This doesn’t really make sense. CSL doesn’t have some unified repository specification with a defined URL scheme. Zotero has its own repo, and other tools have theirs. Zotero will continue to check its own repo for installed styles, and the id will continue to be sufficient for retrieving those.

The point of rel="self" is just to provide a URL at which a tool can download a style outside the context of a repo, aided by normal HTTP mechanics such as redirects and caching. A style provided by a university department could be updated that way. There’s no reason the filename or URL on their server should need to have any relation to the id in the file.

For official styles, authoring tools and CI could verify that id matched rel="self".

(As for the URL for official styles, the benefit to continuing to use https://www.zotero.org/styles/[id] would be that 1) it already works and 2) it would continue to work at least as long as Zotero continued to exist, without needing to set up some other mirror based on id (whereas existing GitHub URLs can change as files are renamed). But it would technically be possible to set up something that used https://styles.citationstyles.org/[id], updated by styles-distribution.)

My impulse is that the rel="self" link should not be formatted around the style ID, but should be human readable based on the current file name. These are widely used by users to call styles (e.g., when creating documents with pandoc).