I have been thinking we really need some code that can extract, and analyze, macros.
As a tiny step towards that, I did a bit of ripgrep and sd work, and came up with this file with all the macro names in the repository; there are close to 40,000 of them!
Suffice to say, there are a lot of common names.
As a next step, I’d like to:
group them by name, to find out what the unique names are, and which are most common.
find a way to programmatically analyze the output to find out which ones actually create unique output.
EDIT: on 1, I managed at least to extract all of the unique names into a YAML array, and removed the other file since it was HUGE.
I wrote a Python script for similar purpose several months ago and I can share it here. The CSL styles are parsed with Python’s xml.etree.ElementTree for easier further analyzing.
It’s not trivial to determine if macros create same output directly from their code and I’ve not finished this feature. First, many CSL attributes has default values (e.g., <text variable="title" form="long"> and <text variable="title"> produce same output). However some of them are context-dependent (e.g., <group delimiter=""> inside a <group delimiter=", "> is not the same as <group>). Second, the condition attributes of <if> and <else-if> accepts multiple variables and their order doesn’t affect the output.
Yeah, I’d start by just literal matching: are they the exact same macro – since a lot of styles are derived from each other, that’s going to take you pretty far.
The next low-hanging fruit would be to strip stuff that’s always irrelevant, i.e. form="long", vertical-align="baseline", and various font-...="normal" (I think that’s about it).
Everything else is going to be quite complicated
I agree that is elegant, but I’m worried that we’re developing past actual user needs. What’s the exact user story here? I’m just not convinced that we have a ton of use cases where this would help: most changes people need to make are – thanks to the style matching by the visual editor – quite small, and for folks who struggle with making those style, swapping around macros is also going to be a challenge.
For sure there are details to sort out with the idea, but I think forcing duplication across styles probably is too high a cost to pay for any benefit, particularly if we do adopt a new model?
Consider that the best, most widely-used, CSL 1.0 styles are mostly macro definitions.
This, in any case, is the simplest change I’m making, and easy-to-reverse if it is a bad idea (it’s just a few lines of code in the model definition, and independent of the rest of it), or add to CSL 1.0 if it’s a good idea.
In the new model, the difference between in-style and external templates.
… which means on the development end it’s trivial to collect the templates across multiple contexts, including (per my point above) serving them from a database.
But again: the development aspects should also enable user-facing innovations; they’re not at odds.
@zepinglee forgive this probably dumb question, but what does the “most-common” property indicate?
PS - I decided to look at some of the very common “author” macros. Seems they’re there pretty much only to configure author substitution. The default substitutions are also extremely common across all the styles.
Hence, I decided to do this in the new model, which is the default value.
On @Sebastian_Karcher’s suggestion, it might be enough to get the list of all child element names, and assume if those are the same, they are effectively the same macro?
There are also 231 publishers in the following form with different delimiter. Both forms have same child element names but they are likely to treated as same macros.