Extracting, analyzing, macros

I have been thinking we really need some code that can extract, and analyze, macros.

As a tiny step towards that, I did a bit of ripgrep and sd work, and came up with this file with all the macro names in the repository; there are close to 40,000 of them!

Suffice to say, there are a lot of common names.

As a next step, I’d like to:

  1. group them by name, to find out what the unique names are, and which are most common.
  2. find a way to programmatically analyze the output to find out which ones actually create unique output.

EDIT: on 1, I managed at least to extract all of the unique names into a YAML array, and removed the other file since it was HUGE.

This one is only is a bit under 1400 items.

1 Like

I wrote a Python script for similar purpose several months ago and I can share it here. The CSL styles are parsed with Python’s xml.etree.ElementTree for easier further analyzing.

It’s not trivial to determine if macros create same output directly from their code and I’ve not finished this feature. First, many CSL attributes has default values (e.g., <text variable="title" form="long"> and <text variable="title"> produce same output). However some of them are context-dependent (e.g., <group delimiter=""> inside a <group delimiter=", "> is not the same as <group>). Second, the condition attributes of <if> and <else-if> accepts multiple variables and their order doesn’t affect the output.

1 Like

Oh, very cool!!

Yes, using a real language with good XML parser is necessary to do it right.

It sounds like you think the next step is hard, but doable?

If yes , I wonder if it there might be value in setting up a “tools” repo for this on the CSL GitHub org?

If you’re interested in that, let me know.

PS - this could be useful for a couple of the experimental ideas I’ve been thinking about or actively working on.

One of my ideas in csl-next is extracting macros into collections, that get maintained and distributed independently of styles.

If we can also convert them to the new model, would also allow loading them in databases that support JSON values.

Which of course also leaves room for serving them from such a database, more easily integrate into UIs, etc.

The deno runtime I’m using has such a KV database built-in :slight_smile:

Yeah, I’d start by just literal matching: are they the exact same macro – since a lot of styles are derived from each other, that’s going to take you pretty far.

The next low-hanging fruit would be to strip stuff that’s always irrelevant, i.e. form="long", vertical-align="baseline", and various font-...="normal" (I think that’s about it).
Everything else is going to be quite complicated

1 Like

I agree that is elegant, but I’m worried that we’re developing past actual user needs. What’s the exact user story here? I’m just not convinced that we have a ton of use cases where this would help: most changes people need to make are – thanks to the style matching by the visual editor – quite small, and for folks who struggle with making those style, swapping around macros is also going to be a challenge.

Ultimately, that users don’t need to edit or create styles directly in the language at all.

As in, that it facilitates easier to create and use UIs for this.

For sure there are details to sort out with the idea, but I think forcing duplication across styles probably is too high a cost to pay for any benefit, particularly if we do adopt a new model?

Consider that the best, most widely-used, CSL 1.0 styles are mostly macro definitions.

This, in any case, is the simplest change I’m making, and easy-to-reverse if it is a bad idea (it’s just a few lines of code in the model definition, and independent of the rest of it), or add to CSL 1.0 if it’s a good idea.

In the new model, the difference between in-style and external templates.

---
title: Template File
templates:
  author-long-apa:
  ...
---
title: Style File
templates:
  author-long-apa:
  ...

… which means on the development end it’s trivial to collect the templates across multiple contexts, including (per my point above) serving them from a database.

But again: the development aspects should also enable user-facing innovations; they’re not at odds.

@zepinglee forgive this probably dumb question, but what does the “most-common” property indicate?

PS - I decided to look at some of the very common “author” macros. Seems they’re there pretty much only to configure author substitution. The default substitutions are also extremely common across all the styles.

Hence, I decided to do this in the new model, which is the default value.

substitution:
  author: ["editor", "translator", "title"]

Yes. I’ll try to implement it this weekend.

It’s the number of the most common macro patterns with the same macro name. The patterns are compared after ElementTree.canonicalize().

And then “total”?

On @Sebastian_Karcher’s suggestion, it might be enough to get the list of all child element names, and assume if those are the same, they are effectively the same macro?

I’m afraid not. For example, there are totally 2255 <macro name="publisher">s. 379 of them are in this form (most common).

  <macro name="publisher">
    <group delimiter=": ">
      <text variable="publisher-place"/>
      <text variable="publisher"/>
    </group>
  </macro>

There are also 231 publishers in the following form with different delimiter. Both forms have same child element names but they are likely to treated as same macros.

  <macro name="publisher">
    <group delimiter=", ">
      <text variable="publisher"/>
      <text variable="publisher-place"/>
    </group>
  </macro>

Right; not sure what I was thinking!

I also used yq to convert apa.csl to YAML, since it’s easier to visualize.

Rough lines for each portion:

  • locale terms: 400
  • macros: 1900
  • citation AND bibliography: 100!