Extracting, analyzing, macros

Bruce_D_Arcus1 · May 10, 2023, 11:46pm

I have been thinking we really need some code that can extract, and analyze, macros.

As a tiny step towards that, I did a bit of ripgrep and sd work, and came up with this file with all the macro names in the repository; there are close to 40,000 of them!

gist.github.com

https://gist.github.com/bdarcus/25e6bc119df007737ce8837c97463544

macro-names.txt

/csl/styles/zwitscher-maschine.csl
author-short
author-short-bibliography
author
editor-translator
container-title
edition
journal-details
pages
title

This file has been truncated. show original

Suffice to say, there are a lot of common names.

As a next step, I’d like to:

group them by name, to find out what the unique names are, and which are most common.
find a way to programmatically analyze the output to find out which ones actually create unique output.

EDIT: on 1, I managed at least to extract all of the unique names into a YAML array, and removed the other file since it was HUGE.

gist.github.com

https://gist.github.com/bdarcus/25e6bc119df007737ce8837c97463544

macro-names.txt

/csl/styles/zwitscher-maschine.csl
author-short
author-short-bibliography
author
editor-translator
container-title
edition
journal-details
pages
title

This file has been truncated. show original

This one is only is a bit under 1400 items.

zepinglee · May 11, 2023, 6:09am

I wrote a Python script for similar purpose several months ago and I can share it here. The CSL styles are parsed with Python’s xml.etree.ElementTree for easier further analyzing.

gist.github.com

https://gist.github.com/zepinglee/2752c6cc9669d7f5d3f31ac1901c0c77

analyze-macros.py

from collections import Counter
import glob
import os
import xml.etree.ElementTree as ET

# from lxml import etree as ET
import yaml

CSL_STYLES_DIR = '../styles'
NSMAP = {'cs': 'http://purl.org/net/xbiblio/csl'}

This file has been truncated. show original

unique-macro-names.yaml

publisher:
  total: 2255
  most-common: 379
  is-unique: false
author:
  total: 2158
  most-common: 74
  is-unique: false
title:
  total: 2017

This file has been truncated. show original

It’s not trivial to determine if macros create same output directly from their code and I’ve not finished this feature. First, many CSL attributes has default values (e.g., <text variable="title" form="long"> and <text variable="title"> produce same output). However some of them are context-dependent (e.g., <group delimiter=""> inside a <group delimiter=", "> is not the same as <group>). Second, the condition attributes of <if> and <else-if> accepts multiple variables and their order doesn’t affect the output.

Bruce_D_Arcus1 · May 11, 2023, 8:55am

Oh, very cool!!

Yes, using a real language with good XML parser is necessary to do it right.

It sounds like you think the next step is hard, but doable?

If yes , I wonder if it there might be value in setting up a “tools” repo for this on the CSL GitHub org?

If you’re interested in that, let me know.

PS - this could be useful for a couple of the experimental ideas I’ve been thinking about or actively working on.

Bruce_D_Arcus1 · May 11, 2023, 9:31am

One of my ideas in csl-next is extracting macros into collections, that get maintained and distributed independently of styles.

If we can also convert them to the new model, would also allow loading them in databases that support JSON values.

Which of course also leaves room for serving them from such a database, more easily integrate into UIs, etc.

The deno runtime I’m using has such a KV database built-in

Sebastian_Karcher · May 11, 2023, 3:00pm

Yeah, I’d start by just literal matching: are they the exact same macro – since a lot of styles are derived from each other, that’s going to take you pretty far.

The next low-hanging fruit would be to strip stuff that’s always irrelevant, i.e. form="long", vertical-align="baseline", and various font-...="normal" (I think that’s about it).
Everything else is going to be quite complicated

Sebastian_Karcher · May 11, 2023, 3:03pm

I agree that is elegant, but I’m worried that we’re developing past actual user needs. What’s the exact user story here? I’m just not convinced that we have a ton of use cases where this would help: most changes people need to make are – thanks to the style matching by the visual editor – quite small, and for folks who struggle with making those style, swapping around macros is also going to be a challenge.

Bruce_D_Arcus1 · May 11, 2023, 3:43pm

Ultimately, that users don’t need to edit or create styles directly in the language at all.

As in, that it facilitates easier to create and use UIs for this.

For sure there are details to sort out with the idea, but I think forcing duplication across styles probably is too high a cost to pay for any benefit, particularly if we do adopt a new model?

Consider that the best, most widely-used, CSL 1.0 styles are mostly macro definitions.

This, in any case, is the simplest change I’m making, and easy-to-reverse if it is a bad idea (it’s just a few lines of code in the model definition, and independent of the rest of it), or add to CSL 1.0 if it’s a good idea.

In the new model, the difference between in-style and external templates.

---
title: Template File
templates:
  author-long-apa:
  ...

---
title: Style File
templates:
  author-long-apa:
  ...

… which means on the development end it’s trivial to collect the templates across multiple contexts, including (per my point above) serving them from a database.

But again: the development aspects should also enable user-facing innovations; they’re not at odds.

Bruce_D_Arcus1 · May 11, 2023, 10:38pm

@zepinglee forgive this probably dumb question, but what does the “most-common” property indicate?

PS - I decided to look at some of the very common “author” macros. Seems they’re there pretty much only to configure author substitution. The default substitutions are also extremely common across all the styles.

Hence, I decided to do this in the new model, which is the default value.

substitution:
  author: ["editor", "translator", "title"]

zepinglee · May 12, 2023, 4:23am

Yes. I’ll try to implement it this weekend.

zepinglee · May 12, 2023, 4:27am

It’s the number of the most common macro patterns with the same macro name. The patterns are compared after ElementTree.canonicalize().

Bruce_D_Arcus1 · May 12, 2023, 7:42am

And then “total”?

On @Sebastian_Karcher’s suggestion, it might be enough to get the list of all child element names, and assume if those are the same, they are effectively the same macro?

zepinglee · May 12, 2023, 8:31am

I’m afraid not. For example, there are totally 2255 <macro name="publisher">s. 379 of them are in this form (most common).

  <macro name="publisher">
    <group delimiter=": ">
      <text variable="publisher-place"/>
      <text variable="publisher"/>
    </group>
  </macro>

There are also 231 publishers in the following form with different delimiter. Both forms have same child element names but they are likely to treated as same macros.

  <macro name="publisher">
    <group delimiter=", ">
      <text variable="publisher"/>
      <text variable="publisher-place"/>
    </group>
  </macro>

Bruce_D_Arcus1 · May 12, 2023, 9:18am

Right; not sure what I was thinking!

Bruce_D_Arcus1 · May 24, 2023, 11:55pm

I also used yq to convert apa.csl to YAML, since it’s easier to visualize.

Rough lines for each portion:

locale terms: 400
macros: 1900
citation AND bibliography: 100!

Topic		Replies	Views
quick analysis of macro names in styles CSL Development	7	251	August 9, 2009
macro "types" CSL Development	3	242	February 21, 2008
xbiblio-devel Digest, Vol 40, Issue 1 CSL Development	1	275	November 2, 2009
reserved macro names? CSL Development	0	226	July 3, 2009
CSL macro bundles CSL Development	2	299	January 30, 2012

Extracting, analyzing, macros

Related topics