proposed changes to CSL to permit AGU-style reference sorting

Hi,

I have been working a bit with the Zotero project to improve their AGU
CSL style so that it matches the AGU style found here:

http://www.agu.org/pubs/AuthorRefSheet.pdf

For those of you who are not familiar, the American Geophysical Union is
a large scientific organization of physicists, materials scientists,
geologists, oceanographers, etc. It publishes probably about a
half-dozen to dozen journals. Recent annual meetings have had +10,000
participants. Furthermore, I am fairly certain similar styles are used
by a number of other journals. Therefore, I think that CSL should be
able to support their format.

I believe that the current CSL implementation cannot support some
elements of the AGU style and therefore would like to propose some
modifications to CSL. AGU uses what on the surface looks like an
unusual ordering of references in the bibliography:

  1. References are sorted globally by first author.
  2. Among references that have the same first author, references with
    just one author come first, ordered by year.
  3. Then come references with two authors ordered by second author and
    then year.
  4. After this, come references with 3 or more authors. These are sorted
    by just the year of publication, ignoring all but the first author for
    sorting.

This sorting appears strange at first, but it makes some sense as
citations for articles with more than 2 authors appear in the main text
as e.g. [Kaplan et al., 2006]. The only info a reader has to locate the
article is the last name, the fact that is has more than 2 authors, and
the year. Therefore, it is useful to group publications with several
authors (same first author) together and order them by year.

Current sorting options don’t really support this as there is no way to
specify a sorting that depends on the number of authors. We have come
up with two basic ways that this could be supported, one based on adding
variables to test for the number of authors, the other based on adding
options for this style. I personally believe that the first option is a
much more robust and general approach, but will present both here.

The first option consists of adding a variable “author-count” than can
be tested for in clauses and adding a way to access just one
particular author from a list of authors, selecting by number. I am not
an XML expert, so the code below is just meant to give an idea of how
this might work, not be grammatically correct:

3

where the macro “first-author” (and similarly for second author) might
be something like:

...

The other approach to solving the problem would be with additional
options. For example, one might add:

<option name="sort-et-al-min" value="3"/>
<option name="sort-et-al-use-first" value="1"/>

While this solution appears more expedient, I think the other approach
is more robust because:

  1. I can imagine and believe there exist slight modifications of this
    format that you will not be able to support without additional options.
  2. Adding author-count and the ability to get at particular authors
    would allow you to completely remove several options in favor of macros,
    which are more flexible (e.g., options “et-al-min” and “et-al-use-first”
    could be coded through macros and choose statements (though having “if
    greater than” and “if less than” tests would be very useful as some
    bibliographic styles only use et. al after a large number of references
  • e.g., AGU style uses 10)).
  1. As option names become more specific, they also become quite obscure
    (related also to the AGU format, we had a discussion about what
    “name-as-sort-order” really means that was quite confusing and is still
    not completely resolved in my mind). Building “options” from macros is
    much easier to understand as it builds complex ideas from simpler ones.
  2. Additional options of all types are almost certainly going to creep
    into the format over time as more and more exceptions to the basic
    formats are found. Trying to keep these as limited as possible to just
    those things which cannot be implemented nicely through other approaches
    seems like a good thing.

Cheers,
David–


David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html


Hi David,

Thanks for the thorough explanation.

Everyone: please read the proposal and tell us what you think. If we
make this change, I think people need to implement it, or we’ll have
problems.

My quick response …

Building “options” from macros is
much easier to understand as it builds complex ideas from simpler ones.

True. CSL started without macros, and the current schema reflects some
of that legacy. With ways to condition formatting based on
author-count, the et-al options wouldn’t be needed.

One thing, though: we’d probably need to specify exactly what
“author-count” means. Does it include any primary creators? If we have
an edited book with one author and two editors (an issue that has come
up), is the count 1, or is it 3?

Bruce

Hi Bruce,

Regarding what author-count means, you bring up a good point and the
best thing would be to put as much flexibility as possible in now. In
some cases, I think author-count will depend on context. For example,
in AGU style, author-count will be whatever authors appear at the
beginning of the bibliographic reference. This might be primary
authors, might be editors, depending on the type of contribution. I
would suggest somehow developing a way to get at author-count by type of
author. There should be an author-count for primary authors, an
author-count for editors, etc., and an author-count summing all of the
above. We could have a general mechanism at getting at each of these
and then create macros that would produce the correct result depending
on context (using substitute and if’s just like in and
). As I am not an XML expert, can someone else suggest a
good way to code this?

Cheers,
David–


David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html


I’m wondering if a simple approach that builds off existing structures
might not be best; something like:

<cs:if variable-count=“author editor”>…</cs:if>

I don’t I have that precisely right, but you get the idea.

Simon? Johan? Ron? Liam?

Need feedback, and a commitment to implement the solution. Otherwise,
there’ s not much point in adding it.

Bruce

Sorry. A bit of a busy day. I’ll take a look at it tomorrow. I’ve only
skimmed this discussion, so I am not up to speed as to what exactly is
needed and how this would and/or could be implemented.

Johan

Hello,

I’ve at last taken a look at this proposal. I think that this might
indeed be a good addition to add to CSL to allow for many more other
oddities being dealt with properly, not just AGU. Though, there is a
big problem with it to think about carefully.

I like the proposal that adds a variable to each reference best. Maybe
we should even extend this further beyond just author-count. It would
just mean that during processing we gather this information for each
reference and than it can be used in macros for further customization.
Besides author-count we could perhaps have a count for all other types
of contributors: editor-count, translator-count etc. Another good
thing to put in a variable might be the number of the reference is
cited in the text, the order-number of the first citation, etc.

However, and I think that herein lies the biggest problem, the
proposed solution changes the sort keys based on the reference being
sorted. But how do I choose between two references which one goes
first if their sort-keys are different? There is nothing that ensures
that the result of both comparisons are the same.

An example:

Ref A: John Doe, 1995, Title A
Ref B: Mike Jameson, Charles Dickens, 1990, Title B

sort keys ref A:
by 1st author

sort keys ref B:
by year

which comes from this CSL, which would be valid if we go by this
proposal:

....

So which comes first? Ref A or ref B? It is impossible to tell. The
sort-keys should not change based on one single reference, but on each
couple of references being compared. I can not write a sorting method
that adheres to two different sorting-rules at the same time (with
exception for rules that have the same output, but that is just mere
coincidence in those cases).

The CSL should say something like: if the author-count of one of the
references is 1, this reference goes first, if both are one than sort
by first author. Otherwise, if the author-count of one of the
references is 2 etc. etc.

I am not sure how that can be said in a good way in CSL. Thought? Do
you agree with my problem with the sort-keys?

Johan

I like the proposal that adds a variable to each reference best. Maybe
we should even extend this further beyond just author-count. It would
just mean that during processing we gather this information for each
reference and than it can be used in macros for further customization.
Besides author-count we could perhaps have a count for all other types
of contributors: editor-count, translator-count etc. Another good
thing to put in a variable might be the number of the reference is
cited in the text, the order-number of the first citation, etc.

Yes, though I think some of that is independent of CSL. And we do now
have a variable for the citation number.

However, and I think that herein lies the biggest problem, the
proposed solution changes the sort keys based on the reference being
sorted. But how do I choose between two references which one goes
first if their sort-keys are different? There is nothing that ensures
that the result of both comparisons are the same.

An example:

Ref A: John Doe, 1995, Title A
Ref B: Mike Jameson, Charles Dickens, 1990, Title B

sort keys ref A:
by 1st author

sort keys ref B:
by year

which comes from this CSL, which would be valid if we go by this
proposal:

....

So which comes first? Ref A or ref B? It is impossible to tell. The
sort-keys should not change based on one single reference, but on each
couple of references being compared. I can not write a sorting method
that adheres to two different sorting-rules at the same time (with
exception for rules that have the same output, but that is just mere
coincidence in those cases).

I’m not entirely following you here Johan.

The CSL should say something like: if the author-count of one of the
references is 1, this reference goes first, if both are one than sort
by first author. Otherwise, if the author-count of one of the
references is 2 etc. etc.

I am not sure how that can be said in a good way in CSL. Thought? Do
you agree with my problem with the sort-keys?

I just don’t understand it ATM :wink:

Bruce

I just don’t understand it ATM :wink:

It is not possible to sort two items when the rules on how to sort are
changed based on each item being sorted. If I have two items that need
to be sorted on different rules, it can become impossible to say which
one comes first. You can only change the sorting rules when you define
rules on every possible combination of items.

3

For the above snippet, say I have two references that I need to sort.
Which author-count do I use to determine which rules apply? Say one
reference has 1 author and the other 2. In that case I don’t know
which set of rules apply. I need to have one set of rules to compare
each combination of references.

Does this clear it up somewhat?

Johan

OK, but step back: it seems you have a problem with the conditional
within the sort. But given that macro calls are already allowed there,
is this a new problem?

Just trying to clarify before getting into details …

Bruce

OK, but step back: it seems you have a problem with the conditional
within the sort. But given that macro calls are already allowed there,
is this a new problem?

Yes and no. I had actually only just started looking at the sorting
routines. It is not impossible to have a macro within the sort, but
than I would use the outcome of the macro resolved to either reference
and compare those outcomes. I had been thinking to do a better job at
sorting by not simply resolving the macros but going to look inside
them for sorting based on the macro content directly. This is for
example needed when sorting on a macro for author names, or with
dates. I had indeed been struggling somewhat with the possible
conditionals there, and this thread now makes it clear to me as to
why. Conditionals based on a single reference are not possible within
sort, and so a macro can only be used by simply using its output for
sorting. I think that this might indeed be a rather confusing thing
for people writing CSL styles.

Conditionals as are now used in macros are not possible in sort.
Furthermore, we might want to reconsider using macros as sort keys. I
think it is much better to actually say that a macro used as a sort
key, means that the text for that macro gets resolved for each
reference, which is then sorted alphabetically.

results in:

Jan, 1990
Feb, 2000
Apr, 2000

but

<macro =“month-year”>




results in:

Apr, 2000
Feb, 2000
Jan, 1990

Does Zotero handle macros in the sort already? How? The other
possibility is to resolve a macro correctly until a conditional is
encountered and from that point use textual comparison (basing the
sort on the outcome as text).

I hope my point is clear enough. It is not very easy to put this in
words so that it is very clear. I hope that at least my problem with
the conditionals in the sort keys is understandable at this point.

Johan

OK, but step back: it seems you have a problem with the conditional
within the sort. But given that macro calls are already allowed there,
is this a new problem?

Yes and no. I had actually only just started looking at the sorting
routines. It is not impossible to have a macro within the sort, but
than I would use the outcome of the macro resolved to either reference
and compare those outcomes. I had been thinking to do a better job at
sorting by not simply resolving the macros but going to look inside
them for sorting based on the macro content directly. This is for
example needed when sorting on a macro for author names, or with
dates. I had indeed been struggling somewhat with the possible
conditionals there, and this thread now makes it clear to me as to
why. Conditionals based on a single reference are not possible within
sort, and so a macro can only be used by simply using its output for
sorting.

Correct; if using a macro for a key, you are using the string
generated by the macro as the key. I don’'t see any other reasonable
way.

I think that this might indeed be a rather confusing thing
for people writing CSL styles.

Conditionals as are now used in macros are not possible in sort.
Furthermore, we might want to reconsider using macros as sort keys. I
think it is much better to actually say that a macro used as a sort
key, means that the text for that macro gets resolved for each
reference, which is then sorted alphabetically.

results in:

Jan, 1990
Feb, 2000
Apr, 2000

but

<macro =“month-year”>




results in:

Apr, 2000
Feb, 2000
Jan, 1990

Does Zotero handle macros in the sort already? How? The other
possibility is to resolve a macro correctly until a conditional is
encountered and from that point use textual comparison (basing the
sort on the outcome as text).

I hope my point is clear enough. It is not very easy to put this in
words so that it is very clear. I hope that at least my problem with
the conditionals in the sort keys is understandable at this point.

So does my answer above help resolve this?

Bruce

Correct; if using a macro for a key, you are using the string
generated by the macro as the key. I don’'t see any other reasonable
way.

Good. That’s indeed how I think it should be.

I hope that at least my problem with
the conditionals in the sort keys is understandable at this point.

So does my answer above help resolve this?

Well, your answer means that macro can be used as a key in sort, but
this does mean also that we still cannot have conditionals in the
sort. The conditionals in the macro are flattened against each
reference, and so generate a set of sort rules that is the same for
each reference. Conditionals directly in the sort still are an
impossibility. If we want that, we need to come up with a conditional
that is based on a pair of references.

Something like this, (but then properly thought out):

etc.

The purpose of sort should be that if given two references, it can say
which one goes first.

Johan

Hello,

I think this could work. We need to add a way to specify the range of
names to be used for sorting using new attributes start and end. If
not defined, start=1 and end=infinity. If no names exist at the
position, it can just use an empty string (which should sort before
non-empty strings). The macro is used only for the sorting, and is
solved to its text value for each reference being sorted.

That would give this sorting (I use a semicolon to separate values,
“”=empty string). This shows the keys some fictional references would
have and how this sort would turn out:

Doe;1;“”;1997
Doe;1;“”;1999
Doe;2;Brown;2004
Doe;2;Gates;1997
Doe;2;Gates;2007
Doe;3;“”;1997
Doe;3;““2004 (this reference actually has 12 authors!!)
Doe;3;””;2006

This is all typed in mail, so I might have used some off CSL, but I
think this principle works. The though bit left to solve is how to
deal with the condition in the if-tag. I am not sure if putting the
condition fully inside the condition attribute is best way to do so in
xml.

JohanOp 2 jun 2008, om 12:26 heeft David M. Kaplan het volgende geschreven:

Hi,

I am at a conference at the moment, so I can’t read this too
extensively
(back next week). I have thought about this problem a bit. That is
why
I think that adding an “author-count” (maxing out at 3) explicitly to
the sort just after the first author would be useful - this should
order
references correctly. You would also have to sort by each key
separately, not just one sort based on a single large text string.

I also think that perhaps allowing directly in the
could
be a problem as there is no guarantee to have the same number of keys,
which could freak out programs. A better option would be to move the
choose to macros (this is not how I framed my initial suggestion,
but I
think this is better).

Another problem that I see is context dependent “author-count” - i.e.
you want an author count that will be follow clauses so
that it will be the number of editors when there is no authors, etc.
Framing this correctly seems a bit complicated to me - thoughts?

Cheers,
David

[…]


Hello,

I forget to get back to you on this, and I’ll need to put it off until
probably tomorrow sometime, but on this …

This is all typed in mail, so I might have used some off CSL, but I
think this principle works. The though bit left to solve is how to
deal with the condition in the if-tag. I am not sure if putting the
condition fully inside the condition attribute is best way to do so in
xml.

No, it’s not ideal; earlier I’d suggested splitting the variable from
the condition (“count”) something like:

Bruce

Never mind about that, we are all doing this in our spare time anyway,
so I understand fully.

Ok, so then if we do the condition like this:

--> 3,4,5,6,…
--> 0,1,2,3
--> 3
--> 3,4,5,6

we end up with this:

Op 2 jun 2008, om 23:43 heeft Bruce D'Arcus het volgende geschreven:

Hi,

I have been away at a conference, so I haven’t been able to participate
in the discussion. I just want to weigh in on a couple of issues:

  1. I think that conditionals in sort are a bad idea (this may already be
    consensus, but just wanted to add my voice). I would keep conditionals
    in macros and then sort on the macro response as text.

  2. At some point there was a suggestion to have conditionals of the
    form:

I think this is a bad idea as it requires parsing the condition text,
which is a go around using xml (i.e., one should be able to express this
appropriately in XML so that we can use the XML parser to tell us what
the condition should look like).

More recently, it has been suggested to use something like:

This is better, but I still think this could have problems. One is that
I am not sure how substitution would work with this (for example, if
instead of authors, the publication had editors, how would the know
to use editors instead of authors?). You could probably work this out
with lots of if’s, but it is a bit awkward.

The other problem is that I think it would be more flexible to have a
general way of expressing basic conditions in CSL and then apply this
appropriately. Someone must have already drafted how statements
could be done in general XML, and we should probably take advantage of
this knowledge. But, it could go something like this:

3 ....

This would be equivalent to “if (author-count > 3) {}” (in essence this
is using XML to pre-parse the conditional statement).

The macro would be something like:

0

This is just a guess as to the best way to do this, but the basic idea
of using XML to express the condition I think is the right direction.

Note, if you did this, you could get rid of in favor of
appropriate if statements, but it is not clear this be particularly more
effective than what is already in CSL.

Cheers,
David–


David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html


Hi,

I have been consulting with some friends that know a thing or two about
XML. They basically pointed me to XSL and wondered why CSL was
developed as a separate language and didn’t just use XSL instead? I
imagine from looking a bit at XSL that it was like using a bulldozer to
crack an egg open.

XSL has conditionals, but they do exactly what I recommended not doing
by placing the condition in a separate language called XPath:

http://www.w3.org/TR/xslt#section-Conditional-Processing-with-xsl:if

I am not sure this is the direction we should go in, but XSL is a
standard. However, our needs seem much more limited than general XSL
and it would seem that developing a simple condition like I
suggested is a good idea. But I defer to someone who has a more global
vision of how zotero works with CSL and the logic behind the structure
of CSL.

Cheers,
David> I you know of any “prior art” on this, I sure like to know. I haven’t

Exactly!

There are a lot of reasons why not XSLT. First, take a look at the APA
style that Microsoft implements in Word 2007/2008 and compare it to
our’s. There’s something like an order-of-magnitude difference in size
(and complexity), and our’s better implements the style.

Second, XSLT is designed for XML input formats; CSL is intended to be
agnostic about that.

Bruce

More recently, it has been suggested to use something like:

This is better, but I still think this could have problems. One is that
I am not sure how substitution would work with this (for example, if
instead of authors, the publication had editors, how would the know
to use editors instead of authors?). You could probably work this out
with lots of if’s, but it is a bit awkward.

The way we do it now is the content of the conditional “type”
attribute is a list; so one ore more.

The other problem is that I think it would be more flexible to have a
general way of expressing basic conditions in CSL and then apply this
appropriately. Someone must have already drafted how statements
could be done in general XML, and we should probably take advantage of
this knowledge. But, it could go something like this:

3 ....

This would be equivalent to “if (author-count > 3) {}” (in essence this
is using XML to pre-parse the conditional statement).

The macro would be something like:

0

Hmm … WRT the conditional, that’s not even necessary now. Our
default (implicit) match condition is to test whether a variable is
present. So there’s no need to do the > 0 test.

The trick here is really how best to grab the count.

Maybe:

The “author-count” thing is maybe a bit ugly though.

Bruce

Oops; which is implemented in XSLT.

Bruce