finishing with CSL?

So I’ve asked for feedback on a number of occasions over the past month
or so on the following questions:

  1. when can we call CSL 1.0 and be done with that?
  2. when is Zotero going to implement CSL as I intend (styles loaded by
    URI rather than stored upfront in the database, publicly available,
    etc.)?

I have not gotten a single response (aside from Simon telling me he’s
busy with other things on Zotero). I’m at least as busy as everyone
else here, and I need help at times to keep this moving forward. I
really don’t want to leave this hanging for another year.

The first question is important for obvious reasons. The second is
important because I’m tired of keeping track of who has what styles and
which is most up-to-date (also not good for users as we start to build
up the number of styles). It’s easy enough to define simple conventions
for distributed repositories, so easier to get this out of the way
sooner rather than later.

Also, Simon mentioned some interns are working on styles using the new
schema? If this is true, where is this work? How are they progressing?
Can they perhaps make sure to update us on how things are going so we
can fix any bugs in the schema, and perhaps start to create real
documentation?

Should I just simply declare an end-of-August freeze on the schema, and
call it 1.0 on September 1?

Bruce

Bruce,

I thought I’d research these more before asking, because these may come just

  1. small thing: When disambiguating, should one really add title, or should
    one add shortened title, if it exists? Is that implicit or something that
    CSL should define? (But “short-title” isn’t in the list of defined
    variables. Hmmm.)

  2. small thing: Does CSL define how works without authors are handled? How
    should they be cited in an author-date style? In the bibliography, should
    they be tagged and sorted as if the author were “Anonymous”? Or should they
    be sorted using the title? Or do all styles handle this the same so there is
    no reason to specify it?

  3. middle-sized thing: I see a pre-defined list of variables, but I don’t
    see a pre-defined list of types of contributor. Should there be?

  4. big thing: If, as discussed earlier, it isn’t CSL-compliant if it uses
    variables other than those specified in the schema, the list of variables
    seems much too short. (Word 2007 offers about fifty.) I can’t see how you
    could define styles for all the cs-types-optional without more variables. Or
    are the variables just meant to cover cs-types-required? But then back to
    the earlier question. If someone adds variables to handle cs-types-optional,
    is it CSL-compliant?

Johnfrom my ignorance, but since you’re trawling for feedback, here you go.

I thought I’d research these more before asking, because these may come just
from my ignorance, but since you’re trawling for feedback, here you go.

  1. small thing: When disambiguating, should one really add title, or should
    one add shortened title, if it exists? Is that implicit or something that
    CSL should define? (But “short-title” isn’t in the list of defined
    variables. Hmmm.)

To get a short-title you use the “form” attribute., with the “short” value.

  1. small thing: Does CSL define how works without authors are handled? How
    should they be cited in an author-date style? In the bibliography, should
    they be tagged and sorted as if the author were “Anonymous”? Or should they
    be sorted using the title? Or do all styles handle this the same so there is
    no reason to specify it?

It does handle this through the “substitution” element, and this is
really critical stuff. So typically you’d do a “creator” or “author”
macro and define all this there (and try to tie conditions to the
atttributes of a resource rather than the type).

The tricky thing that I’ve not tested (because I have no time to code)
is how well this works with the new approach. I’m a little worried
hoiw a processor will know to suppress a title, for example, if it’s
been substituted for the author.

Simon, I presume there’s no problem?

  1. middle-sized thing: I see a pre-defined list of variables, but I don’t
    see a pre-defined list of types of contributor. Should there be?

They should be there related to the name structure.

  1. big thing: If, as discussed earlier, it isn’t CSL-compliant if it uses
    variables other than those specified in the schema, the list of variables
    seems much too short. (Word 2007 offers about fifty.) I can’t see how you
    could define styles for all the cs-types-optional without more variables. Or
    are the variables just meant to cover cs-types-required? But then back to
    the earlier question. If someone adds variables to handle cs-types-optional,
    is it CSL-compliant?

I don’t want to end up with a tower of babel, so I’d say no, they’re
not compliant.

Now, about the list of variables:

Feel free to suggest additions, but one thing to keep in mind is that
the way that a CSL processor and a CSL style works should be as
generic as possible. So, for example, we have “container-title” rather
than “journal-title” and “book-title” and so forth. Same with “event”.

Likewise, you’d never use a variable like “report-number”; you’d use
simply “number” (a shorthand for “document-number” in fact).

So the upshot is before you have the urge to add a new variable, ask
yourself if there’s a more generic one that would work.

All of this needs to be documented at some point of course :slight_smile:

Bruce

John P. McCaskey wrote:

To get a short-title you use the “form” attribute., with the “short” value.

Should the processor always disambiguate with a short title if one is there?

I don’t have a strong opinion, but my hunch is yes.

Is that your position as well? Any other opinions? Simon?

So the upshot is before you have the urge to add a new variable, ask
yourself if there’s a more generic one that would work.

And if there is, how do I use it for my particular purpose?

If I’m writing the style for a map (one of the cs-types-optional), I need a
cartographer. I think, “That’s like an author, that is, I’d like the
bibliography to sort with cartographers and authors together.” Now what do I
do?

Do I add to my author macro, like this

Or do I somewhere in my formatting application, outside the CSL styles, map
what the user sees as “Cartographer” to what CSL calls author?

I think the latter. Yes?

Yes. It might even make sense for us to ditch “author” and use instead
(or also) the more generic “creator.”

An editor or translator, for example. is NOT a creator; they are
secondary roles that substitute for creator/author when not present.

But we need to document the mapping in that case.

As I see it, drawing from recent discussions of data modeling borrowing
from FRBR, we really have three levels of roles:

primary: creator (author, writer, cartographer)
second: realizer (editor, translator)
tertiary: producer (publisher, probably distributor, etc.)

Some things are difficult though. Director (realizer)? Conductor (also
realizer)? Writer (easy, creator, except maybe if talking about a
screenplay?).

If so, then in my end-user application, I need to presume that whoever wrote
the style for a movie is assuming what the user enters as “Director” will
end up in the author variable and “Producer” in the editor variable (or
whatever way that is supposed to be), or for newspapers, whether the locator
will be B and the page 13 or the locator will be B13. What, if anything,
tells the user interface writer the assumptions that the CSL style writer
made? What tells the UI designer “For newspapers, ask the user for Section
and Page Number together, and give that to the CSL formatter as one value in
the ‘locator’ variable?” Is that just the documentation for the CSL style,
which was presumably written with guidance from CSL documentation?

I think we first of all make it clear in the schema where we can through
annotations (the stuff prepended with the “##”; gets converted to formal
annotation elements in the XML schema when converted by Trang). Then
obviously we also clarify it in the documentation (hopefully mostly
automatically-generated from the schema).

As we see and as expected, I have more “how do I use this thing” questions
than any valuable feedback that should hold up CSL 1.0.

Yes, but these questions are still helpful.

Bruce

I think we first of all make it clear in the schema where we can through
annotations (the stuff prepended with the “##”; gets converted to formal
annotation elements in the XML schema when converted by Trang). Then
obviously we also clarify it in the documentation (hopefully mostly
automatically-generated from the schema).

Is it a design goal of CSL that a software maker could create a word
processor and formatter that, just from reading a CSL file, could tell a
user what fields are needed for a particular style and document type and
then format a citation and bibliography accordingly?

In other words, if a user had a CSL-using word processor, and a new CSL file
came out that included support for some previously unsupported source type,
could that word processor (in theory at least) support the new document type
without new software, config files, etc. for the word processor’s author?

I had earlier assumed this would be possible, but now I don’t think it is.
There is still another level of mapping needed. No?

Or will there be enough in the ## annotations to tell the word processor
what to do?

John

We’ll do this, but it won’t happen until 1.5. We need to feature freeze
Zotero for 1.0 this week and aren’t going to be able to implement this
in the client in time.

  1. small thing: When disambiguating, should one really add title,
    or should
    one add shortened title, if it exists? Is that implicit or
    something that
    CSL should define? (But “short-title” isn’t in the list of defined
    variables. Hmmm.)

To get a short-title you use the “form” attribute. with the “short”
value.

Yes, although for this kind of disambiguation at the moment, we use
“disambiguate-add-title,” which may not be quite enough. This option
also does not fully specify how to format titles (with quotes for
articles, with what delimiter between title and year, etc.), and some
of this might change from style to style. Maybe use
tags, or for this kind of thing, to allow
for greater flexibility?

  1. small thing: Does CSL define how works without authors are
    handled? How
    should they be cited in an author-date style? In the bibliography,
    should
    they be tagged and sorted as if the author were “Anonymous”? Or
    should they
    be sorted using the title? Or do all styles handle this the same
    so there is
    no reason to specify it?

It does handle this through the “substitution” element, and this is
really critical stuff. So typically you’d do a “creator” or “author”
macro and define all this there (and try to tie conditions to the
atttributes of a resource rather than the type).

The tricky thing that I’ve not tested (because I have no time to code)
is how well this works with the new approach. I’m a little worried
hoiw a processor will know to suppress a title, for example, if it’s
been substituted for the author.

Simon, I presume there’s no problem?

It all seems to be working in the latest version of Zotero from the
dev branch.

Simon

Simon Kornblith wrote:

  1. small thing: When disambiguating, should one really add title,
    or should
    one add shortened title, if it exists? Is that implicit or
    something that
    CSL should define? (But “short-title” isn’t in the list of defined
    variables. Hmmm.)
    To get a short-title you use the “form” attribute. with the “short”
    value.

Yes, although for this kind of disambiguation at the moment, we use
“disambiguate-add-title,” which may not be quite enough. This option
also does not fully specify how to format titles (with quotes for
articles, with what delimiter between title and year, etc.), and some
of this might change from style to style. Maybe use
tags, or for this kind of thing, to allow
for greater flexibility?

Ugh …

OK, so this is for situations in which you want to disambiguate
citations. E.g. what to do what you have (Doe; Doe) where the two Does
are different people.

We have flags for this basically, where one options does (J. Doe; S.
Doe), and the other (Doe, Some Title; Doe, Some Other Title).

Right?

You, Simon, are floating the possibility that this not be configured in
an attribute like this, but rather with some element (perhaps using the
conditional).

Do you want to do that?

If yes, then perhaps we ought to make sure it’s consistent with how
substitute works? E.g. either we add cs:disambiguate, or we remove
cs:substitute and do or some such.

Otherwise, we could just change to “disambiguate-add-short-title”.

Bruce

Dan Stillman wrote:> On 8/12/07 3:05 PM, Bruce D’Arcus wrote:

  1. when is Zotero going to implement CSL as I intend (styles loaded by
    URI rather than stored upfront in the database, publicly available,
    etc.)?

We’ll do this, but it won’t happen until 1.5. We need to feature freeze
Zotero for 1.0 this week and aren’t going to be able to implement this
in the client in time. On the other hand, our aim was to have Zotero 1.0
support the CSL 1.0 spec, and I think we’ll be able to deliver on that
at least in the processor–Simon can confirm.

OK, but that then leaves the question about what to do now. I don’t see
any new styles in my repo. Do you have people working on them as Simon
suggested?

Bruce

Otherwise, we could just change to “disambiguate-add-short-title”.

That was my expectation when I brought this up.

John

Should there be terms for winter, spring, summer, fall (autumn? probably
both) for periodicals dated that way?>-----Original Message-----

John P. McCaskey wrote:

Should there be terms for winter, spring, summer, fall (autumn? probably
both) for periodicals dated that way?

Yes, but we have one little problem: no standard way to represent this
in data!

I’m still not sure how we should do this in the RDF, for example, and
I’m sure most applications store this as a dumb string.

That said, I would like to find a good solution for this. Maybe this
group will figure it out …

http://dublincore.org/groups/date/

Bruce

Ugh …

OK, so this is for situations in which you want to disambiguate
citations. E.g. what to do what you have (Doe; Doe) where the two Does
are different people.

We have flags for this basically, where one options does (J. Doe; S.
Doe), and the other (Doe, Some Title; Doe, Some Other Title).

Right?

You, Simon, are floating the possibility that this not be
configured in
an attribute like this, but rather with some element (perhaps using
the
conditional).

Do you want to do that?

I’m pretty sure, yes. There are all sorts of strange issues
(capitalization, placement, quotation marks, etc.) that are style-
specific. For example, in MLA, this wouldn’t be (Doe, Some Title),
but (Doe, “Some Title”).

If yes, then perhaps we ought to make sure it’s consistent with how
substitute works? E.g. either we add cs:disambiguate, or we remove
cs:substitute and do or some such.

This would be a child element of layout. It would be the citation
processor’s responsibility to ensure there are not two items with the
same citation, and if there are, it would add whatever is in or to the citation in order to
disambiguate it. This would be exceedingly simple for the CSL parser
to implement, and it really is just an (albeit one that depends
on the other citations in the bibliography), although I wouldn’t be
unhappy using . has a different meaning,
and should definitely stay a separate tag.

Simon

Simon Kornblith wrote:> On Aug 13, 2007, at 5:05 AM, Bruce D’Arcus wrote:

If yes, then perhaps we ought to make sure it’s consistent with how
substitute works? E.g. either we add cs:disambiguate, or we remove
cs:substitute and do or some such.

This would be a child element of layout. It would be the citation
processor’s responsibility to ensure there are not two items with the
same citation, and if there are, it would add whatever is in or to the citation in order to
disambiguate it. This would be exceedingly simple for the CSL parser
to implement, and it really is just an (albeit one that depends
on the other citations in the bibliography), although I wouldn’t be
unhappy using . has a different meaning,
and should definitely stay a separate tag.

Been a long day, and I’m really busy, so this has gone past me. Any
specific proposal, such that it and substitute work consistently (or
does this matter?)?

Bruce

For what it’s worth, the whole approach to disambiguation always seemed
wrong to me. It seemed like it was working at the wrong level of
abstraction, that it was assuming disambiguation strategies were more
similar than they’ll turn out to be, and that eventually there will be need
for more and more flags controlling disambiguation.

I’ve thought of it this way:

A citation is a concatenation of fields, some that are specific to the
instance of the citation (such as page number) and others that are specific
to the item being cited (title, author, etc.) The first are pulled from what
the user entered for the citation, the second is pulled from the
bibliographic records for the cited items. (In SQL-think, a citation is
created with a select clause, joining the citation record to the biblio
records, with the appropriate group bys and order bys.)

In some styles, however, this is not enough. The most common (but not only)
example is where the citation needs a field that depends not just on the
citation info and the joined biblio record, but on other unjoined biblio
records. The citation needs, for example, 1986b. Why? Because it needs a
unique key by which the reader will cross-reference to the bibliography and
year (1986) isn’t unique. If the bibliographic record just had 1986a, then
the regular join machinery would have all it needs.

What I really want is just a new bibliographic field, let’s call it ‘key’.
Before the citations are formatted, the formatter looks at the bibliography,
assigns keys according to some format definition, then lets the citation
formatter run. The citation formatter knows nothing about how that key was
created or why. It’s just been told to join citation record with biblio
record and concatenate "( & key & ", " & page & “)”.

By this, disambiguation is not something defined for citations. It’s
something defined for a bibliography. And there are no disambiguation flags.
There is just, if needed for the style, a key field defined by the style
such that it gets filled with 1986a or 1986b or 2001 “Aardvarks” or 1987[a]
or 2001-i or whatever. It’s the style’s author who defined that field to be
used for disambiguating, but neither the citation nor the bibliography are
told the purpose.

The key might just be a, b, c, not 1986a, 1986b, 1986c. As long as there is
a field now in the biblio record that the citation query can join to and can
use to concatenate what it needs, all is good. Doing it this way makes
collapsing within a citation very simple and flexible. Instead of defining
the collapsing strategies a priori as (none), year or suffix, and hoping you
covered all styles, you just let the sort by and group by in the citation
query automatically do the work.

A style might need more than one key. The style might need it to get the
citation just right. But the keys (just fields in bibliographic records
whose value is determined at formatting time) can serve other purposes. In
Bluebook style, a bibliographic record should contain the number of the
footnote in which the item was first cited, for creating citations such as
See Smith and Jones, supra note 234, at 176. Just add another key. Call it
note-where-first-cited. (Real SQL would burp on this but a programmatic loop
would handle it just fine.)

I would remove all mention of disambiguation in CSL. I would add the concept
of bibliographic variables that a style’s author defines, that get set at
format time, and then are used as variables in citations. If the formatter
sees it has any of these, it runs through the bibliography, populates the
variables, and then formats the citations without heed to how or why those
fields were created and even to whether they came from a biblio database or
had just been set.

At least that’s how I think about disambiguation.

John

To repeat, any attempt to make this consistent with will
be confusing, since it operates in a different context and has a
completely different meaning.

Simon

John P. McCaskey wrote:

For what it’s worth, the whole approach to disambiguation always seemed
wrong to me. It seemed like it was working at the wrong level of
abstraction, that it was assuming disambiguation strategies were more
similar than they’ll turn out to be, and that eventually there will be need
for more and more flags controlling disambiguation.

John,

You bring the freshest eyes to this, so am happy to consider your
suggestions. But as I said, I’m a bit overwhelmed with other things
these days, so can you try to put this in the form of a concise
suggestion on what the XML should look like?

I know XML isn’t really your thing, but ideal would be an actual RNC
schema fragment.

Absent that, some example XML would work too.

Bruce

Also, on this …

John P. McCaskey wrote:

I would add the concept of bibliographic variables that a style’s author defines, that get set at
format time, and then are used as variables in citations. If the formatter
sees it has any of these, it runs through the bibliography, populates the
variables, and then formats the citations without heed to how or why those
fields were created and even to whether they came from a biblio database or
had just been set.

I think we’ve got the basics of that: it’s the new “macro” element.

Branch schema and example is here just in case:

http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/

Bruce

I think we’ve got the basics of that: it’s the new “macro” element.

Bruce

Yes, I think macro is the core. I think we just need a way to
(1) include in a macro some variables that are automatically altered as a bibliography is sorted, scanned, or referenced,
(2) give the macro-writer some predefined variables, and
(3) allow a macro’s results to look to a citation like variables in a bibliographic record.

Starting with (3):

While or after sorting, number each bibliographic item. Make the result a predefined variable, say, sort-order-number. The style’s author writes a simple macro

This sets his bib-item-number to the pre-defined sort-order-number. Then the citation is

Out comes [8:34; 4:12-14; 4:45-47].

The style writer decides he wants these sorted by key1:

He gets [4:12-14; 4:45-47; 8:34].

Now add a groupby parameter

to get [4:12-14, 45-47; 8:34]

(Note that ordering and grouping is defined in the citation.)

Now maybe for a particular style, the key is more complicated.

I think an APA key can always be determined by looking at the fields in a biblio item and the same fields in the previous and next biblio items (previous and next in the sort order). For now let me assume this is true of all styles. So then we need to give the macro-writer access to all fields for the two surrounding items. He writes a macro like this:

Maybe that’s done differently, but the result is a variable, disambiguating-letter, that the citation can now use.

yields (Jones 2004; Jones 2005; Doe 1987a; Doe 1987b) assuming that’s the order it was entered. Now change the sort order and the grouping:

yields (Doe 1987a, b; Jones 2004, 2005).

(I might have the spaces and prefixes wrong on that.)

You can jigger the sortbys and groupbys to get any combination you need. The macro decided whether it would comes out 1987A, 1987i, 1987-a, 1987 with a small-caps A, etc.

Some styles might require two or three macros, maybe disambiguated-author, disambiguated-short-title, and disambiguating-letter. Some would require none.

The only other built-in variable I can think of right now is note-where-first-cited, though I’d probably throw in number-of-times-cited, just in case.

John

I think we’ve got the basics of that: it’s the new “macro” element.

Bruce

Yes, I think macro is the core. I think we just need a way to
(1) include in a macro some variables that are automatically
altered as a bibliography is sorted, scanned, or referenced,
(2) give the macro-writer some predefined variables, and
(3) allow a macro’s results to look to a citation like variables in
a bibliographic record.

Starting with (3):

While or after sorting, number each bibliographic item. Make the
result a predefined variable, say, sort-order-number. The style’s
author writes a simple macro

Why create a separate variable here? What does this accomplish beyond
the current element?

[…]

Now maybe for a particular style, the key is more complicated.

I think an APA key can always be determined by looking at the
fields in a biblio item and the same fields in the previous and
next biblio items (previous and next in the sort order). For now
let me assume this is true of all styles.

As far as I know, this is a safe assumption.

So then we need to give the macro-writer access to all fields for
the two surrounding items. He writes a macro like this:

Maybe that’s done differently, but the result is a variable,
disambiguating-letter, that the citation can now use.

There doesn’t seem to be much of a point to explicitly coding a
disambiguating-letter macro. We will need specialized rules to do it
(e.g., your increment=“abc” attribute) no matter what approach we take.

yields (Jones 2004; Jones 2005; Doe 1987a; Doe 1987b) assuming
that’s the order it was entered. Now change the sort order and the
grouping:

yields (Doe 1987a, b; Jones 2004, 2005).

(I might have the spaces and prefixes wrong on that.)

XML doesn’t allow multiple attributes on an element, for one thing,
and doesn’t care about order, for another. You’d want to model this as:

...

or something of that sort. However, this allows users to do some
complicated things, which a parser would have to support, but which
might never get used (e.g., sort-by order different from group-by
order).

You can jigger the sortbys and groupbys to get any combination you
need. The macro decided whether it would comes out 1987A, 1987i,
1987-a, 1987 with a small-caps A, etc.

If you’re worried about the above cases, I’d suggest be a built-in variable. However, to
my knowledge, there are no styles that use something besides 2001a/b/
c for disambiguation of years, so the old option would probably be fine.

Some styles might require two or three macros, maybe disambiguated-
author, disambiguated-short-title, and disambiguating-letter. Some
would require none.

disambiguated-author is a different animal. I don’t know whether
there’s any need for extensibility here (my hunch is no), and I don’t
know whether it’s possible to create an extensible way of creating
this disambiguated-author macro that doesn’t require a host of new
variables.

My main complaints about the scheme you propose are as follows:

  1. There’s more logic than I feel is necessary here. We might handle
    disambiguation better in some strange fringe cases, but the current
    approach would work for almost everything. How many cases will there
    be where where: 1) you’re using an obscure bibliographic style that
    handles disambiguation in some strange way, 2) this new syntax
    handles disambiguation, but the old syntax does not, and 3) you are
    using multiple sources by the same author published in the same year.

We provide the same disambiguation power EndNote does. Only BibTeX
might do it better, and that’s because its styles are actual
programming code (and you’ll probably have to run the style
formatting script 5 times to get it right). Word 2007 doesn’t even
support disambiguation, or didn’t as of the beta. More powerful
disambiguation is probably unimportant to 99% of our prospective user
base.

  1. Style authors might simply avoid this logic, because it’s
    confusing to code, and because it’s very easy to complete your style
    and not realize these things don’t work. Try adding this to apa.csl
    and see how many lines you need to do it. We shouldn’t require that
    all author-date styles replicate some complex series of sorting/
    grouping rules. We should just implement these rules in the parser so
    styles can easily enable/disable them.

You’re right, however, that we might want to provide more powerful
syntax for grouping. The two cases I can think of are:

(Doe 1987a; Doe 1987b) → (Doe 1987a, b) (your second example)
[1; 2; 3] → [1-3]

Right now the first is handled by disambiguate-year-suffix-collapse,
which you dislike. I wouldn’t mind replacing it with something more
extensible, but the approach you describe here is, to me, excessively
complicated. Preferably, no one should have to define new macros
simply to handle disambiguation. Besides, it’s not intuitively clear
how to me how to handle the latter case with your approach. We could
add a “group-by” option to replace disambiguate-year-suffix-collapse,
but then we’d either have to hard-code the options (perhaps author-
year, author, and cited-number) or come up with some other way of
specifying the syntax.

At least in Zotero, your first example:

[4:12-14; 4:45-47; 8:34] → [4:12-14, 45-47; 8:34]

is irrelevant, since to do this, you’d put “12-14, 45-47” into the
locator field. I’ve never seen a style that requires the author’s
name twice when specifying two page ranges, so this is probably safe.

I would not be opposed to changing to
a more general , where the value is the
name of a macro that describes the sort order, e.g.:

...

That’s five more lines of code, but eliminates the dirty “magic”
author macro.

Ultimately, there’s no question that disambiguation is a tough
problem to deal with. However, it’s easier for the programmer to
implement and harder for the style author to f**k up when all the
logic is in the parser and all the author has to do is set a few
options to “true”.

Simon