CSL Questions

Hi Bruce (and others on this list),

I’ve started implementing CSL in Scholar for Firefox, but I have a few
questions regarding some of the elements:

  1. I’m not sure I completely understand the “initialize-with” attribute on
    the tag.

  2. What exactly are the and elements?

  3. I assume is the label for a given contributor, locator, etc., but
    I’m still slightly unclear about the meaning of the type attribute in the
    following construct:

  4. What is the tag?

  5. What is the “author-shorten-with” attribute on the tag?

  6. Should I assume that the bibliography is delimited with periods, or is
    there a field that specifies this that I’m overlooking?

I will likely have more questions as I make more progress, but this should
allow me to get started.

Thanks,
Simon

Hi Simon,

  1. I’m not sure I completely understand the “initialize-with” attribute on
    the tag.

If one uses the attribute, then it switches initialization of (given)
names on. The value simply says how. E.g.:

“. " --> “J. B.”
” " --> “J B”
"." --> “J.B.”

  1. What exactly are the and elements?

Blah, this is the part I’ve been struggling with, and so am open to
alternative suggestions.

Let’s just talk about the output example.

You cite a famous text from Karl Marx, but the version published in
English in 1995.

The citation will often include information about the original. It
might be simple, like after the year, the original year gets printed;
e.g. “(1995 [1874])”.

Sometimes it’s more elabortate, like a string at the end that says
"originally published as [original language title, pubisher, etc.]"

The original-script bit is even more difficult, and I’ve been talking
to some historians that deal with this stuff.

Chicago says something like this:

If you are citing a text from another language and script (let’s say
Kanji), then use romanized names and titles, but also include the
original script.

The reason for this is somewhat obscure, but it has to do with the
fact that romainzation isn’t always clear, and so to it’s sometimes
easier to include the original title (and name).

See here for an example:

http://www.nanzan-u.ac.jp/SHUBUNKEN/publications/jjrs/pdf/729.pdf

So the design problem is, how to make this possible for users that
need it, but to not in any way make things more complicated for those
that don’t.

The other alternative is to have explicit translated title elements,
and maybe a language conditional on them; to not have this defined in
the global area.

  1. I assume is the label for a given contributor, locator, etc., but
    I’m still slightly unclear about the meaning of the type attribute in the
    following construct:

This is another new, rather experimental, feature.

Basically, Matthias suggested that styles be not language-specific. So
I ripped out the notion of using strings directly within the files. In
the XSLT, for example, I now have a list of variables to deal with
these strings.

In general, this is pretty easy. But one tricky thing is this: how do
you distinguish:

Ed. Jane Doe
Edited By Jane Doe
Jane Doe (Ed)

… and so forth?

This is the solution I came up with. The first two above would be the
verb form, and the first abbreviated. The last would be the noun form.

Again, this is something I’m not 100% confident about, so am open to
suggestions.

  1. What is the tag?

Most commonly, for page numbers, but could be for other similar things. E.g.:

(Doe, 1999, p103)

… “103” is the point locator, and “p” its label.

  1. What is the “author-shorten-with” attribute on the tag?

A lot of styles say that if you have more than one reference from the
same author, you should replace the author on all subsequent with two
or three em-dashes. I couldn’t think of a better term for that.

  1. Should I assume that the bibliography is delimited with periods, or is
    there a field that specifies this that I’m overlooking?

Right now, I am assuming just that, but it might be worth talking
about whether that all ought to be specified explicitly in CSL.
Thoughts?

I will likely have more questions as I make more progress, but this should
allow me to get started.

Sure thing. Keep 'em coming. And if you think of better ideas of how
to solve some of these issues, let me know.

Bruce

My gut instinct is that if you allow specification of a delimiter for the
citation, you should allow specification for the bibliography as well. It
accounts for the style guides that say you should have two spaces after each
period, and it makes things consistent.

I also wonder about footnote citations. Is there some mechanism for dealing
with these? It doesn’t matter for Scholar at the moment, but it should
probably be there.

Here are the problems I’m looking at right now:

  1. The main problem seems to be delimiters. Why is it that the separator for
    the /citationstyle/general/publishers is listed as “:” rather than ": ",
    while the separator for access is listed as ", "? Why is it that the suffix
    for /citationstyle/bibliography/reftype/year is ") " and not “)”? Why is it
    that the book title has “.” listed as the suffix, while other titles don’t?
    It’s possible I’m missing something, but consistency would be appreciated.

  2. How do you specify whether the author should have their entire first name
    or only their initials printed in the bibliography? How do you handle MLA
    style, where the first author is in “Last, First” format, but subsequent
    authors are “First Last”?

  3. Is there a file with all of the CSL terms localized into English? It
    might be useful if all CSL-based projects could use the same format. One
    possibility would be to use a DTD and put XML entities in the style file (as
    Mozilla/XUL does things). This would allow additional versatility if style
    authors absolutely had to use something not in the standard DTD, but might
    make things harder for me, depending on how good E4X is at handling entity
    declarations.

  4. What is the purpose of the label attribute on the element?

  5. What do you do with dates like “Summer 2006”?

  6. (This doesn’t really matter for us yet, because we don’t have a
    translator type yet, although I’ve filed a bug report for one.) Why is a
    tag specified in the schema, but not used in the APA CSL file?

Again, I will probably have more questions, but with these answered I might
be able to get basic APA style working.

Thanks,
Simon

  1. Should I assume that the bibliography is delimited with periods,
    or is
    there a field that specifies this that I’m overlooking?

Right now, I am assuming just that, but it might be worth talking
about whether that all ought to be specified explicitly in CSL.
Thoughts?

My gut instinct is that if you allow specification of a delimiter for
the
citation, you should allow specification for the bibliography as well.
It
accounts for the style guides that say you should have two spaces
after each
period, and it makes things consistent.

Agreed. In that case, the simple solution is to put the period on a
suffix attribute on the item-layout element.

I also wonder about footnote citations. Is there some mechanism for
dealing
with these? It doesn’t matter for Scholar at the moment, but it should
probably be there.

It’s totally there! The chicago-a.csl file is an example of one. Output
example:

<http://xbiblio.sourceforge.net/citeproc/examples/chicago-note-a-
en.html>

Here are the problems I’m looking at right now:

  1. The main problem seems to be delimiters. Why is it that the
    separator for
    the /citationstyle/general/publishers is listed as “:” rather than “:
    ”,
    while the separator for access is listed as ", "?

You mean in the apa example?

New York:Routledge
http://ex.net/1, accessed on January 3, 2004

Why is it that the suffix for /citationstyle/bibliography/reftype/year
is ") " and not “)”? Why is it that the book title has “.” listed as
the suffix, while other titles don’t?
It’s possible I’m missing something, but consistency would be
appreciated.

Two things:

First, it may be that the apa style has picked up some bugs that need
to be fixed.

Second, if they did, these are trivial to fix. In the end, all that
matters is that the output is correct. And when you figuring out how to
handle padding between fields (prefix/suffix), you have to consider the
fact that not all fields might be present.

  1. How do you specify whether the author should have their entire
    first name
    or only their initials printed in the bibliography?

Names are configured globally (in general/names), but can be overridden
locally, like this in the apa citation definition.

      <author form="short" suffix=", "/>

How do you handle MLA style, where the first author is in “Last,
First” format, but subsequent
authors are “First Last”?

The first attribute here:

<bibliography author-as-sort-order="first-author"  

author-shorten-with="———."
sort-order=“author-date”>

  1. Is there a file with all of the CSL terms localized into English?

It’s in citeproc-xsl; the strings.xsl file.

It might be useful if all CSL-based projects could use the same
format. One
possibility would be to use a DTD and put XML entities in the style
file (as
Mozilla/XUL does things). This would allow additional versatility if
style
authors absolutely had to use something not in the standard DTD, but
might
make things harder for me, depending on how good E4X is at handling
entity
declarations.

I’m rather averse to using entities. One reason is they assume XML
processing tools; don’t they? I mean out of box, Ruby or Python or PHP
aren’t likely to support them.

Also, XML tools are quite finicky about them. For example, if you have
a file that includes entities whose declarations are for some reason
not present, XML tools choke. Your CSL files are then no longer
self-contained (for validation, processing, etc.).

I’m just imagining the strings handled via pretty simple data
structures; maybe something like (in riby-ish code):

STRINGS = {
en => {
“in” => “in”
“and” => “and”

}
}

Hell, I bet we could even put this in a JSON/YAML file. If that sounds
good, just let me know what data structure makes most sense (and
Matthias, tell us what you think), and I can put it together.

  1. What is the purpose of the label attribute on the element?

Let’s take this example:

         <group label="in">
            <editor/>
            <title type="container" font-style="italic" prefix=" "  

suffix="."/>


It’ll print “In” before that group of references.

  1. What do you do with dates like “Summer 2006”?

In the XSLT, it would just print it. Date normalization in the code
assumes standard YYYY-MM-DD dates.

  1. (This doesn’t really matter for us yet, because we don’t have a
    translator type yet, although I’ve filed a bug report for one.) Why is
    a
    tag specified in the schema, but not used in the APA CSL
    file?

No reason. I’ll add it.

Again, I will probably have more questions, but with these answered I
might
be able to get basic APA style working.

OK, I’ll try to add more comments to the schema.

Bruce

Hi Bruce & Simon,

Why is it that the suffix for /citationstyle/bibliography/reftype/year
is ") " and not “)”?

I’ve raised the same issue some time ago. Delimiting whitespace should
be handled consistently, i.e., it should either always occur in the
prefix OR in the suffix. But it should not occur in the prefix for some
elements and in the suffix for other elements. This will likely cause
problems in case some elements are missing.

IMHO, the whitespace should always occur in the prefix of the same
element, not in the suffix of the preceding element (which may not be
present!).

Things get even trickier if multiple consecutive elements are missing
and we won’t be able to account for all cases. But the processor should
be as smart as possible.

  1. Is there a file with all of the CSL terms localized into English?

It might be useful if all CSL-based projects could use the same
format. One possibility would be to use a DTD and put XML entities
in the style file (as Mozilla/XUL does things).

I’m rather averse to using entities. One reason is they assume XML
processing tools; don’t they? I mean out of box, Ruby or Python or PHP
aren’t likely to support them.

Yep, I agree. While entities seem nice they make things a lot harder
when processing them in a scripting language such as PHP et al.

I’m just imagining the strings handled via pretty simple data
structures; maybe something like (in riby-ish code):

STRINGS = {
en => {
“in” => “in”
“and” => “and”

}
}

Hell, I bet we could even put this in a JSON/YAML file. If that sounds
good, just let me know what data structure makes most sense (and
Matthias, tell us what you think), and I can put it together.

From a developer point of view, the best option would be a simple text
file with a unique data format that can be parsed easily. In refbase,
the files containing our locales look like this:

“CallNumber”=>“Call Number”,
“Conference”=>“Conference”,
“CreationDate”=>“Date Created”,
“CreationTime”=>“Time Created”,
“year”=>“year”,
“Year”=>“Year”,

Such a structure is simple and very easy to parse into an array. In
fact, it is already in array format and I just need to add the enclosing
braces and have PHP evaluate it as code.

For each language, there’s a separate file. It would be a bit more
complicated but still not too difficult if all languages are stored
within the same file.

Bruce, could you give us an example for JSON/YAML?

Matthias

IMHO, the whitespace should always occur in the prefix of the same
element

Agreed. I’ll update the files and put this in some documentation
somewhere.

Bruce, could you give us an example for JSON/YAML?

en:
edited: "Edited By"
editor: “Editor”

Most languages can parse this into native data structures (not sure
about PHP though). In the above, you’d get nested hashes.

Bruce

Bruce, could you give us an example for JSON/YAML?

en:
edited: "Edited By"
editor: “Editor”

The nice thing about the above format is that it’s easier to the human
eye than my example:

“edited”=>“Edited By”,
“editor”=>“Editor”,

Also, it’s nice to have a fully programming-language agnostic format.

Most languages can parse this into native data structures (not sure
about PHP though). In the above, you’d get nested hashes.

If there’s no native support in a given programming language, there’s
always the option to transform this into another format by use of some
quick regular expressions.

That said, I’d be fine with JSON/YAML.

Matthias

Checked in updates to apa.csl.

Bruce

This issue seems very complex. For example, how do you model something like:

John and Jane Doe, eds.

You need to specify whether the label should be capitalized, the prefixes
and suffixes, and the behavior when the label is pluralized (unless we
simply assume multiple editors are always eds., which may be a safe
assumption).

Perhaps all elements should have both prefixes and suffixes. For
locators, too, you could have:

p103
p. 103

And there’s a similar issue with pluralization. Does p. become pp. when
there are multiple numbers? Should it be pp103-105 or p103-105? Do all style
formats do this the same way?

I must admit I’m slightly ambivalent on the localization issue, because I
feel that localization may make the styles harder to design in the first
place, because you can’t necessarily tell how things behave simply by
looking at them. I suppose a good GUI CSL editor (I haven’t looked at the
one in SVN yet) may be able to take care of this problem.

I also think there should definitely be a way to override the default
localized labels, because for some styles, it may be the only way. Perhaps
we should think of these localizations as something similar to CSS, where
the parser has a default localized YAML file, but you could embed YAML at
the top of the CSL file if you needed to override the default.

Simon

  1. How do you specify whether the author should have their entire
    first name
    or only their initials printed in the bibliography?

Names are configured globally (in general/names), but can be overridden
locally, like this in the apa citation definition.

      <author form="short" suffix=", "/>

What parameter controls this in /general/names, and how do you choose if you
want the “Doe, John R.” “Doe, J.R.” or just “Doe”? Is this what
"initialize-with" is for?

Also, I’m not sure I’m completely understanding the sort-separator. What do
you mean by “sort order differs from display order”? Can you provide an
example?

How do you handle MLA style, where the first author is in “Last,
First” format, but subsequent
authors are “First Last”?

The first attribute here:

<bibliography author-as-sort-order="first-author"

author-shorten-with="‹‹‹."
sort-order=“author-date”>

Let me just verify I have this right, then:

all = Doe, John, Doe, Jane, and Doe, Fred
first-author = Doe, John, Jane Doe, and Fred Doe
none = John Doe, Jane Doe, and Fred Doe

If this is correct, shouldn’t APA use “all” and not “first-author”?

Once the specification is finalized, it might be useful to create a set of
CSL stress tests that would ensure that all CSL parsers will present the
proper output even in more unlikely circumstances, in the same way the Acid2
test ensures proper CSS support in web browsers.

Thanks,
Simon

This issue seems very complex.

Welcome to the wonderful world of citation formatting!

For example, how do you model something like:

John and Jane Doe, eds.

You need to specify whether the label should be capitalized,

I was thinking that the labels should be capitalized by default in the
YAML file, but that an attribute can indicate to lowercase. That’s easy
to do in all programming languages.

the prefixes
and suffixes, and the behavior when the label is pluralized

Correct.

Hi Simon,

And there’s a similar issue with pluralization. Does p. become pp.
when there are multiple numbers? Should it be pp103-105 or p103-105?
Do all style formats do this the same way?

No, the format (and plural style) of the page indicator depends on the
style and can differ heavily between styles.

I must admit I’m slightly ambivalent on the localization issue,
because I feel that localization may make the styles harder to design
in the first place, because you can’t necessarily tell how things
behave simply by looking at them.

In private communication with Bruce, I proposed that the english default
strings would be included within the CSL file. By this, a CSL would only
require the additional global localization file if you want to use
something else than the default english strings.

I also think there should definitely be a way to override the default
localized labels, because for some styles, it may be the only way. Perhaps
we should think of these localizations as something similar to CSS, where
the parser has a default localized YAML file, but you could embed YAML at
the top of the CSL file if you needed to override the default.

Exactly. You could simply edit the default strings at the top of the CSL
file to customize some strings for a particular style.

Matthias

  1. How do you specify whether the author should have their entire
    first name
    or only their initials printed in the bibliography?

Names are configured globally (in general/names), but can be
overridden
locally, like this in the apa citation definition.

      <author form="short" suffix=", "/>

What parameter controls this in /general/names, and how do you choose
if you
want the “Doe, John R.” “Doe, J.R.” or just “Doe”? Is this what
"initialize-with" is for?

Yup. The order is configured in the local elements (citation,
bibliography).

Note, however, CSL has no notion of, for example, middle names. One
either initializes (given) names, or not.

Also, I’m not sure I’m completely understanding the sort-separator.
What do
you mean by “sort order differs from display order”? Can you provide an
example?

Yeah: Asian names. To sort “Mao Zedong” you sort on Mao, which is the
family name. In this case, sort order = display order. Western names
might even be the exception in the grand scheme of things.

I’m trying to leave room to get this right for international users.

How do you handle MLA style, where the first author is in “Last,
First” format, but subsequent
authors are “First Last”?

The first attribute here:

<bibliography author-as-sort-order="first-author"

author-shorten-with="———."
sort-order=“author-date”>

Let me just verify I have this right, then:

all = Doe, John, Doe, Jane, and Doe, Fred
first-author = Doe, John, Jane Doe, and Fred Doe
none = John Doe, Jane Doe, and Fred Doe

Right.

If this is correct, shouldn’t APA use “all” and not “first-author”?

Oops; yes! And names should be intialized with “.”. Fixed and checked
in!

Once the specification is finalized, it might be useful to create a
set of
CSL stress tests that would ensure that all CSL parsers will present
the
proper output even in more unlikely circumstances, in the same way the
Acid2
test ensures proper CSS support in web browsers.

Agreed.

Bruce

No, the format (and plural style) of the page indicator depends on the
style and can differ heavily between styles.

Examples? I mean, yes, punctuation can vary, but for pages, I’ve only
ever seen “p”/“pp” and “page”/“pages.”

In private communication with Bruce, I proposed that the english
default
strings would be included within the CSL file. By this, a CSL would
only
require the additional global localization file if you want to use
something else than the default english strings.

And this is a modification of that, saying that software implements the
string substitution, but that the CSL file can include exceptions.

I want styles to be self-contained.

Bruce

No, the format (and plural style) of the page indicator depends on
the style and can differ heavily between styles.

Examples? I mean, yes, punctuation can vary, but for pages, I’ve only
ever seen “p”/“pp” and “page”/“pages.”

Here’s the common case (single page: p, multiple pages: pp),
examples are given for a book chapter and a whole book using a common
Springer style:

Arrigo KR (2003) Primary production in sea ice. In: Thomas DN,
Dieckmann GS (eds) Sea ice - an introduction to its physics, chemistry,
biology and geology. Blackwell Science Ltd, Oxford, pp 143-183

Clarke KR, Warwick RM (1994) Change in marine communities: An approach
to statistical analysis and interpretation. Plymouth Marine Laboratory,
Plymouth, 144 pp

And here’s a rather weird case (from publisher “Inter-Research” that
publishes e.g. the highly ranked journal “Marine Ecology Progress
Series”, www.int-res.com) where, for book chapters, ‘p’ is used to
indicate a page range:

Arrigo KR (2003) Primary production in sea ice. In: Thomas DN,
Dieckmann GS (eds) Sea ice - an introduction to its physics, chemistry,
biology and geology. Blackwell Science Ltd, Oxford, p 143-183

Clarke KR, Warwick RM (1994) Change in marine communities: An approach
to statistical analysis and interpretation. Plymouth Marine Laboratory,
Plymouth, 144 pp

For journal articles, both styles use page ranges without any prefix,
such as:

Assur A (1958) Composition of sea ice and its tensile strength. Nat Res
Council Publ 598:106-138

In private communication with Bruce, I proposed that the english
default strings would be included within the CSL file. By this, a
CSL would only require the additional global localization file if
you want to use something else than the default english strings.

And this is a modification of that, saying that software implements the
string substitution, but that the CSL file can include exceptions.

I want styles to be self-contained.

Yes, I agree that this is very important.

Matthias

Ah f**king hell. There are some traditions in styles that are
bullshit, and this is a perfect example! I can tell you that if CSL has
to support this sort of stuff, I’ll have to make it more verbose and
complex. It will mean that effectively every label has to be explicit
throughout the templates.

Actually, if I go to the author instructions here:

http://www.int-res.com/journals/meps/guidelines-for-meps-authors/

… there’s nothing specific about how to indicate page numbers, and
there’s only a single example with that messed up “p”. I wonder if
that’s essentially a typo?

Care to ask the publisher, explaining the problem?

Bruce

And here’s a rather weird case (from publisher “Inter-Research” that
publishes e.g. the highly ranked journal “Marine Ecology Progress
Series”, www.int-res.com) where, for book chapters, ‘p’ is used to
indicate a page range:

Arrigo KR (2003) Primary production in sea ice. In: Thomas DN,
Dieckmann GS (eds) Sea ice - an introduction to its physics,
chemistry, biology and geology. Blackwell Science Ltd, Oxford, p
143-183

Ah f**king hell. There are some traditions in styles that are
bullshit, and this is a perfect example!

I agree.

Actually, if I go to the author instructions here:

http://www.int-res.com/journals/meps/guidelines-for-meps-authors/

… there’s nothing specific about how to indicate page numbers, and
there’s only a single example with that messed up “p”. I wonder if
that’s essentially a typo?

No, unfortunately not. I just checked some citations in a handful of
MEPS papers and all did use a single ‘p’ for book chapters.

Care to ask the publisher, explaining the problem?

I don’t think this would help. And, as you know, it’s only one of
many wierd cases that exists with citation styles.

It’s agencies such as CrossRef that can unify this stuff but I don’t
think that I’ll be able to do anything about it.

Matthias

I think for that style, it would mean something like:

But then that suggests requiring number regardless. Likewise for
contributors:

One of the problems of these styles, I think, is they were invented
back when people did this stuff by hand! It strangely enough becomes
more difficult to automate, despite the fact it makes NO sense!

Bruce

Hello all!

because I feel that localization may make the styles harder to design
in the first
place, because you can’t necessarily tell how things behave simply by
looking at them. I suppose a good GUI CSL editor (I haven’t looked at
the one in SVN yet) may be able to take care of this problem.

Johan, you there? Any opinions?

Yup, I’m still lurking around on this list. Haven’t been able to do
much for this lately, except keeping up-to-date with the latest
things. Too bad the PyULike stuff didn’t work out…

I do feel indeed that adding localization does complicate matters
quite a bit, without adding very much benefit. Especially when for
example two styles translate differently into a similar language, e.g.
if one Dutch style wants to translate “author” to “auteur” but another
wants to use “schrijver”.

A publication is pretty much always in one language, so placing
localized strings in a style makes sense, as after all in the end the
ideal is that a journal supplies the CSL file.

I would argue against adding localization.

I recently picked up a bit of python, and I was thinking that it might
be useful to perhaps use the PyObjc bridge for the Cocoa CSL editor,
as it might allow to share code with the pyhon implementation of
CiteProc plus it would take away the dependency on having a XSLT 2
processor present. (Which isn’t there on a default OS X machine)

Would you grant me SVN rights to that part of the SVN tree Bruce? I’d
like to play around with that a bit.

As for the CSL editor, I do of course need the CSL to be at a
reasonable stable state. Let me know when you feel such is the case.

Greetings,

Johan–
http://www.johankool.nl/

I would argue against adding localization.

:slight_smile: Nothing like democracy.

It’s easy enough to add back the strings and remove localization, so
let’s revisit this:

Simon and I are ambivalent
Matthias says yes
Johan says no

From the standpoint of software, it doesn’t make that much difference.
If I create a citation style object like

CitationStyle.new(name="apa", language="en")

… it would work the same in either approach. One way would look for a
different file, while another way looks up some strings.

Localization makes styles potentially more complex, but no localization
means the necessity of having styles for each language (IF there are
more than one).

From a user perspective, I also don’t think there’s be that much
difference, except that in style-per-language approach, you might end
up in a situation where some language has no style for it. The software
could always default to english if needed.

I recently picked up a bit of python, and I was thinking that it might
be useful to perhaps use the PyObjc bridge for the Cocoa CSL editor,
as it might allow to share code with the pyhon implementation of
CiteProc plus it would take away the dependency on having a XSLT 2
processor present. (Which isn’t there on a default OS X machine)

Makes sense. I’ve heard good things about the Python bridge.

Would you grant me SVN rights to that part of the SVN tree Bruce? I’d
like to play around with that a bit.

I thought you had rights? Or are you saying you need rights to the
python-py area, and do not now?

Can you confirm before I go and look?

Bruce