Chapters, volumes and editions

I’m still struggling trying to format these for some of the styles (3rd
edition, edition 3, 3rd ed., 3 etc), so here are some ideas on a way
forward.

I think we need to be able to treat them either as text or numbers. Text for
backwards compatibility and where you don’t care about the format too much.

The most radical idea is to introduce a new formatting construct, that
emulates text to a degree.
So where you can have

you could instead do it with something like

to get something like 3rd
The form should also probably include roman (i,ii, iii), numeric(1,2,3),
ordinal and possibly others.

In zotero, this would pick out the number component of the edition field
with a regexp, and format that.
So from “3rd ed.” “edition 3” it would match 3, and then you can format that
how you like. It might be possible to detect roman numerals too for extra
credit.
This could be tested with to test if there was a
number in there, and so allow just text as an alternative.

The same can apply to the chapter and volume numbers and any other fields
that can be possible purely numeric.

Alternatives would be to overload the current text with some more of these
formatting constructs, but it feels slightly cleaner to have a separate
number as then you can also specify which variables the number directive and
formatting are reasonable to apply to (volume, number, chapter,
no-of-volumes, edition etc).

A possible more complex way would be to structure like names and dates, with
something like




which would allow appropriate labels to be attached to things. That might be
a step too far though.

Thoughts?
Julian.

I’m still struggling trying to format these for some of the styles
(3rd edition, edition 3, 3rd ed., 3 etc), so here are some ideas on
a way forward.

I think we need to be able to treat them either as text or numbers.
Text for backwards compatibility and where you don’t care about the
format too much.

The most radical idea is to introduce a new formatting construct,
that emulates text to a degree.
So where you can have

you could instead do it with something like

to get something like 3rd
The form should also probably include roman (i,ii, iii),
numeric(1,2,3), ordinal and possibly others.

In zotero, this would pick out the number component of the edition
field with a regexp, and format that.
So from “3rd ed.” “edition 3” it would match 3, and then you can
format that how you like. It might be possible to detect roman
numerals too for extra credit.
This could be tested with to test if there
was a number in there, and so allow just text as an alternative.

The same can apply to the chapter and volume numbers and any other
fields that can be possible purely numeric.

This sounds reasonable to me. It might be a little while before I have
time to implement it in Zotero, but if there are no objections I can
implement this in the schema.

Alternatives would be to overload the current text with some more of
these formatting constructs, but it feels slightly cleaner to have a
separate number as then you can also specify which variables the
number directive and formatting are reasonable to apply to (volume,
number, chapter, no-of-volumes, edition etc).

A possible more complex way would be to structure like names and
dates, with something like




which would allow appropriate labels to be attached to things. That
might be a step too far though.

Unless there’s a case where this approach can model a style but the
approach above cannot, I think I like the other approach more. Keeping
CSL as “flat” as possible simplifies the syntax and schema and makes
parsers easier to write.

Simon

Simon Kornblith wrote:

This sounds reasonable to me. It might be a little while before I have
time to implement it in Zotero, but if there are no objections I can
implement this in the schema.

Sure, go for it.

But before doing that, I’d just like to clarify something:

Might there might be some ambiguity in allowing the same variable in
both cs:text and cs:number? And what happens if you have non-numeric
editions?

Unless there’s a case where this approach can model a style but the
approach above cannot, I think I like the other approach more. Keeping
CSL as “flat” as possible simplifies the syntax and schema and makes
parsers easier to write.

I agree. This is another case where we’ve been down that road but
rejected it.

Bruce

Simon Kornblith wrote:

This sounds reasonable to me. It might be a little while before I
have
time to implement it in Zotero, but if there are no objections I can
implement this in the schema.

Sure, go for it.

But before doing that, I’d just like to clarify something:

Might there might be some ambiguity in allowing the same variable in
both cs:text and cs:number?

Not as far as I can tell. It should, after all, represent the same data.

And what happens if you have non-numeric
editions?

By Julian’s approach, it sounds like
wouldn’t display anything, but would, and
you could use to determine if a number existed.
Alternatively, I think that simply displaying the text representation
instead of the numeric representation would be fine (and, in the
interest of compactness, perhaps preferable, since I can’t think of
any situations where this wouldn’t be the desired behavior), as long
as the parser stripped “edition,” “ed,” etc. from the text.

Simon

And what happens if you have non-numeric
editions?

By Julian’s approach, it sounds like
wouldn’t display anything, but would, and
you could use to determine if a number existed.

Yes - those were my initial thoughts. The text version would always display
what was in the field verbatim, subject to the normal format modifications.
The would only work if there is a detectable number present. The
advantages I see are that scrapers and humans can put in what they like
within reason. So a “student edition” could be present there, or “revised
edition”. A “student edition” would not produce any output under the number
directive. A “3rd edition” would produce a number “3” unless formatting was
applied.
The only issue I see is where the number is present, but not relevant, such
as “$3.99 edition” “10th anniversary edition” or some such weirdness. I
don’t think these are common enough to worry about.

You could probably apply the and the included formats to the
following variables:

  • “page” - may not be useful as page already has its own processing,
    and page ranges make this tricky.
  • “locator” - likewise.
  • “version” - not sure quite how this is used, so it might be a
    candidate.
  • “volume” - yes
  • “number-of-volumes” -yes
  • “issue” - yes
  • “chapter-number” - yes
  • “edition” - yes
  • “number” - yes
  • “ISBN” - maybe, but probably not?
  • “citation-number” - maybe.

Some of these might just be for completeness, I can’t think of a good reason
why you would do it with citation-number for instance, unless you wanted
roman numerals for citations.
ISBN is also probably highly dubious, as it often gets spaces and -'s in it.

Alternatively, I think that simply displaying the text representation

instead of the numeric representation would be fine (and, in the
interest of compactness, perhaps preferable, since I can’t think of
any situations where this wouldn’t be the desired behavior), as long
as the parser stripped “edition,” “ed,” etc. from the text.

I’m not quite following you here.

If you update the schema, I think we need “edition” adding to the terms.
Also useful would be “Internet” “cited” “available at” “letter” “interview”
“thesis” from a quick browse through the current csl <text value= list.
I think chapter, anon, volume are currently catered for?

Julian.

\On Nov 30, 2007, at 12:27 PM, Julian Onions wrote:

Alternatively, I think that simply displaying the text representation
instead of the numeric representation would be fine (and, in the
interest of compactness, perhaps preferable, since I can’t think of
any situations where this wouldn’t be the desired behavior), as long
as the parser stripped “edition,” “ed,” etc. from the text.

I’m not quite following you here.

Instead of requiring some kind of conditional to display the text
variable if no number exists within it, why not just present it? For
example, if i have:

“3rd ed”
“Revised edition”

The other approach would require:

This would produce:

-> “3rd edition”
-> “Revised edition”

Alternatively, we could simply use:

with the understanding that, if no number exists, the text will be
printed instead, sans “edition,” “ed,” or anything else of the like.
The output would then be the same as above, but with less XML to be
repeated between styles. I can’t think of any situations where this
would be systematically unsatisfactory off the top of my head, but it
might be a little confusing.

Simon

I’ve been trying it out - I’ve got a basic implementation in zotero now.
It seems to do what I want.

The fragment





produces

EDITION SHORT=2nd Edition LONG=2nd Edition
ROMAN:ii ROMAN:II ORDINAL:2nd NUMERIC:2

Miss out the 2 from 2nd Edition, and the last line disappears.

There is the case to finish still.

Julian.

Julian Onions wrote:

I’ve been trying it out - I’ve got a basic implementation in zotero now.
It seems to do what I want.

I don’t have time to decipher your example. Can you present how you
would propose an average style writer would implement the feature?

I agree with Simon that having to add funky conditionals would not be good.

Bruce

I don’t have time to decipher your example. Can you present how you
would propose an average style writer would implement the feature?

I agree with Simon that having to add funky conditionals would not be
good.

I’ve now finished the zotero implementation, and with a few additions to the
terms (to add edition, and a plural form of the short form volume), I can
get more or less what I want. This from chicago-author-date on a few of the
examples I have.

Herbert, Martin, and Karen Harper-Dorton. 2002. Working with Children,
Adolescents and Their Families. 3rd ed. BPS Blackwell.

Weber, M. , M. de Burlet, and O. Abel. 1928. Die Säugetiere. 2nd ed. 2
vols. Jena:
Gustav Fischer.

Fussner, F. Smith. 1967. The Historical Revolution: English Historical
Writing and Thought, 1580-1640. New Ed. London: Routledge and Paul.

Kirk, J, and R. J. Munday. 1988. Narrative analysis. 3rd ed. Bloomington:
Indiana University Press.

Aronson, Jeffrey K., and M N G Dukes, eds. 2006. Meyler’s Side Effects of
Drugs, Fifteenth Edition:The International Encyclopedia of Adverse Drug
Reactions and Interactions 6 Volume Series. 15th ed. Elsevier Science.

Chadwick, H. Munro, and N. Kershaw Chadwick. 1986. The Growth of Literature.
Reprint. 3 vols. Cambridge: Cambridge University Press.
The new bits are the number of volumes and the edition. Both done with
macros












The directive is useful also as it smooths out what’s been scraped
or entered. The above examples had in the entries things like 2nd ed. 3rd
Rev Ed. edition 2 etc. All of which can be reformatted into a consistent
style.

Julian.

Julian Onions wrote:

I could live with this, but the distinction between text and number
still seems a little odd to me, as does the need for the conditional.

So I can see three options to dealing with this. Let’s call the above
option 1.

Option 2:

Only have “edition” be a “number”, and have a rule that if the datatype
is in fact not a number, then the content gets passed through as is. But
then that runs into the problem of spurious “edition” and such labels,
which we’d need a rule for as well.

Option 3:

I guess edition is perhaps the only variable whose datatype is
ambiguous. Still another option is to not allow that ambiguiuty (e.g.
two different variables: “edition_number” and “edition_description” or
some such).

Any votes? I’d like to wrap this up ASAP, since we have some other
issues to attend to.

Bruce

Julian Onions wrote:

I could live with this, but the distinction between text and number
still seems a little odd to me, as does the need for the conditional.

It does feel a little unweildy, but on the other hand it does give full
control.

So I can see three options to dealing with this. Let’s call the above

option 1.

Option 2:

Only have “edition” be a “number”, and have a rule that if the datatype
is in fact not a number, then the content gets passed through as is. But
then that runs into the problem of spurious “edition” and such labels,
which we’d need a rule for as well.

Indeed - I prefer explicit rules rather than implicit ones as I can see
whats happening.

Option 3:

I guess edition is perhaps the only variable whose datatype is
ambiguous. Still another option is to not allow that ambiguiuty (e.g.
two different variables: “edition_number” and “edition_description” or
some such).

This might work.

I ran into the same thing with Volumes. The field may have in it “3” “3rd
vol” “volume 3” of “vol. 3”, whereas the citation format needs "vol 3. In
this case you could force the volume variable to be purely numeric without
loosing much I think, but that would rely on some tighter constraints on the
inputs. Volume tends to refer to too different things really, volumes of
books - which are not the norm, and volumes of journals which are normally
mandatory. Its the former that is usually the awkward one. Anyway, it tends
to end up with the question should you append a “vol.” to the output, or
assume its part of the field.

Anyone know what bibtex of endnote do in these sort of things?

I’m wondering if a variant on the option-1 could be used for dates. We need
some way of detecting if there is a regular date or a more free-form type.
Something like

// June, 1990



// c.1873-1874
to output the verbatim date.

Julian.

Julian Onions wrote:

I ran into the same thing with Volumes. The field may have in it "3"
“3rd vol” “volume 3” of “vol. 3”, whereas the citation format needs "vol
3.

This is not CSL’s problem. It’s a bug with the user and/or the
application that allows them to enter data this way.

It’s for this reason that perhaps CSL ought to be explicit about the
kind of datatypes it expects.

In this case you could force the volume variable to be purely numeric
without loosing much I think, but that would rely on some tighter
constraints on the inputs. Volume tends to refer to too different things
really, volumes of books - which are not the norm, and volumes of
journals which are normally mandatory.

There are a number of ambiguous data fields in Zotero like this. The
"pages" one is another.

Its the former that is usually
the awkward one. Anyway, it tends to end up with the question should you
append a “vol.” to the output, or assume its part of the field.

A field like “volume” or “issue” should enforce integers in my view, and
CSL should always expect that as input.

Anyone know what bibtex of endnote do in these sort of things?

I don’t. James, you there?

I’m wondering if a variant on the option-1 could be used for dates. We
need some way of detecting if there is a regular date or a more
free-form type. Something like

// June, 1990



// c.1873-1874
to output the verbatim date.

I really don’t want to go down this path. In my view, this becomes a
hack, and reinforces my worry about the solution for numbers.

Bruce

A field like “volume” or “issue” should enforce integers in my view,
and
CSL should always expect that as input.

Anyone know what bibtex of endnote do in these sort of things?

I don’t. James, you there?

Not sure about Endnote, but BibTeX, AFAICS, doesn’t enforce any data-
types. I just ran an Article with people’s names in every 'numerical’
field and it processed all the way through without any warnings. I
think this encourages the user to ‘hack’ regular styles, rather than
encouraging the writing of correct styles.

FWIW, I’m +1 on enforcing datatypes in the regular fields, that will
strongly encourage people to do the right thing (opinionated software)
and +1 on having separate alternative free-text fields, which are
passed through verbatim, for each field. So that’s Option 3. It’s a
bit more complex, especially in the UI and it adds logic to the
styles, but at least it isn’t hidden.

(volume-numerical and volume-verbatim?)

–J

I’m wondering if a variant on the option-1 could be used for dates. We
need some way of detecting if there is a regular date or a more
free-form type. Something like

// June, 1990



// c.1873-1874
to output the verbatim date.

I really don’t want to go down this path. In my view, this becomes a
hack, and reinforces my worry about the solution for numbers.

Bruce

I thought Zotero already had an elegant way to represent dates: ISO 8601
followed by optional free-form text. (I think that’s what Zotero uses. It’s
also what I’m using for a non-Zotero project that will use CSL.)

Even if there is free-form text, you’d like a standardized date for sorting.

Newspaper: 1999-12-01
Book: 1999-00-00 <- Sort before a Jan 1, 1999 newspaper.
Monthly magazine: 1985-01-00 <- Sort as if published at beginning of Jan
1985.
Quarterly magazine: 2002-06-00 <- Sort as if published at beginning of Jun
2002.
Quarterly journal that uses seasonal names for issues:
2002-06-00 Summer 2002 <- Sort as June 2002, but
display
the alternate text.
Imputed date 1584-00-00 [1584] <- Sort as 1584
Uncertain date 1492-00-00 1492? <- Sort as 1492
Date range 1870-00-00 1870-1873 <- Display whole range, but sort
as 1870.
Named issue 2002-12-00 Holiday Issue 2002 <- Sort as Dec 2002.
Approximate range 1643-00-00 c. 1643-1645 <- Sort as 1643.
Approximate range 1645-00-00 c. 1643-1645 <- Sort as 1645.
Approximate range 1640-00-00 early 1640s <- Sort as 1640.

Zotero works a little differently, but I’m providing a UI that lets the user
set both the standard date for sorting and the display date. If the user
enters just the display date, a sort date is guessed. The user can override
the guess.

Because I saw this basic design in Zotero, I assumed it was what CSL was
using. I hope it is.

– John

John P. McCaskey wrote:

Zotero works a little differently, but I’m providing a UI that lets the user
set both the standard date for sorting and the display date. If the user
enters just the display date, a sort date is guessed. The user can override
the guess.

Because I saw this basic design in Zotero, I assumed it was what CSL was
using. I hope it is.

I see this as an implementation issue that shouldn’t be particularly
relevant to CSL. [It is, however, to the RDF stuff we’re working on.*]

I’m not sure it’s a good idea to be designing CSL to account for all of
this. I don’t want to be having all sorts of complicated conditionals
all over the place to account for essentially bad data (where "bad"
means using natural language strings to indicate structured information).

It’s among the reason why CSL also doesn’t get into personal names,
preferring to leave it to implementations to know how to deal with
sorting and displaying them (and ideally recognizing that the whole
world doesn’t use U.S. naming traditions).

BTW, all of this is related to an issue you’ve brought up before, John:
string substitution.

Bruce

  • For the RDF data, I guess I’d prefer two literal properties: one that
    can take standard xsd date datatypes (and perhaps an optional one that
    indicates a range; say YYYY-YYYY), and another – maybe called
    bibo:otherDate or some such – that can take a string. Approximate dates
    probably ought to be indicated with a datatype I suppose:

http://ex.net/1 dc:date “-0555”^^bibo:approximate_date .

The above would then get rendered (in English) as perhaps “c555 BC”.

In any case, this stuff is a PITA!

I’m not sure it’s a good idea to be designing CSL to account for all of
this. I don’t want to be having all sorts of complicated conditionals
all over the place to account for essentially bad data (where "bad"
means using natural language strings to indicate structured information).

It’s among the reason why CSL also doesn’t get into personal names,
preferring to leave it to implementations to know how to deal with
sorting and displaying them (and ideally recognizing that the whole
world doesn’t use U.S. naming traditions).

BTW, all of this is related to an issue you’ve brought up before, John:
string substitution.

Bruce

  • For the RDF data, I guess I’d prefer two literal properties: one that
    can take standard xsd date datatypes (and perhaps an optional one that
    indicates a range; say YYYY-YYYY), and another – maybe called
    bibo:otherDate or some such – that can take a string. Approximate dates
    probably ought to be indicated with a datatype I suppose:

http://ex.net/1 dc:date “-0555”^^bibo:approximate_date .

The above would then get rendered (in English) as perhaps “c555 BC”.

In any case, this stuff is a PITA!

I like the old PERL mantra, “Easy things should be easy and difficult things should be possible” (even if it hasn’t been true of PERL since version 4!).

If the user really needs “[c. 1475? - 1476/7]” or “Dec, 2005, second holiday issue”, it should be possible, even if getting it means the user cannot switch to another format and expect those to get automatically converted to “[c1475?-1476/77]” and 12/2005, 2nd holiday issue".

I started thinking dates needed all sorts of semantic coding – support for range, flags for approximate, imputed, uncertain, split, etc. – but to add enough of that to fully support all the difficult cases, easy cases would no longer be easy. So now I like just two components, an 8601-encoded date (which automatically handles variable resolution) and a display string. For display, the second, if it exists, overrides a formatted version of the first. For sorting, the first always prevails.

Easy things stay easy. Difficult things come at the cost of a semantic loss, but the loss is bound to a known and well defined space.

– John

John P. McCaskey wrote:

I like the old PERL mantra, “Easy things should be easy and difficult
things should be possible” (even if it hasn’t been true of PERL since
version 4!).

I like it too. But there’s another one which trumps it for me …

If the user really needs …

… “why do we do manually what computers should be doing for us?”* E.g.
in a few years, I don’t want to ever have to edit data or worry about
correct citation styles. I don’t want to think about citations at all;
just the data and content I’m working with.

And for that to happen requires some care to these issues, even while
keeping the Perl maxim in mind. You might be right about the two
different kinds of properties.

Bruce

It’s not really structured information when you get into things like
"Summer 2007" and “circa 1754”, though, is it? (Obviously if it were
seen as such, CSL could correctly translate “Summer”, but that
definitely seems like something we don’t want to get into, and those are
probably relatively easy examples anyway.) Zotero doesn’t (or shouldn’t)
pass through natural language strings to indicate structured data, only
when the data doesn’t conform to the expected structure–but in those
cases there’s still a pretty good likelihood that the user intended that
data to represent that field in that place in the citation.

I’m mostly just an observer in this discussion, but the implicit
substitution idea–passing through the field if it isn’t numeric (or
otherwise structured)–seems fine to me. FWIW, Zotero wouldn’t, for
example, include two separate fields in the UI–it might make things a
little clearer for CSL authors, but there’s certainly no need to trouble
the user with it. We would just parse the field and include a status
indicator like we do for dates. If there were two CSL fields, we’d
populate those based on whether or not we had parsed an integer, but I
don’t see much of a compelling reason to have those in CSL either–if
"spurious ‘edition’ and such labels" get through and disable
numeric/structured parsing of a field, that’s the app’s fault, not CSL’s.

I’ve said this before, but a brief reiteration of why we don’t enforce
data types in Zotero: since Zotero is designed to easily grab data from
disparate and often idiosyncratic sources, we’d have to either discard
data at input time or prompt the user each and every time non-conforming
data came in, neither of which is a very good option. And, as in the
case of dates, our ideas of what is structured data are quite likely not
universal. Parsing logic can be improved, but discarded data can’t be
restored. (We may eventually have optional data normalization features
in Zotero, but that’s a separate issue.)

This doesn’t mean we shouldn’t support additional structures–standard
date ranges seem like something Zotero should definitely parse and CSL
should be able to format, and there are surely other examples.

Dan Stillman wrote:

It’s not really structured information when you get into things like
"Summer 2007" and “circa 1754”, though, is it?

Well …

issue_date: { “year”: “2007”; “season”: “2” }
issue_date: { “year”: “1754”; “type”: “approximate” }

The problem is nobody wants to hassle with that. So we fall back on a
kind of loose implicit structure.

I’m mostly just an observer in this discussion, but the implicit
substitution idea–passing through the field if it isn’t numeric (or
otherwise structured)–seems fine to me. FWIW, Zotero wouldn’t, for
example, include two separate fields in the UI–it might make things a
little clearer for CSL authors, but there’s certainly no need to trouble
the user with it. We would just parse the field and include a status
indicator like we do for dates.

Right.

If there were two CSL fields, we’d
populate those based on whether or not we had parsed an integer, but I
don’t see much of a compelling reason to have those in CSL either–if
"spurious ‘edition’ and such labels" get through and disable
numeric/structured parsing of a field, that’s the app’s fault, not CSL’s.

So option are you voting for? Julian’s example was 1, and then there was:

Option 2:

Only have “edition” be a “number”, and have a rule that if the datatype
is in fact not a number, then the content gets passed through as is. But
then that runs into the problem of spurious “edition” and such labels,
which we’d need a rule for as well.

Option 3:

I guess edition is perhaps the only variable whose datatype is
ambiguous. Still another option is to not allow that ambiguiuty (e.g.
two different variables: “edition_number” and “edition_description” or
some such).

I’ve said this before, but a brief reiteration of why we don’t enforce
data types in Zotero: since Zotero is designed to easily grab data from
disparate and often idiosyncratic sources, we’d have to either discard
data at input time or prompt the user each and every time non-conforming
data came in, neither of which is a very good option.

Right. I guess the trick is to figure out how to incrementally enhance
the quality of publicly available data, rather than succumb to entropy.

Bruce

I don’t have enough experience with writing CSLs to vote between 1 and
2–I was just taking issue with the notion that spurious "edition"
labels getting passed through would be CSL’s problem.

My uninformed opinion would be that, if we don’t have any reason to
think that styles would need different formatting around the
passed-through element than they would put around the structured one,
option 2 would be fine. The pass-through would essentially just be
passing the entire formatting issue up to the user. The full control and
explicitness of Option 1 is appealing, but maybe not if it’s just
creating more work for CSL authors without any practical benefits.

We should probably come up with some Option 1 and Option 2 examples for
funkier dates–handling at least date ranges as additional structured
options–before deciding. My sense is that if the date doesn’t parse
into supported semantic fields without a remainder, it probably needs to
be passed through in its entirety without structure, replacing the whole
element, and there wouldn’t be too much point in using extra
conditionals. That would avoid issues like the one Sean mentioned in the
Zotero forums where the range part of a date range was silently discarded.

But I may be overlooking things, so you and others can probably comment
on this better than I can.