CSL Questions

A publication is pretty much always in one language, so placing
localized strings in a style makes sense, as after all in the end the
ideal is that a journal supplies the CSL file.

I would argue against adding localization.

I think that omitting localization may pose problems in the future.
Regarding exactly this issue (i.e. people thinking that publication
styles are always language specific), here’s a quote from an email I did
write to Bruce earlier this month:

Actually, come to think of it, aren’t the bast majority of styles that
people use – journal styles – by definition language specific?

Yes, that is definitively true. :slight_smile:

In that case, localization at the level of the file is appropriate.

I agree that in most cases the default version is all that’s needed.

That was my original thinking in making the decision … and I’m
thinking it was the right one. That said, I’ll see about tweaking
things a bit.

Yes. However, it just seems smarter to allow for global language (or
context) specific string substitution. This would allow the use of
arbitrary styles in combination with arbitrary languages - without any
duplication & editing of existing styles.

There are quite some cases where you may be more or less free to choose
among different styles (e.g. when doing your thesis or when writing a
grant proposal). As an example, a german proposal for the german
research foundation would require german-language strings within my
reference list. With a global language file containing german
translations of strings, I’d be free to choose whichever citation style
would be appropriate – and I could easily change the style later on
while maintaining my german strings. Without a global language file, I’d
need to make a copy of the style I’d like to use and edit all of its
english strings. And, if it turns out that I should need to use another
style for the final proposal, I’d need to again duplicate that other
style and edit all the contained strings… I’d be frustrated quickly. :wink:

But generally, you’re correct in that when submitting your paper to a
journal, there’s always only one language required.

I’m pretty sure that the need WILL arise for some people to have a
particular style available in a non-standard language. Do you want these
folks to edit each style to match their language? Even if styles would
exist already for your language, users would have to find & download
them. Now, how many people would do so (or be able to do so)?

IMHO, Bruce’s reply sums this up quite well:

From the standpoint of software, it doesn’t make that much
difference. If I create a citation style object like

CitationStyle.new(name="apa", language="en")

… it would work the same in either approach. One way would look
for a different file, while another way looks up some strings.

Localization makes styles potentially more complex, but no
localization means the necessity of having styles for each language
(IF there are more than one).

From a user perspective, I also don’t think there’s be that much
difference, except that in style-per-language approach, you might
end up in a situation where some language has no style for it. The
software could always default to english if needed.

I don’t see a problem if (english) default strings are included with
each CSL style file. A dump CSL processor wouldn’t need to know anything
about localization, it would just use the default strings and that’s it.
A smarter CSL processor could use any language-specific information to
use a non-default language for citation formatting.

Ignoring language-specific issues (even if minor ones) will cause
problems in the future, that’s for sure. IMHO, we should try to make CSL
files as generic as possible. This will be the key to success.

Matthias

Would you grant me SVN rights to that part of the SVN tree Bruce? I’d
like to play around with that a bit.

I thought you had rights? Or are you saying you need rights to the
python-py area, and do not now?

Can you confirm before I go and look?

That’s what I thought too. Went to look for it some more, seemed that
my SF password expired. I just reset it to a new one. I still get
authorization failed errors, but that might be because it could take a
little time to get the new password to the svn servers.

I’ll let you know if I still can’t get in later on.

Johan

That’s what I thought too. Went to look for it some more, seemed that
my SF password expired. I just reset it to a new one.

Yeah, I had that problem too.

I still get authorization failed errors, but that might be because it
could take a
little time to get the new password to the svn servers.

That’s probably right. Let me know …

Bruce

Actually, I think that adding localization to level you describe is
going to be quite a challenge to manage. I think it’s important to
contain within a CSL file exactly how strings are to be spelled. If
the file with translations comes from the citeproc implementation, it
can’t be guaranteed to be the same as the original author of the CSL
file had in mind. Including the translations within each CSL file
sounds like a very difficult to keep synced. One mistranslation then
has to be added to every CSL file.

Then, if two German styles are based on APA, but translate a word
differently, which one goes in the official translation?

For the cases you describe it would perhaps be better to write a
simple tool that does a rough translation based on a user supplied
dictionary.

I’d say, translate only once, when the CSL file is created, and not
every time it’s used. The latter is very depended on the dictionary
fed and hence more prone to give different results on different
computers.

Just my thoughts…

Johan

OK, i just checked in a change that allows the formatting config
attributes on the cs:item-layout element. So, for apa:

   <item-layout suffix=".">
      <reftype name="book">

Will that work?

Bruce

Yeah, that’s an interesting idea.

I’ve had in mind that it’d be nice to find a Javascript/Web 2.0 guru to
implement a CSL editor. The way I envision that for a user is that they
would always base any new style (including new language) on an existing
one. So, they choose a “my styles looks like this” option, etc.

It might be possible to tweak CSL to make it easier to automate this
sort of thing. E.g., this:

<label single="Ed" multiple="Eds"/>

… might work. So a tool can then know “I need to prompt user to edit
all label content.”

In the current version of the schema, I actually exclude word content
from the prefix and suffix elements using a regular expression (and
also constrain the values for uris). Ah, RELAX NG is nice!

Bruce

Johan, what do these two terms mean exactly?

Bruce

Actually, I think that adding localization to level you describe is
going to be quite a challenge to manage.

Why? it’s actually less of a challenge (IMHO). With my scenario, there
would be only one single CSL file for the APA style (containing the
default strings) plus one global language file that is optional.

With my setup, if you need to update the APA CSL file, you’ll only need
to update a single file. With multiple APA files (one per language),
you’ll need to update every single language incarnation of the APA
style, which may well leed into a big update mess.

I think it’s important to contain within a CSL file exactly how
strings are to be spelled.

But this is exactly what I’m suggesting. My suggestion is simply that
all (default) strings are contained at the top of the file instead of
being dispersed between individual XML elements deep inside the CSL
structure.

Including the translations within each CSL file sounds like a very
difficult to keep synced. One mistranslation then has to be added to
every CSL file.

No, the translation would be within a single global localization file.
The translation strings would NOT be contained within individual CSL
style files (only the default strings). There’d be one translation per
language, not one translation per style file. If then users would like
to make language-specific customizations for a particular style, they
could easily do so by editing the default strings within the CSL file to
suit their needs.

I’d say, translate only once, when the CSL file is created, and not
every time it’s used.

This is what I’m suggesting. There’s one single translation that will
work for all styles. If this is not appropriate for some individual
styles, just edit the default strings within that style. If necessary, a
user can still duplicate a style and edit this new version without
messing with the default file.

Matthias

I guess that ‘auteur’ means ‘author’ and ‘schrijver’ means literally
’writer’ (same as ‘author’?), one being french, one being dutch?

If you want to adopt the language of a particular style you can always
do so by editing the string variables contained at the top of that
particular style file. I.e., it is not a problem at all that one dutch
style has different language strings than others. There’d just need to
be a mechanism that says: “prefer style-specific strings over global
strings”.

To reiterate: I just propose to add another optional layer on top of
the existing layer (the existing layer being the default language
strings contained inside of the CSL files, the optional layer being a
single global language file). Everything would still work without the
additional language file and CSL files would still be self-contained.

The only change is really that language-specific strings are moved from
within the CSL structure to a separate section at the top of the CSL
file.

(sorry, I feel I’m repeating myself quite a bit, I’ll stop now… :wink:

Matthias

I’m still a bit puzzled as to how delimiting should work. Does a suffix on a
formatting element override the suffix on the item-layout? Does a prefix on
a formatting element override the implicit " "?

What I’m outputting right now (with those assumptions) is at
http://simonster.com/csl/ (you’ll need to use Firefox 1.5 or later to get
it to work). I still have a few questions, but it may make more sense to
wait on them until we make a decision on localization. No validation is
done, and there are probably plenty of bugs, but both the item attributes
and CSL are live, so any changes you make will be reflected as soon as you
press “Show.”

Simon

OK, i just checked in a change that allows the formatting config
attributes on the cs:item-layout element. So, for apa:

   <item-layout suffix=".">
      <reftype name="book">

Will that work?

I’m still a bit puzzled as to how delimiting should work. Does a
suffix on a
formatting element override the suffix on the item-layout?

Can you give an example of a “formatting element” that could see
overriding that on the item-layout? I could see if I allowed the
formatting attributes on the reftype element, but not on its children.

Does a prefix on a formatting element override the implicit " "?

Implicit " "?

What I’m outputting right now (with those assumptions) is at
http://simonster.com/csl/ (you’ll need to use Firefox 1.5 or later
to get
it to work).

Sweet; that was quick!

I should probably mention something about the design of it all, which
is the type and fallback system.

There are three required types: article, book, and chapter. There is no
“generic” type, because these three in fact serve as generic fallbacks
also.

If you’re designing the code, then, you can think of a rule like:

if pages*, then
	if volume then article
	else chapter
else book
  • in my RDF, I just test for the presence of an dcterms:isPartOf
    element, but you don’t have that structure, so need to rely on field
    names. Unfortunately, pages won’t work well for online articles.

I still have a few questions, but it may make more sense to
wait on them until we make a decision on localization.

Make sure to give your thoughts on the conversation so I have an easier
time deciding!

Bruce

I’m still a bit puzzled as to how delimiting should work. Does a
suffix on a
formatting element override the suffix on the item-layout?

Can you give an example of a “formatting element” that could see
overriding that on the item-layout? I could see if I allowed the
formatting attributes on the reftype element, but not on its children.

Does a prefix on a formatting element override the implicit " "?

Implicit " "?

What I’m outputting right now (with those assumptions) is at
http://simonster.com/csl/ (you’ll need to use Firefox 1.5 or later
to get
it to work).

With my current code, I’m taking the prefix and suffix attributes on each
, , etc. element and attaching them before and after the
data. If there’s no prefix, I attach " ". If there’s no suffix, I attach the
suffix specified in the item-layout element.

So, for example, there’s have the year element:

The suffix on the year element overrides the suffix on the item-layout
element, so you only have one period:
Kornblith, S. (2006). CSL in JavaScript Proof of Concept.

There’s also the pages element:

The prefix means no space is inserted before the element, and so you end up
with:
(2), pp105-107.

I have no idea if I’m doing this entirely correctly, but it seems to be
working mostly correctly. How do you do it in CiteProc?

I still have a few questions, but it may make more sense to
wait on them until we make a decision on localization.

Make sure to give your thoughts on the conversation so I have an easier
time deciding!

I think Matthias’s suggestion of an optional layer over terms embedded in a
CSL file sounds reasonable, although I still believe our top two priorities
should be versatility and ease of modification of styles, and localization
should not come at the expense of either of these.

I certainly don’t think we should leave things as they are now, where the
software, rather than the style, is left to define all of the strings. It
seems to me that there are too many possible abbreviations to cover with a
simple form=“short” attribute.

My original proposal of XML entities was based on the assertion that, with
entities, the relationship between a given term and a given element is
clearer. It’s obvious what text is supposed to go where. I now know that I
can’t use E4X with entities, and, while I could easily switch to the DOM
parser (since my parsing and generation code are completely separate), I
realize this approach might cause problems for those working with other
programming languages. If at all possible, however, I’d still like to make
sure that what text applies to which label is clear.

I think one option is simply to provide a standard set of IDs to assign to
given terms, but still allow authors to specify text in-place. This isn’t
actually much different from the way CiteProc 0.7.1 did things with the
element. Localization files could then override these IDs, assuming
everyone uses the same set. In fact, this is the basis of the W3C’s
International Tag Set (http://www.w3.org/TR/2006/WD-its-20060518/) and
their corresponding Best Practices for XML Internationalization
(http://www.w3.org/TR/xml-i18n-bp/). (We could even be hip and use ITS,
but that’s not really necessary.)

For example:
available from:

For encoding locators, contributors, etc. we could use the same basic syntax
as was in CiteProc 0.7.1 files, with IDs attached to each term. The major
downside in my mind is we’d lose the compact attribute-oriented syntax.

What do you think?

Simon

Hi Simon,

I think Matthias’s suggestion of an optional layer over terms embedded
in a CSL file sounds reasonable, although I still believe our top two
priorities should be versatility and ease of modification of styles,
and localization should not come at the expense of either of these.

Yes, I fully agree. And I understand people’s concern that this
localization thing complicates the CSL structure too much.

It’s a good idea to concentrate on the default (english) CSL first. If
this is working, then think about edge cases such as localization. But
this strategy requires that the CSL structure is designed such that it
allows for localization in the future (as you propose below).

I certainly don’t think we should leave things as they are now, where
the software, rather than the style, is left to define all of the
strings.

Right, it should be possible that each style has its own specific
language definitions.

If at all possible, however, I’d still like to make sure that what
text applies to which label is clear.

Yes, this is currently a problem and I’m having trouble with this as
well when trying to figure out what goes were. And if we already have
trouble with it then I guess it’s certainly too complicated for a
regular user wanting to edit a style file. :wink:

I think one option is simply to provide a standard set of IDs to
assign to given terms, but still allow authors to specify text
in-place.

Sounds good.

This isn’t actually much different from the way CiteProc 0.7.1 did
things with the element. Localization files could then
override these IDs, assuming everyone uses the same set

Yes. So this would mean that the strings are specified in place (as they
were before) but the standard strings just gain an ‘id’ attribute which
allows language-aware processors to override the standard string with a
localized one, is this what you’re suggesting?

For example:
available from:

I like this very much. By that you can read a CSL style file continuously
and you don’t need to figure out the correct strings from the top of the
file (or wherever). I agree that this is far easier to comprehend.

The major downside in my mind is we’d lose the compact
attribute-oriented syntax.

Yes, but personally I don’t see this as a big problem. I think it’s more
important that people can easily figure out how the CSL bits form a
style format. And this is definitively easier if everything is defined
in place.

Matthias

With my current code, I’m taking the prefix and suffix attributes on
each
, , etc. element and attaching them before and after the
data. If there’s no prefix, I attach " ". If there’s no suffix, I
attach the
suffix specified in the item-layout element.

[…]

I have no idea if I’m doing this entirely correctly, but it seems to be
working mostly correctly. How do you do it in CiteProc?

Typically, I don’t assume any default. If the processor hits, say, a
cs:title element, it grabs the prefix and suffix content, passes them
as parameters to the dc:title (or whatever) template, and they then get
printed (if there’s anything there of course). In the Ruby port I
started, it’s similar: when a CitationStyle object gets created, there
are formatting field objects that have prefix and suffix attributes,
each of which hold an object.

So I don’t think you should need to assume a prefix that isn’t there.

I still have a few questions, but it may make more sense to
wait on them until we make a decision on localization.

Make sure to give your thoughts on the conversation so I have an
easier
time deciding!

I think Matthias’s suggestion of an optional layer over terms embedded
in a
CSL file sounds reasonable, although I still believe our top two
priorities
should be versatility and ease of modification of styles, and
localization
should not come at the expense of either of these.

Yes.

I certainly don’t think we should leave things as they are now, where
the
software, rather than the style, is left to define all of the strings.
It
seems to me that there are too many possible abbreviations to cover
with a
simple form=“short” attribute.

The form attribute is really just for names and titles.

I think one option is simply to provide a standard set of IDs to
assign to
given terms, but still allow authors to specify text in-place.

This is what I was thinking.

This isn’t actually much different from the way CiteProc 0.7.1 did
things with the
element.

Correct.

For example:
available from:

For encoding locators, contributors, etc. we could use the same basic
syntax
as was in CiteProc 0.7.1 files, with IDs attached to each term. The
major
downside in my mind is we’d lose the compact attribute-oriented syntax.

What do you think?

I’ve gone back-and-forth on the attributes vs. elements issue here.
Part of my design approach to this is that CSL should only be easy to
work with for people editing the raw XML, but that it also ought to be
suitable to creating GUI’s, and for implementation in OO languages.

I think your example above is probably the technically correct solution
that balances these needs, but there is one little issue: mixed
content. Not a problem for XSLT, but for other languages?

If prefix/suffix content is simple strings, printing them is a simple
method call:

print field.prefix

This returns a string. If they are elements (complete with formatting
attributes), the prefix/suffix methods would return objects.

But if we have mixed content (label element plus strings), then what
would the content of those returned objects be?

Bruce

I missed that the label element contained content.

A problem with that is you have duplication, and room for error.

Bruce

Hi Bruce,

For example:
available from:

I like this very much. By that you can read a CSL style file
continuously and you don’t need to figure out the correct strings
from the top of the file (or wherever). I agree that this is far
easier to comprehend.

I missed that the label element contained content.

A problem with that is you have duplication, and room for error.

But as soon as there’s an ID, there’s some kind of duplication, isn’t it?
Or what kind of duplication do you mean?

I fear that the error (or confusion) may be higher, if the relationship
between strings and elements/attributes is opaque and not
straight-forward.

What I like about the above example is that it’s very easy to
comprehend, also for a regular user who hasn’t touched a CSL style file
before.

Besides a personal preference for elements, I also have a gut feeling
that elements may be more flexible in the long run (compared to
attributes), but I’m certainly no XML expert…

I’m sure that the CSL language will evolve with time and that there
might be issues which we cannot foresee today. Elements may be more
flexible to address any future issues.

I understand your point about the mixed content, though. It may be odd,
but wouldn’t it help if the punctuation would be wrapped into its own
sub-element?

available from :

Well, as I said, I’n no XML expert and I’m sure you can justify this
better.

Matthias

I understand your point about the mixed content, though. It may be odd,
but wouldn’t it help if the punctuation would be wrapped into its own
sub-element?

available from :

Well, as I said, I’n no XML expert and I’m sure you can justify this
better.

Yes, this is the more correct approach for OO-friendliness. It could be
called “text” or something else, but it removes the mixed-content
situation. A prefix or suffix object would basically be an array of
other objects (text and/or labels).

I fear that the error (or confusion) may be higher, if the
relationship between strings and elements/attributes is opaque and
not straight-forward.

Note that this comment was meant regarding things like this:

...

where some label text (such as “(eds)” or “edited by”) would be added
in addition to the indicated prefix/suffix – but the XML structure
doesn’t give the user a clue that it does so.

This problem goes away if the XML structure contains a label with an
‘id’ attribute as a kinda placeholder:

,

We have three options if we accept the notion of embedding the label
(which I think a good idea):

available from :
:

In both of the above, the id values would be controlled (though I guess
extensible).

Yes. While the first option seems to be most straight-forward, I see
that the second is less cluttered.

Also, if a string is to be used in multiple places, the first option is
less optimal since you’d have to edit the same string in several places.
Maybe this was what you meant with “duplication”? :slight_smile:

  1. back to non-localized
available from :

As you may guess I wouldn’t favour that option. :wink:

I guess if we went this way, I’d prefer option 2. I see no value in
duplicating the label content in each place, and I think it makes
things more complicated not less.

Ok, I have to agree now.

So question 1 is, do we go this way, with a label and text* as
children of prefix and suffix elements?

I actually like this a lot better than the attribute-oriented style.
But, again, this may be a personal preference. What I like about it that
I can read this following along the XML hierarchy and it’s clear which
element (or text) comes after another one.

Question 2: which option?

Taking your points into account, I’d also favour option 2 now.

  • come to think of it, we’d need to think about whether it ought to be
    text, or something more constrained (non-word characters), as it is
    now.

I think people should be allowed to use it for insertion of plain-text
word characters, even if we’re currently using it for punctuation only.
This may help to cater for strange cases that we couldn’t foresee now
and where there is no controlled vocabulary available. So I’d keep it as
‘’.

Matthias

Quick follow-up:

:
  • come to think of it, we’d need to think about whether it ought to
    be text, or something more constrained (non-word characters), as it
    is now.

I think people should be allowed to use it for insertion of plain-text
word characters, even if we’re currently using it for punctuation
only. This may help to cater for strange cases that we couldn’t
foresee now and where there is no controlled vocabulary available. So
I’d keep it as ‘’.

Can a occur multiple times within a prefix or suffix? I assume,
at least can? If the content in is text (and not only
punctuation), couldn’t the be also called just :

:

So, basically, there’d be strings from a controlled vocabulary (
with an ‘id’ attribute: ) and arbitrary strings or
punctuation ( without an ‘id’ attribute).

Matthias

Sure. That might be better.

Bruce