After a sorry disaster about a year ago during which I managed to lose everything, with all the advantages of having thrown away the first attempt. But I’ve hit an unexpected wall.
I realised that my processor was not passing leading or trailing whitespace (or pure whitespace) from attributes such as delimiter
, prefix
and suffix
. So if the style file had, e.g. delimiter=", "
, I was getting a delimiter of “,”.
At first I assumed a straightforward programming error on my part, but I think it may not be so. A bit of digging in the XML library I was using showed that it is removing leading or trailing whitespace from attributes. This turns out not to be an accident. The XML standards contain rules for attribute normalization, which state (I’ve put the critical requirement in bold):
Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize the attribute value by applying the algorithm below, or by using some other method such that the value passed to the application is the same as that produced by the algorithm.
- All line breaks must have been normalized on input to #xA as described in 2.11 End-of-Line Handling , so the rest of this algorithm operates on text normalized in this way.
- Begin with a normalized value consisting of the empty string.
- For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:
- For a character reference, append the referenced character to the normalized value.
- For an entity reference, recursively apply step 3 of this algorithm to the replacement text of the entity.
- For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
- For another character, append the character to the normalized value.
If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.
Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a white space character; being recursively processed, the white space character is replaced with a space character (#x20) in the normalized value.
As I read this, it follows that the library I am using is correct to strip trailing whitespace from ", "
: an attribute of ", "
and ","
are identical after normalization. One might argue about whether an attribute value that consists only of a space value is “trailing”, I suppose (what does it trail?).
Needless to say, pure whitespace attributes and attributes with trailing whitespace are absolutely endemic in style files when defining delimiters and the like.
This raises two questions:
-
Is the library I am using misinterpreting the standard? Or does CSL depend on a style file being processed being treated not in accordance with the XML standard, without the normalization rules being applied? If so, is any normalization applied (e.g. to strip multiple whitespace), or none? I can’t find explicit reference to this departure from the XML standard in the CSL documentation, but I may well have missed it.
-
Is the only reliable workaround to modify the XML processor so that it does not normalize in accordance with standard? At the moment I can’t think of any other way round which would be safe, because I don’t think one could assume that
delimiter=","
anddelimiter=", "
should produce the same output. I approach the idea of modifying a library with a somewhat heavy heart, and I’m working in a language (OCaml) with only rather limited options.