XML Normalization

After a sorry disaster about a year ago during which I managed to lose everything, with all the advantages of having thrown away the first attempt. But I’ve hit an unexpected wall.

I realised that my processor was not passing leading or trailing whitespace (or pure whitespace) from attributes such as delimiter, prefix and suffix. So if the style file had, e.g. delimiter=", ", I was getting a delimiter of “,”.

At first I assumed a straightforward programming error on my part, but I think it may not be so. A bit of digging in the XML library I was using showed that it is removing leading or trailing whitespace from attributes. This turns out not to be an accident. The XML standards contain rules for attribute normalization, which state (I’ve put the critical requirement in bold):

Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize the attribute value by applying the algorithm below, or by using some other method such that the value passed to the application is the same as that produced by the algorithm.

  1. All line breaks must have been normalized on input to #xA as described in 2.11 End-of-Line Handling , so the rest of this algorithm operates on text normalized in this way.
  2. Begin with a normalized value consisting of the empty string.
  3. For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:
  • For a character reference, append the referenced character to the normalized value.
  • For an entity reference, recursively apply step 3 of this algorithm to the replacement text of the entity.
  • For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
  • For another character, append the character to the normalized value.

If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a white space character; being recursively processed, the white space character is replaced with a space character (#x20) in the normalized value.

As I read this, it follows that the library I am using is correct to strip trailing whitespace from ", ": an attribute of ", " and "," are identical after normalization. One might argue about whether an attribute value that consists only of a space value is “trailing”, I suppose (what does it trail?).

Needless to say, pure whitespace attributes and attributes with trailing whitespace are absolutely endemic in style files when defining delimiters and the like.

This raises two questions:

  1. Is the library I am using misinterpreting the standard? Or does CSL depend on a style file being processed being treated not in accordance with the XML standard, without the normalization rules being applied? If so, is any normalization applied (e.g. to strip multiple whitespace), or none? I can’t find explicit reference to this departure from the XML standard in the CSL documentation, but I may well have missed it.

  2. Is the only reliable workaround to modify the XML processor so that it does not normalize in accordance with standard? At the moment I can’t think of any other way round which would be safe, because I don’t think one could assume that delimiter="," and delimiter=", " should produce the same output. I approach the idea of modifying a library with a somewhat heavy heart, and I’m working in a language (OCaml) with only rather limited options.

Hah, that’s a pretty interesting question we probably need to answer.

Either it’s not a problem, or we need to say something about it in the spec.

Beyond the XML spec, it would be nice to know how a couple/few widely used libraries deal with this?

I can do a little digging later.

Flagging this for @bwiernik and @Denis_Maier as we work on revisions.

I’m still a bit unsure.

The Relax NG book has some discussion of whitespace normalization, but as it relates to validation.

From this perspective, when we use text in a pattern, we’re saying “preserve all whitespace.”

When we use "foo", that’s shorthand for token, which in this case means “normalize whitespace”, so that I think "this that" and "this that " are equivalent.

But you’re talking parsing, where the concern is a parser may throw out the whitespace before you see it.

If I just run xmllint on a sample file, it preserves whitespace.

❯ xmllint --format test.xml
<?xml version="1.0"?>
<style>
  <text foo="test, " bar="one    "/>
  <text foo="test,   " bar="one "/>
</style>

What library are you using? Is there not some option to modify whitespace processing, as here?

I’m using a library called xmlm, in Ocaml, which is more or less the standard.It has options to control the whitespacing in data elements, but it’s magnificently insistent on stripping it from attribute values because, as it points out, that is what the spec says must be done, so … I was surprised, but on first blush it looks correct. I haven’t checked whether the other commonly used XML library is equally insistent.

I don’t think the xml with whitespace is invalid xml. It’s just (for our purposes) that text=" ," is not only just as valid as text="," but is supposed to be treated as the same by processors. If that’s right, one needs a non-compliant processor to draw a distinction which CSL needs.

I was just curious, so checked the style repo:

❯ rg -l '","' | wc -l 
1259

❯ rg -l '", "' | wc -l
2196

❯ rg -l '",  "' | wc -l
22

❯ rg -l '",   "' | wc -l
4

So the majority have one significant trailing space, but if I search beyond that, there are others cases with significant whitespace in attribute values.

Not sure what to say. I don’t recall this coming up before in the more than 10 years since CSL has been available.

And fixing it seems like it would be awkward, particularly in terms of converting existing styles.

In the meantime, could it be possible for you to pre-process the input file to replace the whitespace with tokens, before running it through the XML parser?

If you have some clever suggestion, for us, let us know.

I think that would be overkill: parsing paired delimiters is relatively heavyweight work, and there’s plenty of places where one wants whitespace eaten. But I’m sure I can find a way. The critical thing is that you have confirmed my intuition that the behaviour is common and I don’t think I should do anything arbitrary like assuming commas always have spaces, or that nobody ever uses empty attributes on delimiters.

I think I can probably dig around in the library, find whatever function is doing the normalization, and … denormalize it. It’s abundantly clear that the behaviour (attributes preserve whitespace) is rationally critical to CSL, but it’s sort of curious that it seems to depend on non-standard XML processing, if it does, and agree that if that is the case it would be a good idea to document it explicitly. Not the first time, to be sure, that a standard has said one thing and practice has diverged.

@Bruce_D_Arcus1 See also this discussion from 2016 https://github.com/citation-style-language/schema/issues/135

I see that, but it’s a different issue. I’m not yet sure what I think
about it, or rintze’s proposed solution.

But it’s an easy change if we want to make it.

Just to report back on what I did: modify the library I was using so that it doesn’t normalize the attribute value in

  • prefix
  • suffix
  • delimiter
  • range-delimiter
  • initialize-with

But continues otherwise to normalize: so disambiguation=" true " will be as good as "true", but prefix=", " is not the same as prefix=",". That seems at least sane. If anyone can think of additional attributes that need to be passed through un-normalized, shout.

That doesn’t deal with multiple whitespace, but I think processors are expected (though it’s not clear from the documents, IIRC …) to collapse multiple whitespace in any case. What rules there should be for doing that is a whole 'nother discussion.

1 Like

value= should be passed through. Constructions like <text value="Review of "/> are common and reasonable.

Good point. sort-separator as well, I think.

Also sort-separator, name-delimiter, names-delimiter, cite-group-delimiter, year-suffix-delimiter, after-collapse-delimiter

I’m going through the spec to find any others

That looks like everything to me.

Clarification now merged to master.

Thanks!