XML Normalization

PaulStanley · June 22, 2020, 8:41am

After a sorry disaster about a year ago during which I managed to lose everything, with all the advantages of having thrown away the first attempt. But I’ve hit an unexpected wall.

I realised that my processor was not passing leading or trailing whitespace (or pure whitespace) from attributes such as delimiter, prefix and suffix. So if the style file had, e.g. delimiter=", ", I was getting a delimiter of “,”.

At first I assumed a straightforward programming error on my part, but I think it may not be so. A bit of digging in the XML library I was using showed that it is removing leading or trailing whitespace from attributes. This turns out not to be an accident. The XML standards contain rules for attribute normalization, which state (I’ve put the critical requirement in bold):

Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize the attribute value by applying the algorithm below, or by using some other method such that the value passed to the application is the same as that produced by the algorithm.

All line breaks must have been normalized on input to #xA as described in 2.11 End-of-Line Handling , so the rest of this algorithm operates on text normalized in this way.

Begin with a normalized value consisting of the empty string.

For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:

For a character reference, append the referenced character to the normalized value.

For an entity reference, recursively apply step 3 of this algorithm to the replacement text of the entity.

For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.

For another character, append the character to the normalized value.

If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a white space character; being recursively processed, the white space character is replaced with a space character (#x20) in the normalized value.

As I read this, it follows that the library I am using is correct to strip trailing whitespace from ", ": an attribute of ", " and "," are identical after normalization. One might argue about whether an attribute value that consists only of a space value is “trailing”, I suppose (what does it trail?).

Needless to say, pure whitespace attributes and attributes with trailing whitespace are absolutely endemic in style files when defining delimiters and the like.

This raises two questions:

Is the library I am using misinterpreting the standard? Or does CSL depend on a style file being processed being treated not in accordance with the XML standard, without the normalization rules being applied? If so, is any normalization applied (e.g. to strip multiple whitespace), or none? I can’t find explicit reference to this departure from the XML standard in the CSL documentation, but I may well have missed it.
Is the only reliable workaround to modify the XML processor so that it does not normalize in accordance with standard? At the moment I can’t think of any other way round which would be safe, because I don’t think one could assume that delimiter="," and delimiter=", " should produce the same output. I approach the idea of modifying a library with a somewhat heavy heart, and I’m working in a language (OCaml) with only rather limited options.

Bruce_D_Arcus1 · June 23, 2020, 2:52pm

Hah, that’s a pretty interesting question we probably need to answer.

Either it’s not a problem, or we need to say something about it in the spec.

Beyond the XML spec, it would be nice to know how a couple/few widely used libraries deal with this?

I can do a little digging later.

Flagging this for @bwiernik and @Denis_Maier as we work on revisions.

Bruce_D_Arcus1 · June 23, 2020, 5:57pm

I’m still a bit unsure.

The Relax NG book has some discussion of whitespace normalization, but as it relates to validation.

From this perspective, when we use text in a pattern, we’re saying “preserve all whitespace.”

When we use "foo", that’s shorthand for token, which in this case means “normalize whitespace”, so that I think "this that" and "this that " are equivalent.

But you’re talking parsing, where the concern is a parser may throw out the whitespace before you see it.

If I just run xmllint on a sample file, it preserves whitespace.

❯ xmllint --format test.xml
<?xml version="1.0"?>
<style>
  <text foo="test, " bar="one    "/>
  <text foo="test,   " bar="one "/>
</style>

What library are you using? Is there not some option to modify whitespace processing, as here?

PaulStanley · June 23, 2020, 6:09pm

I’m using a library called xmlm, in Ocaml, which is more or less the standard.It has options to control the whitespacing in data elements, but it’s magnificently insistent on stripping it from attribute values because, as it points out, that is what the spec says must be done, so … I was surprised, but on first blush it looks correct. I haven’t checked whether the other commonly used XML library is equally insistent.

I don’t think the xml with whitespace is invalid xml. It’s just (for our purposes) that text=" ," is not only just as valid as text="," but is supposed to be treated as the same by processors. If that’s right, one needs a non-compliant processor to draw a distinction which CSL needs.

Bruce_D_Arcus1 · June 23, 2020, 6:27pm

I was just curious, so checked the style repo:

❯ rg -l '","' | wc -l 
1259

❯ rg -l '", "' | wc -l
2196

❯ rg -l '",  "' | wc -l
22

❯ rg -l '",   "' | wc -l
4

So the majority have one significant trailing space, but if I search beyond that, there are others cases with significant whitespace in attribute values.

Not sure what to say. I don’t recall this coming up before in the more than 10 years since CSL has been available.

And fixing it seems like it would be awkward, particularly in terms of converting existing styles.

Bruce_D_Arcus1 · June 23, 2020, 6:35pm

In the meantime, could it be possible for you to pre-process the input file to replace the whitespace with tokens, before running it through the XML parser?

If you have some clever suggestion, for us, let us know.

Bruce_D_Arcus1 · June 23, 2020, 6:47pm

PaulStanley · June 23, 2020, 7:18pm

I think that would be overkill: parsing paired delimiters is relatively heavyweight work, and there’s plenty of places where one wants whitespace eaten. But I’m sure I can find a way. The critical thing is that you have confirmed my intuition that the behaviour is common and I don’t think I should do anything arbitrary like assuming commas always have spaces, or that nobody ever uses empty attributes on delimiters.

I think I can probably dig around in the library, find whatever function is doing the normalization, and … denormalize it. It’s abundantly clear that the behaviour (attributes preserve whitespace) is rationally critical to CSL, but it’s sort of curious that it seems to depend on non-standard XML processing, if it does, and agree that if that is the case it would be a good idea to document it explicitly. Not the first time, to be sure, that a standard has said one thing and practice has diverged.

bwiernik · June 24, 2020, 2:14pm

@Bruce_D_Arcus1 See also this discussion from 2016 https://github.com/citation-style-language/schema/issues/135

Bruce_D_Arcus1 · June 24, 2020, 2:51pm

I see that, but it’s a different issue. I’m not yet sure what I think
about it, or rintze’s proposed solution.

But it’s an easy change if we want to make it.

PaulStanley · June 24, 2020, 8:49pm

Just to report back on what I did: modify the library I was using so that it doesn’t normalize the attribute value in

prefix
suffix
delimiter
range-delimiter
initialize-with

But continues otherwise to normalize: so disambiguation=" true " will be as good as "true", but prefix=", " is not the same as prefix=",". That seems at least sane. If anyone can think of additional attributes that need to be passed through un-normalized, shout.

That doesn’t deal with multiple whitespace, but I think processors are expected (though it’s not clear from the documents, IIRC …) to collapse multiple whitespace in any case. What rules there should be for doing that is a whole 'nother discussion.

bwiernik · June 25, 2020, 7:14am

value= should be passed through. Constructions like <text value="Review of "/> are common and reasonable.

PaulStanley · June 25, 2020, 8:10am

Good point. sort-separator as well, I think.

bwiernik · June 25, 2020, 8:18am

Also sort-separator, name-delimiter, names-delimiter, cite-group-delimiter, year-suffix-delimiter, after-collapse-delimiter

bwiernik · June 25, 2020, 8:18am

I’m going through the spec to find any others

bwiernik · June 25, 2020, 8:41am

That looks like everything to me.

Bruce_D_Arcus1 · June 25, 2020, 3:37pm

Clarification now merged to master.

Thanks!

Topic		Replies	Views
delimiter-precedes-last CSL Development	1	242	August 25, 2007
whitespace CSL Development	4	191	March 26, 2005
namespaces (was Re: spec) CSL Development	3	243	July 1, 2009
Delimiter in name substitute CSL Development	14	293	October 10, 2012
CSL name delimiter CSL Development	1	318	February 22, 2011

XML Normalization

Related topics