how much bugged a style may be?

Hi,

this is from ieee.csl that comes with the test-suite:

 <group delimiter=", ">
    <text macro="title"/>
    <text variable="container-title" font-style="italic"/>
    <text variable="volume" prefix=" vol. "/>

[…]

There’s a group delimiter with a final space and a member’s prefix
starting with a space.

Now, citeproc-js will remove the extra space, thus encouraging this
sort of stupid bugs. I’ve already objected to such an approach: bugged
styles should be treated as bugged styles. Period.

I coded a “clean” function to get rid of these kind of errors myself.
But I’m now planning to get rid of it, because it makes me waste time.
I do not want a buggy style to just work, because I’ll never know if a
bug is my fault, the style’s fault, or my fault in being not enough
tolerant with poor style coders.

Andrea

I think the discussions we’ve previously had about the analog of your
"clean" function is limited to punctuation, and is based on cases
where it may not be so obvious if it’s a “poor” style. But I don’t
recall details, so curious what Rintze and Frank have to say about it.

Bruce

I totally agree that punctuation must be treated specially (since it
interacts with attributes, etc.).

I’m just talking about extra-spaces or bugs like this. Shall we have a
common rule according to which ieee is bugged? I wouldn’t be able to
verbalize it, but I’d know how to apply it in the present case…

Andrea

I forget to mention, sorry. The relevant test is
bugreports_IeeePunctuation

Andrea

Sounds like we need to either add a rule that extra spaces are suppressed, or citeproc implementations need to ensure that they show up when rendered as HTML (e.g., by alternating space characters and   entities) so that we can be sure that style authors will notice them. The former rule would be more tolerant to mistakes in styles, but if I remember correctly some (old-fashioned) style guides specifically state that there should be two spaces after a period. To conform to these style guides exactly, we’d need the latter rule.

Simon

Hi,

this is from ieee.csl that comes with the test-suite:

<group delimiter=", ">

   <text macro="title"/>

   <text variable="container-title" font-style="italic"/>

   <text variable="volume" prefix=" vol. "/>

[…]

There’s a group delimiter with a final space and a member’s prefix

starting with a space.

Now, citeproc-js will remove the extra space, thus encouraging this

sort of stupid bugs. I’ve already objected to such an approach: bugged

styles should be treated as bugged styles. Period.

I coded a “clean” function to get rid of these kind of errors myself.

But I’m now planning to get rid of it, because it makes me waste time.

I do not want a buggy style to just work, because I’ll never know if a

bug is my fault, the style’s fault, or my fault in being not enough

tolerant with poor style coders.

I think the discussions we’ve previously had about the analog of your

“clean” function is limited to punctuation, and is based on cases

where it may not be so obvious if it’s a “poor” style. But I don’t

recall details, so curious what Rintze and Frank have to say about it.

I totally agree that punctuation must be treated specially (since it
interacts with attributes, etc.).

I’m just talking about extra-spaces or bugs like this. Shall we have a
common rule according to which ieee is bugged? I wouldn’t be able to
verbalize it, but I’d know how to apply it in the present case…

Sounds like we need to either add a rule that extra spaces are suppressed,
or citeproc implementations need to ensure that they show up when rendered
as HTML (e.g., by alternating space characters and   entities) so that
we can be sure that style authors will notice them. The former rule would be
more tolerant to mistakes in styles, but if I remember correctly some
(old-fashioned) style guides specifically state that there should be two
spaces after a period. To conform to these style guides exactly, we’d need
the latter rule.

The current citeproc-js implementation suppresses multiple spaces only
where they arise from the combination of neighboring affixes (prefix,
suffix or delimiter). If two spaces are explicitly set within a
single affix, they will be passed through as they stand.
(Confirmatory test checked in a few minutes ago.)

While I agree with Andrea that suppressing extraneous spaces in the
processor is a burden – I was surprised to discover, in
post-deployment feedback, the variety of situations in which they can
arise – there are a couple of reasons for going the extra mile. If
the processor is forgiving in this case (as well as with duplicate
punctuation), that makes it easier to recombine macros. There is also
an argument for making things easy for style authors and maintainers,
since their time is a scarce resource in the ecosystem.

We also lack a testing framework and test cases for the styles
themselves. Without a means of catching misformatting before
deployment, passing spaces more strictly would probably result in more
list traffic, with glitches turning up against novel data combinations
in the hands of users.

So although coding for the suppression of extra spaces is a headache,
there were some reasons for doing it.

Frank

Can we easily document the suppression algorithm independent of
particular implementations?

Can we easily document the suppression algorithm independent of
particular implementations?

It’s a very hard case, since the suppression needs to step around
formatting that is applied to each node as the citation is flattened
for output. How you go about it depends on the form in which the
citation is held before rendering.

The citeproc-js code was recently refactored to isolate most of the
work in a couple of functions, amounting to 300 lines of code. It’s
here, for what it’s worth (from line 559):

http://bitbucket.org/fbennett/citeproc-js/src/31cefc16cfd5/src/queue.js

Frank

Can we easily document the suppression algorithm independent of
particular implementations?

It’s a very hard case, since the suppression needs to step around
formatting that is applied to each node as the citation is flattened
for output. How you go about it depends on the form in which the
citation is held before rendering.

For sake of argument, though, couldn’t we just have a rule something
like the whitespace handling rule in HTML, except just expanded to
include commas?

http://www.w3.org/TR/html401/struct/text.html#h-9.1

E.g. “… user agents should collapse input white space sequences when
producing output inter-word space.”

Can we easily document the suppression algorithm independent of
particular implementations?

It’s a very hard case, since the suppression needs to step around
formatting that is applied to each node as the citation is flattened
for output. How you go about it depends on the form in which the
citation is held before rendering.

For sake of argument, though, couldn’t we just have a rule something
like the whitespace handling rule in HTML, except just expanded to
include commas?

http://www.w3.org/TR/html401/struct/text.html#h-9.1

E.g. “… user agents should collapse input white space sequences when
producing output inter-word space.”

As Simon indicated, explicit double-spacing should be allowed.

Am not sure I agree with that, however, so we should consider that a
separate issue (whether to, in effect, work around HTML whitespace
rules). The question remains whether we can specify the rule at a
high-level like this.

Bruce

Can we easily document the suppression algorithm independent of
particular implementations?

It’s a very hard case, since the suppression needs to step around
formatting that is applied to each node as the citation is flattened
for output. How you go about it depends on the form in which the
citation is held before rendering.

For sake of argument, though, couldn’t we just have a rule something
like the whitespace handling rule in HTML, except just expanded to
include commas?

http://www.w3.org/TR/html401/struct/text.html#h-9.1

E.g. “… user agents should collapse input white space sequences when
producing output inter-word space.”

As Simon indicated, explicit double-spacing should be allowed.

Am not sure I agree with that, however, so we should consider that a
separate issue (whether to, in effect, work around HTML whitespace
rules). The question remains whether we can specify the rule at a
high-level like this.

If a rule is included in the specification, certainly it should be
phrased at a high level, as you suggest; my pointer to the citeproc-js
source was not really responsive.

Can we easily document the suppression algorithm independent of
particular implementations?

It’s a very hard case, since the suppression needs to step around
formatting that is applied to each node as the citation is flattened
for output. How you go about it depends on the form in which the
citation is held before rendering.

For sake of argument, though, couldn’t we just have a rule something
like the whitespace handling rule in HTML, except just expanded to
include commas?

http://www.w3.org/TR/html401/struct/text.html#h-9.1

E.g. “… user agents should collapse input white space sequences when
producing output inter-word space.”

As Simon indicated, explicit double-spacing should be allowed.

Am not sure I agree with that, however, so we should consider that a
separate issue (whether to, in effect, work around HTML whitespace
rules). The question remains whether we can specify the rule at a
high-level like this.

Whether to preserve double whitespace under some conditions is not
specific to HTML, of course. The question is whether and under what
conditions repeated space characters should be deleted from output.
If double spaces are to be permitted under some conditions (such as
when they are written into a single affix), that should be made
explicit in the specification.

True. My vote is we have the same rule as HTML more-or-less, which is
that duplicate whitespace gets removed. But note: whitespace would not
include the non-breaking space unicode character.

Bruce

Can we easily document the suppression algorithm independent of
particular implementations?

It’s a very hard case, since the suppression needs to step around
formatting that is applied to each node as the citation is flattened
for output. How you go about it depends on the form in which the
citation is held before rendering.

For sake of argument, though, couldn’t we just have a rule something
like the whitespace handling rule in HTML, except just expanded to
include commas?

http://www.w3.org/TR/html401/struct/text.html#h-9.1

E.g. “… user agents should collapse input white space sequences when
producing output inter-word space.”

As Simon indicated, explicit double-spacing should be allowed.

Am not sure I agree with that, however, so we should consider that a
separate issue (whether to, in effect, work around HTML whitespace
rules). The question remains whether we can specify the rule at a
high-level like this.

Whether to preserve double whitespace under some conditions is not
specific to HTML, of course. The question is whether and under what
conditions repeated space characters should be deleted from output.
If double spaces are to be permitted under some conditions (such as
when they are written into a single affix), that should be made
explicit in the specification.

True. My vote is we have the same rule as HTML more-or-less, which is
that duplicate whitespace gets removed. But note: whitespace would not
include the non-breaking space unicode character.

That should work.

The current citeproc-js implementation suppresses multiple spaces only
where they arise from the combination of neighboring affixes (prefix,
suffix or delimiter). If two spaces are explicitly set within a
single affix, they will be passed through as they stand.
(Confirmatory test checked in a few minutes ago.)

While I agree with Andrea that suppressing extraneous spaces in the
processor is a burden – I was surprised to discover, in
post-deployment feedback, the variety of situations in which they can
arise – there are a couple of reasons for going the extra mile. If
the processor is forgiving in this case (as well as with duplicate
punctuation), that makes it easier to recombine macros. There is also
an argument for making things easy for style authors and maintainers,
since their time is a scarce resource in the ecosystem.

Well, I did not say it is a burden. Actually here it is just a matter
of changing this function:

  isPunct = and . map (flip elem ".,;:!?") $ headInline is

into this one:

  isPunct = and . map (flip elem ".,;:!? ") $ headInline is

a single character would do the job.

So I removed that character and saw what happened. citeproc-hs was
passing 401 tests and now it passes 392. 3 happen to be bugs that
character was hiding.

Shall we have a look at the rest?

bugreports_DuplicateSpaces: this is clearly a style bug (related to
the one that opened this thread, actually):

the macro “publisher-place” has a space prefix when publisher-place is
set, but the macro is used inside a group with a delimiter:

           <group delimiter=", ">
              <text macro="title"/>
              <text macro="publisher"/>
           </group>

The prefix is not needed.

bugreports_DuplicateSpaces3: another style bug you are trying to hide.

      <group prefix=" ">
        <text term="in" text-case="capitalize-first" suffix=": "/>
        <text macro="editor"/>
        <text variable="container-title" font-style="italic" prefix=" " suffix="."/>
        <text variable="volume" prefix="Vol " suffix="."/>
        <text macro="edition" prefix=" "/>

the “container-title” has a prefix regardless the presence of an
editor. The correct solution should be:

      <group prefix=" ">
        <text term="in" text-case="capitalize-first" suffix=": "/>
        <text macro="editor" suffix=" "/>
        <text variable="container-title" font-style="italic" suffix="."/>
        <text variable="volume" prefix="Vol " suffix="."/>
        <text macro="edition" prefix=" "/>

bugreports_LabelsOutOfPlace hides a bug in chicago-full: a macro
starting with a prefix=" "

...

used inside a group with a delimiter ending with a space:

  <group delimiter=". ">
    <text macro="contributors" />
    <text macro="title" />
    <text macro="description" />
    <text macro="secondary-contributors" />

bugreports_parenthesis and bugreports_DuplicateSpaces2 are quite
interesting tests since they are both testing the same group in the
same style, mhra-x.csl

This is tough and could be indeed a case where

"If the processor is forgiving in this case (as well as with
duplicate punctuation), that makes it easier to recombine macros.

we are not talking of macros, though. Still:

      <group prefix=" " suffix="">
        <text variable="container-title" font-style="italic" prefix=" "/>
        <text variable="volume" prefix=" "/>
        <text variable="issue" prefix=", no. "/>
        <date variable="issued" prefix=" (" suffix=")">
          <date-part name="month" suffix=" "/>
          <date-part name="day" suffix=", "/>
          <date-part name="year"/>
        </date>
        <text variable="page" prefix=": "/>
      </group>

We would like to use a group delimiter set to a space, but if there is
a “issue” the delimiter should not be applied. That could be a use
case for your solution, if and only if CSL is not expressive enough
for handling this kind of cases.

But what about:













I prepared a test here:
http://gorgias.mine.nu/citeproc/haskell_ExtraSpacesMhra-x.txt

I do not need to suppress any space to pass it.

collapse_TrailingDelimiter is specifically bugged as far as my
understanding of CSL goes. Shouldn’t “et-al” be subject to the
delimeter?

    <names variable="author">
      <name and="symbol" delimiter=", " form="short" />
      <et-al prefix=" " />
    </names>

simplespace_case1 is another case of a style bug you are trying to
hied: harvard1.

      <group prefix=" " delimiter=" ">
        <text term="in" text-case="capitalize-first"/>
        <text macro="editor"/>
        <text variable="container-title" font-style="italic" suffix="."/>
        <text variable="collection-title" suffix="."/>
        <group suffix="." delimiter=", ">
          <text macro="publisher" prefix=" "/>
          <text macro="pages"/>
        </group>
      </group>

You have a group with a delimiter (" ") and a member with a prefix set
to " ".

We also lack a testing framework and test cases for the styles
themselves. Without a means of catching misformatting before
deployment, passing spaces more strictly would probably result in more
list traffic, with glitches turning up against novel data combinations
in the hands of users.

In the short run you are right, bugged styles will survive because
their bugs will be hidden by your 300 lines of code.

So although coding for the suppression of extra spaces is a headache,
there were some reasons for doing it.

It is not a headache. It hides bugs away pretending there is none. I
think using bugged styles to check if the processor hides them well is
a faulty approach to creating a robust standard language for
formatting citations.

I think I gave you enough evidence to for asking you to be more
specific on the advantages you think extra space elimination would
bring to the clarity and consistency of CSL.

Sorry I was so long.
Andrea

I think I was wrong about the three bugs. I believe these are three
style bugs hidden by citeproc-js too:

plural_NameLabelNever
plural_NameLabelContextualPlural
plural_NameLabelDefaultPlural

They all have:



If you were to set the suffix of to “***”, I think you agree
we should have:

Doe*** and Roe***ed

So, if you want

Doe and Roe ed

you need to write:

  <names variable="editor">
    <name and="text" form="short" />
    <label prefix=" " form="short" plural="never" strip-periods="true" />
  </names>

Am I right?

Andrea

Andrea,

I’ll happily adjust to do whatever the specification calls for. The
CSL behind the tests you raise in your mail all could be written to
avoid the need for duplicate suppression. So no dispute there.

I’ve laid out the reasoning behind space suppression in my previous
mail. The trade-off for eliminating this behavior would be breakage in
a number of existing styles. To protect against that, both at this
point and in future development, we would need a test framework, with
a good foundation of test cases for all extant styles. I don’t think
anyone is proposing to build that infrastructure, so the breakage
would mostly emerge from user feedback. That would mean a lot of
back-and-forth correspondence for the debugging styles, often under
severe time pressure at both ends.

Duplicate space suppression may be inelegant, but it avoids this
problem. As far as I can tell, it doesn’t cause any others.

The current citeproc-js implementation suppresses multiple spaces only
where they arise from the combination of neighboring affixes (prefix,
suffix or delimiter). If two spaces are explicitly set within a
single affix, they will be passed through as they stand.
(Confirmatory test checked in a few minutes ago.)

While I agree with Andrea that suppressing extraneous spaces in the
processor is a burden – I was surprised to discover, in
post-deployment feedback, the variety of situations in which they can
arise – there are a couple of reasons for going the extra mile. If
the processor is forgiving in this case (as well as with duplicate
punctuation), that makes it easier to recombine macros. There is also
an argument for making things easy for style authors and maintainers,
since their time is a scarce resource in the ecosystem.

Well, I did not say it is a burden.

“Waste of my time” was the phrase. Sorry if I misinterpreted that.

Actually here it is just a matter
of changing this function:

 isPunct = and . map (flip elem ".,;:!?") $ headInline is

into this one:

 isPunct = and . map (flip elem ".,;:!? ") $ headInline is

a single character would do the job.

So I removed that character and saw what happened. citeproc-hs was
passing 401 tests and now it passes 392. 3 happen to be bugs that
character was hiding.

Shall we have a look at the rest?

bugreports_DuplicateSpaces: this is clearly a style bug (related to
the one that opened this thread, actually):

the macro “publisher-place” has a space prefix when publisher-place is
set, but the macro is used inside a group with a delimiter:

          <group delimiter=", ">
             <text macro="title"/>
             <text macro="publisher"/>
          </group>

The prefix is not needed.

bugreports_DuplicateSpaces3: another style bug you are trying to hide.

     <group prefix=" ">
       <text term="in" text-case="capitalize-first" suffix=": "/>
       <text macro="editor"/>
       <text variable="container-title" font-style="italic" prefix=" " suffix="."/>
       <text variable="volume" prefix="Vol " suffix="."/>
       <text macro="edition" prefix=" "/>

the “container-title” has a prefix regardless the presence of an
editor. The correct solution should be:

     <group prefix=" ">
       <text term="in" text-case="capitalize-first" suffix=": "/>
       <text macro="editor" suffix=" "/>
       <text variable="container-title" font-style="italic" suffix="."/>
       <text variable="volume" prefix="Vol " suffix="."/>
       <text macro="edition" prefix=" "/>

bugreports_LabelsOutOfPlace hides a bug in chicago-full: a macro
starting with a prefix=" "

...

used inside a group with a delimiter ending with a space:

 <group delimiter=". ">
   <text macro="contributors" />
   <text macro="title" />
   <text macro="description" />
   <text macro="secondary-contributors" />

bugreports_parenthesis and bugreports_DuplicateSpaces2 are quite
interesting tests since they are both testing the same group in the
same style, mhra-x.csl

This is tough and could be indeed a case where

"If the processor is forgiving in this case (as well as with
duplicate punctuation), that makes it easier to recombine macros.

we are not talking of macros, though. Still:

     <group prefix=" " suffix="">
       <text variable="container-title" font-style="italic" prefix=" "/>
       <text variable="volume" prefix=" "/>
       <text variable="issue" prefix=", no. "/>
       <date variable="issued" prefix=" (" suffix=")">
         <date-part name="month" suffix=" "/>
         <date-part name="day" suffix=", "/>
         <date-part name="year"/>
       </date>
       <text variable="page" prefix=": "/>
     </group>

We would like to use a group delimiter set to a space, but if there is
a “issue” the delimiter should not be applied. That could be a use
case for your solution, if and only if CSL is not expressive enough
for handling this kind of cases.

But what about:













I prepared a test here:
http://gorgias.mine.nu/citeproc/haskell_ExtraSpacesMhra-x.txt

I do not need to suppress any space to pass it.

collapse_TrailingDelimiter is specifically bugged as far as my
understanding of CSL goes. Shouldn’t “et-al” be subject to the
delimeter?

   <names variable="author">
     <name and="symbol" delimiter=", " form="short" />
     <et-al prefix=" " />
   </names>

simplespace_case1 is another case of a style bug you are trying to
hied: harvard1.

     <group prefix=" " delimiter=" ">
       <text term="in" text-case="capitalize-first"/>
       <text macro="editor"/>
       <text variable="container-title" font-style="italic" suffix="."/>
       <text variable="collection-title" suffix="."/>
       <group suffix="." delimiter=", ">
         <text macro="publisher" prefix=" "/>
         <text macro="pages"/>
       </group>
     </group>

You have a group with a delimiter (" ") and a member with a prefix set
to " ".

We also lack a testing framework and test cases for the styles
themselves. Without a means of catching misformatting before
deployment, passing spaces more strictly would probably result in more
list traffic, with glitches turning up against novel data combinations
in the hands of users.

In the short run you are right, bugged styles will survive because
their bugs will be hidden by your 300 lines of code.

I certainly wouldn’t claim that the citeproc-js code is compact or elegant. :slight_smile:

So although coding for the suppression of extra spaces is a headache,
there were some reasons for doing it.

It is not a headache. It hides bugs away pretending there is none. I
think using bugged styles to check if the processor hides them well is
a faulty approach to creating a robust standard language for
formatting citations.

I think I gave you enough evidence to for asking you to be more
specific on the advantages you think extra space elimination would
bring to the clarity and consistency of CSL.

There would be costs associated with eliminating this behavior. I
think the question runs the other way. What is lost through the
suppression of duplicate spaces?

Andrea,

I’ll happily adjust to do whatever the specification calls for. The
CSL behind the tests you raise in your mail all could be written to
avoid the need for duplicate suppression. So no dispute there.

I’ve laid out the reasoning behind space suppression in my previous
mail. The trade-off for eliminating this behavior would be breakage in
a number of existing styles. To protect against that, both at this
point and in future development, we would need a test framework, with
a good foundation of test cases for all extant styles. I don’t think
anyone is proposing to build that infrastructure, so the breakage
would mostly emerge from user feedback. That would mean a lot of
back-and-forth correspondence for the debugging styles, often under
severe time pressure at both ends.

Can we isolate the types of conditions which are likely to lead to
these problems? Do we know of some examples, beyond the ones in the
test suite that Andrea identified?

Bruce

I completely agree with your analysis. But my opinion is that we have
to enforce a stricter policy now. I think this, on the long run, will
be compensated by better styles.

I know it is a trade-off we are facing and I understand your point.

Andrea