Page range delimite replacement

A quick question regarding page range delimiters.

I noticed that citeproc-js seems to replace the delimiter even when the
style does not define a page-range-format — at least that’s my
inference, from the test suite, e.g.: the input for test
fullstyles_ABdNT is pages “159-181” and the expected result is “159–
181”; the style being tested (associacao-brasileira-de-normas-tecnicas)
does not have a page-range-format option.

The spec says that “If the attribute is not set, page ranges are
rendered without reformatting.” — My understanding of this was that page
variable should not be touched at all, but the test case seems to
suggest that the delimiter should still be replaced.

Did I miss something here? If not (and if replacement is indeed the
preferred approach) I think this should be stated explicitly in the
spec.

Sylvester

Good catch. Frank made hyphen-replacement independent of the use of
the “page-range-format” attribute in early 2012, and I don’t think
there have been any complaints on the Zotero forums since. We probably
should adopt this, and mention it clearly in the spec.

See also https://github.com/citation-style-language/schema/issues/56#issuecomment-4436206
and https://forums.zotero.org/discussion/21831/citeproc-endashhyphen-in-bibliography-regression/

New ticket: https://github.com/citation-style-language/documentation/issues/32

Rintze

Good catch. Frank made hyphen-replacement independent of the use of
the “page-range-format” attribute in early 2012, and I don’t think
there have been any complaints on the Zotero forums since. We probably
should adopt this, and mention it clearly in the spec.

See also https://github.com/citation-style-language/schema/issues/56#issuecomment-4436206
and https://forums.zotero.org/discussion/21831/citeproc-endashhyphen-in-bibliography-regression/

New ticket: https://github.com/citation-style-language/documentation/issues/32

Sounds good to me; I guess it would be sufficient to add one more
sentence after the one I quoted. To make it easier for implementors we
could also enumerate exactly which characters should be replaced (my
take would be: -, en-dash, em-dash, making sure to catch things like
‘–’, too).

Sylvester

signature.asc (198 Bytes)

(I’m for changing the spec - I had always assumed this to work the way it
does now and we’d need major style updates if it doesn’t.)On Wed, Jan 22, 2014 at 8:52 AM, Rintze Zelle <@Rintze_Zelle>wrote:

Sebastian, Sylvester, does either of you (or anybody else) ever
encounter item metadata that uses anything other than hyphens or
en-dashes in page ranges? Do we really have to substitute repeated
hyphens and em-dashes as well?

Rintze

I don’t know about em-dashes. Repeated hyphens are common in bibtex, so
they may make it into metadata.On Thu, Jan 23, 2014 at 2:46 PM, Rintze Zelle <@Rintze_Zelle>wrote:

And in the TeX world, triple-dashes get replaced by em-dashes.

I haven’t followed this discussion, but looking back on it, I’m a little
confused by what you seem to be proposing.

Generally accepted “typographically correct” page range delimiter is an
en-dash. If you want a rule, I’d say replace a dash or double-dash with an
en-dash?

Sorry if I’m missing something; been a long day.

This is about which symbols the page-range-delimiter replaces and when. It
already defaults to en-dash, that part isn’t controversial.

So the controversial question is whether that list includes em-dashes (and
perhaps by extension triple dashes)?

And therefore a modest change would add just add double-dashes?

I’d suggest the modest one as general approach if unsure; always easier to
add things like this than remove them.

OTOH, I can’t imagine why anyone would use an em-dash in a page range
except to indicate the delimiter.

I think most people don’t care about en-dash, and won’t see the difference, so replacing any version of an en-dash with it makes sense, so they actually do the right thing without ever knowing it was a problem in the first place. The one who do care will want an en-dash and will be delighted to see you do the right thing. The ones who are really crazy will have very weird ideas about page ranges and dashes, and they should be ignored.

This is what we do in Papers: replace any occurence of any of these types of dashes with en-dash: /* hyphen / @"\u2010"
/
nbr hyphen / @"\u2011"
/
fig dash / @"\u2012"
/
en-dash / @"\u2013"
/
em-dash */ @"\u2014”

I like the idea of doing the same with multiple dashes, to get it down to just one en-dash. I had not realized this could be used, but I think it makes sense.

Charles

just to be clear - we have “page-range-delimiter” as an attribute of
cs:style. This defaults to en-dash, but due to multiple requests from users
mainly in Southern Europe, which had style guides that very explicitly
required a hyphen or a non-breaking hyphen we made this customizable.
Sylvester asked about this, because the specifications currently say not
to do this unless page-range-format is set. They do so in two places:
Unambiguously here:

The “locator” variable is always rendered with an en-dash replacing any
hyphens.
For the “page” variable, this replacement is only performed if the
page-range-format attribute is set on cs:style

and somewhat more ambiguously here:

If the attribute [page-range-format] is not set, page ranges are
rendered without reformatting.

That was the original question, and I think everyone agrees we should
replace by the page-range-delimiter more generally.
The only remaining question is how broadly to do that: @Rintze - any reason
why we would not want to replace em-dash etc.?On Fri, Jan 24, 2014 at 1:30 AM, Charles Parnot <@Charles_Parnot>wrote:

Thanks for the clarification. One clarification from me on Papers as well: the delimiter is indeed en-dash by default or what is set by the CSL, so my email was really about to heartfully agree with the idea of going after anything that look like a dash to be replaced by the proper page delimiter.

Not really. Can we agree on what should be substituted? Based on
http://en.wikipedia.org/wiki/Dash#Common_dashes,
http://en.wikipedia.org/wiki/Hyphen#Unicode, and
http://en.wikipedia.org/wiki/Minus_sign#Character_codes , we might
want to cover:

  • dashes: figure dash, en-dash, em-dash, horizontal bar
  • hyphens: hyphen, hyphen-minus, soft-hyphen (maybe?), non-breaking
    hyphen, hyphen bullet
  • minus: minus

(substituting any occurrence of one or more of the same character)

Rintze

I just came across an interesting note on the description page of Wikipedia’s citation bot (https://en.wikipedia.org/wiki/User:Citation_bot#Page_numbers_with_hyphens), and thought it might be relevant here:

The bot replaces hyphens with en dash in page number ranges. On rare occasions when a hyphen is 
right and an en dash is wrong (hyphen in the page number itself), manually use the hyphen HTML 
code &#8209; instead of the dash/hyphen.

So, according to this, there are edge cases where the substitution would be wrong. I don’t have any examples, but if you want to pursue it, you could leave a note on the bot talk page (https://en.wikipedia.org/wiki/User_talk:Citation_bot), or contact the author (https://en.wikipedia.org/wiki/User_talk:Smith609).

Chris Maloney
NIH/NLM/NCBI (Contractor)
Building 45, 5AN.24D-22
301-594-2842> -----Original Message-----

Thanks for this Chris!

In my opinion this demonstrates that we must tread very carefully when
touching user input. A similar issue arises with single and double
quotes — IIRC citeproc-js applies conversions there as well (not quotes
added by the processor, but quotes present in the original input). I
have yet to look at this in more detail, but I imagine this is
potentially even more controversial than hyphens in page numbers. Is
there an equally strong consensus for converting quotes as there is for
the page range delimiter?

signature.asc (198 Bytes)

The problems with quotes and en-dashes are slightly different.

For en-dashes, the chance that auto-replace is doing something undesirable
is quite small. I’m sure hyphens in page numbers exist somewhere, but I
have never seen one. On the other hand, technically we wouldn’t need to do
this, since users could input en-dashes themselves in the data. However,
so few people are even aware of en-dashes, and even fewer know how to type
one on their computer that I think that would be a bad, bad idea (I believe
bibtex requires the en-dash, usually as --, in the data. )

For quotations marks, there is a higher chance of problems - and they do
come up occasionally - but it is also absolutely crucial to getting correct
citations from the same data. Take an article title like: “Ain’t i a
woman?”: Towards an intersectional approach to person perception and
group-based harms
The above is how it’s printed in the journal and I imagine most people
would input it like that. Now if you cite this in APA, you want:
Goff, P. A., Thomas, M. A., & Jackson, M. C. (2008). “Ain’t i a woman?”:
Towards an intersectional approach to person perception and group-based
harms. Sex Roles, 59(5-6), 392–403. doi:10.1007/s11199-008-9505-4

I.e. double quotes as in the original. But in Chicago style, you want

Goff, Phillip Atiba, Margaret A. Thomas, and Matthew Christian Jackson.
“‘Ain’t I a Woman?’: Towards an Intersectional Approach to Person
Perception and Group-Based Harms.” Sex Roles 59, no. 5–6 (September 1,
2008): 392–403. doi:10.1007/s11199-008-9505-4.

i.e. converted single quotes. There is absolutely no alternative to having
the processor do this, so yes, I do think it’s necessary and should be
uncontroversial.

So while, as a general issue, I agree it’s tricky to auto-anything with
user content, I do think we’re doing the right thing here in both cases.On Sun, Jan 26, 2014 at 9:30 AM, Sylvester Keil <@Sylvester_Keil>wrote:

Sebastian, I fully agree that replacement is often crucial; especially
when quotes are added by the processor, like in your example, it is
important to then substitute quotes of the same kind — but how far
should the processor go?

Your example is the easy case: you are adding “ and ” and you replace
all occurrences of the same quotation marks by the inner-quote variants.
I agree that it makes sense for CSL to demand such behavior.

But would you replace " also (as in: “Ain’t I a woman?”)? More
importantly, how do you distinguish between opening and closing "? Or
what do you do if your current locale defines quotes as » and « — do you
still replace all occurrences of " quotes? More interestingly, do you
replace “ and ” too?

The next question is: do you replace ’ with single quotes? Again, do you
do that always, or only if your locale’s single quotes are ‘ and ’?

Sylvester

signature.asc (198 Bytes)