What is the use case and meaning of rich-text's "span" elements?

Hi,

The definition of “rich text” in CSL input format seems to fuzzy in
respect to elements.

In CSL’s data schema there are the following two kinds of spans [1]:

Does the first has any use case at all? In citeproc-js, there are two
other kinds of spans [2]:


Is there a difference between the two?

As far as I understand, spans are needed to mark up text whose case
should not be transformed, as with proper nouns. But this is one use
case, which should result in one (!) well defined markup element. CSL
rich text already defines one custom non-HTML markup element ( for small
caps), so why not something like ? I don’t mind, which markup element is
used, but it should be clear, which is standard it what the element
means. I bet that in contrast to and , nesting spans does not alter the
meaning, so … can be normalized to …, right? If you allow any
element, we end up with incompatible, custom data.

Cheers
Jakob

[1] https://bitbucket.org/bdarcus/csl-schema/src/tip/csl-data.rnc
[2]
https://bitbucket.org/fbennett/citeproc-js/src/tip/src/util_flipflop.js--
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Voß
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242
http://www.gbv.de
@Jakob_Voss

The definition of “rich text” in CSL input format seems to fuzzy in
respect to elements.

Agreed; thanks for raising this.

In CSL’s data schema there are the following two kinds of spans [1]:

Not seeing anything in your note, but the RNC definition for the rich text is:

rich-variable-pattern =
(text
> element abbr { text }
> element b { text }
> element cite {

     ## cited title which is a part (like an article), and so

typically rendered in quotes,
## rather than italics
attribute class { “part” }?,
text
}
> element i { text }
> element sc { text }
> element span {

     ## text whose case should not be transformed (as with proper nouns)
     attribute class { "protect" }?,
     text
   }
 > element sup { text }
 > element sub { text })+

}

Does the first has any use case at all? In citeproc-js, there are two
other kinds of spans [2]:

I can’t access BitBucket ATM, so can’t compare.

Is there a difference between the two?

As far as I understand, spans are needed to mark up text whose case
should not be transformed, as with proper nouns. But this is one use
case, which should result in one (!) well defined markup element. CSL
rich text already defines one custom non-HTML markup element ( for small
caps), so why not something like ? I don’t mind, which markup element is
used, but it should be clear, which is standard it what the element
means. I bet that in contrast to and , nesting spans does not alter the
meaning, so … can be normalized to …, right? If you allow any
element, we end up with incompatible, custom data.

My inclusion of the sc element is a bug. I’d really prefer to stick to
a strict subset of HTML, since it can do what we need, and there are
tons of tools and software that can deal with it.

The use cases I see are:

  • preserve case (proper nouns and acronyms, though the latter can be
    handled programmatically)
  • titles-within-titles
  • species names
  • foreign language terms

The last two can probably reasonably be handled without any semantic
markup: if field output as italic, switch in-field italics to normal.

The second is tricky. While I hate to admit it, maybe it can be
handled through presentation rules similar to the above? But then we
get into awkwardness like, are the following all equivalent from a
processing perspective: simple quotation marks, “smart” quotes that
use the right and left hand quotes marks, the LaTeX double ‘’ and ``
characters, not to mention all the international wrinkles?

So I came to the conclusion it’s cleaner to require wrapping titles.

But if we do that, we need to distinguish between two inline titles:
regula (normally italicized in English), and parts (normally in
quotes).

The first needs some wrapper.

Bruce

Damn, the webmailer scrubled all tags, sorry!

The RNC definition for richt text [1] implies the following rich text
elements

abbr, b, cite, cite+class=part, i, sc, span, span+class=protect, sup,
sup

The FlipFlip parser of citeproc-js [2] supports this richt text
elements:

i, b, sup, sub, sc, span+class=nodecor, span+class=nocase, ", ’

My inclusion of the sc element is a bug. I’d really prefer to stick to
a strict subset of HTML, since it can do what we need, and there are
tons of tools and software that can deal with it.

At least ‘sc’ for small-caps is well defined, you can map it to HTML and
vice versa. The elements i, b, sup, and sub are also clear, so I focus
on
the rest. You wrote about use cases:

  • preserve case (proper nouns and acronyms, though the latter can be
    handled programmatically)

Why keeping ‘abbr’? It is semantic markup, not supported by citeproc-js
anyway.

  • species names
  • foreign language terms

The last two can probably reasonably be handled without any semantic
markup: if field output as italic, switch in-field italics to normal.

Yes, this is semantic markup, that we do not have on real data. In
practise
we only have presentational markup like italic, which could indicate
species
names, foreign language terms, or anything else beyond of scope of CSL.

  • titles-within-titles

[this] is tricky. While I hate to admit it, maybe it can be
handled through presentation rules similar to the above? But then we
get into awkwardness like, are the following all equivalent from a
processing perspective: simple quotation marks, “smart” quotes that
use the right and left hand quotes marks, the LaTeX double ‘’ and ``
characters, not to mention all the international wrinkles?

It’s not about titles within titles but also other words set in quotes
a title. I thought citeproc-js handles these by parsing quotes (" and ')
in rich text. International quotes and LaTeX characters are out of the
scope of CSL schema, they must be recognized and converted in a
pre-parsing
state. I thought that the elements ‘cite’ and ‘cite+class=part’
correspond
to two levels of quotes in citeproc-js.

But if we do that, we need to distinguish between two inline titles:
regula (normally italicized in English), and parts (normally in
quotes).

I have never heard of this “normally” rules before, they seem to apply
to
English only. We only have italicized parts and parts set in quotes,
whatever they mean.

My question was about the remaining markup element:

span or span+class=protect in CSL schema
and
span+class=nodecor or span+class=nocase in citeproc-js

There could be one element for case-preserving (which is semantic
markup), but two?

Jakob

[1] https://bitbucket.org/bdarcus/csl-schema/src/tip/csl-data.rnc
[2]
https://bitbucket.org/fbennett/citeproc-js/src/tip/src/util_flipflop.js
which contains the following definition

[“”, “”, “italics”, “@font-style”, [“italic”, “normal”], true],
[“”, “”, “bold”, “@font-weight”, [“bold”, “normal”], true],
[“”, “”, “superscript”, “@vertical-align”, [“sup”, “sup”], true],
[“”, “”, “subscript”, “@vertical-align”, [“sub”, “sub”], true],
[“”, “”, “smallcaps”, “@font-variant”, [“small-caps”, “small-caps”],
true],
[“”, “”, “passthrough”, “@passthrough”, [“true”, “true”], true],
[“”, “”, “passthrough”, “@passthrough”, [“true”, “true”], true],
[‘"’, ‘"’, “quotes”, “@quotes”, [“true”, “inner”], “'”],
[" ‘", "’“, “quotes”, “@quotes”, [“inner”, “true”], '”']–
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Voß
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242

@Jakob_Voss

  • preserve case (proper nouns and acronyms, though the latter can be
    handled programmatically)

Why keeping ‘abbr’? It is semantic markup, not supported by citeproc-js
anyway.

I see no reason why citeproc-js can’t support it. The main issue is
whether there’s value in defining acronyms in markup (including in
output, BTW).

  • species names
  • foreign language terms

The last two can probably reasonably be handled without any semantic
markup: if field output as italic, switch in-field italics to normal.

Yes, this is semantic markup, that we do not have on real data. In
practise
we only have presentational markup like italic, which could indicate
species
names, foreign language terms, or anything else beyond of scope of CSL.

But this is partly forward-looking; no reason Zotero, Mendeley, et al
can’t allow for tagging text in this way.

But we have to have balance here obviously.

  • titles-within-titles

[this] is tricky. While I hate to admit it, maybe it can be
handled through presentation rules similar to the above? But then we
get into awkwardness like, are the following all equivalent from a
processing perspective: simple quotation marks, “smart” quotes that
use the right and left hand quotes marks, the LaTeX double ‘’ and ``
characters, not to mention all the international wrinkles?

It’s not about titles within titles but also other words set in quotes
a title. I thought citeproc-js handles these by parsing quotes (" and ')
in rich text. International quotes and LaTeX characters are out of the
scope of CSL schema, they must be recognized and converted in a
pre-parsing
state.

I’m not sure I like this idea. If we’re going to expect pre-parsing,
then let’s do the input right?

I thought that the elements ‘cite’ and ‘cite+class=part’
correspond
to two levels of quotes in citeproc-js.

Probably, but Frank and I haven’t discussed it.

But if we do that, we need to distinguish between two inline titles:
regula (normally italicized in English), and parts (normally in
quotes).

I have never heard of this “normally” rules before, they seem to apply
to
English only. We only have italicized parts and parts set in quotes,
whatever they mean.

OK.

My question was about the remaining markup element:

span or span+class=protect in CSL schema
and
span+class=nodecor or span+class=nocase in citeproc-js

There could be one element for case-preserving (which is semantic
markup), but two?

I have no clue what either ‘nodecor’ or ‘nocase’ mean.

Bruce

Bruce wrote:–
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Voß
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242

@Jakob_Voss

  • titles-within-titles

[this] is tricky. While I hate to admit it, maybe it can be
handled through presentation rules similar to the above? But then we
get into awkwardness like, are the following all equivalent from a
processing perspective: simple quotation marks, “smart” quotes that
use the right and left hand quotes marks, the LaTeX double ‘’ and ``
characters, not to mention all the international wrinkles?

It’s not about titles within titles but also other words set in
quotes
a title. I thought citeproc-js handles these by parsing quotes (" and
')
in rich text. International quotes and LaTeX characters are out of
the
scope of CSL schema, they must be recognized and converted in a
pre-parsing state.

I’m not sure I like this idea. If we’re going to expect pre-parsing,
then let’s do the input right?

By pre-parsing I meant converting von BibTeX, MARC21, or any other
bibliographic format to CSL input format. This conversion should also
cover special markup of these formats. If BibTeX and all its special
character sequences must be converted to Unicode anyway, why not
also converting the LaTeX quotes? By the way Biber (reimplementation
of BibTeX) contains some BibTeX to Unicode conversion:
http://biblatex-biber.sourceforge.net/doc.html

I thought that the elements ‘cite’ and ‘cite+class=part’
correspond
to two levels of quotes in citeproc-js.

Probably, but Frank and I haven’t discussed it.

Ok, maybe Frank can help out here.

Jakob

It’s great to see this discussion.

  • preserve case (proper nouns and acronyms, though the latter can be
    handled programmatically)

Why keeping ‘abbr’? It is semantic markup, not supported by citeproc-js
anyway.

I see no reason why citeproc-js can’t support it. The main issue is
whether there’s value in defining acronyms in markup (including in
output, BTW).

The rich text processing in citeproc-js is purely presentational at
the moment. Semantic markup can certainly be supported, but it’s an
additional layer that would need some kind of explicit configuration
mechanism in CSL itself, provide per-style mappings between semantic
and presentational elements. Getting a set of presentational elements
into place that covers known use cases can be seen as a first step
toward that goal.

  • species names
  • foreign language terms

The last two can probably reasonably be handled without any semantic
markup: if field output as italic, switch in-field italics to normal.

Yes, this is semantic markup, that we do not have on real data. In
practise
we only have presentational markup like italic, which could indicate
species
names, foreign language terms, or anything else beyond of scope of CSL.

But this is partly forward-looking; no reason Zotero, Mendeley, et al
can’t allow for tagging text in this way.

But we have to have balance here obviously.

  • titles-within-titles

[this] is tricky. While I hate to admit it, maybe it can be
handled through presentation rules similar to the above? But then we
get into awkwardness like, are the following all equivalent from a
processing perspective: simple quotation marks, “smart” quotes that
use the right and left hand quotes marks, the LaTeX double ‘’ and ``
characters, not to mention all the international wrinkles?

It’s not about titles within titles but also other words set in quotes
a title. I thought citeproc-js handles these by parsing quotes (" and ')
in rich text. International quotes and LaTeX characters are out of the
scope of CSL schema, they must be recognized and converted in a
pre-parsing
state.

I’m not sure I like this idea. If we’re going to expect pre-parsing,
then let’s do the input right?

I’m not sure I follow the discussion here, but I can describe what
citeproc-js does with quotation marks. We can call quote marks that
flip-flop “hot”, and those that don’t “cold”. The ascii single- and
double-quote marks are always hot. In addition, the single- and
double-quote marks specified in the locale of the current CSL style
are also hot. So …

  • a title that contains French-style double-quote marks, when cited in
    an English style that puts quotes around the title will come through
    cold (verbatim), as French-style double-quote marks; and

  • the same title containing French-style double-quote marks, when
    cited in a French style that puts (French-style) quotes around the
    title will come through hot, flip-flopping to French-style
    single-quote marks.

I thought that the elements ‘cite’ and ‘cite+class=part’
correspond
to two levels of quotes in citeproc-js.

Probably, but Frank and I haven’t discussed it.

If this is a semantic element, the comment above would also apply here.

But if we do that, we need to distinguish between two inline titles:
regula (normally italicized in English), and parts (normally in
quotes).

I have never heard of this “normally” rules before, they seem to apply
to
English only. We only have italicized parts and parts set in quotes,
whatever they mean.

OK.

My question was about the remaining markup element:

span or span+class=protect in CSL schema
and
span+class=nodecor or span+class=nocase in citeproc-js

There could be one element for case-preserving (which is semantic
markup), but two?

I have no clue what either ‘nodecor’ or ‘nocase’ mean.

There are discrete use cases for these. I wasn’t able to work out a
better way of handling them, but that doesn’t mean there isn’t one.
At the very least, they could probably do with renaming.

The ‘nocase’ element prevents changes to the case of the item only.
“Decor” such as italics, small-caps and boldface will be applied in
the normal way. This is normally what you would use for proper names
and the like.

The ‘nodecor’ element prevents changes both to case and to
“decorations” such as font style changes. It is used for the “v.” in
Anglo-American legal case names, which needs always to be set in roman
type, regardless of whether the content on either side of it is set in
italics. It’s a real issue, unforutnately, since the two leading
styles in North America split on this: the ALWD style italicizes party
names, and Bluebook does not.

I struggled with various possibilities, such as registering the
parties as separate “creators”, but I concluded that splitting out the
parties would be excessively complex, and wouldn’t do much to improve
the quality of the metadata – the ordering of the party names (which
one comes first) isn’t firmly related to their roles (plaintiff,
defendant) in the original hearings, so it would be purely
presentational anyway. Plus, there are cases with no party names to
deal with (i.e. “In re twenty containers of imported molasses”), which
can’t be detected without doing string parsing guesswork when the
metadata is captured. So I settled on this as a way of getting the
presentational details right in the title field when necessary.

Frank

Hi Frank,

Thanks for clarification. I think that the elements “i”, “b”, “sup”,
“sub”, and also “sc” (although not HTML) are quite clear. There is is
only one property not mentioned them: “i”,“b”, and “sc” are boolean
flags, that means something is either bold, or not, while you can have
different levels of subscript and superscript. You wrote:–
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Voß
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242

@Jakob_Voss

Why keeping ‘abbr’? It is semantic markup, not supported by
citeproc-js
anyway.

I see no reason why citeproc-js can’t support it. The main issue is
whether there’s value in defining acronyms in markup (including in
output, BTW).

The rich text processing in citeproc-js is purely presentational at
the moment. Semantic markup can certainly be supported, but it’s an
additional layer that would need some kind of explicit configuration
mechanism in CSL itself, provide per-style mappings between semantic
and presentational elements. Getting a set of presentational elements
into place that covers known use cases can be seen as a first step
toward that goal.

I am still against “semantic” markup because it blurs so many aspects,
rarely is found in existing data, everyone understands something
different, and in the end the only thing you can agree about is how it
is rendered - which makes it presentational markup. We should not mix
presentational markup like bold, and case-preserving; and semantic data
fields like title, authors, and date.

It’s not about titles within titles but also other words set in
quotes
a title. I thought citeproc-js handles these by parsing quotes ("
and ')
in rich text. International quotes and LaTeX characters are out of
the
scope of CSL schema, they must be recognized and converted in a
pre-parsing
state.

I’m not sure I like this idea. If we’re going to expect pre-parsing,
then let’s do the input right?

I’m not sure I follow the discussion here, but I can describe what
citeproc-js does with quotation marks. We can call quote marks that
flip-flop “hot”, and those that don’t “cold”. The ascii single- and
double-quote marks are always hot. In addition, the single- and
double-quote marks specified in the locale of the current CSL style
are also hot.

In other words: “hot” quote marks are treated as markup, and “cold”
quote marks are literal text. I thought that the elements “cite” and
“cite” with class=part in CSL schema serve the same purpose like “hot”
quote marks and there are only “cold” quote marks if you use this
elements.

Another open question about quotation marks: Is there a difference
between the titles (enclosed in curly brackets to not use quotation
marks): {A “fine” man} and {A ‘fine’ man}? And between {Review of “A
‘fine’ man”} and {Review of ‘A “fine” man’}? I think that we just have
two possible levels of quotation for “hot” quote marks but it does not
matter, which one you use. However with locale quotation marks you can
even nest more levels?

I thought that the elements ‘cite’ and ‘cite+class=part’
correspond to two levels of quotes in citeproc-js.

Probably, but Frank and I haven’t discussed it.

If this is a semantic element, the comment above would also apply
here.

I still do not understand their meaning. Either they result in specific
presentational treatment or in further data analysis, which is beyond
the scope of CSL. In any way we have problems if one processor knows
this tags and other processors keep them literally.

My question was about the remaining markup element:

span or span+class=protect in CSL schema
and
span+class=nodecor or span+class=nocase in citeproc-js

There could be one element for case-preserving (which is semantic
markup), but two?

I have no clue what either ‘nodecor’ or ‘nocase’ mean.

There are discrete use cases for these. I wasn’t able to work out a
better way of handling them, but that doesn’t mean there isn’t one.
At the very least, they could probably do with renaming.

The ‘nocase’ element prevents changes to the case of the item only.
“Decor” such as italics, small-caps and boldface will be applied in
the normal way. This is normally what you would use for proper names
and the like.

The ‘nodecor’ element prevents changes both to case and to
“decorations” such as font style changes. It is used for the “v.” in
Anglo-American legal case names, which needs always to be set in roman
type, regardless of whether the content on either side of it is set in
italics. It’s a real issue, unforutnately, since the two leading
styles in North America split on this: the ALWD style italicizes party
names, and Bluebook does not.

I struggled with various possibilities, such as registering the
parties as separate “creators”, but I concluded that splitting out the
parties would be excessively complex, and wouldn’t do much to improve
the quality of the metadata – the ordering of the party names (which
one comes first) isn’t firmly related to their roles (plaintiff,
defendant) in the original hearings, so it would be purely
presentational anyway. Plus, there are cases with no party names to
deal with (i.e. “In re twenty containers of imported molasses”), which
can’t be detected without doing string parsing guesswork when the
metadata is captured. So I settled on this as a way of getting the
presentational details right in the title field when necessary.

Ok, so the CSL schema should be changed to allow span elements with one
of the two class attributes “nodecor” or “nocase” instead of an optional
class=“protect” attribute. By the way: If I understand it right, the
impact of “nodecor” is a superset of the impact of “nocase”, so it makes
no sense to nest them (can do it, but any CSL processor would just apply
the inner-most element only).

Jakob

At the risk of some confusion, I’m going to trim some of this …

I’m not sure I like this idea. If we’re going to expect pre-parsing,
then let’s do the input right?

I’m not sure I follow the discussion here, but I can describe what
citeproc-js does with quotation marks. We can call quote marks that
flip-flop “hot”, and those that don’t “cold”. The ascii single- and
double-quote marks are always hot. In addition, the single- and
double-quote marks specified in the locale of the current CSL style
are also hot. So …

  • a title that contains French-style double-quote marks, when cited in
    an English style that puts quotes around the title will come through
    cold (verbatim), as French-style double-quote marks; and

  • the same title containing French-style double-quote marks, when
    cited in a French style that puts (French-style) quotes around the
    title will come through hot, flip-flopping to French-style
    single-quote marks.

But how does this work? And what happens with Arabic (to choose an
extreme, but practical, example; one I know nothing about)?

I have no clue what either ‘nodecor’ or ‘nocase’ mean.

There are discrete use cases for these. I wasn’t able to work out a
better way of handling them, but that doesn’t mean there isn’t one.
At the very least, they could probably do with renaming.

Yes, I believe the “nodecor” name in particular is misleading (I
believe decoration in the HTML/CSS world is really about details like
underlining).

The ‘nocase’ element prevents changes to the case of the item only.
“Decor” such as italics, small-caps and boldface will be applied in
the normal way. This is normally what you would use for proper names
and the like.

The ‘nodecor’ element prevents changes both to case and to
“decorations” such as font style changes. It is used for the “v.” in
Anglo-American legal case names, which needs always to be set in roman
type, regardless of whether the content on either side of it is set in
italics. It’s a real issue, unforutnately, since the two leading
styles in North America split on this: the ALWD style italicizes party
names, and Bluebook does not.

Does this rule apply across all possible legal styles? Just seems to
me this is really a style issue, and so shouldn’t belong in the data.
But maybe I’m misunderstanding.

I struggled with various possibilities, such as registering the
parties as separate “creators”, but I concluded that splitting out the
parties would be excessively complex, and wouldn’t do much to improve
the quality of the metadata – the ordering of the party names (which
one comes first) isn’t firmly related to their roles (plaintiff,
defendant) in the original hearings, so it would be purely
presentational anyway. Plus, there are cases with no party names to
deal with (i.e. “In re twenty containers of imported molasses”), which
can’t be detected without doing string parsing guesswork when the
metadata is captured. So I settled on this as a way of getting the
presentational details right in the title field when necessary.

Frank

Bruce