Sub-field parsing

Here’s an item that has been on my mind for awhile. Inside a title, a
legal case name or the title of another work are often set off with
different font attributes. I would like to handle these with a simple
wiki-style markup, but I’m not sure whether that is within the scope
of CSL. Basically, underscores, asterisks and quotes would be
"flip-flops" that distinguish the enclosed text on the basis of
italicization, boldface and single- or double-quotes respectively.
So:

italic title: My Work Concerning Other Work
becomes: My Work Concerning Other
Work

roman title: My Work Concerning Other Work
becomes: My Work Concerning Other
Work

quoted title: My Work Concerning "Other Work"
becomes: “My Work Concerning ‘Other Work’”

… and so forth.

I’m wondering whether this kind of sub-field mangling is something to
be controlled by the CSL specification, or whether it can be done
independently by the application. If the former, I wonder whether it
might be possible to reach agreement to allow this kind of handling
through a variant (of which the syntax described above would be one),
since there is a demand for the functionality, but systems running CSL
may have differing capabilities for representing sub-field decorations
or structure.

There is a related issue concerning Bluebook case names (subsequent
references use only one party name), but that’s a bit more
complicated, so I’ll hold it for later unless there is interest or
need.

Frank

Here’s an item that has been on my mind for awhile.

Yeah, me too.

Inside a title, a
legal case name or the title of another work are often set off with
different font attributes. I would like to handle these with a simple
wiki-style markup, but I’m not sure whether that is within the scope
of CSL.

I’m not sure either.

Basically, underscores, asterisks and quotes would be
“flip-flops” that distinguish the enclosed text on the basis of
italicization, boldface and single- or double-quotes respectively.
So:

italic title: My Work Concerning Other Work
becomes: My Work Concerning Other
Work

roman title: My Work Concerning Other Work
becomes: My Work Concerning Other
Work

quoted title: My Work Concerning “Other Work”
becomes: “My Work Concerning ‘Other Work’”

… and so forth.

I’m wondering whether this kind of sub-field mangling is something to
be controlled by the CSL specification, or whether it can be done
independently by the application.

Well, the problem is this also touches on data transportability
issues. So standardization would be valuable.

If the former, I wonder whether it
might be possible to reach agreement to allow this kind of handling
through a variant (of which the syntax described above would be one),
since there is a demand for the functionality, but systems running CSL
may have differing capabilities for representing sub-field decorations
or structure.

There are two issues:

  1. encoding syntax. Your wiki syntax is one option. But there are
    others (embedded XHTML, for example). If wiki, I’d opt for as simple
    as possible, so probably only support italics/underlines and quotes.

  2. does CSL need to know about this? I really don’t know. It certainly
    does if we want to attach some kind of semantics to different kinds of
    inline markup: titles, species names, etc. If all we care about is
    presentation, then it might not matter.

Bruce

My thoughts on this, which I hope make some sense:

Unless there are cases where presentation really depends on semantics (e.g.
a (fictitious?) rule that a “flip-flop” should occur, except when the
italicized text represent a species name), I think support for semantic
markup shouldn’t hold up an implementation of support for inline markup.
Some people are now doing crazy stuff to get their bibliographies right (
http://forums.zotero.org/discussion/3875/rich-text-in-titles/#Item_11) ;),
and I really think support for inline markup should be part of the
core-functionality of any bibliography tool (it’s the one reason I’m not yet
advocating Zotero with my colleagues). Also, if the first assumption holds
(that semantics aren’t required), I don’t see the point in adding anything
to CSL, but I guess it would ease things down the road if CSL processors are
able to digest standardized (e.g. wiki) inline markup. Finally, isn’t there
the room to support multiple types of syntax (which would mean that there
isn’t a need for a final decision on this)?

Rintze

Unless there are cases where presentation really depends on semantics (e.g.
a (fictitious?) rule that a “flip-flop” should occur, except when the
italicized text represent a species name), I think support for semantic
markup shouldn’t hold up an implementation of support for inline markup.
Some people are now doing crazy stuff to get their bibliographies right
(Rich Text in Titles - Zotero Forums) ;),
and I really think support for inline markup should be part of the
core-functionality of any bibliography tool (it’s the one reason I’m not yet
advocating Zotero with my colleagues). Also, if the first assumption holds
(that semantics aren’t required), I don’t see the point in adding anything
to CSL, but I guess it would ease things down the road if CSL processors are
able to digest standardized (e.g. wiki) inline markup. Finally, isn’t there
the room to support multiple types of syntax (which would mean that there
isn’t a need for a final decision on this)?

I agree this should be dealt with in some, and that we should do so
for the 0.9 release. Have you added it to the tracker?

Bruce2009/3/21 Rintze Zelle <@Rintze_Zelle>:

Just as a followup:

This is a difficult issue, and I know of no existing solution that I
consider satisfactory.

The reason this is difficult is that it touches on a lot of disparate
issues: data modeling and encoding, output formatting, etc. in a
format that tries to be agnostic as possible about these details.

I see a couple/few options.

  1. We say CSL implementations should understand a handful of stuff
    supported in, say, markdown: italics/emphasis, bold/strong, quotes,
    superscripts and subscripts. We say nothing else except that
    implementers should know what to do in different cases; if you have a
    book title like “Some *Inner Title” and the tttle is italics, then the
    inner title flips to normal. This is what Frank mentioned.

The only useful thing about this approach is that it works for getting
proper presentation, and it’s simple (no changes to the schema).

But for people like me who really value structured and semi-structured
data, this is really bad, because I can’t see it leaving any room for
attaching semantics to that micro-content (except for quotes). You
simply have no way to glean from the output what an italicized string
inside a roman title means.

  1. a variant of the above is to have a handful of standardized
    concepts, and some standard rules that can be overridden with some CSL
    fragments (I don’t know what they might be though). I think this is
    more to my liking, but perhaps difficult to do well.

Bruce

Just as a followup:

This is a difficult issue, and I know of no existing solution that I
consider satisfactory.

The reason this is difficult is that it touches on a lot of disparate
issues: data modeling and encoding, output formatting, etc. in a
format that tries to be agnostic as possible about these details.

I see a couple/few options.

  1. We say CSL implementations should understand a handful of stuff
    supported in, say, markdown: italics/emphasis, bold/strong, quotes,
    superscripts and subscripts. We say nothing else except that
    implementers should know what to do in different cases; if you have a
    book title like “Some *Inner Title” and the tttle is italics, then the
    inner title flips to normal. This is what Frank mentioned.

The only useful thing about this approach is that it works for getting
proper presentation, and it’s simple (no changes to the schema).

But for people like me who really value structured and semi-structured
data, this is really bad, because I can’t see it leaving any room for
attaching semantics to that micro-content (except for quotes). You
simply have no way to glean from the output what an italicized string
inside a roman title means.

  1. a variant of the above is to have a handful of standardized

concepts, and some standard rules that can be overridden with some CSL
fragments (I don’t know what they might be though). I think this is
more to my liking, but perhaps difficult to do well.

But: a) is support for semantic markup incompatible with support for e.g.
wiki markup? Can’t you just support an additional set of tags? Supporting
wiki tags has the benefit of the fact that some content providers already
mark up their titles in presentations encoding (italics, sub/superscript).
And b), in some cases (simple) semantic markup just isn’t going to cut it.
When citing papers, one should generally copy the metadata of that paper as
closely as possible, even if it goes against style conventions. E.g. if a
cited paper mentions a gene in the title, but it isn’t italicized (as it
should be), you could annotate it with semantic markup as a gene but then
you would need a way to indicate that it shouldn’t be italicized like the
gene names in other (correctly written) titles. Or you could decide not to
add semantic markup to that particular gene, but that would go against the
semantic spirit.

Rintze

But: a) is support for semantic markup incompatible with support for e.g.
wiki markup? Can’t you just support an additional set of tags?

Yes. But it can get difficult if you have too many of them. And this
is still just syntax. It might be better to say what we want to be
able to represent semantically, what implications that can or may have
for formatting, and then go from there.

Supporting wiki tags has the benefit of the fact that some content providers already
mark up their titles in presentations encoding (italics, sub/superscript).
And b), in some cases (simple) semantic markup just isn’t going to cut it.
When citing papers, one should generally copy the metadata of that paper as
closely as possible, even if it goes against style conventions. E.g. if a
cited paper mentions a gene in the title, but it isn’t italicized (as it
should be), you could annotate it with semantic markup as a gene but then
you would need a way to indicate that it shouldn’t be italicized like the
gene names in other (correctly written) titles.

Hmm. I agree with up to the end. How these things should be handled on
output is a function of the output style; isn’t it?

A related example is this:

British journals have different title-casing rules than U.S. journals.
If I cite an article from a U.K. journal in a U.S. journal, I really
think the title should be presented using the U.S. rules.

Or you could decide not to add semantic markup to that particular gene, but that would go against the
semantic spirit.

And would yield incorrect output if my position above is correct.

Bruce2009/3/22 Rintze Zelle <@Rintze_Zelle>:

But: a) is support for semantic markup incompatible with support for e.g.
wiki markup? Can’t you just support an additional set of tags?

Yes. But it can get difficult if you have too many of them. And this
is still just syntax. It might be better to say what we want to be
able to represent semantically, what implications that can or may have
for formatting, and then go from there.

I personally can live a perfectly happy life without semantic markup. But
that’s probably not the answer you are looking for.

Supporting wiki tags has the benefit of the fact that some content
providers already
mark up their titles in presentations encoding (italics,
sub/superscript).
And b), in some cases (simple) semantic markup just isn’t going to cut
it.
When citing papers, one should generally copy the metadata of that paper
as
closely as possible, even if it goes against style conventions. E.g. if a
cited paper mentions a gene in the title, but it isn’t italicized (as it
should be), you could annotate it with semantic markup as a gene but then
you would need a way to indicate that it shouldn’t be italicized like the
gene names in other (correctly written) titles.

Hmm. I agree with up to the end. How these things should be handled on
output is a function of the output style; isn’t it?

Well, in the case of gene and species names, there isn’t any variability
between styles. The formatting is standardized by convention, so here I
don’t see a benefit of semantic markup in obtaining correct presentation in
the output.

A related example is this:

British journals have different title-casing rules than U.S. journals.
If I cite an article from a U.K. journal in a U.S. journal, I really
think the title should be presented using the U.S. rules.

Do you need inline markup for this?

Or you could decide not to add semantic markup to that particular gene,
but that would go against the

semantic spirit.

And would yield incorrect output if my position above is correct.

This I don’t follow.

Rintze

But: a) is support for semantic markup incompatible with support for
e.g.
wiki markup? Can’t you just support an additional set of tags?

Yes. But it can get difficult if you have too many of them. And this
is still just syntax. It might be better to say what we want to be
able to represent semantically, what implications that can or may have
for formatting, and then go from there.

I personally can live a perfectly happy life without semantic markup. But
that’s probably not the answer you are looking for.

Well, more practically, my point is that we can’t know what solution
is workable (generally) without assessing both. Presentation-only
might work for you, but am not sure (e.g. I really don’t know) if it
works across fields.

Supporting wiki tags has the benefit of the fact that some content
providers already
mark up their titles in presentations encoding (italics,
sub/superscript).
And b), in some cases (simple) semantic markup just isn’t going to cut
it.
When citing papers, one should generally copy the metadata of that paper
as
closely as possible, even if it goes against style conventions. E.g. if
a
cited paper mentions a gene in the title, but it isn’t italicized (as it
should be), you could annotate it with semantic markup as a gene but
then
you would need a way to indicate that it shouldn’t be italicized like
the
gene names in other (correctly written) titles.

Hmm. I agree with up to the end. How these things should be handled on
output is a function of the output style; isn’t it?

Well, in the case of gene and species names, there isn’t any variability
between styles. The formatting is standardized by convention, so here I
don’t see a benefit of semantic markup in obtaining correct presentation in
the output.

So your hypothetical example is only hypothetical?

For sake of argument, what happens if you have a species name in a
book title that is otherwise italicized?

A related example is this:

British journals have different title-casing rules than U.S. journals.
If I cite an article from a U.K. journal in a U.S. journal, I really
think the title should be presented using the U.S. rules.

Do you need inline markup for this?

No; was just addressing your distinction between source data encoding,
and output styling.

Or you could decide not to add semantic markup to that particular gene,
but that would go against the
semantic spirit.

And would yield incorrect output if my position above is correct.

This I don’t follow.

If one had a species name that was not italicized, you could not set
it to output the correct styling (say, italicized).

Bruce2009/3/22 Rintze Zelle <@Rintze_Zelle>:

But: a) is support for semantic markup incompatible with support for
e.g.
wiki markup? Can’t you just support an additional set of tags?

Yes. But it can get difficult if you have too many of them. And this
is still just syntax. It might be better to say what we want to be
able to represent semantically, what implications that can or may have
for formatting, and then go from there.

I personally can live a perfectly happy life without semantic markup. But
that’s probably not the answer you are looking for.

Well, more practically, my point is that we can’t know what solution
is workable (generally) without assessing both. Presentation-only
might work for you, but am not sure (e.g. I really don’t know) if it
works across fields.

But do you know of any cases where semantic markup is really required? I
can’t think of any.

Supporting wiki tags has the benefit of the fact that some content

providers already
mark up their titles in presentations encoding (italics,
sub/superscript).
And b), in some cases (simple) semantic markup just isn’t going to cut
it.
When citing papers, one should generally copy the metadata of that
paper
as
closely as possible, even if it goes against style conventions. E.g.
if
a
cited paper mentions a gene in the title, but it isn’t italicized (as
it
should be), you could annotate it with semantic markup as a gene but
then
you would need a way to indicate that it shouldn’t be italicized like
the
gene names in other (correctly written) titles.

Hmm. I agree with up to the end. How these things should be handled on
output is a function of the output style; isn’t it?

Well, in the case of gene and species names, there isn’t any variability
between styles. The formatting is standardized by convention, so here I
don’t see a benefit of semantic markup in obtaining correct presentation
in
the output.

So your hypothetical example is only hypothetical?

For sake of argument, what happens if you have a species name in a
book title that is otherwise italicized?

My gut tells me it should flip-flop (i.e. become non-italicized), but some
examples I found proof me otherwise:
e.g. for the following reference, found at
http://genome.cshlp.org/content/13/2/244.full, Saccharomyces is italicized
along with the rest of the title: Strathern J.N., Jones E.W., Broach J.R.
(1982) The molecular biology of the yeast Saccharomyces— Metabolism and gene
expression. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY).
The same goes for some papers in the Journal of Biological Chemistry and
Nature I checked. I’m not entirely sure however whether this markup is the
result of a deliberate choice, or of limitations in the software used to
format these entries.

Or you could decide not to add semantic markup to that particular
gene,
but that would go against the
semantic spirit.

And would yield incorrect output if my position above is correct.

This I don’t follow.

If one had a species name that was not italicized, you could not set
it to output the correct styling (say, italicized).

Maybe I was unable to make myself entirely clear in my example. In my mind
the most important principle here is that when citing, one should always
copy the cited titles verbatim, including errors or non-standard markup. As
such, the markup that should be present in the cited titles (e.g. gene and
species markup), is not a function of the output style of the manuscript but
purely related to the cited sources.

Rintze

Well, more practically, my point is that we can’t know what solution
is workable (generally) without assessing both. Presentation-only
might work for you, but am not sure (e.g. I really don’t know) if it
works across fields.

But do you know of any cases where semantic markup is really required? I
can’t think of any.

So let’s itemize what we know.

Here’s the stuff I deal with personally in my corner of the social
sciences/humanities:

  1. titles within titles; in some cases they are placed in single
    quotes, in some cases double-quotes, and in other cases italicized (or
    flipped).

  2. quotes within titles

  3. foreign phrases; not sure off the top of my head how these are
    handled; I think it varies

I gather that you also deal with:

  1. species names

  2. chemicals?

What else is there?

If one had a species name that was not italicized, you could not set
it to output the correct styling (say, italicized).

Maybe I was unable to make myself entirely clear in my example. In my mind
the most important principle here is that when citing, one should always
copy the cited titles verbatim, including errors or non-standard markup. As
such, the markup that should be present in the cited titles (e.g. gene and
species markup), is not a function of the output style of the manuscript but
purely related to the cited sources.

Yes, I understand you, but just disagree. If an article lists my name
as “B D’Arcus” but I have an output style that demands the full given
name, there’s a problem.

Bruce2009/3/22 Rintze Zelle <@Rintze_Zelle>:

Well, more practically, my point is that we can’t know what solution
is workable (generally) without assessing both. Presentation-only
might work for you, but am not sure (e.g. I really don’t know) if it
works across fields.

But do you know of any cases where semantic markup is really required? I
can’t think of any.

So let’s itemize what we know.

Here’s the stuff I deal with personally in my corner of the social
sciences/humanities:

  1. titles within titles; in some cases they are placed in single
    quotes, in some cases double-quotes, and in other cases italicized (or
    flipped).

  2. quotes within titles

  3. foreign phrases; not sure off the top of my head how these are
    handled; I think it varies

I gather that you also deal with:

  1. species names

  2. chemicals?

What else is there?

Parties to lawsuits that are deemed “common litigants” in the view of
the editor of the journal concerned? When citing the case in
subsequent references, the case should be referred to by one party
only, and the party that is not a common litigant should be used.

There is no generally accepted list of common litigants, and no
generally agreed unambiguous standard for identifying one.

But do you know of any cases where semantic markup is really required? I
can’t think of any.

So let’s itemize what we know.

Here’s the stuff I deal with personally in my corner of the social
sciences/humanities:

  1. titles within titles; in some cases they are placed in single
    quotes, in some cases double-quotes, and in other cases italicized (or
    flipped).

  2. quotes within titles

  3. foreign phrases; not sure off the top of my head how these are
    handled; I think it varies

I gather that you also deal with:

  1. species names

  2. chemicals?

What else is there?

Ship names :slight_smile:

If one had a species name that was not italicized, you could not set

it to output the correct styling (say, italicized).

Maybe I was unable to make myself entirely clear in my example. In my
mind
the most important principle here is that when citing, one should always
copy the cited titles verbatim, including errors or non-standard markup.
As
such, the markup that should be present in the cited titles (e.g. gene
and
species markup), is not a function of the output style of the manuscript
but
purely related to the cited sources.

Yes, I understand you, but just disagree. If an article lists my name
as “B D’Arcus” but I have an output style that demands the full given
name, there’s a problem.

That might indeed be an exception to the rule (though full names are never
used in my field).

Rintze

But do you know of any cases where semantic markup is really required? I
can’t think of any.

So let’s itemize what we know.

Here’s the stuff I deal with personally in my corner of the social
sciences/humanities:

  1. titles within titles; in some cases they are placed in single
    quotes, in some cases double-quotes, and in other cases italicized (or
    flipped).

  2. quotes within titles

  3. foreign phrases; not sure off the top of my head how these are
    handled; I think it varies

I gather that you also deal with:

  1. species names

  2. chemicals?

What else is there?

Ship names :slight_smile:
Need to italicize part a title - Zotero Forums

Still fishing for a compromise between ease of entry and semantic
markup, here’s another shot at the topic.

Having sub-field semantic markup would be useful, but it may not be
necessary for the semantic details to be known to CSL. As Rintze’s
data is set up, he knows for a given set of entries what the visual
markup represents. At the application level (Zotero, say), that data
could, for a given set of entries, be exported with the visual markup
mapped to appropriate inline semantic tags. For CSL, though, if there
are very few cases that cannot be covered by a wiki flip-flop scheme
(that’s my impression from this thread), it’s not very painful to just
leave the edge cases for manual touch-up.

If fully fledged semantic markup is left as a problem for another
application to deal with, this would encourage people to adopt
consistent, or at least congruent conventions for inline markup, and
that would I think speed rather than hinder the emergence of data
stores containing semantic hints. I think everybody wins.

A thought, anyway.

Frank2009/3/24 Rintze Zelle <@Rintze_Zelle>:

Resurrecting this ancient discussion to see if usage might have already become established. Since this occurred, citeproc-js and Zotero have adopted support for a limited set of HTML-like markup for inline formatting: https://www.zotero.org/support/kb/rich_text_bibliography

I am wondering whether it would still be a good idea to formalize this feature in CSL. Either using the syntax adopted by Zotero and/or perhaps Markdown or CommonMark syntax (changing existing data to a simpler syntax would probably be doable for the clients that currently support such markup).

I’d be inclined to go with html (which obviously could be hidden by GUI tools like Zotero or Mendeley) because markdown syntax may not be unique enough (as in, you could see * or _ in regular titles).

The only reservation to putting this in the specs is that we currently don’t say much about the data model at all there. Not sure if that should matter.

My major thought is that we probably want consistent behavior across citeprocs, so listing the tags that must be supported seems like a good idea.

That’s not really a problem. CommonMark goes to great lengths to preserve many punctuation marks, especially mid_word ones and those without matching marks like regular askerisk use.* You would have to have a really weird title to get accidental formatting for any of the markdown features. I would much prefer Markdown as an implementor, and in fact I have an implementation going already. Since Markdown implementations often include smart quotes, this also saves a bunch of quote parsing. And it makes preserving backwards compat with citeproc-js’ micro-HTML (as I have come to call it) very easy: simply recognise those raw HTML tokens as they appear as open tag, then contents, then close tag in the standardised token stream. I will share my implementation shortly.

Of course, these considerations are minuscule compared to the difficulty of inputting raw HTML tokens into your reference library; improving that is even more of a win. I would add finally that it’s a lot less work to support user-friendly formatting in a reference manager with Markdown than by “hiding the tags” with some rich text editor that probably won’t give you the right strict HTML subset to work out of the box.

*Like this.

The Markdown family doesn’t seem to have syntax for superscript, subscript, or small-caps. The first two are heavily used in titles to articles in chemistry and biology.

Obviously as Markdown is HTML-oriented, you can still write superscript/subscript with <sup> and <sub>. So this isn’t a loss. There are some proposed extensions to support some of these (e.g. https://talk.commonmark.org/t/why-there-is-no-syntax-for-subscript-and-supscript/586), but nothing we have to actually wait on. For small caps, you simply get a pair of tokens HtmlTag("<span style=\"font-variant: small-caps;\">") and HtmlTag("</span>"), with Markdown in between, that you can recognise instead of escaping it as &lt;span&gt; etc.

Would these titles benefit from Pandoc-like math syntax in $ signs? Biology I’m guessing no, but chemists might like a “pass this straight through to LaTeX please” where they get to \usepackage{mhchem} and not get bogged down in subscript soup. That’s not what math is, but same kind of thing at least.