Citeproc text-case tests

Dear all,

I have a few questions about the expected behaviour of cite processors as regards text-case formatting. Many of the unit test inputs come with ‘nocase’ span elements, but I have seen no mention anywhere in the specification that item fields are expected to be HTML fragments; is it a general rule that items may contain HTML markup? and are there any special classes other than ‘nocase’?

More importantly, though, what is the exact meaning of the ‘nocase’ class? I was assuming it directs the processor to ignore the contents of the span when applying a given text-case format; however, this is not what the unit tests seem to imply. Consider the following examples:

textcase_Lowercase.txt:
input: "This is a pen that is a <span class=“nocase”>Smith pencil"
expected result (using ‘lowercase’): “this is a pen that is a smith pencil”

So in this case, the processor strips the span tag from the input and applies the formatting rules regardless.

textcase_TitleCapitalization.txt
input: "This IS a pen that is a <span class=“nocase”>smith pencil"
expected result (using title-case): “This IS a Pen That Is a smith Pencil”

Now the processor seems to strip away the span tag but does not apply the format to its contents.

Furthermore, would it not be sensible to turn ‘IS’ into ‘Is’ when applying title-case?

I would greatly appreciate if anyone could help clarify these issues or point me to the document where these formatting rules are specified in more detail.

Thanks!

Sylvester

Sylvester Keil <@Sylvester_Keil> writes:

Dear all,

I have a few questions about the expected behaviour of cite processors
as regards text-case formatting. Many of the unit test inputs come
with ‘nocase’ span elements, but I have seen no mention anywhere in
the specification that item fields are expected to be HTML fragments;
is it a general rule that items may contain HTML markup? and are there
any special classes other than ‘nocase’?

There is not such a rule in the specification. The only reference to
rich text markup is when it prescribe to remove it from the sort key.
citeproc-js (later followed by citeproc-hs) supports a set of tags for
formatting some reference’s variables. Have a look in citeproc-js source
(src/util_flipflop.js).

More importantly, though, what is the exact meaning of the 'nocase’
class? I was assuming it directs the processor to ignore the contents
of the span when applying a given text-case format; however, this is
not what the unit tests seem to imply. Consider the following
examples:

textcase_Lowercase.txt:
input: "This is a pen that is a <span class=“nocase”>Smith
pencil"
expected result (using ‘lowercase’): “this is a pen that is a smith
pencil”

So in this case, the processor strips the span tag from the input and
applies the formatting rules regardless.

textcase_TitleCapitalization.txt
input: "This IS a pen that is a <span class=“nocase”>smith
pencil"
expected result (using title-case): “This IS a Pen That Is a smith
Pencil”

Now the processor seems to strip away the span tag but does not
apply the format to its contents.

Furthermore, would it not be sensible to turn ‘IS’ into ‘Is’ when
applying title-case?

I would greatly appreciate if anyone could help clarify these issues
or point me to the document where these formatting rules are specified
in more detail.

Frank was to first to implement rich text markup and so I followed his
code and his tests (I do not remember rising this issue): when the
textcase is set to lowercase or uppercase, nocase will have no effect.
When textcase is set to something else, nocase will prevent the
processor from modifying the textcase.

re: textcase_TitleCapitalization.txt. This is a different case: smith
will not be touched, and neither will be “IS”, since title
capitalization only changes the first character of a word. So, it
would change “is” into “Is”, or “iS” into “IS”.

Keep in mind that this is just my understanding of the way textcase
works.–
andrea rossato

Andrea,

thanks a lot for this! I was stumped by the distinction between lower- and uppercase and the other text-cases (incidentally, what is the rationale behind this?). Also, util_flipflop.js helps quite a bit, thanks!

Now, if I understand this correctly, it is the processor’s prerogative to handle formatting a bit differently than citeproc-js while still being compliant with the CSL specs? In this case, I think I may adapt some of the expected test results for my own testing. In particular, it would seem useful to add a ‘format’ directive to each test: this way, when using a processor that supports multiple output formats it would be possible to define the format of the expected result.

PGP.sig (195 Bytes)

Andrea,

thanks a lot for this! I was stumped by the distinction between lower- and uppercase and the other text-cases (incidentally, what is the rationale behind this?). Also, util_flipflop.js helps quite a bit, thanks!

Now, if I understand this correctly, it is the processor’s prerogative to handle formatting a bit differently than citeproc-js while still being compliant with the CSL specs? In this case, I think I may adapt some of the expected test results for my own testing. In particular, it would seem useful to add a ‘format’ directive to each test: this way, when using a processor that supports multiple output formats it would be possible to define the format of the expected result.

There are a few examples of things that ultimately should be in the
(or a) spec that are effectively now experimental. This is one of them
(others include multiple sectional bibliographies, multilingual
behavior, and an API).

Oh, and to your suggestion …

Now, if I understand this correctly, it is the processor’s prerogative to handle formatting a bit differently than citeproc-js while still being compliant with the CSL specs? In this case, I think I may adapt some of the expected test results for my own testing. In particular, it would seem useful to add a ‘format’ directive to each test: this way, when using a processor that supports multiple output formats it would be possible to define the format of the expected result.

You mean for RTF, LaTeX, etc.?

This is a tricky issue I’ve raised before: CSL is intended to be
format-agnostic along the lines of pandoc, where you’re mapping (as an
output driver driver/filter) certain logical structures to different
output formats. So on the rich text markup, we really want to test
whether the basic logic is right. But that’s hard to do; seems to me
you either need to effectively define an object model (a la pandoc’s)
or rely on more concrete syntax (what’s in the tests now) and write
code to map your implementation particulars to that.

To go back to your suggestion, exactly how would that change the test
suite? Would there be multiples output blocks?

Bruce

Oh, and to your suggestion …

Now, if I understand this correctly, it is the processor’s prerogative to handle formatting a bit differently than citeproc-js while still being compliant with the CSL specs? In this case, I think I may adapt some of the expected test results for my own testing. In particular, it would seem useful to add a ‘format’ directive to each test: this way, when using a processor that supports multiple output formats it would be possible to define the format of the expected result.

You mean for RTF, LaTeX, etc.?

Yes, indeed. Personally, I am mostly interested in producing plain text, HTML, and LaTeX, but as far as the processor proper is concerned I believe it is best for it to remain, as far as possible, format agnostic, too (by delegating formatting to a plugin). Of course that would complicate testing somewhat: as you say below, you could either test against multiple results or against a separate meta-model. Even if you were to pick a given format to test against, it would still be very complicated (for example, one might argue whether to use , or with css styles proper when using HTML etc.).

So, to keep things simple, what I was thinking of is to add a single ‘format’ tag to each test case to make explicit how the expected result is formatted. That way, when running the tests on a given processor it would be straight forward to either switch the processor into a mode that produces a result in the matching format, or—if the given processor does not support the format—to ignore or otherwise automatically patch the result to make meaningful comparisons possible.

Would that make sense?

PGP.sig (195 Bytes)

You mean for RTF, LaTeX, etc.?

Yes, indeed. Personally, I am mostly interested in producing plain text, HTML, and LaTeX, but as far as the processor proper is concerned I believe it is best for it to remain, as far as possible, format agnostic, too (by delegating formatting to a plugin). Of course that would complicate testing somewhat: as you say below, you could either test against multiple results or against a separate meta-model. Even if you were to pick a given format to test against, it would still be very complicated (for example, one might argue whether to use , or with css styles proper when using HTML etc.).

+1 for with css styles, this is what I use in CiteProc.php and
in my opinion is easier to deal with than multiple HTML tags
(some text) because you only have to deal with
one closing tag.

And while we are on the subject of tests…

I wonder someone could explain the rational having of formatting
directives in the input data…

“static-ordering”: false
"comma_suffix": true

…shouldn’t these be in the CSL spec itself?

Also, I’ve run into CSL tags that don’t seem to be in the specification…

…am I missing something?

You mean for RTF, LaTeX, etc.?

Yes, indeed. Personally, I am mostly interested in producing plain text, HTML, and LaTeX, but as far as the processor proper is concerned I believe it is best for it to remain, as far as possible, format agnostic, too (by delegating formatting to a plugin). Of course that would complicate testing somewhat: as you say below, you could either test against multiple results or against a separate meta-model. Even if you were to pick a given format to test against, it would still be very complicated (for example, one might argue whether to use , or with css styles proper when using HTML etc.).

+1 for with css styles, this is what I use in CiteProc.php and
in my opinion is easier to deal with than multiple HTML tags
(some text) because you only have to deal with
one closing tag.

So you guys are suggesting instead:

foo

… or:

foo

…?

In any case, this seems reasonable (particularly if we can move
towards adding at least come RDFa markup so the output isn’t just dumb
text).

And while we are on the subject of tests…

I wonder someone could explain the rational having of formatting
directives in the input data…

“static-ordering”: false
"comma_suffix": true

…shouldn’t these be in the CSL spec itself?

Yeah, I find those a bit opaque.

Ideally, I think we should define a CSL JSON input spec, and reference
that in the CSL main spec.

Also, I’ve run into CSL tags that don’t seem to be in the specification…

…am I missing something?

A guess: I think that was Frank getting ahead of things :slight_smile:

https://bitbucket.org/bdarcus/csl-schema/issue/16/institution-names

Bruce

BTW, Frank …

A guess: I think that was Frank getting ahead of things :slight_smile:

https://bitbucket.org/bdarcus/csl-schema/issue/16/institution-names

You might look to move these files somewhere else (bitbucket or github
wiki?) given the prominent warning from Google at the top:

http://groups.google.com/group/zotero-legal/web/proposal-institution-names

Bruce

Oh, and to your suggestion …

Now, if I understand this correctly, it is the processor’s prerogative to handle formatting a bit differently than citeproc-js while still being compliant with the CSL specs? In this case, I think I may adapt some of the expected test results for my own testing. In particular, it would seem useful to add a ‘format’ directive to each test: this way, when using a processor that supports multiple output formats it would be possible to define the format of the expected result.

You mean for RTF, LaTeX, etc.?

Yes, indeed. Personally, I am mostly interested in producing plain text, HTML, and LaTeX, but as far as the processor proper is concerned I believe it is best for it to remain, as far as possible, format agnostic, too (by delegating formatting to a plugin). Of course that would complicate testing somewhat: as you say below, you could either test against multiple results or against a separate meta-model. Even if you were to pick a given format to test against, it would still be very complicated (for example, one might argue whether to use , or with css styles proper when using HTML etc.).

So, to keep things simple, what I was thinking of is to add a single ‘format’ tag to each test case to make explicit how the expected result is formatted. That way, when running the tests on a given processor it would be straight forward to either switch the processor into a mode that produces a result in the matching format, or—if the given processor does not support the format—to ignore or otherwise automatically patch the result to make meaningful comparisons possible.

Would that make sense?

That sounds good to me, and it would be easy to add the directive to
the existing tests. Let’s wait to hear from Andrea; if there is
agreement, I’ll be happy to do this, and to adjust the test running
for citeproc-js to comply.

Frank

Dear all,

I have a few questions about the expected behaviour of cite processors as regards text-case formatting. Many of the unit test inputs come with ‘nocase’ span elements, but I have seen no mention anywhere in the specification that item fields are expected to be HTML fragments; is it a general rule that items may contain HTML markup? and are there any special classes other than ‘nocase’?

More importantly, though, what is the exact meaning of the ‘nocase’ class? I was assuming it directs the processor to ignore the contents of the span when applying a given text-case format; however, this is not what the unit tests seem to imply. Consider the following examples:

textcase_Lowercase.txt:
input: "This is a pen that is a <span class=“nocase”>Smith pencil"
expected result (using ‘lowercase’): “this is a pen that is a smith pencil”

So in this case, the processor strips the span tag from the input and applies the formatting rules regardless.

textcase_TitleCapitalization.txt
input: "This IS a pen that is a <span class=“nocase”>smith pencil"
expected result (using title-case): “This IS a Pen That Is a smith Pencil”

Now the processor seems to strip away the span tag but does not apply the format to its contents.

The idea behind including it was to cope with proper nouns when
performing transformations in the processor. But it has implications
for data exchange, and hasn’t been discussed or agreed anywhere; as
Bruce says, I just stuck it in there as an experiment.

The processor was built for Zotero in the first instance, and I
figured that the tag would get the necessary attention when the time
came to hook up a WYSIWYG wrapper on the title field.

The relevant tests should probably either be moved to an experimental
zone, or be distinguished by setting a special value on the cs:style
@version attribute on the CSL in the relevant fixtures, so that
processor exercising the test suite have the option of ignoring them.

Furthermore, would it not be sensible to turn ‘IS’ into ‘Is’ when applying title-case?

If I remember correctly, I just adapted the title-case code from the
Zotero 2.0 processor. I think it’s more aggressive about lower-casing
words of less than three letters. As Andrea notes, this behavior is
not specified in detail in CSL.

You mean for RTF, LaTeX, etc.?

Yes, indeed. Personally, I am mostly interested in producing plain text, HTML, and LaTeX, but as far as the processor proper is concerned I believe it is best for it to remain, as far as possible, format agnostic, too (by delegating formatting to a plugin). Of course that would complicate testing somewhat: as you say below, you could either test against multiple results or against a separate meta-model. Even if you were to pick a given format to test against, it would still be very complicated (for example, one might argue whether to use , or with css styles proper when using HTML etc.).

+1 for with css styles, this is what I use in CiteProc.php and
in my opinion is easier to deal with than multiple HTML tags
(some text) because you only have to deal with
one closing tag.

I can see how that is more elegant in that particular output format,
but you would still need to express output in nested form for LaTeX,
RTF, or plain text markup formats, no? If there are strong feelings on
this, maybe the different approaches to XHTML output could be handled
via and output mode directive suggested by Sylvester?

And while we are on the subject of tests…

I wonder someone could explain the rational having of formatting
directives in the input data…

“static-ordering”: false
"comma_suffix": true

…shouldn’t these be in the CSL spec itself?

The presence of “static-ordering: false” everywhere is a hangover from
early work; it’s a noop, and could be removed. "static-ordering: true"
and “comma_suffix: true” are characteristic specific to particular
names, that do need to be conveyed to the processor in the input
somehow. The names assigned to the flags are not necessarily the most
expressive, though (and the irregularity in use of “-” and “_” is not
pretty).

You mean for RTF, LaTeX, etc.?

Yes, indeed. Personally, I am mostly interested in producing plain text, HTML, and LaTeX, but as far as the processor proper is concerned I believe it is best for it to remain, as far as possible, format agnostic, too (by delegating formatting to a plugin). Of course that would complicate testing somewhat: as you say below, you could either test against multiple results or against a separate meta-model. Even if you were to pick a given format to test against, it would still be very complicated (for example, one might argue whether to use , or with css styles proper when using HTML etc.).

+1 for with css styles, this is what I use in CiteProc.php and
in my opinion is easier to deal with than multiple HTML tags
(some text) because you only have to deal with
one closing tag.

And while we are on the subject of tests…

I wonder someone could explain the rational having of formatting
directives in the input data…

“static-ordering”: false
"comma_suffix": true

…shouldn’t these be in the CSL spec itself?

Also, I’ve run into CSL tags that don’t seem to be in the specification…

…am I missing something?

As Bruce notes, the element is off the CSL map at the
moment. It should not appear in any of the tests in the standard
suite; I’ve moved the fixtures that use it to the local citeproc-js
test bundle, and pushed changes that should have deleted them from the
standard suite.

There may be other things in the standard test that reflect
experimental extensions; if you come across things that are not
reflected in the specification, let me know and I’ll move them out as
well.

Frank Bennett <@Frank_Bennett> writes:

So, to keep things simple, what I was thinking of is to add a single
’format’ tag to each test case to make explicit how the expected
result is formatted. That way, when running the tests on a given
processor it would be straight forward to either switch the processor
into a mode that produces a result in the matching format, or—if
the given processor does not support the format—to ignore or
otherwise automatically patch the result to make meaningful
comparisons possible.

Would that make sense?

That sounds good to me, and it would be easy to add the directive to
the existing tests. Let’s wait to hear from Andrea; if there is
agreement, I’ll be happy to do this, and to adjust the test running
for citeproc-js to comply.

I have no objects.

As for my experience, citeproc-hs is mostly intended for generating a
Pandoc data structure - pandoc will then translate it into every
supported export formats. Still, pandoc HTML writer generates an output
which is quite different from the HTML used for the expected results in
the test-suite. So I had to write a specific export function to tailor
the specific HTML used in the test-suite.

That was trivial on my side – it may be a difficult task for other
implementations --, nonetheless what I’m suggesting is that instead of
adopting a common HTML dialect – which would be implied by the presence
of a ‘format’ tag – it is just easier to write an export function that
will produce Frank’s markup, as if there were a specific test-suite
format - and so no need of the tag.

I agree with Andrea that it is not a big deal to write simple filters, especially because it is implicitly clear to me (now) that all tests use a HTML or plain text format. Needless to say, it would not hurt to make this fact explicit, although that would not mitigate all problems, because even using the same format there may still exist several possible renditions (e.g., the vs. instance). Personally, I was just confused by the HTML tags in the test results because I had not anticipated them; what I find more pressing, though, is the need for format directives on input data. It is inevitable that such directives are required, because there will always be the odd exception to the rule when formatting titles etc. It is not pretty, but I suppose it is not possible to separate this kind of information from the data itself, for example, as Frank has done with the ‘nocase’ spans. Ideally this kind of mark-up should be as unobtrusive as possible and easy to strip away.

As an alternative, one could implement a kind of exception mechanism to case-formatting akin to how the abbreviations are handled, i.e. to keep a directory of style changes which may be exported and imported in order to override the default. That way there would be no need for explicit mark-up within item values.

Do you have any other ideas?

Sylvester

PGP.sig (195 Bytes)

And while we are on the subject of tests…

I wonder someone could explain the rational having of formatting
directives in the input data…

“static-ordering”: false
"comma_suffix": true

…shouldn’t these be in the CSL spec itself?

The presence of “static-ordering: false” everywhere is a hangover from
early work; it’s a noop, and could be removed. "static-ordering: true"
and “comma_suffix: true” are characteristic specific to particular
names, that do need to be conveyed to the processor in the input
somehow. The names assigned to the flags are not necessarily the most
expressive, though (and the irregularity in use of “-” and “_” is not
pretty).

As I mentioned, I think it might be valuable to standardize, and
document, a simple JSON representation of the CSL model (which I
started to do on some wiki somewhere, but forget where!). As I recall,
the hairy details involve names (this one) and dates, and everything
else is pretty straightforward.

Am I correct that “static-ordering” is used to denote, for example,
Asian names that sort exactly how they display (family given), and
that therefore a “false” (default) value relates to common Western
names?

And what about “comma_suffix”?

Bruce

And while we are on the subject of tests…

I wonder someone could explain the rational having of formatting
directives in the input data…

“static-ordering”: false
“comma_suffix”: true

…shouldn’t these be in the CSL spec itself?

The presence of “static-ordering: false” everywhere is a hangover from
early work; it’s a noop, and could be removed. “static-ordering: true”
and “comma_suffix: true” are characteristic specific to particular
names, that do need to be conveyed to the processor in the input
somehow. The names assigned to the flags are not necessarily the most
expressive, though (and the irregularity in use of “-” and “_” is not
pretty).

As I mentioned, I think it might be valuable to standardize, and
document, a simple JSON representation of the CSL model (which I
started to do on some wiki somewhere, but forget where!). As I recall,
the hairy details involve names (this one) and dates, and everything
else is pretty straightforward.

Am I correct that “static-ordering” is used to denote, for example,
Asian names that sort exactly how they display (family given), and
that therefore a “false” (default) value relates to common Western
names?

Yes, that’s how I understood it, too.

And what about “comma_suffix”?

The option ‘comma-suffix’ can be set on a name, indicating (when true) that the suffix in question is separated by a comma from the rest of the name. This helps differentiate between names such as:

Edward III and Plinius, Sr.

and works quite well, I think, although one might consider distinguishing between two types of suffices; this would more closely resemble the way particles are treated.

Sylvester

PGP.sig (195 Bytes)