Developing a CSL processor

Hi

I decided to develop a simple CSL processor to convert Zotero json strings to APA citations. The code will be used in ZotPad and after the processor works, I will publish the code as a separate project in gihub.

I am using json strings from Zotero server as data and validating the output against formatted citations from Zotero server. The citations are formatted using the APA style from https://github.com/citation-style-language/styles/blob/master/apa.csl using the CSL 1.0.1 specification.

I am using the following bibliography item as my test data

Cadogan, J. W., & Lee, N. (Forthcoming). Improper Use of Endogenous Formative Variables. Journal of Business Research.

There is one thing that I do not understand. In the APA style (lines 429-434) there is a group

    <group delimiter=". ">
      <text macro="author"/>
      <text macro="issued"/>
      <text macro="title" prefix=" "/>
      <text macro="container"/>
    </group>

The macro “author” has a names element with initialize-with=". " and the macro “issued” contains a group with prefix " (". Now to my understanding, this means that

  • The “author” macro will end with ". " [Cadogan, J. W., & Lee, N.]
  • The “issued” macro will start with " (" [ (Forthcoming)]
  • The macros are delimited with ". "

This results in a bibliographic item that starts by

Cadogan, J. W., & Lee, N… (Forthcoming).

This is obviously not correct. There should not be a double period followed by a double space, but I do not understand which part of the formatting logic is incorrect.

Mikko

where are you trying this out? In Zotero?

The data are from zotero server using the read API and I am running it on my Mac. (XCode console to be exact.)Sent from my iPad

On 20.9.2012, at 23.52, “Sebastian Karcher” <@Sebastian_Karchermailto:Sebastian_Karcher> wrote:

where are you trying this out? In Zotero?

On Thu, Sep 20, 2012 at 1:47 PM, Rönkkö Mikko <@Ronkko_Mikkomailto:Ronkko_Mikko> wrote:
Hi

I decided to develop a simple CSL processor to convert Zotero json strings to APA citations. The code will be used in ZotPad and after the processor works, I will publish the code as a separate project in gihub.

I am using json strings from Zotero server as data and validating the output against formatted citations from Zotero server. The citations are formatted using the APA style from https://github.com/citation-style-language/styles/blob/master/apa.csl using the CSL 1.0.1 specification.

I am using the following bibliography item as my test data

Cadogan, J. W., & Lee, N. (Forthcoming). Improper Use of Endogenous Formative Variables. Journal of Business Research.

There is one thing that I do not understand. In the APA style (lines 429-434) there is a group

    <group delimiter=". ">
      <text macro="author"/>


      <text macro="issued"/>
      <text macro="title" prefix=" "/>


      <text macro="container"/>
    </group>

The macro “author” has a names element with initialize-with=". " and the macro “issued” contains a group with prefix " (". Now to my understanding, this means that

  • The “author” macro will end with ". " [Cadogan, J. W., & Lee, N.]
  • The “issued” macro will start with " (" [ (Forthcoming)]
  • The macros are delimited with ". "

This results in a bibliographic item that starts by

Cadogan, J. W., & Lee, N… (Forthcoming).

This is obviously not correct. There should not be a double period followed by a double space, but I do not understand which part of the formatting logic is incorrect.

Mikko


Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html


xbiblio-devel mailing list
xbiblio-devel@lists.sourceforge.netmailto:xbiblio-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xbiblio-devel

The data are from zotero server using the read API and I am running it on my
Mac. (XCode console to be exact.)

Is the output with the extra period and space what you get back from
the API call?

sorry, I still feel like I’m missing the question - what produces the
outcome? Is that your implementation, citeproc.js or citeproc-node? (well -
I just see I’m not the only one, that makes me feel better).

There are obviously two periods and two spaces in the logic - one each in
the delimiter and in the initialization - but I’m pretty sure citeproc
deals with that elegantly.

Mikko, the processor you write will have to deal with duplicated punctuations. It seems this is the issue here?

Hi

I decided to develop a simple CSL processor to convert Zotero json strings
to APA citations. The code will be used in ZotPad and after the processor
works, I will publish the code as a separate project in gihub.

I am using json strings from Zotero server as data and validating the
output against formatted citations from Zotero server. The citations are
formatted using the APA style from
styles/apa.csl at master · citation-style-language/styles · GitHub using
the CSL 1.0.1 specification.

I am using the following bibliography item as my test data

Cadogan, J. W., & Lee, N. (Forthcoming). Improper Use of Endogenous
Formative Variables. Journal of Business Research.

There is one thing that I do not understand. In the APA style (lines
429-434) there is a group

    <group delimiter=". ">
      <text macro="author"/>
      <text macro="issued"/>
      <text macro="title" prefix=" "/>
      <text macro="container"/>
    </group>

The macro “author” has a names element with initialize-with=“. " and the
macro “issued” contains a group with prefix " (”. Now to my understanding,
this means that

  • The “author” macro will end with ". " [Cadogan, J. W., & Lee, N.]
  • The “issued” macro will start with " (" [ (Forthcoming)]
  • The macros are delimited with ". "

This results in a bibliographic item that starts by

Cadogan, J. W., & Lee, N… (Forthcoming).

This is obviously not correct. There should not be a double period followed
by a double space, but I do not understand which part of the formatting
logic is incorrect.

Mikko

Mikko,

Below I’ve assumed that the output is from your project code. If I
have it backwards, let me know.

You have the logic right. That’s the literal result you will get from
flattening the structure without anything more:

[author ending in “.”] + ". “{delimiter} + " (”{prefix} + [issued]

Double punctuation needs to be culled by the processor. It’s a little
tricky, since formatting (italics etc) might lie between the two
periods, depending on the style. There is also potential interaction
with quote marks, depending on whether or not the style has
punctuation-in-quotes set true or false. For those reasons, the cull
function can’t work on the output string: it needs to analyse the
nested structure before collapsing to identify “adjacent” punctuation.
With content strings, delimiters and affixes in the mix, it’s pretty
hair-raising. The citeproc-js code for this is heavily tested and
seems to work quite well, but I would be hard-pressed to explain
exactly how it works.

Concerning spaces, there was a long discussion a couple of years back
concerning whether extraneous spaces added by affixes should be
considered style bugs:

http://xbiblio-devel.2463403.n2.nabble.com/how-much-bugged-a-style-may-be-tt5784767.html#none

That thread does not reflect well on me, I’m afraid. The point made by
Andrea (and, I think, Bruce) is perfectly valid: double-space issues
can be eliminated by more careful construction of CSL code, and
should be. It is also true that masking double spaces in the processor
gives a green light to sloppy coding. That said, the amount of work
required to eliminate all potential extra spaces from the CSL
repository would be pretty staggering. At the end of the day, we’re
kind of stuck with this problem.

Double spaces are hard to catch in the processor for the same reason:
you have to work on the nested structure before it is flattened into
an output string. It’s a little simpler because you can assume input
strings will not have leading or trailing spaces; but tracking spaces
across affix and delimiter attributes across multiple nested layers is
still a challenge.

If you are only going to process one style in one output format and a
single locale, you may be able to fix things up by running a regular
expression over the output string. That wouldn’t work as a general
solution, though.

Sorry for the long response. Hope it helps!

Frank

Hi

Thanks for the response.

Hi

I decided to develop a simple CSL processor to convert Zotero json strings
to APA citations. The code will be used in ZotPad and after the processor
works, I will publish the code as a separate project in gihub.

I am using json strings from Zotero server as data and validating the
output against formatted citations from Zotero server. The citations are
formatted using the APA style from
styles/apa.csl at master · citation-style-language/styles · GitHub using
the CSL 1.0.1 specification.

I am using the following bibliography item as my test data

Cadogan, J. W., & Lee, N. (Forthcoming). Improper Use of Endogenous
Formative Variables. Journal of Business Research.

There is one thing that I do not understand. In the APA style (lines
429-434) there is a group

   <group delimiter=". ">
     <text macro="author"/>
     <text macro="issued"/>
     <text macro="title" prefix=" "/>
     <text macro="container"/>
   </group>

The macro “author” has a names element with initialize-with=“. " and the
macro “issued” contains a group with prefix " (”. Now to my understanding,
this means that

  • The “author” macro will end with ". " [Cadogan, J. W., & Lee, N.]
  • The “issued” macro will start with " (" [ (Forthcoming)]
  • The macros are delimited with ". "

This results in a bibliographic item that starts by

Cadogan, J. W., & Lee, N… (Forthcoming).

This is obviously not correct. There should not be a double period followed
by a double space, but I do not understand which part of the formatting
logic is incorrect.

Mikko

Mikko,

Below I’ve assumed that the output is from your project code. If I
have it backwards, let me know.

You are correct.

The problem was that my implementation produces incorrect bibliography items even though the implementation follows the CSL specification. (Or a subset of the CSL specification, that is sufficient to produce bibliography items in the APA style). I did not know that strictly following the specification will not result in correct formatting, but the processor needs to “be smart” about spaces and punctuation. I could not find this in the documentation. But now that I know this, it should not be difficult to fix.

I posted my code to GitHub - mronkko/CSLProcessor

At this point the goal is to format single citations and single bibliography items using the APA style. In the future I may make it more generic.

Mikko

Hi

Thanks for the response.

Hi

I decided to develop a simple CSL processor to convert Zotero json strings
to APA citations. The code will be used in ZotPad and after the processor
works, I will publish the code as a separate project in gihub.

I am using json strings from Zotero server as data and validating the
output against formatted citations from Zotero server. The citations are
formatted using the APA style from
styles/apa.csl at master · citation-style-language/styles · GitHub using
the CSL 1.0.1 specification.

I am using the following bibliography item as my test data

Cadogan, J. W., & Lee, N. (Forthcoming). Improper Use of Endogenous
Formative Variables. Journal of Business Research.

There is one thing that I do not understand. In the APA style (lines
429-434) there is a group

   <group delimiter=". ">
     <text macro="author"/>
     <text macro="issued"/>
     <text macro="title" prefix=" "/>
     <text macro="container"/>
   </group>

The macro “author” has a names element with initialize-with=“. " and the
macro “issued” contains a group with prefix " (”. Now to my understanding,
this means that

  • The “author” macro will end with ". " [Cadogan, J. W., & Lee, N.]
  • The “issued” macro will start with " (" [ (Forthcoming)]
  • The macros are delimited with ". "

This results in a bibliographic item that starts by

Cadogan, J. W., & Lee, N… (Forthcoming).

This is obviously not correct. There should not be a double period followed
by a double space, but I do not understand which part of the formatting
logic is incorrect.

Mikko

Mikko,

Below I’ve assumed that the output is from your project code. If I
have it backwards, let me know.

You are correct.

The problem was that my implementation produces incorrect bibliography items even though the implementation follows the CSL specification. (Or a subset of the CSL specification, that is sufficient to produce bibliography items in the APA style). I did not know that strictly following the specification will not result in correct formatting, but the processor needs to “be smart” about spaces and punctuation. I could not find this in the documentation. But now that I know this, it should not be difficult to fix.

I posted my code to GitHub - mronkko/CSLProcessor

At this point the goal is to format single citations and single bibliography items using the APA style. In the future I may make it more generic.

Mikko

This may be more distraction than you need at this point, but just in case …

There is a set of test fixtures covering space-suppression in the
citeproc-js sources (scroll down to the fixtures prefixed with
“spaces_”):

https://bitbucket.org/fbennett/citeproc-js/src/5cc7cff350ee/tests/fixtures/local

I didn’t put the tests into the main test suite, because the
discussion I linked above was inconclusive about whether it would be
appropriate to recognise space-suppression in the official
specification. The main test suite is here:

https://bitbucket.org/bdarcus/citeproc-test

The future of CSL processor testing probably lies in work by Sylvester
Keil, which is here:

GitHub - citation-style-language/test-suite

(The repository above hasn’t been updated in awhile, but Sylvester
recently indicated that there will be activity there once he has
reached a milestone in his current work on citeproc-ruby.)

Frank

For those reasons, the cull function can’t work on the output string:
it needs to analyse the nested structure before collapsing to identify “adjacent”
punctuation. With content strings, delimiters and affixes in the mix,
it’s pretty hair-raising.

W3C specifications often include pseudo-algorithms that
implementations should follow.
Perhaps it would make sense to try and do the same in the CSL spec?

Regards,
Rob.

Hi Rob,

While the idea of pseudo-algorithms is attractive, I very much like the idea of fixtures being the specification instead. In the case of bibliogprahic software, this seems like a really good fit, as it’s easy to discuss the output, and compare those to what’s actually in books and articles (or common sense). The fixtures can be read by people that are not programmers, and this is a big plus as well: you can show them and discuss them with non-technical people that know the field. Discussing the “algorithms” to get to the actual results is not as useful IMO. Or at least if should not come first, and only be formalized when enough examples of the issue at hand have been produced. The disambiguation process is another one of the hair-rising issue.

My 2 cents :slight_smile:

Charles

Same goes for parsing of raw dates or unstructured names. These kinds
of things mostly evolve based on user feedback, so I agree with
Charles that test fixtures should play an important role here. Based
on those, we indeed might be able to extract some standardized rules.

Rintze

Are these mutually exclusive though?

Are these mutually exclusive though?

Nope, but i still feel the fixtures and use cases should come first, and should guide the initial client implementation. Some logic may hopefully come out of it, and that helps writing better documentation, and yes, why not, some pseudo algorithms, which can be very useful for new implementations. Edge cases abound, however, and some of the logic is really convoluted. For CSL, I feel like fixtures should be holding the truth.

In the case of HTML or CSS, the specifications are written so that they are as unambiguous as possible, and don’t have to follow rules set by crazy librarians 30 years ago :wink:

Charles

Frank recently added some tests to catalog the current citeproc-js
behavior when it comes to punctuation suppression:

https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyPlain.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyQuotesIn.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyQuotesOut.txt
https://bitbucket.org/bdarcus/citeproc-test/src/tip/processor-tests/humans/punctuation_FullMontyField.txt

It doesn’t cover suppression of duplicated spaces (as discussed below,
there are already older “space_…” unit tests), and it only covers
punctuation added by prefixes, suffixes and punctuation that is part
of the variable field content (e.g. punctuation added as group
delimiters isn’t tested). I tried to pull these new results together
in a spreadsheet:

With this as a starting point, I hope we can agree on specific rules
for punctuation suppression so that we can include some guidance on
this topic in the CSL specification. These rules will likely have to
be very precise, and take into account the origin of the punctuation
(variable field content, affixes, group delimiters, group affixes,
etc.).

Sebastian, could you remind me whether CMoS has any clear rules on
punctuation suppression?

Rintze

"A period (aside from an abbreviating period; see
6.117http://www.chicagomanualofstyle.org/16/ch06/ch06_sec117.html)
never accompanies a question mark or an exclamation point. The latter two
marks, being stronger, take precedence over the period. This principle
continues to apply when the question mark or exclamation point is part of
the title of a work, as in the final example"
CMoS 6.118

"When a title ending with a question mark or an exclamation mark would
normally be followed by a period, the period is omitted; see"
CMoS 14.105On Fri, Aug 23, 2013 at 7:01 PM, Rintze Zelle <@Rintze_Zelle>wrote: