Updates for citeproc-hs

Hi,

I’ve just pushed a few big patches for the citeproc-hs effort. Now
there’s a cls file parser, which means we can start looking at the
actual evaluation of real styles.

While the parser is complete - it should generate a complete
representation of the style (there may be quite a few bugs,
though…:wink: - the formatting is far from being complete, obviously.

I’ve also overhauled the test script, which can be now compiled into a
binary that support a few options: you can test a csl file, alone or
together with a locale file (if the csl file doesn’t contain a
’terms’, you will not see terms and label expanded), and you can dump
the internal style representation.

To give it a try:
cd test
ghc --make -i…/src test.hs

and then run:
./test -h

A few example: to process the hasrvard.xml with locales-it-IT.xml run:
./test -l locales-it-IT.xml harvard.xml

if you want to dump the internal representation:
./test -d -l locales-it-IT.xml harvard.xml

In the next few days I’ll start working at the formatting engine.

Cheers,
Andrea

PS: a few words about the parser: it is not meant to be used for
debugging citation styles, since it won’t report errors (it will
silently fail on bad files). It’s been implemented with the pickler
combinator library of the Haskell XML Toolkit, which is a required
dependency, now. The dump of the internal representation of the style
requires the package “haskell-src” which should come by default in a
standard GHC installation.

To give it a try:
cd test
ghc --make -i…/src test.hs

and then run:
./test -h

Hmm …

ghc --make -i…/src test.hs

…/src/Text/CSL/CSParser.hs:29:17:
Could not find module `Text.XML.HXT.DOM.XmlNode’:

I have a macports version of GHC:

ghc @6.8.2_2+darwin_9_i386

PS: a few words about the parser: it is not meant to be used for
debugging citation styles, since it won’t report errors (it will
silently fail on bad files).

What do you consider a “bad file”? Implementations probably should
check to see if they’re valid against the schema.

Bruce

ghc --make -i…/src test.hs

…/src/Text/CSL/CSParser.hs:29:17:
Could not find module `Text.XML.HXT.DOM.XmlNode’:

I have a macports version of GHC:

ghc @6.8.2_2+darwin_9_i386

It seems you need to upgrade your HXT package. I thought version >=
7.2 or higher was fine, instead you need at least hxt-8.0 or higher. I
hope this is not going to be a problem for Mac users.

PS: a few words about the parser: it is not meant to be used for
debugging citation styles, since it won’t report errors (it will
silently fail on bad files).

What do you consider a “bad file”? Implementations probably should
check to see if they’re valid against the schema.

So far I don’t check for validity and I indeed wanted to ask you if
the implementation should check for validity.

But what I meant was that the parser will not report the error: it
will just fail. For instance, if no element is found you’ll
get an “error while reading file”, without further explanation.

Andrea

ghc --make -i…/src test.hs

…/src/Text/CSL/CSParser.hs:29:17:
Could not find module `Text.XML.HXT.DOM.XmlNode’:

I have a macports version of GHC:

ghc @6.8.2_2+darwin_9_i386

It seems you need to upgrade your HXT package. I thought version >=
7.2 or higher was fine, instead you need at least hxt-8.0 or higher. I
hope this is not going to be a problem for Mac users.

It seems to be a problem now. I managed to compile a number of
dependencies for hxt (still a PITA though), but compiling hxt itself
failed at the very end:

make install_local_hxt
ld -r --whole-archive -o HShxt.o libHShxt.a
ld: unknown option: --whole-archive
make[2]: *** [HShxt.o] Error 1
make[1]: *** [all] Error 2
make: *** [all] Error 2

Bummer.

In the short-run for me, any ideas how to fix this?

Longer term, any ideas of when 8.1 gets into GHC?

PS: a few words about the parser: it is not meant to be used for
debugging citation styles, since it won’t report errors (it will
silently fail on bad files).

What do you consider a “bad file”? Implementations probably should
check to see if they’re valid against the schema.

So far I don’t check for validity and I indeed wanted to ask you if
the implementation should check for validity.

I would; hxt supports RNG out-of-box, so it should be easy.

Bruce

It seems to be a problem now. I managed to compile a number of
dependencies for hxt (still a PITA though), but compiling hxt itself
failed at the very end:

make install_local_hxt
ld -r --whole-archive -o HShxt.o libHShxt.a
ld: unknown option: --whole-archive
make[2]: *** [HShxt.o] Error 1
make[1]: *** [all] Error 2
make: *** [all] Error 2

Bummer.

In the short-run for me, any ideas how to fix this?

The failure seems related to the Makefile you are using (not a regular
one, as far as I can see from the hxt package distribution).

Did you try with the cabal way?

    runhaskell Setup configure --user --prefix=$HOME
    runhaskell Setup build
    runhaskell Setup install --user

The Makefile you are using is passing “–whole-archive” to the linker,
but the linker (ld) doesn’t have that option).

Hope this helps.

I need to check, but I could define the needed functions and so
require an older version of hxt.

I would; hxt supports RNG out-of-box, so it should be easy.

Ok, I’ll put it on my TODO list.

Andrea

Did you try with the cabal way?

   runhaskell Setup configure --user --prefix=$HOME
   runhaskell Setup build
   runhaskell Setup install --user

The Makefile you are using is passing “–whole-archive” to the linker,
but the linker (ld) doesn’t have that option).

Hope this helps.

Yes, that works; I got your test to compile now.

I would; hxt supports RNG out-of-box, so it should be easy.

Ok, I’ll put it on my TODO list.

Another thing to keep in mind: CSL files are IDed by URI. We’ve never
settled the UI questin, but it seems to me I ought to be able to
specify the URI and have the tool grab (and cache) the file as needed.
Perhaps there’s a way to associate a short-label with the URI (probaby
should be specified in the CSL)…

Bruce

Yes, that works; I got your test to compile now.

Cool.

Another thing to keep in mind: CSL files are IDed by URI. We’ve never
settled the UI questin, but it seems to me I ought to be able to
specify the URI and have the tool grab (and cache) the file as needed.
Perhaps there’s a way to associate a short-label with the URI (probaby
should be specified in the CSL)…

This should be working out of the box: the filename you pass as a
command line option is passed to the hxt function readDocument, which
is going to use it as a URI. Still:

./test -d http://www.zotero.org/styles/chicago-author-date/install

is going to download the style, but causing a parsing error I’m not
able to fully understand it yet - I can use some options to get the
file parsed anyhow, but that’s a bad hack and I want to solve the real
problem.

About the validation: I would need a relax ng xml schema, and I’ve not
been able to find a working converter, so far. Any help?

Andrea

Trang is your tool. IIRC, this should do it:

java -jar trang.jar csl.rnc csl.rng

When we finally settle on 1.0, I’ll package a version up with both.

Bruce

Couple more quick things:

A few example: to process the hasrvard.xml with locales-it-IT.xml run:
./test -l locales-it-IT.xml harvard.xml

if you want to dump the internal representation:
./test -d -l locales-it-IT.xml harvard.xml

I see this:

./test -d ~/zotero/csl/apa.csl
test: error while reading file /Users/darcusb/zotero/csl/apa.csl

So question: why is this failing?

Second, related to the previous note on URIs and such, I wonder if the
-l option isn’t too concrete? It seems to me the tool ought to be
caring about the locale, not any particular representation of it. So,
e.g., it ought to pick up the default locale without user input, and
allow an override that is the locale in quesiton. So we end up with
something like:

citeproc-hs -l ‘en-gb’ -s ‘apa’ [… not sure how this will get
integrated in processes …]

Bruce

./test -d http://www.zotero.org/styles/chicago-author-date/install

BTW, the above URI is not actually the URI for the style. That would
be the one sans the “/install” bit.

Essentially what they do is serve the file at the main URI as xml, and
at the second with a csl mimetype. The latter makes it easy to do
one-click installation (in, for example, Zotero).

curl -I http://www.zotero.org/styles/chicago-author-date
HTTP/1.1 200 OK
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.4
Content-Type: text/xml

curl -I http://www.zotero.org/styles/chicago-author-date/install
HTTP/1.1 200 OK
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.4
Content-Type: text/x-csl

BruceOn Sun, Jun 29, 2008 at 3:18 PM, Andrea Rossato <@Andrea_Rossato1> wrote:
Date: Sun, 29 Jun 2008 22:05:23 GMT
Date: Sun, 29 Jun 2008 22:05:41 GMT

Yes, I noticed that. Sorry, but the ‘/install’ was there just because
I copied the incorrect link…:wink:

BTW, I found out why

./test -d http://www.zotero.org/styles/chicago-author-date

doesn’t work: the function is getting the HTTP headers and treating
them as part of the xml file. Since the first header is:

HTTP/1.1 200 OK

hence the error:
error: “http://www.zotero.org/styles/chicago-author-date” (line 1, column 1):
unexpected “H”

Andrea

Couple more quick things:

./test -d ~/zotero/csl/apa.csl
test: error while reading file /Users/darcusb/zotero/csl/apa.csl

So question: why is this failing?

The one from here:
http://www.zotero.org/styles/apa

is working perfectly fine here. Would you please send it to me?

While pickler combinators are cool (they make it very ease
serializing and deserializing data in xml formats), they do not
provide feedback in case of a failure, so I need to manually check the
problem. I hope to find a better error handling, though.

Second, related to the previous note on URIs and such, I wonder if the
-l option isn’t too concrete? It seems to me the tool ought to be
caring about the locale, not any particular representation of it. So,
e.g., it ought to pick up the default locale without user input, and
allow an override that is the locale in quesiton. So we end up with
something like:

citeproc-hs -l ‘en-gb’ -s ‘apa’ [… not sure how this will get
integrated in processes …]

keep in mind the test.hs is not going to become the citeproc-hs
binary. It is just a test script I use to test the code.

I’m not even sure if citeproc-hs will provide a binary, or just a
library, with a separated utility providing a user interface.

Anyway, when the core will be working I will start thinking about a
user interface, using the other implementation as a guide.

Obviously the locale option ‘-l’ is there just because I didn’t think,
yet, about the way of accessing locales. Anyway, once the
implementation will know where the locales are located, then the
appropriate one will be loaded according to the ‘locale’ and
‘xml:lang’ attributes of the style element.

This is my roadmap:

  1. get the style formatter work;
  2. write the parser for bibliographic information (MODS, Bibo, etc.);
  3. pandoc integration;
  4. an user interface?

What’s (more or less) done (but still to be refined) is just the csl
parser, the internal data representation of the style, and the shape
of the evaluation routine.

As I previously said, this is going to take some time. Moreover I have
some more issues and questions to ask (I need the time to write them
down, and now it’s bed time…).

Andrea

Couple more quick things:

./test -d ~/zotero/csl/apa.csl
test: error while reading file /Users/darcusb/zotero/csl/apa.csl

So question: why is this failing?

The one from here:
http://www.zotero.org/styles/apa

is working perfectly fine here. Would you please send it to me?

It’s just the SVN version of that same file. The only difference is
the empty updated element. If I change it to be non-null,it works
fine.

Note: updated is required to be there and have a valid value, so it’s
technically invalid in the SVN. They add the date-times by script.

While pickler combinators are cool (they make it very ease
serializing and deserializing data in xml formats), they do not
provide feedback in case of a failure, so I need to manually check the
problem. I hope to find a better error handling, though.

I presume you’re just expecting a date-time to be in that element?

Bruce

The only reason the parser may fail is because I require a mandatory
value, or element or attribute when the value, element or attribute
is optional. Which means, the parser may fail only if there is a bug,
which means that I did not correctly map the schema.

Non well-formed xml documents do have proper error handling by the xml
parser.

I hope to have the relax ng validator working soon: I had trang
convert the schema, but I seem to have problems in understanding how
to have hxt validate the schema and, with the schema, validate the csl
file.

With a working validation system - with its build in error reporting -
obscure parser failures should be ruled out, bugs excluded obviously.
Anyway, I hope to clean up the parser code soon. It’s been written
very quickly - probably too much!

Andrea

Hi,

I’ve just pushed a few patches with an almost complete new formatting
algorithm.

This is just a scratch and a lot must be done: string formatting (text
case, capitalize first, quotes, etc.), name formatting (et-al, but
also stuff like delimiter-precedes-last or name-as-sort-order),
options, sorting, but the basic algorithm should be there.

I’ve also added some reference data to the test suite, the file Liam
and Bruce have been sending to the list in JSON format, which have
been translated into Haskell (see test/RefData.hs)

This is the output of running:

runghc -i./src -i./test test/test.hs -l /path/to/locales-en-US.xml /path/to/apa.csl

(Rossato, Uggino, Gazzoni, 2006)
A. Rossato, A. Uggino, L. Gazzoni. (2006). Diritto e architettura nello spazio digitale (P. Locatin, ed., R. Caso, F. Uggino, trans.) (I ed.). Bologna: Cedam. retrieved 1 16, 2006, from http://some.url.com.
(Some Title, 2007)
(Doniger, 2000)
(Laumann, Gagnon, Michael, 1994)
(Smith, 1998)
(Doe, Smith, 2000)
Some Title. . (2007, 1 1).Journal News, A5. retrieved 11 12, 2007, from http://ex.net/1.
W. Doniger. (2000). Splitting the Difference. Chicago: University of Chicago Press.
E. Laumann, J. Gagnon, R. Michael. (1994). The social organization of sexuality: Sexual practices in the United States. Chicago: University of Chicago Press.
J. Smith. (1998). The origin of altruism. Nature, (393), 639-640.
J. Doe, J. Smith. (2000). Introduction: A Chapter Title. in J. Doe, J. Smith (eds.), Edited Book Title, Series Title… New York: ABC Books.

There are many bugs, I believe, but the style evaluation should be
complete. Sorry if I kept bothering about that “group” element
problem: as you may see if you can read Haskell, the style evaluation
is done with a single recursive traversal of the style with the
reference data and implementing the “group” rule breaks this design.

Thanks for your kind attention, patience and help.

Andrea

This is just a scratch and a lot must be done: string formatting (text
case, capitalize first, quotes, etc.), name formatting (et-al, but
also stuff like delimiter-precedes-last or name-as-sort-order),
options, sorting, but the basic algorithm should be there.

Great.

I’ve also added some reference data to the test suite, the file Liam
and Bruce have been sending to the list in JSON format, which have
been translated into Haskell (see test/RefData.hs)

This reminds me: we need data that can test proper sorting and suffix
generation. E.g. three references from the same year and author, but
say, one of them with an additional co-author. So you’d get (Doe,
2000a, 2000b) and (Doe and Smith, 2000).

There are many bugs, I believe, but the style evaluation should be
complete. Sorry if I kept bothering about that “group” element
problem: as you may see if you can read Haskell, the style evaluation
is done with a single recursive traversal of the style with the
reference data and implementing the “group” rule breaks this design.

So you mean you first parse the CSL into some internal Haskell
structure (I presume something like nested lists), and then you run
through that?

If yes, what if on parsing the style you had a little function that
flattens the group into its conditional equivalent?

Just a thought …

Bruce

Seems you still have that issue with fetching styles over HTTP? Local
style works …

runghc -i./src -i./test test/test.hs -l ~/xbiblio/csl/locales/locales-en-US.xml http://www.zotero.org/styles/apa

./src/Text/CSL/Eval.hs:42:8:
Warning: Defined but not used: `options’

./src/Text/CSL/Eval.hs:63:23: Warning: Defined but not used: `ns’

./src/Text/CSL/Eval.hs:66:6: Warning: Defined but not used: `deb’
– (1) readDocument: start processing document
http://www.zotero.org/styles/apa
– (1) getXmlContents: content read and decoded for
http://www.zotero.org/styles/apa
– (1) readDocument: “http://www.zotero.org/styles/apa” (mime type: "
text/xml") will be processed

error: “http://www.zotero.org/styles/apa” (line 1, column 1):
unexpected “H”
expecting xml declaration, comment, processing instruction,
“<!DOCTYPE” or element

– (1) readDocument: “http://www.zotero.org/styles/apa” processed
test.hs: error while reading file http://www.zotero.org/styles/apa

Bruce

Seems you still have that issue with fetching styles over HTTP? Local
style works …

runghc -i./src -i./test test/test.hs -l ~/xbiblio/csl/locales/locales-en-US.xml http://www.zotero.org/styles/apa
[…]
error: “http://www.zotero.org/styles/apa” (line 1, column 1):
unexpected “H”

yeah, I didn’t give it very high priority, I must confess - I had more
difficult stuff to deal with…:wink:

I think it’s a bug in the HXT library, because the xml parser is fed
with the HTTP headers and it is correctly complaining, and I’m using a
HXT function to feed it.

I’ll take care of that.

Andrea

There are many bugs, I believe, but the style evaluation should be
complete. Sorry if I kept bothering about that “group” element
problem: as you may see if you can read Haskell, the style evaluation
is done with a single recursive traversal of the style with the
reference data and implementing the “group” rule breaks this design.

So you mean you first parse the CSL into some internal Haskell
structure (I presume something like nested lists), and then you run
through that?

Yes, indeed. The style is parsed into a Haskell recursive data type,
indeed.

If yes, what if on parsing the style you had a little function that
flattens the group into its conditional equivalent?

Just a thought …

That would be a neat solution indeed, but, in terms of lines of code,
probably more expensive than the solution I found: trying the “group”
element inside an isolated environment:

tryGroup l = get >>= \s → evalElements (checkGroup l) >>= \r → put s >> return r

which is just one line of (quite nasty) code, after all.

Andrea

some more updates: yesterday I was able to hack the beginning of a
MODS parser and I was able to read a collection of reference data I
converted from bibtex to MODS using bibutils.

Here you can find a sample:
http://gorgias.mine.nu/tmp/mods_test.xml

So now it is possible to run something like:

runghc -i./src -i./test test/test.hs -l locales-it-IT.xml -c mods_test.xml apsa.csl

all the entries in the collection will be formatted with the style. If
you want to parse a mods file (a file with a single entry and
as the root element) use the ‘-m filepath’ option.

Let me know the issues of the MODS parser if you try it out, please.
This way I’ll be able to debug it more quickly.

I’ve also started working on the formatting: quotes, and
capitalization is more or less supported.

I have some issue in name formatting, mostly related to periods, etc.
I think I have not a clear idea of what some options and directives
really mean, and I’ll have to ask some info about that. I will also
have to dig in the other thread, but, more or less, here’s the
problem, from the apa.csl style:

...

with this bibliography layout:




If I understand it correctly:

  • “name-as-sort-order” means: Doe, John, instead of John Doe;
  • “sort-separator” is the comma after Doe in “Doe, John”
  • “initialize-with” means "J. " instead of “John”

If we put everything together we end up with:
Doe, J. .(start macro=“issued”)…and the rest

Is that right?

Cheers,
Andrea