CSL processor news

A couple of items. I’ve opened work on a javascript CSL processor.
There is a test suite in place and the code is slowly growing.

@Bruce: I’d like to put the sources in the xbiblio SVN. If you grant
access to my Sourceforge account (bierce), I’ll put the project up.

@Andrea: If you have time, I’d be grateful if you could take a quick
look at the sources when I get them in place. I took a look through
citeproc-hs, and I think I’ve set things up on similar lines (but
there are, of course, more … lines). There isn’t much code yet, but
I’ve blocked out placeholders for the top-level classes with comments
that show how I intend the thing to work when it’s complete.

Frank

Done. I’d put it under a top-level “citeproc-js” directory, for
consistency’s sake.

BTW, license? I’d suggest a liberal BSD, so long as it doesn’t
conflict with Zotero’s (in which case maybe a dual license?).

Bruce

A couple of items. I’ve opened work on a javascript CSL processor.
There is a test suite in place and the code is slowly growing.

@Bruce: I’d like to put the sources in the xbiblio SVN. If you grant
access to my Sourceforge account (bierce), I’ll put the project up.

Done. I’d put it under a top-level “citeproc-js” directory, for
consistency’s sake.

BTW, license? I’d suggest a liberal BSD, so long as it doesn’t
conflict with Zotero’s (in which case maybe a dual license?).

Great stuff. The test suite uses a dual license, BSD and academic. I’l
just copy that into the root.

Here is comes …

Frank

Quick point: you use the word “blob” in various places, including a
file name. Is this really the right word for it? I think of blob as
unstructured data, while what you’re doing here is structured.

Bruce

A couple of items. I’ve opened work on a javascript CSL processor.
There is a test suite in place and the code is slowly growing.

@Bruce: I’d like to put the sources in the xbiblio SVN. If you grant
access to my Sourceforge account (bierce), I’ll put the project up.

Done. I’d put it under a top-level “citeproc-js” directory, for
consistency’s sake.

BTW, license? I’d suggest a liberal BSD, so long as it doesn’t
conflict with Zotero’s (in which case maybe a dual license?).

Great stuff. The test suite uses a dual license, BSD and academic. I’l
just copy that into the root.

Quick point: you use the word “blob” in various places, including a
file name. Is this really the right word for it? I think of blob as
unstructured data, while what you’re doing here is structured.

That’s a final unit of output enclosed in an environment. They can be
nested, so both a citation and a comma qualify. Didn’t know what else
to call it, so I used blob. It will play a role similar to
“formattedString” in current Zotero. That’s more descriptive, and
using the same name is probably a good idea, but it takes four times
as much typing. Once it settles down, I’ll do a find and replace.

Frank

OK.

And why the Dojo dependency?

Bruce

That’s a final unit of output enclosed in an environment. They can be
nested, so both a citation and a comma qualify. Didn’t know what else
to call it, so I used blob. It will play a role similar to
“formattedString” in current Zotero. That’s more descriptive, and
using the same name is probably a good idea, but it takes four times
as much typing. Once it settles down, I’ll do a find and replace.

OK.

And why the Dojo dependency?

That’s the development test suite, there’s no dependency in the
software itself. See the attached screenshot.

It’s also explained in the README, come to think of it.

The screenshot was trapped by the list mailer for oversize, this one
might sneak through.

If you take a look in src/output.js, you’ll see that output formats
will be simple to define. I don’t know anything about RTF, if someone
could add a format spec for that it would be very helpful.

It would also be possible to output in the native format of a word
processor, if that would help the performance of the plugins. It’s
all bright vistas in the early stages, but when this is finished I’m
thinking that it should be very fast, at least in comparison to the
existing module. He said.

Frank

One practical issue that I’d like to see addressed is better
integration with document styling. See, as just one example (though
there are others, and more complex ones):

http://forums.zotero.org/discussion/5656/change-hangingindent

At minimum I think this means attaching a CSS class wherever possible,
including passing on the cs:group/@name value as a class.

Somewhat more ambitious, and a little less important, would be to
somehow pass styling information to the styles and remove it from the
inline encoding.

Bruce

Hi Frank,

great to hear you are working on a new implementation. At the present
time I do not have much time for hacking but I’m keeping an eye on
your code and I’ll start studying it carefully in the next few weeks.

I’m please to know you had a look at my code: any suggestion, comments
or whatever is highly appreciated.

Andrea

@Andrea: If you have time, I’d be grateful if you could take a quick
look at the sources when I get them in place. I took a look through
citeproc-hs, and I think I’ve set things up on similar lines (but
there are, of course, more … lines). There isn’t much code yet, but
I’ve blocked out placeholders for the top-level classes with comments
that show how I intend the thing to work when it’s complete.

Hi Frank,

great to hear you are working on a new implementation. At the present
time I do not have much time for hacking but I’m keeping an eye on
your code and I’ll start studying it carefully in the next few weeks.

I’m please to know you had a look at my code: any suggestion, comments
or whatever is highly appreciated.

Hi, Andrea,

In citeproc-js, the tag will just be a noop that passes
through the result of the condition without asking any questions. Is
this also the case with citeproc-hs? If so, there would be an
argument for eliminating the element from CSL. Would save a bit of
typing.

Frank

There are other reasons why it probably makes sense, like the schema
validation design. Having the container means I can ensure that only
cs:if, etc. children are valid. It’s not to say it’s impossible
otherwise, but am not sure it’s worth the bother to change things now.

Bruce

In citeproc-js, the tag will just be a noop that passes
through the result of the condition without asking any questions. Is
this also the case with citeproc-hs? If so, there would be an
argument for eliminating the element from CSL.

There are other reasons why it probably makes sense, like the schema
validation design. Having the container means I can ensure that only
cs:if, etc. children are valid. It’s not to say it’s impossible
otherwise, but am not sure it’s worth the bother to change things now.

I’m easy either way. Just thought I’d mention the possibility.

Frank

In citeproc-js, the tag will just be a noop that passes
through the result of the condition without asking any questions. Is
this also the case with citeproc-hs? If so, there would be an
argument for eliminating the element from CSL.

There are other reasons why it probably makes sense, like the schema
validation design. Having the container means I can ensure that only
cs:if, etc. children are valid. It’s not to say it’s impossible
otherwise, but am not sure it’s worth the bother to change things now.

I’m easy either way. Just thought I’d mention the possibility.

I’m looking at the handling of conditions for citeproc-js now, and I
was seriously wrong about this. The choose tag is needed to identify
which condition level an else tag applies to. Oops.

Frank

Hi Frank,

I’ve been doing some work on a new CSL processor (citeproc-js), and
have made some progress. The next academic term is approaching,
though, and I’ll be battening down the work on citeproc-js over the
next few days. I’ll be pretty much leaving the code alone until
sometime during the summer,

Whose “summer”; the one down south, or up north? So June, or December?

Also, just a couple of more questions …

but I don’t claim ownership of it, and any
work by others on the project will be very welcome as far as I’m
concerned. (In fact, it’s probably better for the long term if I’m
not the primary maintainer. I’m a hobbyist, my skill level is not
that high, and my ability to focus on programming issues varies with
the season.) Before downing tools, I’ll go through the code to update
the comments and bring them into line with the state of the code. The
test suites don’t show much organization, but I’ll probably leave
those alone for the present.

It’s been an exciting ride over the past month. There’s a lot still
to do, but most of the seriously worrisome issues have been cleared.

Can you estimate what percentage is complete?

Here are some of the highlights:

With test-driven development, you write the tests first, then write
the code until they pass.

But given that all test pass but you’ve said there’s still “a lot to
do” I take it that’s not exactly the approach you’ve taken.

So would it be fair to say that the next step really ought to be to
sort out the remaining tests?

If yes, do you have some input on what they should be?

Or, if you can find some remaining time, can you imagine starting to
put those in place so that others can enter and figure out how to make
them pass?

BTW, I converted you TODO.pdf to a text file in the repo for editing purposes.

Bruce

He teaches in Japan, so the one north… June, I believe.

Frank, thanks for your contribution, I’m looking forward to seeing you
back here hacking again. Its’ been a pleasure getting to know I’m not
the only comparative lawyer wasting her time and jeopardizing an
academic career just for the joy of coding. :wink:

Andrea

Hi Frank,

I’ve been doing some work on a new CSL processor (citeproc-js), and
have made some progress. The next academic term is approaching,
though, and I’ll be battening down the work on citeproc-js over the
next few days. I’ll be pretty much leaving the code alone until
sometime during the summer,

Whose “summer”; the one down south, or up north? So June, or December?

Why, my summer, of course! Things will quieten down here again in July/August.

Also, just a couple of more questions …

but I don’t claim ownership of it, and any
work by others on the project will be very welcome as far as I’m
concerned. (In fact, it’s probably better for the long term if I’m
not the primary maintainer. I’m a hobbyist, my skill level is not
that high, and my ability to focus on programming issues varies with
the season.) Before downing tools, I’ll go through the code to update
the comments and bring them into line with the state of the code. The
test suites don’t show much organization, but I’ll probably leave
those alone for the present.

It’s been an exciting ride over the past month. There’s a lot still
to do, but most of the seriously worrisome issues have been cleared.

Can you estimate what percentage is complete?

I tend to be over-optimistic. But in terms of time, I’d say it’s
maybe 40% done in the coding. It’s about 1600 lines at the moment,
half the size of the csl.js in Zotero. I’d expect it to swell
significantly over the current implementation in total size, because
the definitions of individual attributes in citeproc-js are more
verbose at the compiler level (although the runtime it will generate
will be much more spartan).

Here are some of the highlights:

With test-driven development, you write the tests first, then write
the code until they pass.

But given that all test pass but you’ve said there’s still “a lot to
do” I take it that’s not exactly the approach you’ve taken.

Oh, darn, I messed up again! Sorry about that, I’ll try to do
better in the future. :slight_smile:

But seriously, I felt my way in a spiral, with bits of code, then
tests, then rewriting of the code to make it more readable. Some
parts of the code have been rewritten three or four times as new
issues came up. I’ve watched XP teams work, it’s been a similar
process, except that I didn’t have a programming partner and II was a
neophyte in the language when I started writing – and I wrote a lot
more verbal commentary as I went along because I’m chatty by nature.
You gets what you pays for.

So would it be fair to say that the next step really ought to be to
sort out the remaining tests?

If yes, do you have some input on what they should be?

Yep, absolutely. The only big piece of infrastructure still to be
built is the disambiguation/sorting registry. I can certainly provide
internal unit tests for that, if there’s need.

For the CSL language, anyone building an engine would want to have at
least one test for each element, attribute and option, and test suites
for known hard cases (like et al., disambiguation, and sorting). It
would be great if you as the language designer could provide the
hard-case items, to be sure behaviour is defined as you intend.

Or, if you can find some remaining time, can you imagine starting to
put those in place so that others can enter and figure out how to make
them pass?

Sure thing. Dividing the work between test-writing and coding is
ideal. There is a proposed generic test layout from Simon (with a
couple of tiny changes by me) at data/README-3.txt in the archive. If
the layout can be agreed and a file hierarchy set up somewhere, I’ll
be happy to chip in.

BTW, I converted you TODO.pdf to a text file in the repo for editing purposes.

Thanks, that was a hasty addition, to be sure I didn’t lose the message.

For the CSL language, anyone building an engine would want to have at
least one test for each element, attribute and option, and test suites
for known hard cases (like et al., disambiguation, and sorting). It
would be great if you as the language designer could provide the
hard-case items, to be sure behaviour is defined as you intend.

I could help with this, though I already did some of this way back
with the python and ruby versions. Some of that may be out-of-date,
but there may still be cases of value that could be repurposed.

Or, if you can find some remaining time, can you imagine starting to
put those in place so that others can enter and figure out how to make
them pass?

Sure thing. Dividing the work between test-writing and coding is
ideal. There is a proposed generic test layout from Simon (with a
couple of tiny changes by me) at data/README-3.txt in the archive. If
the layout can be agreed and a file hierarchy set up somewhere, I’ll
be happy to chip in.

Yeah, so we should probably settle that.

So what Simon proposes is a set of test fixtures written in JSON, and
so language-agnostic.

I like the idea. We just need to settle:

  1. the metatata representation. I think we agree it ought to be as
    close to CSL’s model as possible, and that contributors need to be
    ordered arrays. What we did not agree on is how to represent the
    contributors. Simon wants something like:

[
{ “family_name”:“Doe”, “given_name”:“Jane”}
]

But if we do that, then we also need to accommodate organizational and
non-Western names, as well as prefix, suffix and articular pieces.

Andrea has this in his code, which may be a good model (except that
the Person has no way to store different display rules):

data Agent
= Entity String

Person {
namePrefix :: String
givenName :: [String]
initials :: String
articular :: String
familyName :: String
nameSuffix :: String
}

  1. the citation stuff; Simon suggested:

font-style/test001.txt

{
“testof”:“citation”,
“csl”:
“<text font-style="italic" variable="title"/>”,

“citation”:[
{“source”:“book-a”, “locators”:{“page”:“10”}},
{“source”:“book-b”}
],
“result”:“My Book; Your Book
}

The only concern I have about the above is that the result is awfully
low-level.

Bruce

For the CSL language, anyone building an engine would want to have at
least one test for each element, attribute and option, and test suites
for known hard cases (like et al., disambiguation, and sorting). It
would be great if you as the language designer could provide the
hard-case items, to be sure behaviour is defined as you intend.

I could help with this, though I already did some of this way back
with the python and ruby versions. Some of that may be out-of-date,
but there may still be cases of value that could be repurposed.

Or, if you can find some remaining time, can you imagine starting to
put those in place so that others can enter and figure out how to make
them pass?

Sure thing. Dividing the work between test-writing and coding is
ideal. There is a proposed generic test layout from Simon (with a
couple of tiny changes by me) at data/README-3.txt in the archive. If
the layout can be agreed and a file hierarchy set up somewhere, I’ll
be happy to chip in.

Yeah, so we should probably settle that.

So what Simon proposes is a set of test fixtures written in JSON, and
so language-agnostic.

I like the idea. We just need to settle:

  1. the metatata representation. I think we agree it ought to be as
    close to CSL’s model as possible, and that contributors need to be
    ordered arrays.

That’s right. I’ve adapted a sample from your earlier test data and
put it into that section of the README-3 doc to make that clear, and
changed the attribute names as you and Simon agreed.

What we did not agree on is how to represent the
contributors. Simon wants something like:

[
{ “family_name”:“Doe”, “given_name”:“Jane”}
]

But if we do that, then we also need to accommodate organizational and
non-Western names, as well as prefix, suffix and articular pieces.

Andrea has this in his code, which may be a good model (except that
the Person has no way to store different display rules):

data Agent
= Entity String

Person {
namePrefix :: String
givenName :: [String]
initials :: String
articular :: String
familyName :: String
nameSuffix :: String
}

  1. the citation stuff; Simon suggested:

font-style/test001.txt

{
“testof”:“citation”,
“csl”:
“<text font-style="italic" variable="title"/>”,

“citation”:[
{“source”:“book-a”, “locators”:{“page”:“10”}},
{“source”:“book-b”}
],
“result”:“My Book; Your Book
}

The only concern I have about the above is that the result is awfully
low-level.

It’s just a sample, you can write tests for larger blocks of CSL with
the same layout. Fine-grained tests like this of individual elements
and attributes are helpful to a coder, though, to provide a target for
specific functionality, and to identify specific areas of misbehaviour
when later changes to a program mess things up.

I’ll leave the other issues you raise to people who have a better
understanding of that end of things.

Frank