CSL processor status

Hi all,

The CSL processor is pretty much in order. We have tested code in
place for rendering text elements, names, dates and terms.
Conditional branching, macros, multi-tiered sorts and collapsing all
work. Formatting decorations apply correctly, and the infrastructure
for inline markup is in place.

I feel that this is very close to completion, but I need guidance on a
few items before proceeding further. Here’s the list:

  • Inline Markup
    The code in flipflop.js will handle pretty much any markup scheme that
    emerges from discussion, but I need to know what specific
    configuration values to fix in the machinery, and how specific
    instances of inline markup should render in particular styles.

  • Plugin API
    The processor maintains state info for disambiguation and bibliography
    sort order internally. When finished, it will similarly maintain
    state info needed for generating subsequent citations and
    back-references. Before the position registry that holds this info is
    added, I will need to know what API the plugins expect to see, and
    what capabilities the plugins have for delivering data out of the
    target document.

  • Locators
    There has been some discussion of the API for delivering locator
    information to the processor, but the data model does not seem to line
    up with the current capabilities of CSL markup. I need specific
    guidance on what the processor will receive from the application, and
    what markup will be used to control its rendering.

I’m not in any special rush, but I thought I should make it clear that
the decisions needed on these items are outside my bailiwick – these
are matters that touch Zotero (and other apps) and CSL core, and while
I look forward to their resolution, it really isn’t my position to
lobby for one solution or another, so I’ll just be holding off on any
further coding until I receive specs to work from on these items.
When I have that info to hand, I can finish things up.

I’m cross-posting this on xbiblio-devel and zotero-dev. Apologies for
the additional traffic, but I want to be sure to catch everyone with
an interest in the project.

Here’s to the completion of this thing!
Cheers,
Frank

Hi all,

The CSL processor is pretty much in order. We have tested code in
place for rendering text elements, names, dates and terms.
Conditional branching, macros, multi-tiered sorts and collapsing all
work. Formatting decorations apply correctly, and the infrastructure
for inline markup is in place.

I feel that this is very close to completion, but I need guidance on a
few items before proceeding further. Here’s the list:

  • Inline Markup
    The code in flipflop.js will handle pretty much any markup scheme that
    emerges from discussion, but I need to know what specific
    configuration values to fix in the machinery, and how specific
    instances of inline markup should render in particular styles.

Let’s continue this discussion the XBib list. I’ve already let you
know my opinions there, but I am fairly open on this.

I don’t have any views, really. I just need to know how to
set things up for testing.

  • Plugin API
    The processor maintains state info for disambiguation and bibliography
    sort order internally. When finished, it will similarly maintain
    state info needed for generating subsequent citations and
    back-references. Before the position registry that holds this info is
    added, I will need to know what API the plugins expect to see, and
    what capabilities the plugins have for delivering data out of the
    target document.

The plug-ins deliver a list of citation fields, with the item ID and
locator information, and the current position information about the
citation. Disambiguation and bibliography sort order are handled
internally in the existing CSL processor as well, so that’s not a
problem. If you want to handle subsequent/ibid/ibid without locator
information in the CSL processor, that’s fine, but integration.js
needs to know if the current position information on a given field is
wrong so that it can send the updated citation to the word processor.
Additionally, when a new item is added, integration.js needs to know
whether any other items need to be updated as a result (e.g., in a
numeric style, or when a year suffix needs to be added), and when a
citation is added, integration.js needs to know if the bibliography
has changed. Finally, there needs to be a way to add uncited items to
a document bibliography as well. This all requires a document-level
session class, which is currently independent from the CSL processor,
but I wouldn’t mind if it were integrated.

It’s the transactions that give effect to all those operations that
I’m
curious about. Maybe I should confess on one thing I do have a
desire, if not an opinion. If possible, I’d like to set things up so
that
updates scale well on large documents.

Absolutely a really important requirement.

To that end, I’ve been
thinking
of how to reduce the volume of data that needs to be sent across the
plugin connection in order to assure that state is preserved.

One idea I’ve toyed with is assigning a unique ID to each cite
cluster.
If the processor is presented with a list of cluster IDs, and if the
order matches current state, then we know that there have been
no deletions or moves, and the update can go forward immediately.
If not, there would be a staged series of further checks, with queries
and answers back and forth to the plugin, until either a limited set
of updates is determined, or everything is updated from scratch.
For most updates you would get good performance, with long
pauses those preceded by aggressive editing of the document.

For sake of argument/comparison, couldn’t some or all of this
contextual logic be moved to the plugin, so that citeproc-js was only
responsible for passing over the different formatted options?

If yes, what would likely be the trade-offs in the two approaches?

Bruce

Hi all,

The CSL processor is pretty much in order. We have tested code in
place for rendering text elements, names, dates and terms.
Conditional branching, macros, multi-tiered sorts and collapsing all
work. Formatting decorations apply correctly, and the infrastructure
for inline markup is in place.

I feel that this is very close to completion, but I need guidance on a
few items before proceeding further. Here’s the list:

  • Inline Markup
    The code in flipflop.js will handle pretty much any markup scheme that
    emerges from discussion, but I need to know what specific
    configuration values to fix in the machinery, and how specific
    instances of inline markup should render in particular styles.

Let’s continue this discussion the XBib list. I’ve already let you
know my opinions there, but I am fairly open on this.

I don’t have any views, really. I just need to know how to
set things up for testing.

  • Plugin API
    The processor maintains state info for disambiguation and bibliography
    sort order internally. When finished, it will similarly maintain
    state info needed for generating subsequent citations and
    back-references. Before the position registry that holds this info is
    added, I will need to know what API the plugins expect to see, and
    what capabilities the plugins have for delivering data out of the
    target document.

The plug-ins deliver a list of citation fields, with the item ID and
locator information, and the current position information about the
citation. Disambiguation and bibliography sort order are handled
internally in the existing CSL processor as well, so that’s not a
problem. If you want to handle subsequent/ibid/ibid without locator
information in the CSL processor, that’s fine, but integration.js
needs to know if the current position information on a given field is
wrong so that it can send the updated citation to the word processor.
Additionally, when a new item is added, integration.js needs to know
whether any other items need to be updated as a result (e.g., in a
numeric style, or when a year suffix needs to be added), and when a
citation is added, integration.js needs to know if the bibliography
has changed. Finally, there needs to be a way to add uncited items to
a document bibliography as well. This all requires a document-level
session class, which is currently independent from the CSL processor,
but I wouldn’t mind if it were integrated.

It’s the transactions that give effect to all those operations that
I’m
curious about. Maybe I should confess on one thing I do have a
desire, if not an opinion. If possible, I’d like to set things up so
that
updates scale well on large documents.

Absolutely a really important requirement.

To that end, I’ve been
thinking
of how to reduce the volume of data that needs to be sent across the
plugin connection in order to assure that state is preserved.

One idea I’ve toyed with is assigning a unique ID to each cite
cluster.
If the processor is presented with a list of cluster IDs, and if the
order matches current state, then we know that there have been
no deletions or moves, and the update can go forward immediately.
If not, there would be a staged series of further checks, with queries
and answers back and forth to the plugin, until either a limited set
of updates is determined, or everything is updated from scratch.
For most updates you would get good performance, with long
pauses those preceded by aggressive editing of the document.

For sake of argument/comparison, couldn’t some or all of this
contextual logic be moved to the plugin, so that citeproc-js was only
responsible for passing over the different formatted options?

If yes, what would likely be the trade-offs in the two approaches?

That would mean my end of the work is basically done, which would be
nice. Don’t see any other advantages, though.

I’ve been over this before. Position evaluation is tricky, it can
break, and it’s currently broken in Zotero:

https://www.zotero.org/trac/ticket/1298

The position variable is part of CSL, and if it is unreliable, then
the complaint will be, “CSL is nice, but its position evaluation is
unreliable”. To avoid that situation, the relevant logic should be
tested in the CSL processor, so that it is known to work correctly
before deployment. That also makes for less code replication.

But if the concensus is to keep things as they are, that’s fine with
me. I’m not here to argue, I’m just waiting for guidance on how to
get to the finish line.

Frank Bennett wrote:

I’ve been over this before. Position evaluation is tricky, it can
break, and it’s currently broken in Zotero:

https://www.zotero.org/trac/ticket/1298

The position variable is part of CSL, and if it is unreliable, then
the complaint will be, “CSL is nice, but its position evaluation is
unreliable”. To avoid that situation, the relevant logic should be
tested in the CSL processor, so that it is known to work correctly
before deployment. That also makes for less code replication.

But if the concensus is to keep things as they are, that’s fine with
me. I’m not here to argue, I’m just waiting for guidance on how to
get to the finish line.

I really have no opinion; was just asking.

Bruce

I though this was being moved to xbib.

(This is definitely the bottleneck for the MacWord plug-
in, which I’ve profiled; I don’t know about WinWord or OOo.) We can’t
optimize pulling fields while maintaining the current feature set, but
it’s possible that we could create a faster mode that inserts a new
citation without updating others and assumes no citations have been
moved. To do this, we’d need to be able to insert a new citation into
the middle of the existing list at some index.

A citeproc-js registry object works both as a list and as a hash for
read access, and supports efficient inserts to a sequence. No
problem there.

I don’t know the details of the code bases in question, but was just
thinking along these lines that it seems to me the list of reference
objects might well be in a separate process, which could include
functions to load data from and dump it to the document?

So the list is kept constantly up-to-date (by the plug-in), and
citeproc gets run (I guess) when it changes.

That could solve two problems with one stone: much more efficient
processing, and document portability (the being able to load and dump
to document).

But going back to my question yesterday, what happens from a
processing standpoint if you have a long document with 200 citations,
and a change is made in one of them that only amounts to fixing a
typo?

Bruce

Unfortunately, the problem here is that accurate position evaluation
requires accurately determining field order in the document. OOo does
not provide the list of fields to the plug-in in order, so we have to
sort them (using OOo-specific APIs) to correct this, and our sort is
apparently not taking into account position within the footnote
properly at present. Moving position evaluation to the CSL processor
cannot actually solve this problem, as I assume you still need a
correctly ordered set of citation indices, which is what the plug-in
is not providing in this case.

Simon

For sake of argument/comparison, couldn’t some or all of this

contextual logic be moved to the plugin, so that citeproc-js was only

responsible for passing over the different formatted options?

If yes, what would likely be the trade-offs in the two approaches?

That would mean my end of the work is basically done, which would be
nice. Don’t see any other advantages, though.

I’ve been over this before. Position evaluation is tricky, it can
break, and it’s currently broken in Zotero:

https://www.zotero.org/trac/ticket/1298

The position variable is part of CSL, and if it is unreliable, then
the complaint will be, “CSL is nice, but its position evaluation is
unreliable”. To avoid that situation, the relevant logic should be
tested in the CSL processor, so that it is known to work correctly
before deployment. That also makes for less code replication.

Unfortunately, the problem here is that accurate position evaluation
requires accurately determining field order in the document. OOo does not
provide the list of fields to the plug-in in order, so we have to sort them
(using OOo-specific APIs) to correct this, and our sort is apparently not
taking into account position within the footnote properly at present. Moving
position evaluation to the CSL processor cannot actually solve this problem,
as I assume you still need a correctly ordered set of citation indices,
which is what the plug-in is not providing in this case.

Ach. Exploring the possibilities … I wonder … can the footnote
number (for citations in footnotes) and the preceding citation (for
in-text citations, or for footnotes containing multiple citations) be
acquired at reasonable cost by the plugin, working from an individual
citation?

If that’s possible … if citations are assigned UIDs by the
processor, and the UID is stored in the citation, an update from the
plugin can reference the state info for the citation directly (with
the UID as a hash key), and the note number and predecessor provided
by the plugin with the update request can be used to check that state
(probably) hasn’t changed. For a newly created citation, the note
number and predecessor UID could be used to slot the new item into the
correct sequence position in the processor registry (the processor
wouldn’t need to know the overall index position of the item).

For sake of argument/comparison, couldn’t some or all of this

contextual logic be moved to the plugin, so that citeproc-js was only

responsible for passing over the different formatted options?

If yes, what would likely be the trade-offs in the two approaches?

That would mean my end of the work is basically done, which would be
nice. Don’t see any other advantages, though.

I’ve been over this before. Position evaluation is tricky, it can
break, and it’s currently broken in Zotero:

https://www.zotero.org/trac/ticket/1298

The position variable is part of CSL, and if it is unreliable, then
the complaint will be, “CSL is nice, but its position evaluation is
unreliable”. To avoid that situation, the relevant logic should be
tested in the CSL processor, so that it is known to work correctly
before deployment. That also makes for less code replication.

Unfortunately, the problem here is that accurate position evaluation
requires accurately determining field order in the document. OOo does not
provide the list of fields to the plug-in in order, so we have to sort them
(using OOo-specific APIs) to correct this, and our sort is apparently not
taking into account position within the footnote properly at present. Moving
position evaluation to the CSL processor cannot actually solve this problem,
as I assume you still need a correctly ordered set of citation indices,
which is what the plug-in is not providing in this case.

Ach. Exploring the possibilities … I wonder … can the footnote
number (for citations in footnotes) and the preceding citation (for
in-text citations, or for footnotes containing multiple citations) be
acquired at reasonable cost by the plugin, working from an individual
citation?

If that’s possible … if citations are assigned UIDs by the
processor, and the UID is stored in the citation, an update from the
plugin can reference the state info for the citation directly (with
the UID as a hash key), and the note number and predecessor provided
by the plugin with the update request can be used to check that state
(probably) hasn’t changed. For a newly created citation, the note
number and predecessor UID could be used to slot the new item into the
correct sequence position in the processor registry (the processor
wouldn’t need to know the overall index position of the item).

I’ve been thinking a little more about this today, and it seems to me
that things could be set up to work pretty smoothly if these details
(preceding citation + note number) can be supplied with update and
insert requests. Where the citation numbers match what the processor
registry has on file, update/insert transactions would process
normally, and would be safe (there could be some anomalies – see
below – but by and large the result would be correct). Where a
discrepancy is identified, the processor can report the error, and a
popup asking the user to refresh citations can be issued.

The tree walking required to collect the two items of data would add a
bit to the cost of each individual transaction, but the marginal cost
would be nearly constant for all document and bibliography sizes.

Update machinery built this way could be fooled where a user deletes a
citation, and then uses cut and paste to clone another to replace it,
or where preceding areas of the document are rearranged; an update to
a later citation would not pick up that there had been changes. In
some cases this will produce erroneous cite forms, and misformatted
cites might remain in the document until the next general refresh. I
think this would tolerable to users, and that a caution to refresh
citations before finalizing a document would be sufficient.

A fix to the sorting works for OO would still be needed, to produce
correct citations upon refresh. But sorting would not be required for
ordinary updates and inserts, so you would not need to incur that
overhead in ordinary cases.

For sake of argument/comparison, couldn’t some or all of this

contextual logic be moved to the plugin, so that citeproc-js was only

responsible for passing over the different formatted options?

If yes, what would likely be the trade-offs in the two approaches?

That would mean my end of the work is basically done, which would be
nice. Don’t see any other advantages, though.

I’ve been over this before. Position evaluation is tricky, it can
break, and it’s currently broken in Zotero:

https://www.zotero.org/trac/ticket/1298

The position variable is part of CSL, and if it is unreliable, then
the complaint will be, “CSL is nice, but its position evaluation is
unreliable”. To avoid that situation, the relevant logic should be
tested in the CSL processor, so that it is known to work correctly
before deployment. That also makes for less code replication.

Unfortunately, the problem here is that accurate position evaluation
requires accurately determining field order in the document. OOo does not
provide the list of fields to the plug-in in order, so we have to sort them
(using OOo-specific APIs) to correct this, and our sort is apparently not
taking into account position within the footnote properly at present. Moving
position evaluation to the CSL processor cannot actually solve this problem,
as I assume you still need a correctly ordered set of citation indices,
which is what the plug-in is not providing in this case.

Ach. Exploring the possibilities … I wonder … can the footnote
number (for citations in footnotes) and the preceding citation (for
in-text citations, or for footnotes containing multiple citations) be
acquired at reasonable cost by the plugin, working from an individual
citation?

If that’s possible … if citations are assigned UIDs by the
processor, and the UID is stored in the citation, an update from the
plugin can reference the state info for the citation directly (with
the UID as a hash key), and the note number and predecessor provided
by the plugin with the update request can be used to check that state
(probably) hasn’t changed. For a newly created citation, the note
number and predecessor UID could be used to slot the new item into the
correct sequence position in the processor registry (the processor
wouldn’t need to know the overall index position of the item).

I’ve been thinking a little more about this today, and it seems to me
that things could be set up to work pretty smoothly if these details
(preceding citation + note number) can be supplied with update and
insert requests. Where the citation numbers match what the processor
registry has on file, update/insert transactions would process
normally, and would be safe (there could be some anomalies – see
below – but by and large the result would be correct). Where a
discrepancy is identified, the processor can report the error, and a
popup asking the user to refresh citations can be issued.

The tree walking required to collect the two items of data would add a
bit to the cost of each individual transaction, but the marginal cost
would be nearly constant for all document and bibliography sizes.

Update machinery built this way could be fooled where a user deletes a
citation, and then uses cut and paste to clone another to replace it,
or where preceding areas of the document are rearranged; an update to
a later citation would not pick up that there had been changes. In
some cases this will produce erroneous cite forms, and misformatted
cites might remain in the document until the next general refresh. I
think this would tolerable to users, and that a caution to refresh
citations before finalizing a document would be sufficient.

A fix to the sorting works for OO would still be needed, to produce
correct citations upon refresh. But sorting would not be required for
ordinary updates and inserts, so you would not need to incur that
overhead in ordinary cases.

Apart from the suggestions above … maybe I’m not reading the signals
correctly. Is it preferable to keep things as they are, and rely on
positioning hints delivered by integration.js? If so, just say the
word and I’ll finish out an API that does that. I’ve said that I
think it would be preferable to place the positioning code where it
can be tested in the CSL processor before deployment, but it’s not a
grand principle or anything. Whatever works.

Simon,

Boiling this down, here are a couple of possibilities:

(1) Leave positioning evaluation in integration.js. For a citation,
the CSL processor receives an array of objects composed of Item,
locator, author-suppress flag, prefix, suffix, note number, position
hint, and note-count distance to most recent previous cite to the same
source. It returns a rendered string for the citation.

(2) Move positioning evaluation to the CSL processor. For each
citation, the processor receives a note number (if any) a UID (if
known), the note number (if any) and UID of the preceding citation
within the same note or within the text (as appropriate), and an array
of objects composed of Item, locator, author-suppress flag, prefix and
suffix. It returns a result code of “ok” or “request_refresh” and an
array of objects for citations requiring update, composed of citation
UID and rendered string.

I don’t know if the plugins can be made to supply the details required
for (2), and whether they can efficiently address citations by a UID
key. If the answer to both is yes, this would avoid the need to send
all citation data for every insert and update transaction, so it would
scale. If the text is reshuffled in the document there could be
discrepancies, but the user could resolve these with a manual refresh.

Let me know which option plays better on Zotero side.

Frank>

I’d opt for option (1), since the existing code is working fine and it
would presumably require less coding on your side, while also leaving
the door open to possible API changes in the plug-ins (which are
likely as we move them to XPCOM). I’ll look into implementing the UID-
based features when we make these changes, but it seems like
interaction with the word processor can and should be implemented as a
module independent of the CSL processor.

Simon

Simon,

Boiling this down, here are a couple of possibilities:

(1) Leave positioning evaluation in integration.js. For a citation,
the CSL processor receives an array of objects composed of Item,
locator, author-suppress flag, prefix, suffix, note number, position
hint, and note-count distance to most recent previous cite to the same
source. It returns a rendered string for the citation.

(2) Move positioning evaluation to the CSL processor. For each
citation, the processor receives a note number (if any) a UID (if
known), the note number (if any) and UID of the preceding citation
within the same note or within the text (as appropriate), and an array
of objects composed of Item, locator, author-suppress flag, prefix and
suffix. It returns a result code of “ok” or “request_refresh” and an
array of objects for citations requiring update, composed of citation
UID and rendered string.

I don’t know if the plugins can be made to supply the details required
for (2), and whether they can efficiently address citations by a UID
key. If the answer to both is yes, this would avoid the need to send
all citation data for every insert and update transaction, so it would
scale. If the text is reshuffled in the document there could be
discrepancies, but the user could resolve these with a manual refresh.

Let me know which option plays better on Zotero side.

I’d opt for option (1), since the existing code is working fine and it
would presumably require less coding on your side, while also leaving
the door open to possible API changes in the plug-ins (which are
likely as we move them to XPCOM). I’ll look into implementing the UID-
based features when we make these changes, but it seems like
interaction with the word processor can and should be implemented as a
module independent of the CSL processor.

Sounds good. I’ll set up some simple docs and working examples
for inspection and review and post again when it’s ready.

Frank

I’d opt for option (1), since the existing code is working fine and it
would presumably require less coding on your side, while also leaving
the door open to possible API changes in the plug-ins (which are
likely as we move them to XPCOM). I’ll look into implementing the UID-
based features when we make these changes …

FWIW, in the new metadata support in ODF 1.2, the citations groups
will get encoded using the new metadata field, each of which gets
described using RDF, and hence identified by a URI.

Bruce