Disambiguation questions

It looks like disambiguation is the next thing that needs to happen in
citeproc-js. Before writing anything, I’d like to check a few details
about how it’s meant to work. Here’s the csl.rnc comment:

defines parameters relating to disambiguation, followed in the order given

below until a citation is disambiguated

disambiguate-add-names: add additional names, disregarding

the “et-al” setting, to disambiguate the citations

disambiguate-add-givenname: add a given name to a citation

to disambiguate it (e.g., John Doe, 2005 vs. Doe, 2005)

disambiguate-add-year-suffix: add a suffix to the year (e.g.,

2007a) when there are two works by the same author published in

the same year included in one bibliography

I’m assuming that disambiguation is uniform across citation and
bibliography. (I guess that’s obvious, but I just want to confirm
that I’m not missing something.)

I’d like to confirm that the priority order of disambiguation is fixed
(in the order given in the comment), but that the steps included will
depend on the options given in the style – so if no options are
given, no disambiguation is performed. I think that’s what the
comment says, but again I’d just like to confirm that I’m not
misunderstanding.

For disambiguate-add-names, would it be right to add one name at a
time until disambiguation is achieved, and use et al. if the
disambiguated form does not include all the names in a cite? I think
that’s right, but again, just confirming.

For disambiguate-add-givenname, should given names be added for all
names in one go? Or should they only be added where it makes a
difference (where given names differ)? Or should these also be added
one at a time until disambiguation is achieved?

With disambiguate-add-year-suffix, is there a shadow sort order (by
title, say) that should be used to cast the suffixes? Or is it safe
to apply them in the order in which they appear in the stream (their
order in the document)?

Sorry for all the questions, but I want to be sure I understand how
this should work before writing anything. It doesn’t look like
something that can be safely worked out in increments along the way
(it scares the living daylights out of me, actually, but I’ll keep
that to myself :).

Frank

I don’t think it’s safe. The guide at
http://www.lib.monash.edu.au/tutorials/citing/apa.html, which claims to
follow APA, describes a shadow sort by title for year-suffixes:

“If there is more than one reference by an author in the same year, suffixes
(a, b, c, etc.) are added to the year. … Allocation of the suffixes is
determined by the order of the references in the reference list. Suffixes
are also included in the reference list, and these references are listed
alphabetically by title.”

RintzeOn Thu, Mar 12, 2009 at 12:34 AM, Frank Bennett <@Frank_Bennett>wrote:

I’ll try to help here, but just want to point out that a) Simon wrong
the disambiguation part of the schema, and b) I’ve not implemented it
myself (at least not in a really long time). E.g. Simon or Andrea may
be more helpful.

Either way, clearly the schema documentation is lacking and we should fix this.

Also, you might read through Andrea’s docs for this before even
reading my reply below:

http://code.haskell.org/citeproc-hs/docs/Text-CSL-Proc-Disamb.html

It looks like disambiguation is the next thing that needs to happen in
citeproc-js. Before writing anything, I’d like to check a few details
about how it’s meant to work. Here’s the csl.rnc comment:

defines parameters relating to disambiguation, followed in the order given

below until a citation is disambiguated

disambiguate-add-names: add additional names, disregarding

the “et-al” setting, to disambiguate the citations

disambiguate-add-givenname: add a given name to a citation

to disambiguate it (e.g., John Doe, 2005 vs. Doe, 2005)

disambiguate-add-year-suffix: add a suffix to the year (e.g.,

2007a) when there are two works by the same author published in

the same year included in one bibliography

I’m assuming that disambiguation is uniform across citation and
bibliography. (I guess that’s obvious, but I just want to confirm
that I’m not missing something.)

If I understand your question correctly, the answer is no to the first
two, yes to the last.

I’d like to confirm that the priority order of disambiguation is fixed
(in the order given in the comment), but that the steps included will
depend on the options given in the style – so if no options are
given, no disambiguation is performed. I think that’s what the
comment says, but again I’d just like to confirm that I’m not
misunderstanding.

Simon wrote this part, so maybe he should explain.

For disambiguate-add-names, would it be right to add one name at a
time until disambiguation is achieved, and use et al. if the
disambiguated form does not include all the names in a cite? I think
that’s right, but again, just confirming.

I think so too, but am not 100% sure.

For disambiguate-add-givenname, should given names be added for all
names in one go? Or should they only be added where it makes a
difference (where given names differ)? Or should these also be added
one at a time until disambiguation is achieved?

I believe the last.

With disambiguate-add-year-suffix, is there a shadow sort order (by
title, say) that should be used to cast the suffixes? Or is it safe
to apply them in the order in which they appear in the stream (their
order in the document)?

As Rinzte notes, no; you need to sort the entire reference list and
generate the suffix key (for both citation and bib) from that.

Bruce

BTW, going back to testing; perhaps when these sorts of questions come
up, we ought to organize the discussion around writing the tests? That
might allow a more precise conversation, and also yield a useful
product that can fill out the tests.

Not exactly sure how to do that, and I don’t have time ATM to figure
it all out, but maybe we ensure there are a few records in the json
test that requiring disambiguation, and we say:

{
‘cite’: [ { ‘id’: ‘ref-1’ }, { ‘id’: ‘ref-2’ } ]
‘out’: ‘whatever’ /* maybe some kind of array */
}

Bruce

I’ll try to help here, but just want to point out that a) Simon wrong
the disambiguation part of the schema, and b) I’ve not implemented it
myself (at least not in a really long time). E.g. Simon or Andrea may
be more helpful.

Either way, clearly the schema documentation is lacking and we should fix this.

Also, you might read through Andrea’s docs for this before even
reading my reply below:

http://code.haskell.org/citeproc-hs/docs/Text-CSL-Proc-Disamb.html

Unfortunately the documentation is not complete: I need to describe
the disambiguation process.

BTW, disambiguation has been a major challenge. The toughest one, I’d
say.

Below my answers, which are base upon my understanding of the issue.

It looks like disambiguation is the next thing that needs to happen in
citeproc-js. Before writing anything, I’d like to check a few details
about how it’s meant to work. Here’s the csl.rnc comment:

defines parameters relating to disambiguation, followed in the order given

below until a citation is disambiguated

disambiguate-add-names: add additional names, disregarding

the “et-al” setting, to disambiguate the citations

disambiguate-add-givenname: add a given name to a citation

to disambiguate it (e.g., John Doe, 2005 vs. Doe, 2005)

disambiguate-add-year-suffix: add a suffix to the year (e.g.,

2007a) when there are two works by the same author published in

the same year included in one bibliography

I’m assuming that disambiguation is uniform across citation and
bibliography. (I guess that’s obvious, but I just want to confirm
that I’m not missing something.)

If I understand your question correctly, the answer is no to the first
two, yes to the last.

I agree: only adding the year suffix requires uniformity across
citations and the bibliography.

I’d like to confirm that the priority order of disambiguation is fixed
(in the order given in the comment), but that the steps included will
depend on the options given in the style – so if no options are
given, no disambiguation is performed. I think that’s what the
comment says, but again I’d just like to confirm that I’m not
misunderstanding.

Simon wrote this part, so maybe he should explain.

This is the order I implemented: if both add-names and add-given-names
are present, first the et-al option is overridden, by adding more
names. If the citations are not disambiguated then we start adding
given-names. Then we add the year suffix (if the option is set). If no
disambiguation is achieved we try re-evaluating the style with the
disambiguate conditional set to true.

For disambiguate-add-names, would it be right to add one name at a
time until disambiguation is achieved, and use et al. if the
disambiguated form does not include all the names in a cite? I think
that’s right, but again, just confirming.

I think so too, but am not 100% sure.

Yes, this is the way I implemented it.

For disambiguate-add-givenname, should given names be added for all
names in one go? Or should they only be added where it makes a
difference (where given names differ)? Or should these also be added
one at a time until disambiguation is achieved?

I believe the last.

I would say the second (we should add given-names only for
contributors whose family name is the same but they have different
given names). If the name form is short (or there is the
initialize-with attribute), then we start adding the initials first.
Still, in my implementation there’s a bug I’m trying to fix (I
discovered it while writing this message).

With disambiguate-add-year-suffix, is there a shadow sort order (by
title, say) that should be used to cast the suffixes? Or is it safe
to apply them in the order in which they appear in the stream (their
order in the document)?

As Rinzte notes, no; you need to sort the entire reference list and
generate the suffix key (for both citation and bib) from that.

Right.

I’ve put together some data and some tests here:
http://gorgias.mine.nu/csl/disambig/

Hope this helps.

Andrea

PS: the Zotero implementation is incomplete, since it will try to
disambiguate the citations by analyzing the bibliographic data, and
not the output of the style application.

More info here:

http://sourceforge.net/mailarchive/forum.php?thread_name=9F3DCE8A-28A6-40DF-A52D-47CDC69F8EFF%40gmail.com&forum_name=xbiblio-devel

and specifically here:

http://sourceforge.net/mailarchive/message.php?msg_name=20080801105812.GE14668%40Andrea.Nowhere.net

BTW, going back to testing; perhaps when these sorts of questions come
up, we ought to organize the discussion around writing the tests? That
might allow a more precise conversation, and also yield a useful
product that can fill out the tests.

Not exactly sure how to do that, and I don’t have time ATM to figure
it all out, but maybe we ensure there are a few records in the json
test that requiring disambiguation, and we say:

{
‘cite’: [ { ‘id’: ‘ref-1’ }, { ‘id’: ‘ref-2’ } ]
‘out’: ‘whatever’ /* maybe some kind of array */
}

Bruce

+2. I’ll work on this. I want to read through Andrea’s very thorough
work on this, and then think about how I might tackle it in
citeproc-js. Then I’ll try to build some test JSON and set up some
hooks to run them. It will take awhile, but when I get something in
place I’ll post a note.

I’m sooo glad I took my foot off the gas in the coding when I did. I
think this can be solved, though.

Frank

I’ll try to help here, but just want to point out that a) Simon wrong
the disambiguation part of the schema, and b) I’ve not implemented it
myself (at least not in a really long time). E.g. Simon or Andrea may
be more helpful.

Either way, clearly the schema documentation is lacking and we should fix this.

Also, you might read through Andrea’s docs for this before even
reading my reply below:

http://code.haskell.org/citeproc-hs/docs/Text-CSL-Proc-Disamb.html

It looks like disambiguation is the next thing that needs to happen in
citeproc-js. Before writing anything, I’d like to check a few details
about how it’s meant to work. Here’s the csl.rnc comment:

defines parameters relating to disambiguation, followed in the order given

below until a citation is disambiguated

disambiguate-add-names: add additional names, disregarding

the “et-al” setting, to disambiguate the citations

disambiguate-add-givenname: add a given name to a citation

to disambiguate it (e.g., John Doe, 2005 vs. Doe, 2005)

disambiguate-add-year-suffix: add a suffix to the year (e.g.,

2007a) when there are two works by the same author published in

the same year included in one bibliography

I’m assuming that disambiguation is uniform across citation and
bibliography. (I guess that’s obvious, but I just want to confirm
that I’m not missing something.)

If I understand your question correctly, the answer is no to the first
two, yes to the last.

I’d like to confirm that the priority order of disambiguation is fixed
(in the order given in the comment), but that the steps included will
depend on the options given in the style – so if no options are
given, no disambiguation is performed. I think that’s what the
comment says, but again I’d just like to confirm that I’m not
misunderstanding.

Simon wrote this part, so maybe he should explain.

For disambiguate-add-names, would it be right to add one name at a
time until disambiguation is achieved, and use et al. if the
disambiguated form does not include all the names in a cite? I think
that’s right, but again, just confirming.

I think so too, but am not 100% sure.

For disambiguate-add-givenname, should given names be added for all
names in one go? Or should they only be added where it makes a
difference (where given names differ)? Or should these also be added
one at a time until disambiguation is achieved?

I believe the last.

Disambiguation is coming together quickly in citeproc-js, which is a
big relief. I probably would have thrown up my hands if Andrea and
Simon hadn’t demonstrated that it could be done.

Add-names is working, and add-givennames is ready to have its switch
turned on – but something just occurred to me. Logically, adding a
“significant” givenname at any position will disambiguate two
citations, even if nothing else changes. When name one fails, then,
should it be left expanded (which I think may be the natural reading
of the spec), or should we shrink it back when we try the next one?
Here are two examples:

J. Doe, J. Roe & R. Brown
J. Doe, J. Roe & R. Brown

becomes:

John Doe, Jane Roe & R. Brown
John Doe, Janet Roe & R. Brown

or:

J. Doe, Jane Roe & R. Brown
J. Doe, Janet Roe & R. Brown

The second one looks right to me, but it opens up a nasty problem.
You could have a case where the relation chains across multiple
citations, and requires multiple names to be expanded. For example:

J. Doe, Jane Roe & Richard Brown
J. Doe, Jane Roe & Robert Brown
J. Doe, Janet Roe & Richard Brown

(Assume that John Doe is the first author in all cases.) That also
looks to me like the elegant resolution, although it’s a little scarey
to contemplate. It think it might be possible to code it. Possible,
but a considerable headache, so I thought I’d better ask first. Are
the latter two examples correct, or should the first names be left
expanded?

It’s really a style question. If any editors are listening, I’d be
grateful for thoughts and observations.

Frank

Here are two examples:

J. Doe, J. Roe & R. Brown
J. Doe, J. Roe & R. Brown

becomes:

John Doe, Jane Roe & R. Brown
John Doe, Janet Roe & R. Brown

or:

J. Doe, Jane Roe & R. Brown
J. Doe, Janet Roe & R. Brown

The second one looks right to me, but it opens up a nasty problem.
You could have a case where the relation chains across multiple
citations, and requires multiple names to be expanded. For example:

J. Doe, Jane Roe & Richard Brown
J. Doe, Jane Roe & Robert Brown
J. Doe, Janet Roe & Richard Brown

(Assume that John Doe is the first author in all cases.) That also
looks to me like the elegant resolution, although it’s a little scarey
to contemplate. It think it might be possible to code it. Possible,
but a considerable headache, so I thought I’d better ask first. Are
the latter two examples correct, or should the first names be left
expanded?

It’s really a style question. If any editors are listening, I’d be
grateful for thoughts and observations.

I would go with this one:

J. Doe, Jane Roe & Richard Brown
J. Doe, Jane Roe & Robert Brown
J. Doe, Janet Roe & Richard Brown

but if there’s an agreement on:

John Doe, Janet Roe & Richard Brown

that would be fine too for me.

Let’s wait for Bruce.

Andrea

I would go with this one:

J. Doe, Jane Roe & Richard Brown
J. Doe, Jane Roe & Robert Brown
J. Doe, Janet Roe & Richard Brown

I agree, but am not 100% sure.

Bruce

It is a headache, indeed. Uselessly complicated, maybe. Now that I’m
thinking about an algorithm to deal with something like this I’m
asking myself if it is worth the effort.

So I hadn’t really followed this closely; what is the issue?

Bruce

with “disambiguate-add-given-names”, given names must be added
sequentially till you find a non colliding citation, or you have to
add only those given-names that actually differ?

Andrea