disambiguate-add-date option?

This case came up in Zotero forums:

First time interview is cited in text: Rose Hill, interviewed by Mary
Davis, transcript, May 13, 1928, University of Virginia Special
Collections
Next citation:
Rose Hill, interview
Problem:
No date to distinguish between-
First cite:
Rose Hill, interviewed by Mary Davis, transcript, June 4, 1928,
University of Virginia Special Collections
Second cite:
Rose Hill, interview

To distinguish between the two, the citation manager should add the
date:

Rose Hill, interview, May 13, 1928

Since the date should only be added if there are multiple interviews
with the same subject and interviewer, there should be something like
disambiguate-add-date option in CSL. It would be relevant to letters
with the same recipient and sender but different dates. Would it be
possible to add this option?

Thanks!
Elena

This makes sense, but there’s a little problem related to the previous
discussion with Andrea: how would we specify the formatting of the
date?

I’m not really clear: would the disambiguate prefix on the conditional
work? I wasn’t the one the added the following, and so I’m not really
clear:

If text inside an block can be used to

differentiate two otherwise identical citations, it will be added.

If the citations remain identical after its addition, it will not

be added.

Bruce

I would suggest to adopt a very careful approach with this.

I’ve been struggling in search of a nice and efficient way of
implementing disambiguation options and I’m coming to think there is
none I’m able to think about.

So I went to have a closer look at what Zotero does, and, as far as I
can understand, it does not implement disambiguation. Or at least it
does not implement it correctly. So I actually wonder if you proposal
would be trivial to be implemented in Zotero or not. Could you please
elaborate a bit on this?

Why am I saying that Zotero seems to me not to be implementing
disambiguation correctly? Because all it does is to make a guess,
given a reference group and some style elements, whether 2 citations
will need to be disambiguated or not. This guess is good most of the
time, but it is very easy to create perfectly valid styles for which
Zotero will miserably fail to disambiguate citations, because the way
it is using to guess cannot provide it with enough information to
decide in every situation.

Just to give you a few example, take this data set:
http://gorgias.mine.nu/csl/zoteroItems.rdf

You should be able to import it in Zotero.

Here you’ll find:

a) a book I (fictionally …:wink: edited with a friend and a book I
authored with my wife in 2007. I’m the first editor of the first and
the first author of the second.

b) 2 books by two brothers: Giovanni and Giuseppe Pascuzzi, same year,
different titles;

c) 2 books with the same author and 2 different editors, same year.

If we apply the following style to a):
http://gorgias.mine.nu/csl/testAddName.xml

we will get:
(Andrea Rossato et al., 2007), (Andrea Rossato et al., 2007)

you can install this style in Zotero from here:
http://gorgias.mine.nu/csl/install/testAddName.csl

What happened? The style requires author and date. If no author is
present the editor will be used instead. While these 2 citations
produce the same output, and so add-name should be applied, Zotero
thinks their output is different (as far as I understand because their
author field is different).

Now, if you apply this style to b):
http://gorgias.mine.nu/csl/testAddGivenName.xml

install it from here:
http://gorgias.mine.nu/csl/install/testAddGivenName.csl

you’ll get:
(G. Pascuzzi, 2004),(G. Pascuzzi, 2004)

I’m not sure what the issue is, here. But the result is definitely
wrong: add-given-name should be used, and it is not. But, as I said,
Zotero is not take into account what the style actually produces. It
is just guessing.

And you can see that especially in this 3rd issue.

If you apply this style to c):
http://gorgias.mine.nu/csl/testYearSuffix.xml

install it from here:
http://gorgias.mine.nu/csl/install/testYearSuffix.csl

you’ll get:
(Roberto Caso, no date),(Roberto Caso, no date)

Now, this style is not using the date for books. But the items have
the date set. Since the items have the date set and the style has the
option disambiguate-add-year-suffix Zotero is guessing every citation
gets the date, and so to disambiguate it is sufficient to add the year
suffix. In this case this is false.

But Zotero will think that the citation has been disambiguated, and it
will not apply the disambiguate-condition.

You can test it by applying to c) the following style:
http://gorgias.mine.nu/csl/testYearSuffix-2.xml

install here:
http://gorgias.mine.nu/csl/install/testYearSuffix-2.csl

This will produce:
(Roberto Caso, Giovanni Pascuzzi, no date),(Roberto Caso, Andrea Rossato, no date)

Since disambiguate-add-year-suffix is not set, Zotero will not
consider the citations disambiguated, and it will try the disambiguate
condition.

Now, you may think those are bugs that can be easily solved. Maybe you
are right, but the information I have, and the fact that I discovered
those issues by reading the code, makes me think this is a design
flaw. I don’t see any easy solutions.

Which brings me to the reason why I wrote such a long message: when
you decided to add those disambiguation options did you have in mind a
way of implementing them? Would you please share it with me?

Thanks,
Andrea

Good analysis Andrea :slight_smile:

Which brings me to the reason why I wrote such a long message: when
you decided to add those disambiguation options did you have in mind a
way of implementing them? Would you please share it with me?

I’m not sure how Zotero is implementing it, but it might be good to
isolate what needs to be done, and what the specific problems are? In
fact, maybe we ought to define a formal test based on your examples?

a) a book I (fictionally …:wink: edited with a friend and a book I
authored with my wife in 2007. I’m the first editor of the first and
the first author of the second.

b) 2 books by two brothers: Giovanni and Giuseppe Pascuzzi, same year,
different titles;

c) 2 books with the same author and 2 different editors, same year.

So what would we expect of these examples?

a)

add-suffix = True
add-names = False, unless et al is set to not print second name
add-title = True

b)

add-suffix = True
add-givennames = True
add-title = True

c)

I’m unclear on this one.

Is the above correct (except for c)? Can we write some tests for it?

Bruce

Good analysis Andrea :slight_smile:

Which brings me to the reason why I wrote such a long message: when
you decided to add those disambiguation options did you have in mind a
way of implementing them? Would you please share it with me?

I’m not sure how Zotero is implementing it, but it might be good to
isolate what needs to be done, and what the specific problems are? In
fact, maybe we ought to define a formal test based on your examples?

a) a book I (fictionally …:wink: edited with a friend and a book I
authored with my wife in 2007. I’m the first editor of the first and
the first author of the second.

b) 2 books by two brothers: Giovanni and Giuseppe Pascuzzi, same year,
different titles;

c) 2 books with the same author and 2 different editors, same year.

So what would we expect of these examples?

a)

add-suffix = True
add-names = False, unless et al is set to not print second name

and so this is True since the test style I proposed has:

b)

add-suffix = True
add-givennames = True
add-title = True

c)

I’m unclear on this one.

according to the test I proposed here:
http://gorgias.mine.nu/csl/testYearSuffix.xml

this should get evaluated:

Is the above correct (except for c)? Can we write some tests for it?

I don’t know if I don’t get something, but you can find all relevant
tests here:

http://gorgias.mine.nu/csl/

Do you need/mean something different?

Andrea

I mean a unit test that will report failure (or success) when run.

See, for example:

http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-rb/test/test_csl.rb?view=markup

Bruce

so far only Zotero support those features, and I wouldn’t know how to
test it (I just imported the rfd file, installed the styles I linked
before, and opened a document with all the 6 citations).

we have the data though.

The first thest:

This is the expected output:

(Andrea Rossato & Paola Locatin, 2007)
(Andrea Rossato & Roberto Caso, 2007)

instead we get:
(Andrea Rossato et al., 2007)
(Andrea Rossato et al., 2007)

Second, we should get
(Giuseppe Pascuzzi, 2004)
(Giovanni Pascuzzi, 2004)

instead we get:
(G. Pascuzzi, 2004)
(G. Pascuzzi, 2004)

The third we should get:
(Roberto Caso, Giovanni Pascuzzi, no date)
⁠(Roberto Caso, Andrea Rossato, no date)

instead we get:
(Roberto Caso, no date)
⁠Roberto Caso,no date)

Andrea

Do you need/mean something different?

I mean a unit test that will report failure (or success) when run.

so far only Zotero support those features, and I wouldn’t know how to
test it (I just imported the rfd file, installed the styles I linked
before, and opened a document with all the 6 citations).

Right, but this is where test driven development makes some sense: you
define the expectations formally up front, and then code until it
passes. It allows us to, as you said earlier, be “careful” about these
sorts of features (and documentation probably).

we have the data though.

That’s helpful, and could be factored into the tests. Thanks.

Would love to hear from Simon on the bugs you’re reporting here :slight_smile:

Bruce

Right, but this is where test driven development makes some sense: you
define the expectations formally up front, and then code until it
passes. It allows us to, as you said earlier, be “careful” about these
sorts of features (and documentation probably).

Actually in Haskell test driven development is the rule (I spend a lot
of my time preparing tests and data, but I didn’t write any proof of
the properties of my functions… so far.:wink:

Just to emphasize my agreement.

Would love to hear from Simon on the bugs you’re reporting here :slight_smile:

As I said I don’t think we are actually talking of bugs, if with bugs
we mean unintended errors.

I’ve been thinking about it quite a lot, now, and I think that there
is only one way out, if you do not want to blindly evaluate every
possible combination of a citation till you find the non colliding one
(which could mean a lot of space, just think to a huge document with a
huge number of colliding citations of collaborative works, with a
strict et-al setup, not uncommon in many disciplines).

The way out is to encode into the formatted output disambiguating
information to be used for post-processing. This, on my side, means
increasing the complexity of the output and either write an
abstraction layer for the end user output filter (pandoc), or increase
the complexity of the rendering functions. Still I’m not sure this is
going to work. But it’s probably worth a try.

I’d love to hear from the other developers how they think to deal with
this issue. I’d love to ear if someone has an idea on a possible
algorithm, basically.

Andrea

While we’re waiting on Simon, I tested the items at http://gorgias.mine.nu/csl/zoteroItems.rdf
with APA style on dev xpi and got this:

Caso, R. (2005a). Same Author Different Editors First (A. Rossato, Ed.).
Caso, R. (2005b). Same Author Different Editors Second (G. Pascuzzi,
Ed.).
Pascuzzi, G. (2004). The Brother’s Book.
Pascuzzi, G. (2004). First Book.
Rossato, A., & Caso, R. (Eds.). (2007). Edited Book.
Rossato, A., & Locatin, P. (2007). Authored Book.

(Caso, 2005a)
(Caso, 2005b)
(Giovanni Pascuzzi, 2004)
(Giuseppe Pascuzzi, 2004)
(Rossato & Caso, 2007)
(Rossato & Locatin, 2007)

Which seems like expected behavior, since editors of authored works
shouldn’t matter for disambiguation. No?

There is an open ticket on subsequent-author-substitute problem which
may be related:
https://www.zotero.org/trac/ticket/933

But I think most of the problems with existing disambiguations have
been solved.

On the disambiguate-add-date option, could we just specify the name
for the macro? Then the particular style will be able to format the
date as necessary as long as the name is standard. I know we haven’t
done it before, but it was discussed at one point as possibility. For
this option, you wouldn’t have to compare the dates, just the names
for interviewee/interviewer and letter author/recipient.

Sorry you’d still compare the dates, but presumably it wouldn’t matter
since it’s the dates in the implementer’s db that will be compared and
not the macros. So I think specifying a standard names for key macros
may work for this. E>

sorry Elena, maybe I was not clear on this point, but you need to test
the data with the test styles I provided here:

http://gorgias.mine.nu/csl/install/

I know that with APA (and probably with most or even all of the other
styles) everything is fine.

The problem is that the disambiguation routine is designed to guess,
and guess right for the greatest majority of the cases. Still it is
not a general implementation, as my test cases point out.

Each style points out a single problem with the Zotero implementation.
But you need those styles to actually see it.

I hope I clarified the point.

Best,
Andrea

PS: I said that I may be wrong. All my styles are valid, according to
jing.

The way out is to encode into the formatted output disambiguating
information to be used for post-processing.

In general, this is the approach I took in the XSLT code. So, I take
the raw source data, process it into an intermediate representation
that includes additional flags (like year suffix*), and then use that
for output processing. In an OO design, you’d just set those
parameters on each object.

I guess the trick is still exactly how best to do that. The above
worked really well for suffix generation, for example, but at least
some of the other disambiguation options have different logics that
depends in part on the output style (as you showed).

Bruce

Hi,

I just joined this list and was reading the interesting discussion on
disambiguation options in the Sourceforge list archive.

Andrea wrote:

I’ve been thinking about it quite a lot, now, and I think that there
is only one way out, if you do not want to blindly evaluate every
possible combination of a citation till you find the non colliding one
(which could mean a lot of space, just think to a huge document with a
huge number of colliding citations of collaborative works, with a
strict et-al setup, not uncommon in many disciplines).
(…)
I’ve been thinking about it quite a lot, now, and I think that there
is only one way out, if you do not want to blindly evaluate every
possible combination of a citation till you find the non colliding one
(which could mean a lot of space, just think to a huge document with a
huge number of colliding citations of collaborative works, with a
strict et-al setup, not uncommon in many disciplines).

It’s possible that I’m missing something important, but I don’t really
see the difficulty. A simple algorithm would collect all the citations
from the documents, format them according to the default style, check
for duplicates in the generated text references and then try out the
disambiguation options for each set of duplicates until all citations
are unambiguous. Even in the worst case the set of citations that
generate a given identical text reference probably isn’t larger than a
few dozens, so I don’t think a simple iterative/recursive approach will
be problematic with regard to space or time.

Best regards,
Stephan

Suppose a style with

and a few citations like:

  1. A. Rossato, B. Caso, C. Sempronio, D. Tizio, E. Doe, F. Smith, G.
    Scarpa, Some Title, 2001.
  2. A. Rossato, B. Caso, C. Sempronio, D. Tizio, E. Doe, F. Smith, H.
    Scarpa, Some Other Title, 2001.

All will be evaluated to:
Rossato, et al., 2001.

When you look at the evaluated citations you cannot say why and when
you’ll be able to find two non colliding versions, so you start adding
a name, you evaluate and look at the output a both have:

Rossato, Caso, et al., 2001.

And so you add a second name, and so on. Then you switch to adding
give names. How many times do yo have to evaluate the same citations
for? You cannot know. And if you have a book with hundreds of
citations, you cannot predict how many times a citation will be
evaluated. I had some previous experience with the WIKINDX citation
style. I wanted to use it in a wiki I was developing (UniWakka), in
order to have a (more) easily extensible citation style support. But
my laptop was not able to render a chapter of the book I was writing
at that time in less than a few minutes. Sometime, performance and
space usage can be critical. Moreover I would really like to prevent
possible future bug reports.:wink:

Now I’m working on a version of the Haskell implementation that will
disambiguate every citation in a maximum of two evaluations (maybe
even just one, I still don’t know). That would be a great improvement.
This way I could also gather some idea on how to implement the
"collapse" option. Any idea on this?

Cheers,
Andrea

Andrea Rossato wrote:

Suppose a style with

and a few citations like:

  1. A. Rossato, B. Caso, C. Sempronio, D. Tizio, E. Doe, F. Smith, G.
    Scarpa, Some Title, 2001.
  2. A. Rossato, B. Caso, C. Sempronio, D. Tizio, E. Doe, F. Smith, H.
    Scarpa, Some Other Title, 2001.

All will be evaluated to:
Rossato, et al., 2001.

When you look at the evaluated citations you cannot say why and when
you’ll be able to find two non colliding versions, so you start adding
a name, you evaluate and look at the output a both have:

Rossato, Caso, et al., 2001.

And so you add a second name, and so on. Then you switch to adding
give names. How many times do yo have to evaluate the same citations
for? You cannot know.

The important thing is that the number of evaluations per citation is
bounded by a low constant determined by the maximum number of authors
and the number of disambiguation options.

And if you have a book with hundreds of
citations, you cannot predict how many times a citation will be
evaluated.

A smart algorithm would memoize the final formatting of a given citation
in a given formatting context (ibid options etc), so that it doesn’t
matter performance-wise how often the same work is cited.

I had some previous experience with the WIKINDX citation
style. I wanted to use it in a wiki I was developing (UniWakka), in
order to have a (more) easily extensible citation style support. But
my laptop was not able to render a chapter of the book I was writing
at that time in less than a few minutes. Sometime, performance and
space usage can be critical.

That implementation seems to be a bit inefficient…

Now I’m working on a version of the Haskell implementation that will
disambiguate every citation in a maximum of two evaluations (maybe
even just one, I still don’t know). That would be a great improvement.

I’m not sure the improvement would be that great. Depending on how much
you cache intermediate formatting results you would mainly save on some
string concatenations, I’d guess.

This way I could also gather some idea on how to implement the
“collapse” option. Any idea on this?

Whether or not multiple work citations are collapsed seems to be
independent of disambiguation, tough the formatting depends on it.
Hence, for disambiguation one should probably treat multiple work
citations as multiple independent citations, then collapse if authors
are identical using the longest name and the disambiguated years (which
means one needs to keep track of applied disambiguations). Coding this
up may be a bit hairy, though.

Stephan

The important thing is that the number of evaluations per citation is
bounded by a low constant determined by the maximum number of authors
and the number of disambiguation options.

yes, and that number is unknown. I don’t know if I understand your
point, but the issue, for me, is given to the fact that you do not
know how to manipulate the output of an evaluated citation.

That is to say, when you read:

Smith, et al., 2004

you don’t know that Smith is a name and that, if you want to
disambiguate that citation, you need to insert another name from the
author field of the reference data right after it.

Instead you must change the evaluation environment and re-evaluate the
style with the reference data, and generate another output.

A smart algorithm would memoize the final formatting of a given citation
in a given formatting context (ibid options etc), so that it doesn’t
matter performance-wise how often the same work is cited.

the contest is not important, and it’s easy to generate just by
looking at the citations list. But this is more or less what I’m
trying to do, which is made a bit complex by the recursive nature of
the style evaluation.

I’m not sure the improvement would be that great. Depending on how much
you cache intermediate formatting results you would mainly save on some
string concatenations, I’d guess.

Well, as I said, instead of regenerating the output you can transform
it with simple post-processing functions. Using some generic
programming techniques, querying and transforming portions of the
output tree should be easy and far less resource intensive then
regenerating it.

Whether or not multiple work citations are collapsed seems to be
independent of disambiguation, tough the formatting depends on it.
Hence, for disambiguation one should probably treat multiple work
citations as multiple independent citations, then collapse if authors
are identical using the longest name and the disambiguated years (which
means one needs to keep track of applied disambiguations). Coding this
up may be a bit hairy, though.

The problem is that you don’t know what a given string is. If we were
talking of structured data everything would be easy. Instead we are
talking of the output of macros, conditionals, etc.: strings. Am I
missing something?

Andrea

I don’t think you’re comparing the output, I think you’re comparing
the original data. Then there is less ambiguity, right? I agree that
maximum number of authors can’t be determined, but couldn’t you just
run a loop to find out how many there are?

You can look here to see how Simon does it in Zotero:
https://www.zotero.org/trac/browser/extension/branches/1.0/chrome/content/zotero/xpcom/cite.js

It’s true that your example styles don’t work in Zotero, but I suspect
they could be fixed to work–I just don’t have time to play with them
right now–maybe during the week.

E

I don’t think you’re comparing the output, I think you’re comparing
the original data.

Not exactly. In examples using et al, the ambiguity is in fact
introduced by the et al rules. So you have to know what those rules
say to drop so that you can know what to add back. Blah …

Then there is less ambiguity, right? I agree that
maximum number of authors can’t be determined, but couldn’t you just
run a loop to find out how many there are?

You can look here to see how Simon does it in Zotero:
https://www.zotero.org/trac/browser/extension/branches/1.0/chrome/content/zotero/xpcom/cite.js

I think Andreas’ point is that Simon’s code on this is less-than-robust.

Bruce

Andrea Rossato wrote:

The important thing is that the number of evaluations per citation is
bounded by a low constant determined by the maximum number of authors
and the number of disambiguation options.

yes, and that number is unknown. I don’t know if I understand your
point, but the issue, for me, is given to the fact that you do not
know how to manipulate the output of an evaluated citation.

My point was that the number is low and only grows with the number of
authors of a work, which always is a low number. So I don’t see a reason
to worry about performance here.

That is to say, when you read:

Smith, et al., 2004

you don’t know that Smith is a name and that, if you want to
disambiguate that citation, you need to insert another name from the
author field of the reference data right after it.

Right, but you certainly know the number of authors in the referenced
work and you can keep record of how many names previously were printed
before “et al”. So you just try until the references are disambiguated
or there are no more unprinted names.

I’m not sure the improvement would be that great. Depending on how much
you cache intermediate formatting results you would mainly save on some
string concatenations, I’d guess.

Well, as I said, instead of regenerating the output you can transform
it with simple post-processing functions. Using some generic
programming techniques, querying and transforming portions of the
output tree should be easy and far less resource intensive then
regenerating it.

My feeling is that in the end your querying and transformation will
almost do the same as the straightforward algorithm, except for pasting
together the output, but maybe I’m overlooking something.

Whether or not multiple work citations are collapsed seems to be
independent of disambiguation, tough the formatting depends on it.
Hence, for disambiguation one should probably treat multiple work
citations as multiple independent citations, then collapse if authors
are identical using the longest name and the disambiguated years (which
means one needs to keep track of applied disambiguations). Coding this
up may be a bit hairy, though.

The problem is that you don’t know what a given string is. If we were
talking of structured data everything would be easy. Instead we are
talking of the output of macros, conditionals, etc.: strings. Am I
missing something?

But you know whether the content of a certain variable has become part
of the output, at least if you keep record. Author-year collapsing can
only be reasonably done if for each cited work in a multiple work
citation the same variables are printed and the printed variables, with
the exception of the year, are identical. Additionally, one should
require that the individual citations would generate the same output if
the years were identical. Under these circumstances one can collapse the
citations by printing any one of the citations and substituting the year
for a comma-separated list of years.

BTW: I can imagine that coding up the complicated formatting logic in a
purely functional way (i.e. Haskell-style) could be a major pain.

Stephan