[TEST] disambiguation

Frank,

I see you’re putting in some test-related data. As a next step, how
about let’s write a few tests that fail?

E.g. how should we formalize the discussion of disambiguation in tests?

We probably need to test:

add_given_names:

add_title:

add_year_suffix:

… and maybe one or two that mix them.

Bruce

Frank,

I see you’re putting in some test-related data. As a next step, how
about let’s write a few tests that fail?

I’ll try to check in a little framework code today, that attempts to
read and run a
test in the standard format.

It’s an implementation thing, but for the moment I’m stuck at some
threshold issues. The test data is stored (as is right) in an unordered
form, so some means of imposing a sequence will be needed, in order
to read in a name. Thinking about that issue, I can also see that
whether a name part is rendered as initials or in full form is not
firmly tied either to its sequence position (cf. Japan or China, where
abbreviation should not be used) nor to the syntactic meaning of the
name part (cf. Mongolia, where given name rather than family name
or patronymic is the core term). Ordinarily, I would look for some small
piece of the problem that can be solved atomically, get that in place,
and use it as a building block for the next increment of logic. But with
several interrelated layers of ontology (if that’s the right word) and
formatting, and with the disambiguation rules waiting in the wings, I’m
kind of puzzled over where to start.

To boil that down to something a little more concise, I think I need
to figure out how to get names in a correct, un-disambiguated form,
before going on to the higher math of disambiguation and sorting.

Even more concisely, “What’s in a name?” :slight_smile:

But I’ll try to check in an integration test, at least, to provide
a target to work toward.

Frank

Right. but once you get that set so it works for Asian names and such
too (the hard part), then we just break it into little pieces. E.g.:

Given a citation with:

  • a reference from “Jane Doe” and another from “John Doe”
  • our style says:
    • names should in family-name-only form, and …
    • “disamibiguate-add-giveenname” is true

Then names need to be disambiguated.

It’s an implementation thing, but for the moment I’m stuck at some
threshold issues. The test data is stored (as is right) in an unordered
form, so some means of imposing a sequence will be needed, in order
to read in a name. Thinking about that issue, I can also see that
whether a name part is rendered as initials or in full form is not
firmly tied either to its sequence position (cf. Japan or China, where
abbreviation should not be used) nor to the syntactic meaning of the
name part (cf. Mongolia, where given name rather than family name
or patronymic is the core term). Ordinarily, I would look for some small
piece of the problem that can be solved atomically, get that in place,
and use it as a building block for the next increment of logic. But with
several interrelated layers of ontology (if that’s the right word) and
formatting, and with the disambiguation rules waiting in the wings, I’m
kind of puzzled over where to start.

To boil that down to something a little more concise, I think I need
to figure out how to get names in a correct, un-disambiguated form,
before going on to the higher math of disambiguation and sorting.

Right. but once you get that set so it works for Asian names and such
too (the hard part), then we just break it into little pieces. E.g.:

Given a citation with:

  • a reference from “Jane Doe” and another from “John Doe”
  • our style says:
    • names should in family-name-only form, and …
    • “disamibiguate-add-giveenname” is true

Then names need to be disambiguated.

Yes. One thing that has me stuck (well, stalled), though, is the
possibility of interaction between a type-Asian name (not sure yet
if that’s a sensible thing to call it) and style rules. If all styles will
use the name TAKEDA Shingen in that order, without abbreviation,
then ordering, non-abbreviation (no initials), and use of the full name is
a property of the name. But if some styles will insist on listing this
as Shingen TAKEDA, or as S. TAKEDA, then some more complex
layering of metadata will be required. So I’m not sure what the
object to be manipulated at the disambiguation stage should look like,
what structure or elements it will need to have.

It’s a puzzle that will have to simmer for awhile, I think, but I’ll
keep puzzling over it. It would be very good to know more about
the possible demands that styles might place on the formatter.
That would help to dispel some of the fear, uncertainty and
doubt that creeps in when I try to think about this stuff. Maybe
Elena can cast some light on that?

Frank

Some do:
http://www.journalarchive.jst.go.jp/english/jnlabstract_en.php?cdjournal=bbb1961&cdvol=49&noissue=6&startpage=1633
Despite being a Japanese journal, the bibliography lists all the (mostly
Asian) names in a Western abbreviated style. I also can’t recall ever having
seen a journal in my field that doesn’t do the same thing (i.e. overrule the
Asian-type name listing).

RintzeOn Sun, Mar 15, 2009 at 11:49 PM, Frank Bennett <@Frank_Bennett>wrote:

Yes. One thing that has me stuck (well, stalled), though, is the
possibility of interaction between a type-Asian name (not sure yet
if that’s a sensible thing to call it) and style rules. If all styles
will
use the name TAKEDA Shingen in that order, without abbreviation,
then ordering, non-abbreviation (no initials), and use of the full name is
a property of the name. But if some styles will insist on listing this
as Shingen TAKEDA, or as S. TAKEDA, then some more complex
layering of metadata will be required.

Some do:
http://www.journalarchive.jst.go.jp/english/jnlabstract_en.php?cdjournal=bbb1961&cdvol=49&noissue=6&startpage=1633
Despite being a Japanese journal, the bibliography lists all the (mostly
Asian) names in a Western abbreviated style. I also can’t recall ever having
seen a journal in my field that doesn’t do the same thing (i.e. overrule the
Asian-type name listing).

I’m sure that’s the most common approach, come to think of it. Here’s
a counter-example, for what it’s worth (mostly non-Japanese authors,
but there’s a review by “Anno Tadashi” in there, and that one is
definitely in family-given order):

http://monumenta.cc.sophia.ac.jp/

So I guess it’s our fate to write interesting computer code, so to speak. :slight_smile:

Frank2009/3/16 Rintze Zelle <@Rintze_Zelle>:

Here’s another fun one, both for names, as well as other stuff.

http://www.nanzan-u.ac.jp/SHUBUNKEN/publications/jjrs/jjrsMain.htm

Example:

De Bary, Wm. Theodore De Bary, et al, eds.,
1969 The Buddhist Tradition in India, China, and Japan. New York:
The Modern
Library.
Ishii Shūdō 石井修道
1987 Sōdai zenshūshi no kenkyū 宋代禅宗史の研究. Tokyo: Daitō Shuppansha.
1988 Chūgoku zenshūshi wa: “Mana Shōbōgenzō” ni manabu 中国全集史話「漢
字正法眼蔵」に学ぶ. Kyoto: Zen Bunka Kenkyūjō.

So Kanji AND transliterated roman names, as well as standard Western
names. Each are treated differently.

Note also the grouping issue that a lot of anthropologists desperately
need (technically, CSL supports this, but Zotero doesn’t).

Bruce

Okay, here’s one to try on for size. For the input format for testing
of names handling:

“Doe, John”
“van Doe, James”
“Doe, Jacques, III”
“Takeda, Shingen !”
“Ministry of Education, Culture, Sports, Science, and Technology !!”

The names can be used as-is (maybe with locale tweaks) as units in a
set of sort keys. For display, the first element of every name is to
be listed in full in all renderings. A third element, where present,
is understood to be a suffix. A single exclamation point ending the
entry says “prefer to present in this order with no abbreviations,
using appropriate touchups (i.e. lose the comma) but if you insist on
Western formatting, I give up”.

But if we allow formatting of nameparts, that may be a problem?
Also, what if the “van” above is dropped for sorting?

Two exclamation points says, “forget
about all those commas, just use this string literally”.

This is just a format for internal testing. Implementations can of
course ship information in any shape they like, so long as the same
formatting hints can be extracted from it.

Well, let’s put aside the syntax stuff, and just say we have two kinds of names:

  1. (display) names

  2. structured personal names

#2 is the hard part, because the different sort-display rules. I just
remembered this:

http://sourceforge.net/mailarchive/message.php?msg_id=C812EDFE-1E0A-4933-84C9-A1C0094C051C%40highwire.stanford.edu

To quote:

"This discussion may be interested in @name-style in that DTD:

Value: Meaning
eastern: The name will both be displayed and sorted/inverted with the
family name preceding the given name.
islensk: The name will both be displayed and sorted/inverted with the
given name preceding the family name.
western: The name will be displayed with the given name preceding the
family name but will be sorted/inverted with the family name preceding
the given name.
Default value: western"

So a name-style value on the name may work, where we allow this to be
overiden locally.

Bruce

Okay, here’s one to try on for size. For the input format for testing
of names handling:

“Doe, John”
“van Doe, James”
“Doe, Jacques, III”
“Takeda, Shingen !”
“Ministry of Education, Culture, Sports, Science, and Technology !!”

The names can be used as-is (maybe with locale tweaks) as units in a
set of sort keys. For display, the first element of every name is to
be listed in full in all renderings. A third element, where present,
is understood to be a suffix. A single exclamation point ending the
entry says “prefer to present in this order with no abbreviations,
using appropriate touchups (i.e. lose the comma) but if you insist on
Western formatting, I give up”.

But if we allow formatting of nameparts, that may be a problem?
Also, what if the “van” above is dropped for sorting?

Two exclamation points says, “forget
about all those commas, just use this string literally”.

This is just a format for internal testing. Implementations can of
course ship information in any shape they like, so long as the same
formatting hints can be extracted from it.

Well, let’s put aside the syntax stuff, and just say we have two kinds of names:

  1. (display) names

  2. structured personal names

#2 is the hard part, because the different sort-display rules. I just
remembered this:

http://sourceforge.net/mailarchive/message.php?msg_id=C812EDFE-1E0A-4933-84C9-A1C0094C051C%40highwire.stanford.edu

To quote:

"This discussion may be interested in @name-style in that DTD:

Value: Meaning
eastern: The name will both be displayed and sorted/inverted with the
family name preceding the given name.
islensk: The name will both be displayed and sorted/inverted with the
given name preceding the family name.
western: The name will be displayed with the given name preceding the
family name but will be sorted/inverted with the family name preceding
the given name.
Default value: western"

So a name-style value on the name may work, where we allow this to be
overiden locally.

Could do. What I have in place now for testing will work for all
of these cases, so I’ll run with it for the moment. The syntax
stuff is just an arbitrary markup for testing, to save typing.
Initial top-level tests coming soon, I think.

Here’s another fun one, both for names, as well as other stuff.

http://www.nanzan-u.ac.jp/SHUBUNKEN/publications/jjrs/jjrsMain.htm

Example:

De Bary, Wm. Theodore De Bary, et al, eds.,
1969 The Buddhist Tradition in India, China, and Japan. New York:
The Modern
Library.
Ishii Shūdō 石井修道
1987 Sōdai zenshūshi no kenkyū 宋代禅宗史の研究. Tokyo: Daitō Shuppansha.
1988 Chūgoku zenshūshi wa: “Mana Shōbōgenzō” ni manabu 中国全集史話「漢
字正法眼蔵」に学ぶ. Kyoto: Zen Bunka Kenkyūjō.

So Kanji AND transliterated roman names, as well as standard Western
names. Each are treated differently.

Small world – the Nanzan campus is 5 minutes’ walk from my office!

Our local version of Zotero is hacked (definitely the right word for it) to
render kanji names correctly, just based on the presence of high-end
unicode in the string. Dodgey, but seems to work. A better solution,
and dual-lingo presentation like this, would need some sort of multi-lingual
layering in the client, a la Linguaplone. Needed eventually, but don’t even
want to think about that one at the moment. :slight_smile:

If grouping only happens when the authors are exactly identical,
that should be very doable, once the other (tsunami of) details
has been sorted.

Frank