Dropping and non-dropping particles

Hi all,

It’s not a subject where we get too many complaints, but on occasion. In Papers2, we have hard-coded a number of name particles and have tried to decide what rule to apply to each (dropping or non-dropping) based on usage. I realize the rule can change for the same particle, as some particles are the same in different languages, and even worse, the rules can differ when used in different countries. In any case, I was curious to hear your feedback on that topic. Please let me know if it’s been beaten to death in a previous thread. I have seen a few threads in searching the mailing list, but no extensive discussion.

I am listing here all the particles Papers2 detect. The particles are decomposed in the dropping part + non-dropping part (either can be empty of course). Note we also correct the capitalization.

I think we have the ‘al’, ‘el’, wrong.

// spain(??) / arabic
al al
dos dos
el el
de las de Las
lo lo
les les

// italy(??)
il il
del del
dela dela
della della
dello dello
di Di
da Da
do Do
des Des
lou Lou
pietro Pietro

// france
de de
de la de La
du du
d’ d’
le Le
la La
l’ L’
saint Saint
sainte Sainte
st. Saint
ste. Sainte

// holland
van van
van de vande
van der vander
van den vanden
vander vander
v.d. vander
vd vander
van het van het
ver ver
ten ten
ter ter
te te
op de op de
in de in de
in 't in 't
in het in het
uit de uit de
uit den uit den

// germany / austria
von von
von der von der
von dem von dem
von zu von zu
v. von
v von
vom vom
das das
zum zum
zur zur
den den
der der
des des
auf den auf den

// scotland(?)
mac Mac

// arabic
ben Ben
bin Bin
sen sen

// what to do with these??
// mc Mc
// o’ O’
// au
// af

Hi all,

It’s not a subject where we get too many complaints, but on occasion. In Papers2, we have hard-coded a number of name particles and have tried to decide what rule to apply to each (dropping or non-dropping) based on usage. I realize the rule can change for the same particle, as some particles are the same in different languages, and even worse, the rules can differ when used in different countries. In any case, I was curious to hear your feedback on that topic. Please let me know if it’s been beaten to death in a previous thread. I have seen a few threads in searching the mailing list, but no extensive discussion.

The citeproc-js relies on input for the semantic dropping/non-dropping
distinction. With two-field input, a particle that precedes the
"family" name element is non-dropping, and one that is attached to the
"given" name with a comma is dropping. Some parsing clutter is used to
cover special cases, such as name suffixes (Jr, III), and particles
that form a fixed part of the family name, and a few cases that have
come up where a particle is capitalized in the input. Apart from those
bits, which are essentially workarounds, we don’t try to interpret
what a given fragment means in its own right.

Thanks, that also makes sense. When you talk about ‘input’, do you mean both user input, and input from repositories? E.g. also based on the data that is returned by PubMed or Google Scholar?

Thanks, that also makes sense. When you talk about ‘input’, do you mean both user input, and input from repositories? E.g. also based on the data that is returned by PubMed or Google Scholar?

I meant just the input that the processor sees in the incoming JSON
(with really proper input all of these elements will be broken out
into separate JSON keys, but two-field name systems are common, so the
processor has a layer to convert from the two-field name format
delivered straight from a calling application into the internal form
used for processing).

I’m not closely involved in the translator layer, but judging from
work on the CiNii site (to get ready for my own return to the world of
actual research, and because it’s one of the few sites with
multilingual metadata to feed MLZ), names can get pretty messy on the
server side. In the best case translators will be able to remangle
names into the form expected by the processor, but where that fails
the user will have to touch things up in their database.

Ah, OK, it makes sense. Papers already splits the names into all the fields that the processor needs. Indeed, the particle detection belongs in the client, not the processor. Interesting that you still handle the 2-part names as well, it makes sense.

charles