non-dropping particles

This is to proceed with a discussion started on


.

While the CSL schema in its current form seems adequate for dealing with
non-dropping particles in European and Arabic names, I feel some aspects of
interpretation need to be reviewed:

In a nutshell, I argue that “van den”, “al-” and friends are genuine
non-dropping particles, but “La” and possibly a few others are not and are
best seen as parts of a single multipart last name (just like “Van” in
Belgian or American names, e.g., “Van Rompuy”).

The following is copied from


:

Certain names start with non-dropping particles, where “non-dropping” means
these particles have to appear in in-text citations (“van den Keere”,
“al-Hakim”) but may or may not be dropped in a bibliography for sorting
(“al-Hakim, Tawfiq” [sort under “H”], “van den Keere, Pieter” [sort under
“K”]), or sorting and display (“Hakim, Tawfiq al-”, “Keere, Pieter van
den”).

The Chicago Manual clearly recommends the sort-and-display variant (16e:
8.10, 8.14, 16.71, 16.76); that’s why I would argue that all CSL Chicago
styles should switch to demote-non-dropping-particle="display-and-sort".

By contrast, any last name that does not function this way, i.e., where
elements are never removed from the front for purposes of sorting or
display, or in other words, where the last name is always used in one and
the same form only throughout a document, both in text and in a
bibliography, should be parsed as one multipart last name.

For example, I would argue that “La Fontaine” should be understood, contra
the examples given in
http://docs.citationstyles.org/en/stable/specification.html,
http://docs.citationstyles.org/en/stable/specification.html as one single
multipart last name, since “Fontaine” never seems to be used alone, neither
for sorting nor display (I’ve sometimes seen “Fontaine” used as a
crossreference pointing to “La Fontaine”, but that’s nothing currently
implemented in CSL anyway).

Parsing such “immutable” last names as multipart last names will most
likely take care of all “potential objections to demoting the particle when
demote-non-dropping-particle=“display-and-sort” is applied for European
name formatting” [fbennett] referred to earlier in this thread.

If this seems acceptable so far, it would also mean that some of
citeproc-js’s parsing rules need to be reviewed, e.g., the one on “La”.
Protecting such names by wrapping them in double quotation marks would
serve as a workaround, of course.

On the other hand, if a genuine need is felt to have more flexibility,
e.g., allowing different settings for demoting various individual groups of
non-dropping-particles (e.g., “al-” vs. “van den” vs. “La”) we’d have to
discuss an extension of the CSL schema – but currently I don’t really think
that’s necessary.

I searched around a bit, and I agree that “Jean de La Fontaine” might
not be the best example. Better examples might be “Ludwig van
Beethoven” (dropping particle) and “Vincent van Gogh” (non-dropping
particle). Then we get:

Display order with “demote-non-dropping-particle” set to “never” or “sort-only”:
“Beethoven, Ludwig van”
“van Gogh, Vincent”

Display order with “demote-non-dropping-particle” set to “display-and-sort”:
“Beethoven, Ludwig van”
“Gogh, Vincent van”

As the example above shows, “van” has an ambiguous particle type and
we thus cannot rely on automatic parsing of two-field name fields
(given and family name) like those used in the Zotero UI to identify
particles and assign them as dropping or non-dropping. The CSL spec
currently doesn’t discuss this type of parsing, since it assumes fully
structured metadata. But it’s clear that the particle parsing process
is by far the most opaque aspect of Zotero/CSL’s particle treatment.
I’m really not a fan of protecting names in double quotation marks. I
think the best option would be for the Zotero UI to be more explicit
about particles, e.g. by offering a multi-part name field (given,
dropping particle, non-dropping particle, family, and suffix).

Rintze

Though the dropping particle in Rintze’s example can already be defined
explicitly via first name field, so it doesn’t undergo any parsing anyway.

I agree with Rintze about a more explicit UI and that may come in the
future (probably not for 5.0). I would still like to have automatic parsing
and have that work correctly 99% of the time. The explicit UI would only be
necessary where automatic parsing fails.

How is a regular Zotero user going to discover that that’s possible, though?

Rintze

For what it’s worth (and it’s not a point that I would press hard in
the face of strong opposition), I’m not a fan of adding fields to the
UI for particle-purposes. I think it would make manual entry a real
pain, and code maintenance would not be fun.

I wonder if a workaround would be to have some toggle switch that would
turn off the parsing for a specific name?

That seems like something we should move to Zotero forums. In any case,
like I said before, the automatic parsing is still useful, so improving
that would be great!

It could be documented >ducks<. Or you could have first-run guidance.
It’s a pretty straightforward distinction, easy to remember once
you’re exposed to it once.

Things will be a lot easier to document now that the parsing is driven
by a proper per-particle specification. The behavior is much more
well-defined than it was previously.

Well, other tools are bound to use a two-field setup as well, so there
is some merit in discussing it here.

Some kind of color-coding and/or a tooltip in the name fields showing
how a name is parsed could help as well. Or Zotero could offer a popup
box to help the user format individual names (e.g. accessible via a
right-click menu option). That would create a lot more space to
provide feedback. E.g. it could explain the role of the single-field
and two-field switch, and provide assistance in formatting the
particles (with examples and live preview of particle parsing).

Rintze

Before inviting feedback on a number of questions, here’s my reasoning
again: According to the Chicago Manual of Style, 16e, 8.10, 16.71, “Pieter
van den Keere” needs to appear in the text (leaving capitalisation issues
aside) as “van den Keere” and in the bibliography as “Keere, Pieter van
den”. The same applies for “Tawfiq al-Hakim”: “al-Hakim” and “Hakim, Tawfiq
al-” (CMS 8.14, 16.76). This requires “van den” and “al-” to be entered or
parsed as a non-dropping particles, and “demote-non-dropping-particle” to
be set to “display-and-sort”. This in turn requires names such as “La
Fontaine” to be entered/parsed as one multi-part family name rather than
what the CSL specs used to suggest, “La” as non-dropping-particle and
“Fontaine” as family name, or else we’d end up with the incorrect
“Fontaine, Jean de La”. (Parsing “La Fontaine” as one multi-part family
name seems appropriate anyway, since to the best of my knowledge the two
elements of “La Fontaine” are never separated in any circumstances.) This
again requires adjusting citeproc-js’s (and hopefully soon, Zotero’s) name
parsing algorithm.

So my proposal is (1) to set “demote-non-dropping-particle” to
“display-and-sort” in all Chicago styles (and, most likely, other styles,
too), (2) to remove “La” and other strings that aren’t genuine non-dropping
particles from the CSL specs and the list citeproc-js uses for parsing, and
(3), to make citeproc-js’s name parsing algorithm not only field- but also
case-specific: Field-specific means parsing ambiguous strings according to
whether they are found at the front of the family field (-> non-dropping)
or at the end of the first field (-> dropping); citeproc-js can do this.
Case-specific means distinguishing, e.g., “Van” and “van”, and parsing,
e.g., “Van Rompuy” as one multi-part family name, but splitting “van Gogh”
into a non-dropping-particle “van” and a (root) family name “Gogh”. Since I
haven’t been able to find any upper-case elements that would still count
as dropping or non-dropping particles in this scheme, we might even be able
to simplify the parsing algorithm to “lower-case strings at the front of
the family field are parsed as non-dropping particles, lower-case strings
at the end of the given field are parsed as dropping particles”.

Note that even with field- and case-sensitive particle identification there
are still a few strings that are ambiguous, and thus in some cases a name
in the family field still needs to be protected for correct parsing (i.e.,
wrapped in quotes; this is an existing citeproc-js feature):

  • A French “Paul de Man” (“de” = dropping particle) is entered as [Man]
    [Paul de];
  • a Dutch (“de” = non-dropping particle) as [de Man] [Paul];
  • but for an American(ised) “Paul de Man” (CMS 8.5, “de” = part of family
    name), the family name will still have to be wrapped in quotes, [“de Man”]
    [Paul], in order to be parsed correctly as one multi-part family name.

Now, the questions:

  • Is there anything wrong with this reasoning?
  • Is there anything problematic about these proposals?

And, more specifically:

  • Is anyone aware of style guides or other authoritative sources that would
    call for treating particles, especially non-dropping ones, differently from
    what CMS recommends? (In particular, anything that could not be solved by
    setting “demote-non-dropping-particle” to “sort-only” or “never”? – Would a
    Dutch publication prefer “sort-only”?)
  • Is anyone aware of upper-case name elements that are genuine
    non-dropping particles, i.e., would have to appear as “Bla Doe” in the
    text but as “Doe, Paul Bla” in the bibliography? (All non-dropping
    particles I’ve come across so far are lower-case.)
    • Regarding Arabic names, would anyone ever want to display “Tawfiq
      Al-Hakim” as “Al-Hakim” and “Hakim, Tawfiq Al-”? Or would the use of upper
      case typically indicate that “Al-”/“El-” should be seen as part of the
      family name rather than as a particle, and thus sorted under “A” or “E”?
  • Is anyone aware of upper-case name elements that are genuine dropping
    particles? (All dropping particles I’ve come across so far are lower-case.)
  • Thus, is the rule “Unless it’s part of a family name (and thus wrapped in
    quotes), any lower-case string must be a particle” sound?
  • Is anyone aware of style guides or other authoritative sources that would
    ever call for separating the elements of multi-part family names such as
    “La Fontaine” or “Van Rompuy” for sorting or display? (If there were, I
    fear we’d have to discuss reviewing the CSL specs …)
  • Can anyone provide an example of a real name with both dropping and
    non-dropping particles? (“Jean de La Fontaine” no longer qualifies; “Jean
    de van Gogh”, if it existed, might.)
  • As far as I see, all names with genuine non-dropping particles are of
    Dutch or Arabic origin. Is anyone aware of others?
  • What are your views on allowing the use of non-breaking spaces, like
    [de·Man] [Paul], for protecting multi-part family names from being parsed?
    (Prettier than quotes, but less obvious, and we’d still need the quotes for
    “d’Alembert” or “al-Hakim”, if these were ever found to need protection.)

Finally, though we need a good parsing solution now, of course none of this
should keep us from working on a better UI that could eliminate the need
for this awkward parsing of name fields altogether – though the algorithm
might still be useful for parsing data upon import in the future.