dates

I’m at a bit of a crossroads on a design decision regarding date
handling:

I’ve just recorded (darcs language for committed) some changes to date
handling, which exposed the following problem/issue:

The structure of MODS and of CSL assumes that the publication date for
a journal article is basically a property of the journal, rather than
the article. In fact, I’d need to jump through some technical hoops to
change this, with unclear implications.

So, three options:

  1. stay with existing practice, in which case article dates won’t get
    formatted correctly unless the MODS records encode the data in the
    relatedItem host.

  2. add logic to pull the date from the main level, but otherwise keep
    everything (e.g. CSL) the same

  3. reconsider the logic (e.g. the structure of CSL)

3 is tricky, because even among metadata experts, the issue of which
level the date belongs to is ambiguous. It also may negatively impact
more complex data-handling, such as for original publication and such
(I don’t know; I have yet thought about how to handle that).

2 is a little hackish, and leaves MODS coding for this stuff ambiguous.

Thoughts?

Bruce

Is this a Journal specific issue, or does it apply to all containers
(ie the date is the date of the container, rather than the item?)On May 14, 2005, at 10:49 AM, Bruce D’Arcus wrote:

I’m at a bit of a crossroads on a design decision regarding date
handling:

I’ve just recorded (darcs language for committed) some changes to date
handling, which exposed the following problem/issue:

The structure of MODS and of CSL assumes that the publication date for
a journal article is basically a property of the journal, rather than
the article. In fact, I’d need to jump through some technical hoops
to change this, with unclear implications.

So, three options:

  1. stay with existing practice, in which case article dates won’t get
    formatted correctly unless the MODS records encode the data in the
    relatedItem host.

  2. add logic to pull the date from the main level, but otherwise keep
    everything (e.g. CSL) the same

  3. reconsider the logic (e.g. the structure of CSL)

3 is tricky, because even among metadata experts, the issue of which
level the date belongs to is ambiguous. It also may negatively impact
more complex data-handling, such as for original publication and such
(I don’t know; I have yet thought about how to handle that).

2 is a little hackish, and leaves MODS coding for this stuff ambiguous.

Thoughts?

Bruce


This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
Compare, Download & Develop Open Source & Business Software - SourceForge


xbiblio-devel mailing list
xbiblio-devel@lists.sourceforge.net
xbiblio-devel List Signup and Options

–James
+1 315 395 4056
Details: http://freelancepropaganda.com/jameshowison.vcf

I guess all. It just becomes more apparent, somehow, when dealing with
serials (it seems obvious to code a dateIssued with the book rather
than the chapter).

Bruce

So it get’s hairy when we’re talking about something like a web
publication where the container doesn’t really have a publication date,
but the articles do.
Physical production meant that the article neccessarily had the same
date as the container, that assumption has changed.

What about reprinted articles in containers later (ie collected volumes
of classic papers). those would need two dates, wouldn’t they (frankly
I think I’d probably just cite the earlier publication and let the user
figure out that it was more accessible in a reprint, gosh, that’s
mean!)

–J

That’s a good point. Maybe I ought to add an example or two like that;
say a weblog post.

I guess we’d end up with something like this (not including url, which
I’m too lazy to code!):

<mods ID="darcusb2005">
   <titleInfo>
      <title>Innovation and Problems of Metadata Modeling</title>
   </titleInfo>
   <name type="personal">
      <namePart type="given">Bruce</namePart>
      <namePart type="family">D'Arcus</namePart>
      <role>
         <roleTerm authority="marcrelator" 

type=“text”>author


text


darcusblog


continuing

weblog

2005-05-01


… or:

<mods ID="darcusb2005">
   <titleInfo>
      <title>Innovation and Problems of Metadata Modeling</title>
   </titleInfo>
   <name type="personal">
      <namePart type="given">Bruce</namePart>
      <namePart type="family">D'Arcus</namePart>
      <role>
         <roleTerm authority="marcrelator" 

type=“text”>author



2005-05-01

text


darcusblog


continuing

weblog

Am not sure which I prefer.

Bruce

The structure of MODS and of CSL assumes that the publication date
for a journal article is basically a property of the journal,
rather than the article.

Does valid MODS assume or require this? I.e., are both of your
examples valid MODS:

Both are valid, and discussion on MODS list awhile back ended up with
no consensus.

Generally, I would prefer the following rule:

All information that’s unique for a given resource should be on that
resource’s own level while information that’s only unique to the
related container item should go into the container item.

IMHO, this assumption should make it a lot easier for developers to
find and associate things within the MODS hierarchy.

Yes, I understand this. It’s just that this is one of the funny places
in bibliographic metadata and citation practice where the proper
“level” is just not clear.

Yes. Having multiple dates on several levels should be actually a
feature of MODS so that it can handle such complexity. Using the
above assumption, the date on top level would describe the date when
the original article was published, while the date in the ‘related >
host’ item is the date of the book volume.

Actually, it gets a little more complicated. IIRC (I’ve not looked at
it in awhile) MODS has a way to code a relatedItem that is the original
publication. So you could have still another level.

If no date on top level is given for an article, this would mean that
the article was originally published together with the book/journal.
In other words, if there’s no top level date the top level item would
inherit the date of the container item.

This is sort of what I’m thinking.

I would prefer:

Snow cover effects on Antarctic sea ice thickness Ackley S author [...] 16 21 Sea ice properties and processes Weeks W editor [...] book 299

Note that in the latter example, the book editor is given within
the ‘related > host’ item and not on top level (which is far more
intuitive, IMHO).

Yes, you’ve got it right.

Plus, the top level contains the page range of the
individual article while the ‘related > host’ item contains the total
number of pages for the book.

This is off, though. You’re right to note this is another awkward area
though. Pages are in the mods:relatedItem[@type=‘host’]/mods:part.

In truth, this sort of “locator” information is somewhere between the
main level and the container level, and is difficult to represent
cleanly. In CSL, I follow MODS convention, in part because I think it
better reflects formatting practice. I’'ve considered changing it
though.

Bruce

The structure of MODS and of CSL assumes that the publication
date for a journal article is basically a property of the
journal, rather than the article.

Does valid MODS assume or require this?

Both are valid, and discussion on MODS list awhile back ended up
with no consensus.

Ok, thanks for the clarification. For refbase we ended up to output
on both levels, main and host.

Having multiple dates on several levels should be actually a
feature of MODS so that it can handle such complexity. Using
the above assumption, the date on top level would describe the
date when the original article was published, while the date in
the ‘related > host’ item is the date of the book volume.

Actually, it gets a little more complicated. IIRC (I’ve not looked
at it in awhile) MODS has a way to code a relatedItem that is the
original publication. So you could have still another level.

Hmm, good point. I see that my above assumption may get confusing and
that a proper way of encoding that sort of information (using
may be clearer since it avoids any
ambiguities.

Btw, I see that you used for the top level
but used instead for the host level:

<mods ID="darcusb2005">

[…]

   <relatedItem type="host">

[…]

      <part>
         <date>2005-05-01</date>
      </part>
   </relatedItem>
</mods>

… or:

<mods ID="darcusb2005">

[…]

     <originInfo>
        <dateIssued>2005-05-01</dateIssued>
     </originInfo>

[…]

   <relatedItem type="host">

[…]

   </relatedItem>
</mods>

For refbase we use in both cases and use
only for volume/issue/pages information. I assume that is
equally valid? Re-reading the description of MODS elements on the
MODS project page I don’t really understand if there’s any difference
between the two.

I would prefer:

Snow cover effects on Antarctic sea ice thickness Ackley S author [...] 16 21 Sea ice properties and processes Weeks W editor [...] book 299

Note that in the latter example, the book editor is given within
the ‘related > host’ item and not on top level (which is far more
intuitive, IMHO).

Yes, you’ve got it right.

refbase does currently output any book editor(s) on top level (and
bibutils seems to recognize it just fine) but I think we should
modify output similar to the above example.

Plus, the top level contains the page range of the individual
article while the ‘related > host’ item contains the total number
of pages for the book.

This is off, though. You’re right to note this is another awkward area
though. Pages are in the mods:relatedItem[@type=‘host’]/mods:part.

Yes, and that’s how refbase currently outputs it. I can live with the
current MODS implementation. But specifying pages information (which
is unique for a given article) on top level would be more intuitive to
me, though.

In truth, this sort of “locator” information is somewhere between
the main level and the container level, and is difficult to
represent cleanly.

Hmm yes, that’s a valid point and I agree that valid arguements can
be made for both levels.

Thanks for your clarifications,

Matthias

Ok, thanks for the clarification. For refbase we ended up to output
on both levels, main and host.

That’s reasonable.

FYI, the XSLT logic for date handling in citeproc is to take the first
present date among a list:

dateIssued (main level)
dateIssued (host level)
date (host level)

Btw, I see that you used for the top level
but used instead for the host level:

Yes. Again, useage is not quite clear, but I was working with the idea
that the date is specific to the issue of the journal (which is what
part is designed to cover), rather than the journal per se.

For refbase we use in both cases and use
only for volume/issue/pages information. I assume that is
equally valid? Re-reading the description of MODS elements on the
MODS project page I don’t really understand if there’s any difference
between the two.

You’re reading of it all is reasonable.

refbase does currently output any book editor(s) on top level (and
bibutils seems to recognize it just fine) but I think we should
modify output similar to the above example.

Yes, dates are not always clear, but I’d say your current practice on
editors is umbiguously wrong :slight_smile:

Plus, the top level contains the page range of the individual
article while the ‘related > host’ item contains the total number
of pages for the book.

This is off, though. You’re right to note this is another awkward area
though. Pages are in the mods:relatedItem[@type=‘host’]/mods:part.

Yes, and that’s how refbase currently outputs it. I can live with the
current MODS implementation. But specifying pages information (which
is unique for a given article) on top level would be more intuitive to
me, though.

To me it’s awkward in both cases. But if you look at citation practice,
page numbers and grouped with volume and issue numbers, which are
similar to document numbers. So there is some logic to it.

Thanks for your clarifications,

Sure thing!

BruceOn Mon, 16 May 2005 15:05:18 +0200, “Matthias Steffens” <@Matthias_Steffens> said:

Question:

We all know that bib styles define date formatting. But can we say that
such date formatting is in fact defined by the document format? Put
differently, is it ever the case that a document style mandates that
dates are formatted one way (let’s say “January 12, 2002”), but its
bibliographic entries formatted another (say abbreviated; “Jan. 12,
2002”)? If yes, please provide urls.

This is an important question if I want to consider integrating CSL
logic into OD (and I do!), as OD already has support for general
date-internalization/configuration. Would be nice to be able to leave
that out of bib styling.

Bruce

I have looked up my reference work, Kate L. Turabin “A Manual for Writers of
Term papers, Theses, and Dissertations: Sixth Edition” which is based on the
’Chicago Manual of Style’ 14th Edition.

The book gives a definite proscription for date formating

section 2.49 Date, Month and Year
"One of two permissable Styles for expressing day, month, and year should be
followed consistently throughout a paper. The first, which omits punctation,
is preferred:

Have been working again on the RDF schema and have hit on at least a
strategy for handling dates. Example:

sbo:date a owl:DatatypeProperty ;
rdfs:label “date”@en ;
rdfs:isDefinedBy sbo: ;
owl:equivalentProperty dc:date ;
rdfs:range [ owl:unionOf (xs:dateTime xs:date xs:gYear
xs:gYearMonth) ] .

sbo:dateAnnotation a owl:DatatypeProperty ;
rdfs:label “date annotation”@en ;
rdfs:isDefinedBy sbo: ;
rdfs:range xs:string ;
rdfs:comment “A plain text date string to handle non-normalized
dates (‘Spring’, ‘Second Quarter’, etc.).”@en .

Another possibility is to define a data type; maybe drawn from RIS. So
we’d have have a regular expression that says one can use an optional
extension to standard dates, such that one could do:

<date>2000/Spring</date>

The bit delimited by the slash would basically represent “other date
part.”

Otherwise the dates would use standard xsd datetypes.

As I think about this, I guess I lean towards the latter approach.

Thoughts?

Bruce

I lean toward the former, because I think that we should try to avoid the
need for string processing (it is, after all, XML), but my opinion is not
especially strong.

With the majority of the more critical issues for our next beta solved, I’m
ready to start implementing the Biblio ontology. Is sbo.n3 the normative
version?

Simon

I lean toward the former, because I think that we should try to avoid the
need for string processing (it is, after all, XML), but my opinion is not
especially strong.

I guess to turn it around; how would you – ideally – store it in the
DB? As a single field, or as two (one normalized and one not)? I’ve
seen both approaches used in bib apps.

The modeling problem introduced with two fields is, how do you
associate them? Do we assume issued, accessed, updated are always
normalized, but that data alone can be annotated with a plain text
extension?

The alternative is in to have a full date class; e.g.:

2004 02 Spring

… feels like overkill.

With the majority of the more critical issues for our next beta solved, I’m
ready to start implementing the Biblio ontology. Is sbo.n3 the normative
version?

Yeah, though I’ll emphasize that – as with CSL – comments are
welcome. Also, I’ve not yet included the contribution stuff. Not sure
how much we want or need a full – in a single namespace – schema,
but will probably add that back, and then bring the RELAX NG schema in
line.

I mentioned to Dan that the hardest part of the schema is designing a
smart but nicely extensible class model for the references. The
properties are easier.

BTW, I’ve been struggling with trying to find a catchy name and
acronym for the thing. Current experiment is “Description of Citation
Sources”, or DOCS. E.g. docs:title and such.

Ugh … am really not good with naming stuff.

Bruce

I lean toward the former, because I think that we should try to avoid the
need for string processing (it is, after all, XML), but my opinion is not
especially strong.

I guess to turn it around; how would you – ideally – store it in the
DB? As a single field, or as two (one normalized and one not)? I’ve
seen both approaches used in bib apps.

Well, in Zotero, we store it completely un-normalized, but we can extract
data on the fly very easily. I realize that, while this makes sense for us,
it probably isn’t a good idea in a schema.

The modeling problem introduced with two fields is, how do you
associate them? Do we assume issued, accessed, updated are always
normalized, but that data alone can be annotated with a plain text
extension?

The alternative is in to have a full date class; e.g.:

2004 02 Spring

… feels like overkill.

Ugh. Yes, this is certainly a problem. I suppose that’s a good reason to go
with the RIS style.

With the majority of the more critical issues for our next beta solved, I’m
ready to start implementing the Biblio ontology. Is sbo.n3 the normative
version?

Yeah, though I’ll emphasize that – as with CSL – comments are
welcome. Also, I’ve not yet included the contribution stuff. Not sure
how much we want or need a full – in a single namespace – schema,
but will probably add that back, and then bring the RELAX NG schema in
line.

I mentioned to Dan that the hardest part of the schema is designing a
smart but nicely extensible class model for the references. The
properties are easier.

Yes. We had this question, too. We finally decided that custom item types
could have their own sets of properties, but, for export and citation
purposes, they need to be based off of some existing class. I encourage you
to model a similar structure in Biblio, as it will make things easier for
applications that have to figure out how to handle classes not defined in
the core schema.

I mentioned to Dan that the hardest part of the schema is designing a
smart but nicely extensible class model for the references. The
properties are easier.

Yes. We had this question, too. We finally decided that custom item
types
could have their own sets of properties, but, for export and citation
purposes, they need to be based off of some existing class.

Exactly.

I encourage you to model a similar structure in Biblio, as it will
make things easier for
applications that have to figure out how to handle classes not defined
in
the core schema.

One thing to pay attention to is that the class structure is (and has
always been) hierarchical. So:

Document
	Book
		EditedBook
	Article
		JournalArticle
	Image
		Diagram

… and such.

One of the reasons for this design is that it gives that nice balance
of flexibility and structure.

I submitted a report for Zotero in which I wanted to store the press
release for Bush’s latest speech the other day. Having that “Document”
fallback allows me to capture that, and it’s not a big deal that I
don’t have a specific “PressRelease” subclass. It’s actually not
uncommon that I cite press releases, but I’m not really sure they
should have their own owl:Class, because then we might end up with a
hundred of them!

So the trick in designing this hierarchy right. Most of the reference
classes in fact descend from “Document” (which itself reference
foaf:Document), but there are some that are tricky (Interview, etc.).

Bruce