Embedding citation-specific metadata in PDF files

digi-libris Reader http://digi-libris.com can export all Metadata of an
object including CSL-specific variables (those that cannot be mapped 1:1 to
Dublin Core Terms. e.g. pageRange, event, genre etc.) as XMP sidecar file
which can be imported into PDF files (with Acrobat.exe) and which other
software might be able to read.

Until now we have stored these CSL variables as attribute/value pairs under
custom entries which appear in Acrobat.exe under /File >> Properties >>
Additional Metadata >> Advanced/ and are stored in the XMP file as

/<rdf:Description rdf:about="" xmlns:pdfx=“http://ns.adobe.com/pdfx/1.3/”>
pdfx:citation_pageRange7-9</pdfx:citation_pageRange>
</rdf:Description>/

I am now considering changing this to include a proper CSL namespace which
could look line
/
<rdf:Description rdf:about="" xmlns:cs=“http://purl.org/net/xbiblio/csl/”>
cs:pageRange7-9</cs:pageRange>
</rdf:Description>/

but unfortunately this URL returns a 404 error or automatically re-directs
you to http://citationstyles.org. No way to see a list of variables.

What do you recommend:

1 stay with pdfx
2 change to xbiblio even though the latter does not reveal a valid
namespace
3 register a new domain with purl.org (under purl.org/digi/csl/ or
similar)
4 as nbr 3 above but use a proprietary prefix (e.g. digicita: or similar)
?–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedding-citation-specific-metadata-in-PDF-files-tp7579100.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

Since nobody is responding, my two cents: I would pick option 4 for now.

Rintze

You could do option 5: use bibo?

http://bibliontology.com/

Since nobody is responding, my two cents: I would pick option 4 for now.

Rintze

digi-libris Reader http://digi-libris.com can export all Metadata
of an
object including CSL-specific variables (those that cannot be mapped 1:1
to
Dublin Core Terms. e.g. pageRange, event, genre etc.) as XMP sidecar file
which can be imported into PDF files (with Acrobat.exe) and which other
software might be able to read.

Where are these CSL-specific variables coming from? I don’t see any CSL
spec (neither CSL documentation, nor CSL JSON format) defining pageRange.

Until now we have stored these CSL variables as attribute/value pairs
under
custom entries which appear in Acrobat.exe under /File >> Properties >>
Additional Metadata >> Advanced/ and are stored in the XMP file as

/<rdf:Description rdf:about="" xmlns:pdfx=“http://ns.adobe.com/pdfx/1.3/
”>
pdfx:citation_pageRange7-9</pdfx:citation_pageRange>
</rdf:Description>/

I am now considering changing this to include a proper CSL namespace
which
could look line
/
<rdf:Description rdf:about="" xmlns:cs=“http://purl.org/net/xbiblio/csl/
”>
cs:pageRange7-9</cs:pageRange>
</rdf:Description>/

but unfortunately this URL returns a 404 error or automatically
re-directs
you to http://citationstyles.org. No way to see a list of variables.

Namespaces are not required to resolve to a valid page (I agree that it may
be useful though). For all intents and purposes they’re just some
globally-unique string.

digi-libris Reader http://digi-libris.com can export all Metadata of an
object including CSL-specific variables (those that cannot be mapped 1:1 to
Dublin Core Terms. e.g. pageRange, event, genre etc.)

Have you considered mapping to PRISM as well? [1] That fills in a
number of gaps in Dublin Core and is already in use by several
publishers. Mendeley will read PRISM metadata from PDFs in addition to
Dublin Core and I think Papers does as well. I’m not sure if Zotero
can?

[1] http://www.prismstandard.org/specifications/2.1/PRISM_prism_namespace_2.1.pdf

PRISM 3.0
http://www.idealliance.org/specifications/prism-metadata-initiative/prism/specifications/prism-30-spec
has been published as well, though Zotero (I can’t speak for other
managers) does not yet recognize the new spec/namespace, but we’ll get
there soon. In either case, Zotero does not read metadata directly from
PDFs, because, from what we’ve seen, the metadata is very unreliable
(though this may change in the future).

Thanks Rintze for relaunshing the debate

I have adopted option four as follows:
Citation relevant variables will be stored on export in XMP sidecar files as

<rdf:Description rdf:about="" xmlns:cs=“http://purl.org/digilib/cita/”>
citation:titleEmbedded Metadata add Value to Scientific
Publications</citation:title>
citation:number-of-pages3</citation:number-of-pages>
citation:original-publisher-placeGeneva</citation:original-publisher-place>

</rdf:Description>

and those which cannot be mapped to DC will also be carried under pdfx as
custom attribute/value pairs with the ‘citation_’ prefix, same as already in
use in many HTML files.

to aurimas: You are absolutely right, ‘pagerange’ is not in the CSL
specification. It is a convenience variable I have used, but it exports as
’page’ and not as ‘pagerange’. My fault, sorry for the misleading typo.

to robert: be happy to include a bridge for PRISM variables if this is a
widely used standard. Just show me a mapping list and the purl.org entry to
use.–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedding-citation-specific-metadata-in-PDF-files-tp7579100p7579116.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

to robert: be happy to include a bridge for PRISM variables if this is a
widely used standard. Just show me a mapping list and the purl.org entry to
use.

I’m not sure if there is an existing purl.org entry. The example at
http://www.prismstandard.org/resources/mod_prism.html uses a
prismstandard.org URL for the namespace. There is a PURL
’/rss/1.0/modules/prism/’ which points to the aforementioned
mod_prism.html resource but you want one which points to the
namespace?

I don’t have a list of PRISM -> CSL mappings directly to hand, but the
fields that we recognize which I believe
map straightforwardly to CSL in most cases are:

“prism:aggregationType”, “prism:copyright”, “prism:doi”, “prism:edition”,
“prism:endingPage”, “prism:genre”, “prism:issn”, “prism:issueIdentifier”,
“prism:issueName”, “prism:keyword”, “prism:location”, “prism:number”,
“prism:organization”, “prism:pageRange”, “prism:publicationDate”,
“prism:publicationName”, “prism:section”, “prism:startingPage”,
“prism:volume”, “prism:url”

In either case, Zotero does not read metadata directly from PDFs, because, from what we’ve seen,
the metadata is very unreliable (though this may change in the future).

The main problem we observed was that the same Dublin Core fields that
are used for article metadata are also filled in by PDF generation
software using generic defaults - for example the filename of the
source document (Word, LaTeX etc.) as dc:title and the name of the
software that created the PDF as dc:creator.

In Mendeley we apply some simple heuristics based on a comparison of
the metadata with the actual content of the first few pages of the PDF
to decide whether or not to use that metadata.

The presence of PRISM fields is also a useful indicator since they are
more domain specific and less likely to be populated with other data
than the DC fields.

Probably just a typo, but your namespace declaration doesn’t match the
prefix you are using.

More importantly, if the idea behind using this namespace URI is to offer
interoperability between software, then I’m not sure this is helpful. I’ve
never seen http://purl.org/digilib/cita/ namespace used in this context
(though it may be) and I can’t find any documentation for it. Can anyone
point to a reference?

I think your best choice for interoperability would be to use a common,
rich vocabulary, like PRISM, with a namespace URI that is official/widely
used (e.g. http://prismstandard.org/namespaces/basic/2.1/ or a different
version).

If you don’t care about interoperability, then I guess it doesn’t matter at
all.

Aurimas

Thanks Rintze for relaunshing the debate

I have adopted option four as follows:
Citation relevant variables will be stored on export in XMP sidecar files
as

<rdf:Description rdf:about="" xmlns:cs=“http://purl.org/digilib/cita/”>
citation:titleEmbedded Metadata add Value to Scientific
Publications</citation:title>
citation:number-of-pages3</citation:number-of-pages>

citation:original-publisher-placeGeneva</citation:original-publisher-place>


</rdf:Description>

and those which cannot be mapped to DC will also be carried under pdfx as
custom attribute/value pairs with the ‘citation_’ prefix, same as already
in
use in many HTML files.

to aurimas: You are absolutely right, ‘pagerange’ is not in the CSL
specification. It is a convenience variable I have used, but it exports as
’page’ and not as ‘pagerange’. My fault, sorry for the misleading typo.

to robert: be happy to include a bridge for PRISM variables if this is a
widely used standard. Just show me a mapping list and the purl.org entry
to
use.


View this message in context:
http://xbiblio-devel.2463403.n2.nabble.com/Embedding-citation-specific-metadata-in-PDF-files-tp7579100p7579116.htmlOn Jul 14, 2014 4:21 AM, “johnmie” <@johnmie> wrote:
Sent from the xbiblio-devel mailing list archive at Nabble.com.


actually it now reads <rdf:Description rdf:about=""
xmlns:citation=“http://purl.org/digilib/cita/”>, cs came from a previous
test version.

The purl.org/digilib/cita/ link is new and the updated version of
digi-libris reader has not yet been uploaded. This is why you have not yet
seen it anywhere. Remember I had asked the original question only a few days
ago and did not get any meaningful suggestions until Rintze re-launched the
debate.

If you click on our purl link you will be directed to our Citation Variables
appendix which documents all the variables we use and how CSL variables are
mapped.

The prismstandard link points to an errata page which in turn returns 404.
But from what I have seen on another prism page, many of the variables do
not map 1:1 to either CSL or DC and will be treated as
custom variables.

john m.–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedding-citation-specific-metadata-in-PDF-files-tp7579100p7579120.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.

OK, so you’re basically establishing a new schema, which is highly parallel
to the variables used in CSL documentation (those are effectively not
under http://purl.org/net/xbiblio/csl namespace, because they are never
used in QNAMES in that context). The only thing that is not entirely clear
to me from the documentation is which verbs (in the RDF sense) fall under
the “http://purl.org/digilib/cita/” namespce. Is it all of the listed terms
under “Variables used in CSL styles”, except for the ones marked with * and
** (since those are mapped to dc/dcterms namespace)? Or is it all of the
terms on that page? IMO, since you’re establishing a new namespace anyway,
it would make sense to add all of the CSL variables to this namespace with
no exceptions (you’re not forced to use this in digibib export anyway). I
would also go with a namespace URI that makes this relationship clear (e.g.
http://purl.org/net/xbiblio/csl-vars#”, “
http://purl.org/net/digibib/csl-vars#” or something similar).

digi-libris distinguishes between 5 types of variables:
1. those that can be mapped to dc or dcterms,
2. those that can be mapped to CSL and dc/dcterms,
3. those that can only be mapped to CSL,
4. those imported and re-mappable such as ris, bib, MARC21, Prism
(planned) etc.
5. those that cannot be mapped to any of the above (which become
custom attribute/value pairs).

  In export all CSL variables are included under digilib namespace
  and in addition those also available as dc or dcterms are
  duplicated under the respective namespace, all other CSL variables are
  duplicated as custom attribute/value pairs under pdfx with the

‘citation_’ prefix.
Thus they are visible even to individuals who may not have
adequate software to read all of the embedded metadata.–
View this message in context: http://xbiblio-devel.2463403.n2.nabble.com/Embedding-citation-specific-metadata-in-PDF-files-tp7579100p7579122.html
Sent from the xbiblio-devel mailing list archive at Nabble.com.