As I’ve worked on MODS Tight (the MODS subset schema), I’ve come across
an unfortunate restriction in RELAX NG.* The practical consequence is
that while I was trying to both enforce the presence of certain
elements (and numbers of them), but to leave their order flexible, it’s
not possible to do this.
So if we take a simple example – mods:originInfo and mods:genre – I
can offer two choices:
-
I can enforce example one occurrence of each, but I must constrain
the order. In RNC, this looks something like:
Genre,
Origins
-
I can leave order flexible, but lose all other control. In RNC:
(Genre|Origins)*
The practical consequences of the design choice would mean that in
example 1, this would be invalid:
<originInfo>
....
</originInfo>
<genre>whatever</genre>
In example 2, this would be valid:
<originInfo>
....
</originInfo>
<originInfo>
....
</originInfo>
<originInfo>
....
</originInfo>
<originInfo>
....
</originInfo>
Note missing genre element and spurious duplicate originInfo elements.
I could also do something more hybrid, like:
Genre,
(Name, Title)+,
Origins,
Container
...
Thoughts?
Bruce
- Well, it’s not like any other schema language supports this
functionality. It’s just when you have a lot of power, you notice the
limits! The limitations is partly in the design of MODS; not just a
restriction in RNG. If the LoC had not relied on attributes so much, I
could do what I want.
RNG supports a feature called interleave. However, you cannot
interleave elements with the same name. Because MODS codes a lot (too
much!) of its semantics in attributes, this becomes the problem.
As I’ve worked on MODS Tight (the MODS subset schema), I’ve come
across an unfortunate restriction in RELAX NG.* The practical
consequence is that while I was trying to both enforce the presence of
certain elements (and numbers of them), but to leave their order
flexible, it’s not possible to do this.
So if we take a simple example – mods:originInfo and mods:genre – I
can offer two choices:
- I can enforce example one occurrence of each, but I must constrain
the order. In RNC, this looks something like:
Genre,
Origins
- I can leave order flexible, but lose all other control. In RNC:
(Genre|Origins)*
So there is no general quantifier syntax? Like (Genre|Origins}{0,1}
That’s a shame and an odd oversight. I wonder why.
The practical consequences of the design choice would mean that in
example 1, this would be invalid:
....
whatever
In example 2, this would be valid:
....
....
....
....
Note missing genre element and spurious duplicate originInfo elements.
I think the first is better, Definitely it is better from a recovery
point of view. If someone, or more likely a broken piece of software
churns out the first then the error, which is only of order, can be
fixed without ambiguity (just change the order) but the second can’t be
fixed without ambiguity. Which one is the right one.
I could also do something more hybrid, like:
Genre,
(Name, Title)+,
What does the + mean as opposed to the * ?
–James
+1 315 395 4056
Details: <http://freelancepropaganda.com/jameshowison.vcf
- I can leave order flexible, but lose all other control. In RNC:
(Genre|Origins)*
So there is no general quantifier syntax? Like (Genre|Origins}{0,1}
That’s a shame and an odd oversight. I wonder why.
Don’t worry; it’s there. That convention (which predates RNG) is a way
to say that order is unimportant, but it’s a hack because the only way
to do that is to say you can have zero or more of a choice of x
patterns.
RNG supports interleave, which would be:
Genre & Origins
There I’m saying there must be one of each of these two patterns, but
there order is unimportant.
It’s just that the algorithms to validate interleave can get
complicated (why XML Schema I guess doesn’t support it), so they added
a restriction (that supposedly they hope to remove later).
I think the first is better, Definitely it is better from a recovery
point of view. If someone, or more likely a broken piece of software
churns out the first then the error, which is only of order, can be
fixed without ambiguity (just change the order) but the second can’t
be fixed without ambiguity.
That’s my preference.
I could also do something more hybrid, like:
Genre,
(Name, Title)+,
What does the + mean as opposed to the * ?
It’s standard regular expression syntax:
* = zero or more
+ = one or more
? = zero or one
Bruce
Followup:
I’ve been working on the schema some more. Initially, I was enforcing
genre terms on each level, placing those terms in the xbib authority,
and using it all to drives validation.
I’ve just made changes that keep the rigorous structure intact, but
which use marc genre terms and authority attribute where relevant. It
results in an example like this:
article
2000-03
periodical
Some Magazine
continuing
23
The schema then looks like this:
genre-part-inSerial =
element genre {
Genres-part-inSerial-xbib | Genres-part-inSerial-marc
}
Genres-part-inSerial-marc = attribute authority { "marc" }?,
("legislation"
> "patent"
> "reporting")
Genres-part-inSerial-xbib = attribute authority { "xbib" }?,
("abstract"
> "article"
> "editorial"
> "legal article"
> "legal case"
> "letter to the editor"
> "patent"
> "review")
Thoughts?
Note: in a couple places I trivially modified the marc term and moved
it to xbib. For example, “legal case and case notes” is too long, so I
shortened it to just “legal case.”
Bruce