sru xslt support

OK, am working on SRU support on my end (Matthias having been working
on it on his). Here’s what I’m looking at:

  <xsl:when test="$bibdb='sru'">
    <xsl:copy-of
      select="doc(concat($server_url,
      version=1.1&query=bib.citekey+any+,
      $citekeys,  

&operation=searchRetrieve&recordSchema=mods&recordPacking=xml&startRecord=1&maximumRecords=9999,
$authentication))"/>
</xsl:when>

So the citekeys variable is the same as the existing one, and I’m
thinking I need to add two more: server_url (for the base url) and
authentication.

Any thoughts? I want to keep things flexible, but also easy to handle
(and so simple).

Bruce

As we’ve discussed earlier, I think that individual cite keys must be
enclosed by anchors to provide for exact field matches if the 'any’
relation is used:

 "^Smith1992a^ ^Smith1992b^ ^Mitchell1995a^"

Matthias

I don’t think it makes any difference, actually.

any, all and adjacency are word relations. If all you have in the field
is a single word, then it will act like an exact equality relation.

If you have a cite key with a space, you’ll still end up with potentialy
incorrect results from:

 foo all "^the first key^  ^a second key^"

which means:

 foo = ^the and foo = first and foo = key^ and foo = ^a and foo =
 second and foo = key^

Smith1992a will not match Smith1992abcdef. The word anchors only say
’this word must be at the beginning or end of the field’

Rob

   ,'/:.          Dr Robert Sanderson (@Dr_Robert_Sanderson)
 ,'-/::::.        http://www.csc.liv.ac.uk/~azaroth/

,’–/::(@)::. Dept. of Computer Science, Room 805
,’—/::::::::::. University of Liverpool
____/:::::::::::::.
I L L U M I N A T I Cheshire3 IR System: http://www.cheshire3.org/

As we’ve discussed earlier, I think that individual cite keys must be
enclosed by anchors to provide for exact field matches if the ‘any’
relation is used:
“^Smith1992a^ ^Smith1992b^ ^Mitchell1995a^”

I don’t think it makes any difference, actually.

any, all and adjacency are word relations. If all you have in the field
is a single word, then it will act like an exact equality relation.

True, but what if you’re query contains cite keys that would match
multiple keys in your database, like:

any “Smith1992 Mitchell1995”

and you have following cite keys in your database:

“Smith1992”
“Smith1992Univariate”
“Smith1992Multiple”
“Mitchell1995”
“JeffriesMitchell1995”

Without the anchors, the query keys won’t be unique. That’s a serious
problem since we can’t (and shouldn’t) make assumptions about other
people’s cite key syntax.

If you have a cite key with a space, you’ll still end up with
potentialy incorrect results from:

 foo all "^the first key^  ^a second key^"

which means:

 foo = ^the and foo = first and foo = key^ and foo = ^a and
 foo = second and foo = key^

Ok, I see. Btw, for me it’s one of the most confusing things in CQL
that spaces do mean completely different things depending on context
and which relation is used. That’s a concept that’s pretty different
from all other search languages I’ve come across, so far and I find
it hard to grasp. Of course, that doesn’t mean it’s bad ;-), it’s
just easy to get trapped by that.

I think that when querying for multiple cite keys we should not use
the ‘any’ relation then but multiple ‘exact’ statements connected
with ‘and’ instead. (which isn’t as smart as using ‘any “…”’ since
it gets pretty wordy :-/)

This problem was also the reason why I was asking for an ‘anyexact’
relation which would ease things for us quite a bit, IMHO.

Matthias

Maybe I’m not understanding it correctly, here. From the CQL
information on the SRW web site I learned that:

bib.citekey any “Smith1992 Mitchell1995”

would resolve to:

bib.citekey=“Smith1992” and bib.citekey=“Mitchell1995”

and that the equals sign means “contains”. That, in turn, means that
other cite keys which contain the search term would be matched as
well.

If search terms used with the ‘any’ relation do only match whole
words, then the explanations on the web site are somehow misleading,
IMHO.

And how does CQL define a word? What about international characters
or a hyphen (’-’)?

Thanks, Matthias

Maybe I’m not understanding it correctly, here. From the CQL
information on the SRW web site I learned that:

bib.citekey any “Smith1992 Mitchell1995”

would resolve to:

bib.citekey=“Smith1992” and bib.citekey=“Mitchell1995”

Not quite - try:

bib.citekey = “Smith1992” or bib.citekey = “Mitchell1995”

and that the equals sign means “contains”. That, in turn, means that
other cite keys which contain the search term would be matched as
well.

I don’t think so, from the SRW/CQL pages:

“= is used:
For word adjacency, when the term is a list of words. That is to say
that the words appear in that order with no others intervening.
Otherwise, for exact equality of value”

So Smith1992aabb would not be a match for

bib.citekey = “Smith1992” or bib.citekey = “Mitchell1995”

Or

bib.citekey any “Smith1992 Mitchell1995”

For Smith1992aabb to be a match you need the stem modifier

So

bib.citekey =/stem “Smith1992” or bib.citekey =/stem “Mitchell1995”

Or

bib.citekey any/stem “Smith1992 Mitchell1995”

If search terms used with the ‘any’ relation do only match whole
words, then the explanations on the web site are somehow misleading,

Could you point out where the misleading bits are - as far as we are
aware it is fairly clear, however we are always open to
corrections/ammendments/clarifications etc!

Matthew Dovey
(Technical Editor - SRW)
Oxford University

Matthias – are you still on digest? If yes, it might be good to
change that for these kinds of discussions.

Anyway, if someone could settle the syntax I should be using, that’d be
great. I was so far assuming:

bib:citekey+any+"^Smith1992a^+^Smith1992b^"

The anchors are trivial to add of course.

So now issues with my other decisions?

Bruce

Date: Mon, 30 May 2005 12:38:50 +0200
From: Matthias Steffens <@Matthias_Steffens>

I think that individual cite keys must be enclosed by anchors to
provide for exact field matches if the ‘any’ relation is used:
“^Smith1992a^ ^Smith1992b^ ^Mitchell1995a^”

any, all and adjacency are word relations. If all you have in the
field is a single word, then it will act like an exact equality
relation.

Maybe I’m not understanding it correctly, here. From the CQL
information on the SRW web site I learned that:

bib.citekey any “Smith1992 Mitchell1995”

would resolve to:

bib.citekey=“Smith1992” and bib.citekey=“Mitchell1995”

(That “and” should be “or” – presumably a typo?)

and that the equals sign means “contains”.

Yes.

That, in turn, means that other cite keys which contain the search
term would be matched as well.

No. “=” is doing word matching, not substring matching.

bib.citekey = Smith1992

with find “Smith1992” but not “Smith1992a”. If you want to find
those, you’ll need to use a wildcard:

bib.citekey = Smith1992*

/| ___________________________________________________________________
/o ) / Mike Taylor <@Mike_Taylor> http://www.miketaylor.org.uk
)v_/\ Join the ASCII ribbon campaign against HTML mail -
http://arc.pasp.de/--
Listen to free demos of soundtrack music for film, TV and radio
http://www.pipedreaming.org.uk/soundtrack/

bib.citekey any “Smith1992 Mitchell1995”
would resolve to:
bib.citekey=“Smith1992” and bib.citekey=“Mitchell1995”

(That “and” should be “or” – presumably a typo?)

Oops, yes that was a typo.

and that the equals sign means “contains”.

That, in turn, means that other cite keys which contain the
search term would be matched as well.

No. “=” is doing word matching, not substring matching.

Ah, ok. I somehow got this wrong when reading the CQL documentation.
Sorry for the confusion. Then the anchors aren’t necessary, of course.

Thanks, Matthias

Yes, you’re correct and I should change that. (especially since the
digest sometimes seems to get the chronological order of postings
wrong which makes it impossible to follow a conversation)

Matthias

"^Smith1992a^ ^Smith1992b^ ^Mitchell1995a^"

I don’t think it makes any difference, actually.
any, all and adjacency are word relations. If all you have in the field
is a single word, then it will act like an exact equality relation.

True, but what if you’re query contains cite keys that would match
multiple keys in your database, like:
any “Smith1992 Mitchell1995”

and you have following cite keys in your database:
“Smith1992Univariate”
“Smith1992Multiple”

You would match them with Smith1992* using the default masking characters.
You could also use a regular expression (for example) to match them if you
specified a different masking algorithm.

If you have a cite key with a space, you’ll still end up with
potentialy incorrect results from:

 foo all "^the first key^  ^a second key^"

which means:

 foo = ^the and foo = first and foo = key^ and foo = ^a and
 foo = second and foo = key^

Ok, I see. Btw, for me it’s one of the most confusing things in CQL
that spaces do mean completely different things depending on context
and which relation is used.

Yes, the distinction is primarily if the field is to be treated as a
single string or a list of words

Rob

   ,'/:.          Dr Robert Sanderson (@Dr_Robert_Sanderson)
 ,'-/::::.        http://www.csc.liv.ac.uk/~azaroth/

,‘–/::(@)::. Dept. of Computer Science, Room 805
,’—/::::::::::. University of Liverpool
____/:::::::::::::.
I L L U M I N A T I Cheshire3 IR System: http://www.cheshire3.org/

bib.citekey any “Smith1992 Mitchell1995”
would resolve to:
bib.citekey=“Smith1992” and bib.citekey=“Mitchell1995”

and that the equals sign means “contains”.

= means (currently) word adjacency when applied to strings. So = with one
term is the same as any or all with one term – the field contains the
word given.

And how does CQL define a word? What about international characters
or a hyphen (‘-’)?

It doesn’t. It’s up to the search engine to determine the best way to
turn a field into a list of words.

Rob

   ,'/:.          Dr Robert Sanderson (@Dr_Robert_Sanderson)
 ,'-/::::.        http://www.csc.liv.ac.uk/~azaroth/

,‘–/::(@)::. Dept. of Computer Science, Room 805
,’—/::::::::::. University of Liverpool
____/:::::::::::::.
I L L U M I N A T I Cheshire3 IR System: http://www.cheshire3.org/

For Smith1992aabb to be a match you need the stem modifier
So
bib.citekey =/stem “Smith1992” or bib.citekey =/stem “Mitchell1995”
Or
bib.citekey any/stem “Smith1992 Mitchell1995”

Or more appropriately, * on the end as stem is used for linguistic
stemming (ala the Porter algorithm)

Rob

   ,'/:.          Dr Robert Sanderson (@Dr_Robert_Sanderson)
 ,'-/::::.        http://www.csc.liv.ac.uk/~azaroth/

,‘–/::(@)::. Dept. of Computer Science, Room 805
,’—/::::::::::. University of Liverpool
____/:::::::::::::.
I L L U M I N A T I Cheshire3 IR System: http://www.cheshire3.org/

Anyway, if someone could settle the syntax I should be using, that’d be
great. I was so far assuming:
bib:citekey+any+“^Smith1992a^+^Smith1992b^”

Assuming that the context set has a short name of ‘bib’:

  bib.citekey any "Smith1992a Smith1992b"

Plus escaping on all non URL okay characters such as space and "

Rob

   ,'/:.          Dr Robert Sanderson (@Dr_Robert_Sanderson)
 ,'-/::::.        http://www.csc.liv.ac.uk/~azaroth/

,‘–/::(@)::. Dept. of Computer Science, Room 805
,’—/::::::::::. University of Liverpool
____/:::::::::::::.
I L L U M I N A T I Cheshire3 IR System: http://www.cheshire3.org/

Here’s what I’m currently outputting:

version=1.1&query=bib.citekey%20any%20"Tilly2000a,%20Thrift1990a,%20Tilly2002a,%20Veer1996a,%20Tremblay2001a,%20NW2000-0207,%20NW2000-0424a"&operation=searchRetrieve"recordSchema=mods&recordPacking=xml&startRecord=1&maximumRecords=9999&x-info-2-auth1.0-authenticationToken=

I still need to finish the authentication support, and a server to
test against. Matthias, are you able to support these queries yet?

Bruce

Anyway, if someone could settle the syntax I should be using, that’d be
great. I was so far assuming:
bib:citekey+any+“^Smith1992a^+^Smith1992b^”

Assuming that the context set has a short name of ‘bib’:

  bib.citekey any "Smith1992a Smith1992b"

Plus escaping on all non URL okay characters such as space and "

Here’s what I’m currently outputting:

version=1.1&query=bib.citekey%20any%20"Tilly2000a,%20Thrift1990a,%
20Tilly2002a,%20Veer1996a,%20Tremblay2001a,%20NW2000-0207,%20NW2000
-0424a"&operation=
searchRetrieve"recordSchema=mods&recordPacking=xml&startRecord=1&
maximumRecords=9999&x-info-2-auth1.0-authenticationToken=

I assume the commas shouldn’t be in the above search term?

I still need to finish the authentication support, and a server to
test against. Matthias, are you able to support these queries yet?

Almost. :slight_smile: I haven’t yet found time to correct my incorrect
interpretation of the equals relation (i.e. ‘=’ matches only full
words but not sub-strings). The authentication token isn’t supported
either but I suppose this is easy to implement. I’ll hope to finish
these things over the weekend.

Ultimately, I should rewrite my simple CQL parser as suggested by Rob
in an earlier email.

Matthias

I assume the commas shouldn’t be in the above search term?

Right.

I’ll hope to finish these things over the weekend.

OK, I’ll do the same. Let me know when you’re ready and we can do a
test/demo. I want to announce this project formally next week if
possible.

Bruce

I’ll hope to finish these things over the weekend.

OK, I’ll do the same. Let me know when you’re ready and we can do
a test/demo.

I’ll do.

I want to announce this project formally next week if possible.

Great!

[Btw: I got an “Undelivered Mail” error with my last email that I
cc-ed to you directly:

“<@Bruce_D_Arcus>: host
/var/imap/socket/lmtpprox[/var/imap/socket/lmtpprox] said: 552 5.2.2
Over quota (reported by server2.internal in RCPT TO) (in reply to end
of DATA command)”]

Matthias

I don’t know what the deal is with fastmail. Matthew and I have both
seen the same.

GMail is a bit safer I think.

Bruce

Hi,

(in the hope to integrate with xbib) I’m trying to implement support
for the ‘x-…-authenticationToken’ parameter in a SRU query:

sru.php?version=1.1&query=bib.citekey=Mock2003Diss
&x-info-2-auth1.0-authenticationToken=email=@Matthias_Steffens

Problems are that none of the Mac OSX browsers I’ve tried (Safari,
Firefox, Mozilla, Camino, Opera) seems to pass the token correctly:

x-info-2-auth1.0-authenticationToken

The dot seems to be the culprit. If I remove the dot everything works
as expected. Or could this be a problem with PHP/Apache? I’m using
Apache/1.3.33 (Darwin) PHP/5.0.4.

I appreciate any hints.

Thanks, Matthias