lxml

So I just looked at lxml:

http://codespeak.net/lxml/

I’m not so sure I’d recommend it, since it’s a binding to libxslt and
libxml, rather than a standalone Python library. That’s a pretty
heavy-weight dependency.

I guess the question would be, what is it about ET that is insufficient?

Bruce

I guess the question would be, what is it about ET that is insufficient?

ElementTree has only limited support for XPath: can’t search on
attributes. If I understand correctly lxml is capable of doing that.

Johan

ElementTree has only limited support for XPath: can’t search on
attributes. If I understand correctly lxml is capable of doing that.

OK, but where do you need to search for attributes (as opposed to just
grabbing their values*)? We might consider tweaking the XML to make it
easier. I don’t think CSL should require a full xpath engine to handle.

Or are you talking about more the input drivers for, say, MODS (which
relies far too much on attributes for logic)?

Bruce

  • this bit is easy of course:

elem = Element(“tag”, first=“1”, second=“2”)
elem.get(“first”)
‘1’

Actually, I see one obvious place where you need to find by attribute
value: the type templates.

Bruce

Okay, I’ll use ElementTree instead then. Anyway, since the API between
lxml and ElementTree is the same, switching between these isn’t very
difficult. Switching from xml.dom.minidom to ElementTree is going to
be a bit more work. Not being to able to search on attributes is
indeed quite annoying with regards to MODS, but also for CSL it can be
handy at times, but not quite as needed.

Johan–
http://www.johankool.nl/

I guess an option is just to use ET to load a dedicated object, so that
you’re not relying on the details of the (generic) XML API?

So if you want to find the template for an article, just do:

style.bibliography.item("article")

… where you give the source type as parameter, and let the method
figure out which definition to use.

Just one way to get around the problem, of course.

Bruce

Hello again,

I’ve been trying to switch over to ElementTree, but there were some
issues regarding to namespaces that makes me doubt wether this was
such a good idea after all. The fact that I have to declare the
namespace every time is doable, but somewhat annoying. E.g.
csl.find(’{%s}citation/{%s}layout/{%s}item’ % (CSL_NS,CSL_NS,CSL_NS))
(CSL_NS is defined as ‘http://purl.org/net/xbiblio/csl’)

What I feel is more of a show-stopper is this behaviour of
ElementTree: http://www.xml.com/pub/a/2003/02/12/py-xml.html?page=2
I.e. it changes the names of the namespaces, and although this is
officially allowed in XML, I’d rather not have that happen when I
merge the results into the incoming XML document. I can already
foresee that this might cause troubles with e.g. MS Word reading in
its own XML files.

I’ve no idea yet how lxml does this. The requirement of libxml/libxslt
is for me not a big deal (it comes standard with OS X), but I can
understand it isn’t a very handy requirement.

Johan

Just to be clear, you are talking about namespace prefixes here, not
namespaces per se.

Ideally, an API allows you register a namespace prefix, just as you do
in an XML document. Not sure if ET forces you to write custom code to
do this.

But this again raises the question: why not just use dedicated objects
for access? Then you know precisely how to get what you want. Just
iterate through the XML document, load it into objects, and then do
your work.

Here’s an example from my Ruby code (using REXML; which, BTW, doesn’t
handle namespaces correctly!):

 # creates a csl metadata object from csl file
 def info
   config = {}
   csl.elements.each("/citationstyle/info/*") do |e|
     if e.nil? then content = nil
     else content = e.text
     end
     config[e.name] = content
   end
   CSLInfo.new(title=config["title"],
               short_title=config["title-short"],
               date_created=config["dateCreated"])
 end

Also, see:

<http://www.oreillynet.com/onlamp/blog/2005/01/
thoughts_on_xpath_xml_python.html>
http://uche.ogbuji.net/tech/4suite/amara/
http://www.xml.com/pub/a/2005/01/19/amara.html

An example:On Aug 10, 2006, at 5:40 AM, Johan Kool wrote:

What I feel is more of a show-stopper is this behaviour of
ElementTree: http://www.xml.com/pub/a/2003/02/12/py-xml.html?page=2
I.e. it changes the names of the namespaces, and although this is
officially allowed in XML, I’d rather not have that happen when I
merge the results into the incoming XML document.

===

from amara import binderytools
csl = binderytools.bind_file(‘apa.csl’)
print csl.style.info.title
American Psychological Association
===

Nice, eh?

It also has xpath support, and namespace prefix binding.

I didn’t quite work out how to iterate over child elements though.

Bruce

What I feel is more of a show-stopper is this behaviour of
ElementTree: http://www.xml.com/pub/a/2003/02/12/py-xml.html?page=2
I.e. it changes the names of the namespaces, and although this is
officially allowed in XML, I’d rather not have that happen when I
merge the results into the incoming XML document.

Just to be clear, you are talking about namespace prefixes here, not
namespaces per se.

Indeed.

Ideally, an API allows you register a namespace prefix, just as you do
in an XML document. Not sure if ET forces you to write custom code to
do this.

It sure seems to be that way. It’ll be a very annoying thing to do. I
can’t stop thinking this ought to be doable in a more sensible way.

But this again raises the question: why not just use dedicated objects
for access? Then you know precisely how to get what you want. Just
iterate through the XML document, load it into objects, and then do
your work.

That doesn’t change anything really, only the point where I read in
the xml file. Wether I obtain info from an Element or and custom
object doesn’t change very much.

Nice, eh?

It also has xpath support, and namespace prefix binding.

I’ll have a further look at Amara as it seems less painful to use that
ElementTree from the docs.

Johan

Ideally, an API allows you register a namespace prefix, just as you do
in an XML document. Not sure if ET forces you to write custom code to
do this.

It sure seems to be that way. It’ll be a very annoying thing to do. I
can’t stop thinking this ought to be doable in a more sensible way.

Yes. You could always ask the developer?

But this again raises the question: why not just use dedicated objects
for access? Then you know precisely how to get what you want. Just
iterate through the XML document, load it into objects, and then do
your work.

That doesn’t change anything really, only the point where I read in
the xml file. Wether I obtain info from an Element or and custom
object doesn’t change very much.

I’m just thinking it will get rid of some of the namespace and
attribute trickiness. You don’t have to worry about how a generic API
deals with the XML; you just use it to get the object you want.

But Amara is pretty nice. For example:

print csl.style.lang
en

So elements and attributes are treated the same. That means you can do
this too:

csl.style.bibliography.layout.item.choose.type.name
u’book’

It does choke on this, probably because of the attribute name:

 print csl.style.class
                     ^

SyntaxError: invalid syntax

Bruce

I’ve justed started with Amara, but this is really the way to go! Very
cool and easy way to get to data, and it makes it really unnecessary
to create many custom objects with such easy pythonesque access to the
data.

It does choke on this, probably because of the attribute name:

 print csl.style.class
                     ^

SyntaxError: invalid syntax

I ran into that for the “and” attribute too. The solution is to use
xpath: print csl.style.xml_xpath(“class”).

Johan

This goes for both you and Simon: if there are trivial changes we can
make in the XML to make these sort of bindings easier, then let me know
… soon.

Simon is using a similar kind of XML extension for Javascript.

Bruce

“type”, “class” and “and” are the ones I’ve seen so far. A dash “-” in
a tag name or attribute is not very handy either because it can’t be
used in python. Removing them might decrease readability, but would
make live easier for me.

Johan