YJL: Selecting elements with namespaces in pyquery

If you have an XML as shown below and you use XML parser in pyquery 1.1.1, then there is no quick way to select those with namespaces:

<?xml version="1.0" encoding="UTF-8" ?>
<foo xmlns:bar="http://example.com/bar" xmlns:bar2="http://example.com/bar2">
  <bar:blah>What</bar:blah>
  <bar2:blah>Woot</bar2:blah>
  <idiot>123</idiot>
</foo>

A simplest workaround is to load it with HTML parser, but you will get two indistinct elements of <blah/>. The more proper way is to build your own CSS selector:

from pyquery import PyQuery as pq
from lxml.cssselect import CSSSelector
d = pq(xml, parser='xml')
sel = CSSSelector('bar|blah', namespaces={'bar': 'http://example.com/bar'})
print sel(d[0])[0].text, pq(sel(d[0])).text()
sel = CSSSelector('bar2|blah', namespaces={'bar2': 'http://example.com/bar2'})
print sel(d[0])[0].text, pq(sel(d[0])).text()

A valid CSS selector for namespace in lxml is using | to separate namespace and tag names. With correct syntax, you also need to tell the namespaces by giving a dict as shown above. The results, list of elements, can be fed into pyquery, then you can use pyquery to carry out more selection.

I made a patch to pyquery, it will be easier if it gets pulled. The code would look like:

namespaces = {'bar': 'http://example.com/bar',
              'bar2': 'http://example.com/bar2'}
print pq('bar|blah', xml, parser='xml', namespaces=namespaces)
print d('bar2|blah', namespaces=namespaces).text()

YJL

Selecting elements with namespaces in pyquery

0 comments:

Post a Comment