《Python Cookbook（第3版）中文版》——6.7　用命名空间来解析XML文档-阿里云开发者社区

《Python Cookbook（第3版）中文版》——6.7　用命名空间来解析XML文档

2017-05-02 1583

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

本节书摘来自异步社区《Python Cookbook（第3版）中文版》一书中的第6章，第6.7节，作者[美]David Beazley , Brian K.Jones，陈舸译，更多章节内容可以访问云栖社区“异步社区”公众号查看。

6.7　用命名空间来解析XML文档

6.7.1　问题

我们要解析一个XML文档，但是需要使用XML命名空间来完成。

6.7.2　解决方案

考虑使用了命名空间的如下XML文档：

<?xml version="1.0" encoding="utf-8"?>
<top>
  <author>David Beazley</author>
  <content>
      <html xmlns="http://www.w3.org/1999/xhtml">
          <head>
              <title>Hello World</title>
          </head>
          <body>
              <h1>Hello World!</h1>
          </body>
      </html>
  </content>
</top>

如果解析这个文档并尝试执行普通的查询操作，就会发现没那么容易实现，因为所有的东西都变得特别冗长啰嗦：

>>> # Some queries that work
>>> doc.findtext('author')
'David Beazley'
>>> doc.find('content')
<Element 'content' at 0x100776ec0>
>>> # A query involving a namespace (doesn't work)
>>> doc.find('content/html')
>>> # Works if fully qualified
>>> doc.find('content/{http://www.w3.org/1999/xhtml}html')
<Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0>
>>> # Doesn't work
>>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/head/title')
>>> # Fully qualified
>>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/'
... '{http://www.w3.org/1999/xhtml}head/{http://www.w3.org/1999/xhtml}title')
'Hello World'
>>>

通常可以将命名空间的处理包装到一个通用的类中，这样可以省去一些麻烦：

class XMLNamespaces:
    def __init__(self, **kwargs):
        self.namespaces = {}
        for name, uri in kwargs.items():
            self.register(name, uri)
    def register(self, name, uri):
        self.namespaces[name] = '{'+uri+'}'
    def __call__(self, path):
        return path.format_map(self.namespaces)

要使用这个类，可以按照下面的方式进行：

>>> ns = XMLNamespaces(html='http://www.w3.org/1999/xhtml')
>>> doc.find(ns('content/{html}html'))
<Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0>
>>> doc.findtext(ns('content/{html}html/{html}head/{html}title'))
'Hello World'
>>>

6.7.3　讨论

对包含有命名空间的XML文档进行解析会非常繁琐。XMLNamespaces类的功能只是用来稍微简化一下这个过程，它允许在后序的操作中使用缩短的命名空间名称，而不必去使用完全限定的URI。

不幸的是，在基本的ElementTree解析器中不存在什么机制能获得有关命名空间的进一步信息。但是如果愿意使用iterparse()函数的话，还是可以获得一些有关正在处理的命名空间范围的信息。示例如下：

>>> from xml.etree.ElementTree import iterparse
>>> for evt, elem in iterparse('ns2.xml', ('end', 'start-ns', 'end-ns')):
...     print(evt, elem)
...
end <Element 'author' at 0x10110de10>
start-ns ('', 'http://www.w3.org/1999/xhtml')
end <Element '{http://www.w3.org/1999/xhtml}title' at 0x1011131b0>
end <Element '{http://www.w3.org/1999/xhtml}head' at 0x1011130a8>
end <Element '{http://www.w3.org/1999/xhtml}h1' at 0x101113310>
end <Element '{http://www.w3.org/1999/xhtml}body' at 0x101113260>
end <Element '{http://www.w3.org/1999/xhtml}html' at 0x10110df70>
end-ns None
end <Element 'content' at 0x10110de68>
end <Element 'top' at 0x10110dd60>
>>> elem # This is the topmost element
<Element 'top' at 0x10110dd60>
>>>

最后要提到的是，如果正在解析的文本用到了除命名空间之外的其他高级XML特性，那么最好还是使用lxml库。比方说，lxml对文档的DTD验证、更加完整的XPath支持和其他的高级XML特性提供了更好的支持。本节提到的技术只是为解析操作做了一点修改，使得这个过程能够稍微简单一些。

《Python Cookbook（第3版）中文版》——6.7　用命名空间来解析XML文档

6.7　用命名空间来解析XML文档

6.7.1　问题

6.7.2　解决方案

6.7.3　讨论

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

《Python Cookbook（第3版）中文版》——6.7 用命名空间来解析XML文档

6.7 用命名空间来解析XML文档

6.7.1 问题

6.7.2 解决方案

6.7.3 讨论

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

《Python Cookbook（第3版）中文版》——6.7　用命名空间来解析XML文档

6.7　用命名空间来解析XML文档

6.7.1　问题

6.7.2　解决方案

6.7.3　讨论