使用Python提取XML中的特定元素-猿码集

使用Python提取XML中的特定元素

1. 什么是XML

XML(Extensible Markup Language)，中文名为可扩展标记语言，是一种常用的数据交换格式，它使用标记来描述数据，标记的含义由相应的应用程序来解释。

XML与HTML相似，都是使用标签来表示内容，但是XML的标签没有预定义的，它们是由用户自己定义的，这使得XML更加灵活，更适用于数据交换。

在Python中，XML数据可以通过使用ElementTree模块进行解析。

2. 元素的定位

在解析XML文件时，我们通常需要根据某种规则来定位其中的特定元素，ElementTree模块提供了两种方式实现这个功能：

使用标签名

注意：标签名不包括尖括号


    import xml.etree.ElementTree as ET
    tree = ET.parse('example.xml')
    root = tree.getroot()
    for child in root:
        if child.tag == 'country':
            print(child.attrib['name'])

使用XPath

XPath是在XML中查找信息的语言，它可以通过元素名称、属性等多种方式定位特定的元素。使用XPath需要先导入xpath模块，然后在ElementTree对象上调用findall方法并传入XPath表达式来定位元素。


    import xml.etree.ElementTree as ET
    tree = ET.parse('example.xml')
    root = tree.getroot()
    for country in root.findall("./country[@name='Singapore']"):
        rank = country.find('rank').text
        year = country.find('year').text
        gdppc = country.find('gdppc').text
        print(rank, year, gdppc)

3. 示例

假设我们有一个名为example.xml的XML文件：


<data>
   <country name="Liechtenstein">
      <rank>1</rank>
      <year>2008</year>
      <gdppc>141100</gdppc>
   </country>
   <country name="Singapore">
      <rank>4</rank>
      <year>2011</year>
      <gdppc>59900</gdppc>
   </country>
   <country name="Panama">
      <rank>68</rank>
      <year>2011</year>
      <gdppc>13600</gdppc>
   </country>
</data>

我们要提取其中的所有country标签，并输出每个country的名字和排名：


import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
for child in root:
    if child.tag == 'country':
        print(child.attrib['name'], child.find('rank').text)

输出：

Liechtenstein 1 Singapore 4 Panama 68

我们也可以根据排名的值来筛选country标签：


import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
for country in root.findall("./country[rank='1']"):
    print(country.attrib['name'])

输出：

Liechtenstein

4. 总结

ElementTree模块提供了强大的解析XML文件的功能，使用标签名和XPath可以方便快速地定位特定的元素，这在大型的XML文件中非常有用。要注意的是，使用XPath需要导入xpath模块。

使用Python提取XML中的特定元素