使用Python分析大型XML数据集

1. 前言

在当今大数据时代,数据处理和分析成为越来越热门的话题,而XML数据作为一种表现形式和存储方式更容易理解和使用。Python作为一种简单易学、开发效率高的编程语言,其在处理数据方面也特别强大。因此,本文将介绍使用Python分析大型XML数据集的方法。

2. XML介绍

XML是一种可扩展标记语言,它的主要作用是定义和传输数据,便于数据在各个应用之间互相传递和共享。其语法类似于HTML,但XML更强调数据的结构和语义,更适合于表达数据本身的含义。

2.1 XML的基本结构

XML文档由标记(tag)构成,标记用于描述文档中元素的类型和意义。一个完整的XML文档通常由以下几部分组成:

XML声明:用于指定XML版本和编码方式等基本信息

文档类型定义(DTD):用于定义元素和属性的结构、属性值、默认值等信息

根元素(root element):所有其他元素都是其子元素,是文档的开始和结束标志

元素(element):由开始标记和结束标记组成,标记中可以指定元素的属性

注释(comment):用于向读者说明某个部分的作用或者用途

处理指令(processing instructions):用于指定XML处理器在处理文档时所需的一些信息

2.2 XML读取和解析

Python提供了很多模块用于对XML进行读取和解析,其中比较常用的有:xml、xml.dom、xml.sax、lxml等。下面介绍两种比较常用的方法。

2.2.1 DOM解析

DOM(Document Object Model)是一种基于树形结构的XML解析方式,它将整个文档加载到内存中并构建出一棵DOM树,然后使用树的节点进行操作。这种方式适合于处理较小的XML文档,但对于大型文档则会影响性能。

import xml.dom.minidom

# 打开xml文件

DOMTree = xml.dom.minidom.parse("movies.xml")

# 获取根节点

collection = DOMTree.documentElement

# 获取所有电影

movies = collection.getElementsByTagName("movie")

# 循环遍历所有电影

for movie in movies:

# 获取电影属性

title = movie.getAttribute("title")

# 获取子元素

type = movie.getElementsByTagName('type')[0]

# 获取文本内容

print("Title: %s, Type: %s" % (title, type.childNodes[0].data))

2.2.2 SAX解析

SAX(Simple API for XML)是一种基于事件驱动的XML解析方式,它不需要将整个文档加载到内存中,而是在解析时逐行读取并处理。这种方式适合处理大型XML文档,以提高性能。

import xml.sax

class MovieHandler(xml.sax.ContentHandler):

def __init__(self):

self.CurrentData = ""

self.title = ""

self.type = ""

def startElement(self, tag, attributes):

self.CurrentData = tag

if tag == "movie":

print("*****Movie*****")

title = attributes["title"]

print("Title:", title)

def endElement(self, tag):

if self.CurrentData == "type":

print("Type:", self.type)

self.CurrentData = ""

def characters(self, content):

if self.CurrentData == "type":

self.type = content

# 创建XMLReader

parser = xml.sax.make_parser()

# 重写默认的 ContextHandler

Handler = MovieHandler()

parser.setContentHandler(Handler)

parser.parse("movies.xml")

3. 使用Python分析大型XML数据集

在实际应用中,大型的XML数据集往往是需要我们进行进一步处理和分析的。接下来,我们将以美国联邦选举委员会的竞选捐助数据为例,介绍使用Python对大型XML数据集进行分析的方法和流程。

3.1 数据介绍

美国联邦选举委员会(Federal Election Commission,简称FEC)公布了近20年来的美国竞选捐助数据集,该数据集包含了总额超过21亿美元、超过100万笔捐款的记录。这个数据集记录了竞选捐助人的姓名、地址、捐款时间、捐款数额等信息。该数据集是一个非常典型的大型XML数据集,其中每个捐款人的捐款信息被封装在一个xml元素中。为了方便大家理解,下面是该数据集的一段示例:

FEC文件示例

ELECTRON KIN

MARK BLEDSOE TRADESHOWS ETC.

SPRING

TX

77373

409712-S8

58T1JOY8G051214MN0117252

1

1995

2006-04-07T00:00:00+00:00

9E166D6990B8C4A8

2500.00

15

HARRINGTON

CHARLES

ROCHESTER

NY

14618

998710-S6

05H7MI8V781100306N

1

2002

2003-03-14T00:00:00+00:00

5BFA637AB00A5657

2000.00

15

GEIGER

WARNER

PIERMONT

NY

10968

1243679-S4

PA52IJDO8G051053OK0154693

1

2002

2003-03-14T00:00:00+00:00

5BFAD1C6B00A5657

500.00

15

3.2 数据读取

首先,我们需要使用Python进行数据的读取和解析。由于该数据集非常大,我们需要采用SAX解析方式,代码如下:

import xml.sax

class FECDataHandler(xml.sax.ContentHandler):

def __init__(self):

self.CurrentData = ""

self.rec_id = ""

self.amount = ""

self.timestamp = ""

self.contributor_name = ""

self.zipcode = ""

self.contributor_state = ""

self.contributor_occupation = ""

self.contributor_employer = ""

self.report_year = ""

# 元素开始事件处理

def startElement(self, tag, attributes):

self.CurrentData = tag

if tag == "contributor":

self.rec_id = attributes["rec-id"]

self.amount = ""

self.timestamp = ""

self.contributor_name = ""

self.zipcode = ""

self.contributor_state = ""

self.contributor_occupation = ""

self.contributor_employer = ""

self.report_year = ""

# 元素结束事件处理

def endElement(self, tag):

if self.CurrentData == "timestamp":

self.timestamp = self.timestamp[:10]

elif self.CurrentData == "first-name" or self.CurrentData == "last-name":

self.contributor_name = self.contributor_name.strip()

elif self.CurrentData == "zipcode":

self.zipcode = self.zipcode.zfill(5)

elif self.CurrentData == "state":

self.contributor_state = self.contributor_state.upper()

elif self.CurrentData == "contributor":

print("Recordid: %s, Name: %s, State: %s, Zip: %s, Amount: %s, Time: %s" \

% (self.rec_id, self.contributor_name, self.contributor_state,

self.zipcode, self.amount, self.timestamp))

self.CurrentData = ""

# 内容事件处理

def characters(self, content):

if self.CurrentData == "amount":

self.amount += content.strip()

elif self.CurrentData == "timestamp":

self.timestamp += content.strip()

elif self.CurrentData == "first-name" or self.CurrentData == "last-name":

self.contributor_name += content.strip()

elif self.CurrentData == "zipcode":

self.zipcode += content.strip()

elif self.CurrentData == "state":

self.contributor_state += content.strip().upper()

elif self.CurrentData == "occupation" or self.CurrentData == "employer":

text = content.strip()

text = text.replace("&", "and")

text = text.replace("|", "")

text = text.replace("<", "")

text = text.replace(">", "")

text = text.replace("'", "")

text = text.replace('"', "")

if self.CurrentData == "occupation":

self.contributor_occupation = text

else:

self.contributor_employer = text

elif self.CurrentData == "report-year":

self.report_year = content.strip()

# 创建XMLReader

parser = xml.sax.make_parser()

# 重写默认的 ContextHandler

Handler = FECDataHandler()

parser.setContentHandler(Handler)

# 解析正在下载的xml文件

parser.parse("FEC.xml")

3.3 数据分析

接下来,我们将用Python对数据集进行分析,以回答一些关于竞选捐助的基本问题:

3.3.1 问题一:捐款金额分布情况

我们可以通过统计不同金额区间的捐款人数和捐款总金额,来了解总体捐款金额的分布情况。下面是代码:

def plot_donors_by_amount():

amount_map = {}

amount_total = 0

count_total = 0

with open('FEC.csv', encoding='ISO-8859-1', mode='r') as f:

reader = csv.DictReader(f)

for row in reader:

try:

amount = float(row['amount'])

except ValueError:

continue

amount_total += amount

count_total += 1

if amount < 10:

amount_map['<10'] = amount_map.get('<10', 0) + 1

elif amount < 20:

amount_map['10-20'] = amount_map.get('10-20', 0) + 1

elif amount < 50:

amount_map['20-50'] = amount_map.get('20-50', 0) + 1

elif amount < 100:

amount_map['50-100'] = amount_map.get('50-100', 0) + 1

elif amount < 500:

amount_map['100-500'] = amount_map.get('100-500', 0) + 1

elif amount < 1000:

amount_map['500-1000'] = amount_map.get('500-1000', 0) + 1

elif amount < 5000:

amount_map['1000-5000'] = amount_map.get('1000-5000', 0) + 1

else:

amount_map['>5000'] = amount_map.get('>5000', 0) + 1

# 绘制直方图

plt.bar(range(len(amount_map)), amount_map.values(), align='center')

plt.xticks(range(len(amount_map)), amount_map.keys())

plt.title('Donation Amount Distribution')

plt.xlabel('Amount Range')

plt.ylabel('Number of Donations')

plt.show()

# 输出捐款总金额和平均捐款金额

print('Total amount of donations:', amount_total)

print('Total number of donations:', count_total)

print('Average donation amount:', amount_total / count_total)

plot_donors_by_amount()

3.3.2 问题二:捐献人口统计情况

我们可以通过统计各个州的捐款人数和捐款总金额,了解各州捐赠情况。下面是代码:

def plot_donors_by_state():

state_map_donors = {}

state_map_amount = {}

with open('FEC.csv', encoding='ISO-8859-1', mode='r') as f:

reader = csv.DictReader(f)

for row in reader:

state = row['contributor_state']

if state == "":

continue

try:

amount = float(row['amount'])

except ValueError:

continue

state_map_donors[state] = state_map_donors.get(state, 0) + 1

state_map_amount[state] = state_map_amount.get(state, 0) + amount

# 绘制捐款人口地图

state_names = []

state_donors = []

state_amounts = []

for state in state_map_donors.keys():

state_names.append(state)

state_donors.append(state_map_donors[state])

state_amounts.append(state_map_amount[state])

fig = plt.figure(figsize=(16, 8))

ax1 = fig.add_subplot(121)

mpl_toolkits.basemap

from mpl_toolkits.basemap import Basemap

map = Basemap(projection='merc', lat_0=39.5, lon_0=-99, resolution='i', area_thresh=0.1,

llcrnrlon=-124.8, llcrnrlat=25, urcrnrlon=-66.9, urcrnrlat=49.5)

map.drawcoastlines()

map.drawcountries()

map.drawstates()

x, y = map(np.array([-122, -75]), np.array([47, 28]))

ax1.plot(x, y, 'bo', markersize=3)

x, y = map(np.array([-104, -87]), np.array([38, 26]))

ax1.plot(x, y, 'bo', markersize=3)

x, y = map(np.array([-121]), np.array([37]))

ax1.plot(x, y, 'go', markersize=5)

ax1.text(x[0] + 200000, y[0] + 100000, 'San Francisco', fontsize=10)

ax1.text(x[0] - 800000, y[0] + 100000, 'Los Angeles', fontsize=10)

ax1.text