1. 前言
在当今大数据时代,数据处理和分析成为越来越热门的话题,而XML数据作为一种表现形式和存储方式更容易理解和使用。Python作为一种简单易学、开发效率高的编程语言,其在处理数据方面也特别强大。因此,本文将介绍使用Python分析大型XML数据集的方法。
2. XML介绍
XML是一种可扩展标记语言,它的主要作用是定义和传输数据,便于数据在各个应用之间互相传递和共享。其语法类似于HTML,但XML更强调数据的结构和语义,更适合于表达数据本身的含义。
2.1 XML的基本结构
XML文档由标记(tag)构成,标记用于描述文档中元素的类型和意义。一个完整的XML文档通常由以下几部分组成:
XML声明:用于指定XML版本和编码方式等基本信息
文档类型定义(DTD):用于定义元素和属性的结构、属性值、默认值等信息
根元素(root element):所有其他元素都是其子元素,是文档的开始和结束标志
元素(element):由开始标记和结束标记组成,标记中可以指定元素的属性
注释(comment):用于向读者说明某个部分的作用或者用途
处理指令(processing instructions):用于指定XML处理器在处理文档时所需的一些信息
2.2 XML读取和解析
Python提供了很多模块用于对XML进行读取和解析,其中比较常用的有:xml、xml.dom、xml.sax、lxml等。下面介绍两种比较常用的方法。
2.2.1 DOM解析
DOM(Document Object Model)是一种基于树形结构的XML解析方式,它将整个文档加载到内存中并构建出一棵DOM树,然后使用树的节点进行操作。这种方式适合于处理较小的XML文档,但对于大型文档则会影响性能。
import xml.dom.minidom
# 打开xml文件
DOMTree = xml.dom.minidom.parse("movies.xml")
# 获取根节点
collection = DOMTree.documentElement
# 获取所有电影
movies = collection.getElementsByTagName("movie")
# 循环遍历所有电影
for movie in movies:
# 获取电影属性
title = movie.getAttribute("title")
# 获取子元素
type = movie.getElementsByTagName('type')[0]
# 获取文本内容
print("Title: %s, Type: %s" % (title, type.childNodes[0].data))
2.2.2 SAX解析
SAX(Simple API for XML)是一种基于事件驱动的XML解析方式,它不需要将整个文档加载到内存中,而是在解析时逐行读取并处理。这种方式适合处理大型XML文档,以提高性能。
import xml.sax
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.title = ""
self.type = ""
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "movie":
print("*****Movie*****")
title = attributes["title"]
print("Title:", title)
def endElement(self, tag):
if self.CurrentData == "type":
print("Type:", self.type)
self.CurrentData = ""
def characters(self, content):
if self.CurrentData == "type":
self.type = content
# 创建XMLReader
parser = xml.sax.make_parser()
# 重写默认的 ContextHandler
Handler = MovieHandler()
parser.setContentHandler(Handler)
parser.parse("movies.xml")
3. 使用Python分析大型XML数据集
在实际应用中,大型的XML数据集往往是需要我们进行进一步处理和分析的。接下来,我们将以美国联邦选举委员会的竞选捐助数据为例,介绍使用Python对大型XML数据集进行分析的方法和流程。
3.1 数据介绍
美国联邦选举委员会(Federal Election Commission,简称FEC)公布了近20年来的美国竞选捐助数据集,该数据集包含了总额超过21亿美元、超过100万笔捐款的记录。这个数据集记录了竞选捐助人的姓名、地址、捐款时间、捐款数额等信息。该数据集是一个非常典型的大型XML数据集,其中每个捐款人的捐款信息被封装在一个xml元素中。为了方便大家理解,下面是该数据集的一段示例:
FEC文件示例
ELECTRON KIN
MARK BLEDSOE TRADESHOWS ETC.
SPRING
TX
77373
409712-S8
58T1JOY8G051214MN0117252
1
1995
2006-04-07T00:00:00+00:00
9E166D6990B8C4A8
2500.00
15
HARRINGTON
CHARLES
ROCHESTER
NY
14618
998710-S6
05H7MI8V781100306N
1
2002
2003-03-14T00:00:00+00:00
5BFA637AB00A5657
2000.00
15
GEIGER
WARNER
PIERMONT
NY
10968
1243679-S4
PA52IJDO8G051053OK0154693
1
2002
2003-03-14T00:00:00+00:00
5BFAD1C6B00A5657
500.00
15
3.2 数据读取
首先,我们需要使用Python进行数据的读取和解析。由于该数据集非常大,我们需要采用SAX解析方式,代码如下:
import xml.sax
class FECDataHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.rec_id = ""
self.amount = ""
self.timestamp = ""
self.contributor_name = ""
self.zipcode = ""
self.contributor_state = ""
self.contributor_occupation = ""
self.contributor_employer = ""
self.report_year = ""
# 元素开始事件处理
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "contributor":
self.rec_id = attributes["rec-id"]
self.amount = ""
self.timestamp = ""
self.contributor_name = ""
self.zipcode = ""
self.contributor_state = ""
self.contributor_occupation = ""
self.contributor_employer = ""
self.report_year = ""
# 元素结束事件处理
def endElement(self, tag):
if self.CurrentData == "timestamp":
self.timestamp = self.timestamp[:10]
elif self.CurrentData == "first-name" or self.CurrentData == "last-name":
self.contributor_name = self.contributor_name.strip()
elif self.CurrentData == "zipcode":
self.zipcode = self.zipcode.zfill(5)
elif self.CurrentData == "state":
self.contributor_state = self.contributor_state.upper()
elif self.CurrentData == "contributor":
print("Recordid: %s, Name: %s, State: %s, Zip: %s, Amount: %s, Time: %s" \
% (self.rec_id, self.contributor_name, self.contributor_state,
self.zipcode, self.amount, self.timestamp))
self.CurrentData = ""
# 内容事件处理
def characters(self, content):
if self.CurrentData == "amount":
self.amount += content.strip()
elif self.CurrentData == "timestamp":
self.timestamp += content.strip()
elif self.CurrentData == "first-name" or self.CurrentData == "last-name":
self.contributor_name += content.strip()
elif self.CurrentData == "zipcode":
self.zipcode += content.strip()
elif self.CurrentData == "state":
self.contributor_state += content.strip().upper()
elif self.CurrentData == "occupation" or self.CurrentData == "employer":
text = content.strip()
text = text.replace("&", "and")
text = text.replace("|", "")
text = text.replace("<", "")
text = text.replace(">", "")
text = text.replace("'", "")
text = text.replace('"', "")
if self.CurrentData == "occupation":
self.contributor_occupation = text
else:
self.contributor_employer = text
elif self.CurrentData == "report-year":
self.report_year = content.strip()
# 创建XMLReader
parser = xml.sax.make_parser()
# 重写默认的 ContextHandler
Handler = FECDataHandler()
parser.setContentHandler(Handler)
# 解析正在下载的xml文件
parser.parse("FEC.xml")
3.3 数据分析
接下来,我们将用Python对数据集进行分析,以回答一些关于竞选捐助的基本问题:
3.3.1 问题一:捐款金额分布情况
我们可以通过统计不同金额区间的捐款人数和捐款总金额,来了解总体捐款金额的分布情况。下面是代码:
def plot_donors_by_amount():
amount_map = {}
amount_total = 0
count_total = 0
with open('FEC.csv', encoding='ISO-8859-1', mode='r') as f:
reader = csv.DictReader(f)
for row in reader:
try:
amount = float(row['amount'])
except ValueError:
continue
amount_total += amount
count_total += 1
if amount < 10:
amount_map['<10'] = amount_map.get('<10', 0) + 1
elif amount < 20:
amount_map['10-20'] = amount_map.get('10-20', 0) + 1
elif amount < 50:
amount_map['20-50'] = amount_map.get('20-50', 0) + 1
elif amount < 100:
amount_map['50-100'] = amount_map.get('50-100', 0) + 1
elif amount < 500:
amount_map['100-500'] = amount_map.get('100-500', 0) + 1
elif amount < 1000:
amount_map['500-1000'] = amount_map.get('500-1000', 0) + 1
elif amount < 5000:
amount_map['1000-5000'] = amount_map.get('1000-5000', 0) + 1
else:
amount_map['>5000'] = amount_map.get('>5000', 0) + 1
# 绘制直方图
plt.bar(range(len(amount_map)), amount_map.values(), align='center')
plt.xticks(range(len(amount_map)), amount_map.keys())
plt.title('Donation Amount Distribution')
plt.xlabel('Amount Range')
plt.ylabel('Number of Donations')
plt.show()
# 输出捐款总金额和平均捐款金额
print('Total amount of donations:', amount_total)
print('Total number of donations:', count_total)
print('Average donation amount:', amount_total / count_total)
plot_donors_by_amount()
3.3.2 问题二:捐献人口统计情况
我们可以通过统计各个州的捐款人数和捐款总金额,了解各州捐赠情况。下面是代码:
def plot_donors_by_state():
state_map_donors = {}
state_map_amount = {}
with open('FEC.csv', encoding='ISO-8859-1', mode='r') as f:
reader = csv.DictReader(f)
for row in reader:
state = row['contributor_state']
if state == "":
continue
try:
amount = float(row['amount'])
except ValueError:
continue
state_map_donors[state] = state_map_donors.get(state, 0) + 1
state_map_amount[state] = state_map_amount.get(state, 0) + amount
# 绘制捐款人口地图
state_names = []
state_donors = []
state_amounts = []
for state in state_map_donors.keys():
state_names.append(state)
state_donors.append(state_map_donors[state])
state_amounts.append(state_map_amount[state])
fig = plt.figure(figsize=(16, 8))
ax1 = fig.add_subplot(121)
mpl_toolkits.basemap
from mpl_toolkits.basemap import Basemap
map = Basemap(projection='merc', lat_0=39.5, lon_0=-99, resolution='i', area_thresh=0.1,
llcrnrlon=-124.8, llcrnrlat=25, urcrnrlon=-66.9, urcrnrlat=49.5)
map.drawcoastlines()
map.drawcountries()
map.drawstates()
x, y = map(np.array([-122, -75]), np.array([47, 28]))
ax1.plot(x, y, 'bo', markersize=3)
x, y = map(np.array([-104, -87]), np.array([38, 26]))
ax1.plot(x, y, 'bo', markersize=3)
x, y = map(np.array([-121]), np.array([37]))
ax1.plot(x, y, 'go', markersize=5)
ax1.text(x[0] + 200000, y[0] + 100000, 'San Francisco', fontsize=10)
ax1.text(x[0] - 800000, y[0] + 100000, 'Los Angeles', fontsize=10)
ax1.text