1. BS4简介
Beautiful Soup是一种通过标签来解析HTML和XML文档的Python库。可以用它去解析一些大小比较规模的HTML文件,提取出其中某些特定的数据,如文章标题、段落文字信息等。同时,Beautiful Soup也提供了强大的网络爬虫能力。
BS4的一般使用过程如下:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
其中,需要解析的HTML代码记作html_doc
,第二个参数为解析器的名称,Beautiful Soup支持Python标准库中的HTML解析器,也支持第三方解析器,例如lxml和html5lib,但建议使用Python标准库中的解析器,它不依赖其他库。
2. 标签选择器
2.1 根据标签名选择
可以通过soup.tagname
或者soup.find_all('tagname')
来选择HTML文档中的特定标签。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.find_all('p'))
以上代码中,soup.title
选择了文档的title
标签,soup.title.name
返回标签的名称'title'
,soup.title.string
返回标签的字符串内容The Dormouse's story
,soup.title.parent.name
返回父标签的名称'head'
,soup.p
选择了文档中第一个p
标签,soup.find_all('p')
选择了文档中所有的p
标签。
2.2 根据属性选择
可以通过soup.find_all('tagname', attrs={'attrname': 'attrvalue'})
或者soup.select('tagname[attrname="attrvalue"]')
来选择含有特定属性的HTML标签。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a', attrs={'class': 'sister'}))
print(soup.select('a.sister'))
以上代码中,soup.find_all('a', attrs={'class': 'sister'})
选择了所有a
标签,且class
属性值为'sister'
,soup.select('a.sister')
选择了所有a
标签,且class
属性值为'sister'
。
2.3 根据内容选择
可以通过soup.find_all('tagname', string='content')
来选择含有特定内容的HTML标签。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a', string='Lacie'))
print(soup.find_all(string=['Elsie', 'Lacie', 'Tillie']))
以上代码中,soup.find_all('a', string='Lacie')
选择了所有a
标签,且标签包含字符串'Lacie'
,soup.find_all(string=['Elsie', 'Lacie', 'Tillie'])
选择了文档中所有包含'Elsie'
,'Lacie'
,'Tillie'
其中一个字符串的标签内的字符串。
3. 标签属性操作
3.1 获取标签属性值
可以通过tag['attrname']
或者tag.get('attrname')
来获取标签中特定属性的值。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
link = soup.select('a')[0]
print(link['href'])
print(link.get('href'))
以上代码中,link['href']
和link.get('href')
均选择了a
标签的href
属性值。
3.2 修改/添加标签属性值
可以通过以下2种方法修改或添加标签的属性:
通过赋值tag['attrname'] = 'attrvalue'
或者tag.attrs['attrname'] = 'attrvalue'
来修改标签中特定属性的值。
通过tag['newattr'] = 'newvalue'
或者tag.attrs['newattr'] = 'newvalue'
来添加新属性。
4. 遍历文档树
4.1 子节点
可以通过tag.contents
或者tag.children
来获取某个HTML标签的所有子节点。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
for child in soup.body.children:
print(child)
以上代码中,soup.body.children
遍历了文档中body
标签的所有子节点。
4.2 子孙节点
可以通过tag.descendants
来获取某个HTML标签的所有子孙节点。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
for child in soup.body.descendants:
print(child)
以上代码中,soup.body.descendants
遍历了文档中body
标签的所有子孙节点。
4.3 父节点
可以通过tag.parent
获取某个HTML标签的直接父节点,通过tag.parents
获取某个HTML标签的所有祖先节点。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.parent)
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
以上代码中,soup.title.parent
获取了文档中title
标签的直接父节点head
标签,soup.a.parents
获取了文档中第一个a
标签的所有祖先节点。
4.4 兄弟节点
可以通过tag.next_sibling
获取某个HTML标签的下一个兄弟节点,通过tag.previous_sibling
获取某个HTML标签的上一个兄弟节点。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.a.next_sibling)
print(soup.a.previous_sibling)
以上代码中,soup.a.next_sibling
获取了文档中第一个a
标签的下一个兄弟节点,soup.a.previous_sibling
获取了文档中第一个a
标签的上一个兄弟节点。
5. 常用操作
5.1 压缩输出
通过soup.prettify()
可以将标签以换行符和缩进的形式进行排版,使页面更具可读性。如果希望输出时去掉多余的空白和换行符,可以添加参数soup.prettify(formatter=lambda s: s.strip())
。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
print(soup.prettify(formatter=lambda s: s.strip()))
以上代码中,第一个print
输出的HTML代码会含有多余的空白和换行符,第二个print
输出的HTML代码去掉了多余的空白和换行符。
5.2 获取/修改标签内容
可以通过tag.string
获取某个HTML标签的字符串内容,通过tag.string.replace_with(newstring)
修改某个HTML标签的字符串内容。如果想要获取标签内所有字符串内容,可以通过tag.get_text()
获取。例如:
from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and