Py3操作BS4常用的语法代码片段

1. BS4简介

Beautiful Soup是一种通过标签来解析HTML和XML文档的Python库。可以用它去解析一些大小比较规模的HTML文件,提取出其中某些特定的数据,如文章标题、段落文字信息等。同时,Beautiful Soup也提供了强大的网络爬虫能力。

BS4的一般使用过程如下:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

其中,需要解析的HTML代码记作html_doc,第二个参数为解析器的名称,Beautiful Soup支持Python标准库中的HTML解析器,也支持第三方解析器,例如lxml和html5lib,但建议使用Python标准库中的解析器,它不依赖其他库。

2. 标签选择器

2.1 根据标签名选择

可以通过soup.tagname或者soup.find_all('tagname')来选择HTML文档中的特定标签。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title)

print(soup.title.name)

print(soup.title.string)

print(soup.title.parent.name)

print(soup.p)

print(soup.find_all('p'))

以上代码中,soup.title选择了文档的title标签,soup.title.name返回标签的名称'title'soup.title.string返回标签的字符串内容The Dormouse's storysoup.title.parent.name返回父标签的名称'head'soup.p选择了文档中第一个p标签,soup.find_all('p')选择了文档中所有的p标签。

2.2 根据属性选择

可以通过soup.find_all('tagname', attrs={'attrname': 'attrvalue'})或者soup.select('tagname[attrname="attrvalue"]')来选择含有特定属性的HTML标签。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('a', attrs={'class': 'sister'}))

print(soup.select('a.sister'))

以上代码中,soup.find_all('a', attrs={'class': 'sister'})选择了所有a标签,且class属性值为'sister'soup.select('a.sister')选择了所有a标签,且class属性值为'sister'

2.3 根据内容选择

可以通过soup.find_all('tagname', string='content')来选择含有特定内容的HTML标签。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('a', string='Lacie'))

print(soup.find_all(string=['Elsie', 'Lacie', 'Tillie']))

以上代码中,soup.find_all('a', string='Lacie')选择了所有a标签,且标签包含字符串'Lacie'soup.find_all(string=['Elsie', 'Lacie', 'Tillie'])选择了文档中所有包含'Elsie''Lacie''Tillie'其中一个字符串的标签内的字符串。

3. 标签属性操作

3.1 获取标签属性值

可以通过tag['attrname']或者tag.get('attrname')来获取标签中特定属性的值。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

link = soup.select('a')[0]

print(link['href'])

print(link.get('href'))

以上代码中,link['href']link.get('href')均选择了a标签的href属性值。

3.2 修改/添加标签属性值

可以通过以下2种方法修改或添加标签的属性:

通过赋值tag['attrname'] = 'attrvalue'或者tag.attrs['attrname'] = 'attrvalue'来修改标签中特定属性的值。

通过tag['newattr'] = 'newvalue'或者tag.attrs['newattr'] = 'newvalue'来添加新属性。

4. 遍历文档树

4.1 子节点

可以通过tag.contents或者tag.children来获取某个HTML标签的所有子节点。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

for child in soup.body.children:

print(child)

以上代码中,soup.body.children遍历了文档中body标签的所有子节点。

4.2 子孙节点

可以通过tag.descendants来获取某个HTML标签的所有子孙节点。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

for child in soup.body.descendants:

print(child)

以上代码中,soup.body.descendants遍历了文档中body标签的所有子孙节点。

4.3 父节点

可以通过tag.parent获取某个HTML标签的直接父节点,通过tag.parents获取某个HTML标签的所有祖先节点。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title.parent)

for parent in soup.a.parents:

if parent is None:

print(parent)

else:

print(parent.name)

以上代码中,soup.title.parent获取了文档中title标签的直接父节点head标签,soup.a.parents获取了文档中第一个a标签的所有祖先节点。

4.4 兄弟节点

可以通过tag.next_sibling获取某个HTML标签的下一个兄弟节点,通过tag.previous_sibling获取某个HTML标签的上一个兄弟节点。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.a.next_sibling)

print(soup.a.previous_sibling)

以上代码中,soup.a.next_sibling获取了文档中第一个a标签的下一个兄弟节点,soup.a.previous_sibling获取了文档中第一个a标签的上一个兄弟节点。

5. 常用操作

5.1 压缩输出

通过soup.prettify()可以将标签以换行符和缩进的形式进行排版,使页面更具可读性。如果希望输出时去掉多余的空白和换行符,可以添加参数soup.prettify(formatter=lambda s: s.strip())。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

print(soup.prettify(formatter=lambda s: s.strip()))

以上代码中,第一个print输出的HTML代码会含有多余的空白和换行符,第二个print输出的HTML代码去掉了多余的空白和换行符。

5.2 获取/修改标签内容

可以通过tag.string获取某个HTML标签的字符串内容,通过tag.string.replace_with(newstring)修改某个HTML标签的字符串内容。如果想要获取标签内所有字符串内容,可以通过tag.get_text()获取。例如:

from bs4 import BeautifulSoup

html_doc = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and