Py3操作BS4常用的语法代码片段-猿码集

1. BS4简介

Beautiful Soup是一种通过标签来解析HTML和XML文档的Python库。可以用它去解析一些大小比较规模的HTML文件，提取出其中某些特定的数据，如文章标题、段落文字信息等。同时，Beautiful Soup也提供了强大的网络爬虫能力。

BS4的一般使用过程如下：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

其中，需要解析的HTML代码记作html_doc，第二个参数为解析器的名称，Beautiful Soup支持Python标准库中的HTML解析器，也支持第三方解析器，例如lxml和html5lib，但建议使用Python标准库中的解析器，它不依赖其他库。

2. 标签选择器

2.1 根据标签名选择

可以通过soup.tagname或者soup.find_all('tagname')来选择HTML文档中的特定标签。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.find_all('p'))

以上代码中，soup.title选择了文档的title标签，soup.title.name返回标签的名称'title'，soup.title.string返回标签的字符串内容The Dormouse's story，soup.title.parent.name返回父标签的名称'head'，soup.p选择了文档中第一个p标签，soup.find_all('p')选择了文档中所有的p标签。

2.2 根据属性选择

可以通过soup.find_all('tagname', attrs={'attrname': 'attrvalue'})或者soup.select('tagname[attrname="attrvalue"]')来选择含有特定属性的HTML标签。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a', attrs={'class': 'sister'}))
print(soup.select('a.sister'))

以上代码中，soup.find_all('a', attrs={'class': 'sister'})选择了所有a标签，且class属性值为'sister'，soup.select('a.sister')选择了所有a标签，且class属性值为'sister'。

2.3 根据内容选择

可以通过soup.find_all('tagname', string='content')来选择含有特定内容的HTML标签。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a', string='Lacie'))
print(soup.find_all(string=['Elsie', 'Lacie', 'Tillie']))

以上代码中，soup.find_all('a', string='Lacie')选择了所有a标签，且标签包含字符串'Lacie'，soup.find_all(string=['Elsie', 'Lacie', 'Tillie'])选择了文档中所有包含'Elsie'，'Lacie'，'Tillie'其中一个字符串的标签内的字符串。

3. 标签属性操作

3.1 获取标签属性值

可以通过tag['attrname']或者tag.get('attrname')来获取标签中特定属性的值。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
link = soup.select('a')[0]
print(link['href'])
print(link.get('href'))

以上代码中，link['href']和link.get('href')均选择了a标签的href属性值。

3.2 修改/添加标签属性值

可以通过以下2种方法修改或添加标签的属性：

通过赋值tag['attrname'] = 'attrvalue'或者tag.attrs['attrname'] = 'attrvalue'来修改标签中特定属性的值。

通过tag['newattr'] = 'newvalue'或者tag.attrs['newattr'] = 'newvalue'来添加新属性。

4. 遍历文档树

4.1 子节点

可以通过tag.contents或者tag.children来获取某个HTML标签的所有子节点。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
for child in soup.body.children:
    print(child)

以上代码中，soup.body.children遍历了文档中body标签的所有子节点。

4.2 子孙节点

可以通过tag.descendants来获取某个HTML标签的所有子孙节点。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
for child in soup.body.descendants:
    print(child)

以上代码中，soup.body.descendants遍历了文档中body标签的所有子孙节点。

4.3 父节点

可以通过tag.parent获取某个HTML标签的直接父节点，通过tag.parents获取某个HTML标签的所有祖先节点。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.parent)
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

以上代码中，soup.title.parent获取了文档中title标签的直接父节点head标签，soup.a.parents获取了文档中第一个a标签的所有祖先节点。

4.4 兄弟节点

可以通过tag.next_sibling获取某个HTML标签的下一个兄弟节点，通过tag.previous_sibling获取某个HTML标签的上一个兄弟节点。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.a.next_sibling)
print(soup.a.previous_sibling)

以上代码中，soup.a.next_sibling获取了文档中第一个a标签的下一个兄弟节点，soup.a.previous_sibling获取了文档中第一个a标签的上一个兄弟节点。

5. 常用操作

5.1 压缩输出

通过soup.prettify()可以将标签以换行符和缩进的形式进行排版，使页面更具可读性。如果希望输出时去掉多余的空白和换行符，可以添加参数soup.prettify(formatter=lambda s: s.strip())。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
print(soup.prettify(formatter=lambda s: s.strip()))

以上代码中，第一个print输出的HTML代码会含有多余的空白和换行符，第二个print输出的HTML代码去掉了多余的空白和换行符。

5.2 获取/修改标签内容

可以通过tag.string获取某个HTML标签的字符串内容，通过tag.string.replace_with(newstring)修改某个HTML标签的字符串内容。如果想要获取标签内所有字符串内容，可以通过tag.get_text()获取。例如：

from bs4 import BeautifulSoup
html_doc = '''
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and


            
                免责声明：本文来自互联网，本站所有信息（包括但不限于文字、视频、音频、数据及图表），不保证该信息的准确性、真实性、完整性、有效性、及时性、原创性等，版权归属于原作者，如无意侵犯媒体或个人知识产权，请来电或致函告之，本站将在第一时间处理。猿码集站发布此文目的在于促进信息交流，此文观点与本站立场无关，不承担任何责任。            
            
            
                
                    上一篇：PUMA：DOA估计模式的改进实现附Matlab代码

                
                
                    下一篇：plt.figure()参数使用详解及运行演示                
            
            
                相关阅读
                
                                            anaconda是什么？
                                            Python 如何编写交互界面？
                                            python程序怎么运行结果
                                            如何解决Python的代码中的无用变量错误？
                                            Python搭建Keras CNN模型破解网站验证码的实现
                                            Django组件content-type使用方法详解
                                            Python中常见的内置类型

Py3操作BS4常用的语法代码片段