用Python网络爬虫来看看最近电影院都有哪些上映的电影-猿码集

1. 爬取网站

首先，我们需要确定从哪个网站爬取电影信息。在此，我们选择使用豆瓣电影网站。豆瓣电影是一个比较大的电影信息网站，我们可以从中获取到较为详细的电影信息，如电影名称、导演、演员、上映时间、评分等。另外，豆瓣电影的电影信息也比较全面，基本上包含了当前上映的所有电影。

1.1 导入需要的库

在进行爬虫之前，我们需要先导入需要用到的库。在本次爬虫中，我们需要用到requests库、BeautifulSoup库、pandas库及re库。其中requests库用于向网站发送请求获取页面内容，BeautifulSoup库用于解析网页文本，pandas库用于处理数据，re库用于进行正则表达式的匹配。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

1.2 发送请求，获取网页内容

在requests库中，我们可以使用get方法来向网站发送请求并获取页面内容。我们可以指定请求头部信息，请求头包含User-Agent、Referer、Accept-Encoding等信息。

url = 'https://movie.douban.com/cinema/nowplaying/shenzhen/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',
    'Referer': 'https://movie.douban.com/',
    'Accept-Encoding': 'gzip, deflate, br'
}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
page_text = response.text

2. 解析页面信息

在获得页面的HTML代码后，我们需要使用BeautifulSoup库进行解析和提取页面信息。通过使用BeautifulSoup库，我们可以使用标签名称、类名、属性名进行查找网页中的元素。

2.1 解析HTML页面

在使用BeautifulSoup库之前，我们需要将获取到的页面内容传递给BeautifulSoup库以进行解析。

soup = BeautifulSoup(page_text, 'html.parser')

2.2 查找页面元素

在解析页面后，我们需要查找页面中的元素。在本次爬虫中，我们需要查找当前影院中上映的电影名称、评分、导演、主演、上映时间等信息，这些信息在页面中都是使用div标签包裹的。

titles = soup.find_all('div', {'class': 'info'})
print(titles)

上面的代码中，我们使用了find_all方法来查找页面中所有class属性为info的div标签，并返回一个列表对象，列表中包含每一部电影的信息。因此，我们可以对每一个标签对象进行遍历，从中获取我们所需要的信息。

2.3 提取电影信息

在获取到每一部电影的标签对象后，我们需要从中提取出我们需要的信息。在本次爬虫中，我们需要提取的信息有电影名称、导演、主演、评分、上映时间等。

movies = []
for title in titles:
    movie = {}
    movie['title'] = title.find('a').text.strip()
    movie['director'] = re.findall('导演: (.*?) ', title.find('p', {'class': '').text.strip())[0]
    movie['actors'] = re.findall('主演: (.*?) ', title.find('p', {'class': ''}).text.strip())[0]
    movie['score'] = float(title.find('span', {'class': 'rating_num'}).text.strip())
    movie['release_date'] = re.findall('上映日期: (.*?)\(', title.find('p', {'class': ''}).text.strip())[0]
    movies.append(movie)
print(movies)

在以上的代码中，我们使用了re库中的findall函数进行正则匹配，从而获取到电影名称、导演、主演、评分、上映时间等信息，并将其加入到一个字典对象movie中，最终再将movie对象加入到movies列表中。

3. 保存电影信息

在获取到所有电影信息之后，我们需要将其保存到本地文件中以便后续处理和使用。在本次爬虫中，我们可以使用pandas库将字典列表保存为csv文件。

movies_df = pd.DataFrame(movies)
movies_df.to_csv('movie_info.csv', index=False)

以上代码中，我们首先将字典列表转换为DataFrame对象，然后调用to_csv方法将其写入到movie_info.csv文件中，其中index参数设为False表示不将索引写入文件。

4. 完整代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'https://movie.douban.com/cinema/nowplaying/shenzhen/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',
    'Referer': 'https://movie.douban.com/',
    'Accept-Encoding': 'gzip, deflate, br'
}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
page_text = response.text
soup = BeautifulSoup(page_text, 'html.parser')
titles = soup.find_all('div', {'class': 'info'})
movies = []
for title in titles:
    movie = {}
    movie['title'] = title.find('a').text.strip()
    movie['director'] = re.findall('导演: (.*?) ', title.find('p', {'class': 'director'}).text.strip())[0]
    movie['actors'] = re.findall('主演: (.*?) ', title.find('p', {'class': ''}).text.strip())[0]
    movie['score'] = float(title.find('span', {'class': 'rating_num'}).text.strip())
    movie['release_date'] = re.findall('上映日期: (.*?)\(', title.find('p', {'class': ''}).text.strip())[0]
    movies.append(movie)
movies_df = pd.DataFrame(movies)
movies_df.to_csv('movie_info.csv', index=False)

5. 结论

通过以上方法的运行，我们可以成功地爬取到当前深圳电影院中上映的所有电影信息，并将其保存到本地文件中。在爬虫过程中，我们使用了requests库向网站发送请求并获取页面内容，使用BeautifulSoup库解析页面并查找元素，使用re库进行正则匹配并提取电影信息，最后使用pandas库将字典列表保存为csv文件。

用Python网络爬虫来看看最近电影院都有哪些上映的电影