PHP代码实现爬虫记录-猿码集

1. 爬虫是什么

爬虫是一种程序，它可以自动地浏览网络并提取数据。通常地，爬虫用于像搜索引擎索引网站、价格比较网站、新闻聚合等服务。当用户在搜索引擎中输入关键字时，搜索引擎会利用爬虫在网上找到与关键字相关的网页并返回给用户。

2. PHP爬虫的工作原理

PHP爬虫的工作原理如下：

2.1. 发送HTTP请求并获取HTML数据

PHP爬虫使用cURL库向指定URL发送HTTP GET请求。GET请求返回的数据通常是HTML文档。cURL库提供了基于libcurl的API，它可以让PHP程序直接从Web服务器上获取数据，例如HTTP、FTP等。


$url = 'https://www.example.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);

2.2. 解析HTML数据并提取需要的信息

PHP爬虫使用PHP内置的DOMDocument类对HTML文档进行解析。可以通过DOMDocument类提供的方法，如getElementsByTagName和getAttribute获取需要的信息。


$doc = new DOMDocument();
@$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach($links as $link) {
    $href = $link->getAttribute('href');
    echo $href;
}

2.3. 存储提取的信息

PHP爬虫提取的信息可以存储到数据库中，文件中，或者内存中等方式，以便后续使用。最常见的方式是存储到数据库中，可以使用PDO或mysqli库。


$db = new PDO('mysql:host=localhost;dbname=test', 'username', 'password');
$stmt = $db->prepare("INSERT INTO links (url) VALUES (:url)");
foreach($links as $link) {
    $url = $link->getAttribute('href');
    $stmt->bindParam(':url', $url);
    $stmt->execute();
}

3. 实现一个简单的PHP爬虫

下面是一个简单的PHP爬虫示例：


class Spider {
    private $url;
    private $content;
    public function __construct($url) {
        $this->url = $url;
        $this->fetch();
    }
    private function fetch() {
        $ch = curl_init($this->url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $this->content = curl_exec($ch);
        curl_close($ch);
    }
    public function parse() {
        $doc = new DOMDocument();
        @$doc->loadHTML($this->content);
        $links = array();
        $elements = $doc->getElementsByTagName('a');
        foreach($elements as $element) {
            $href = $element->getAttribute('href');
            if(!empty($href)) {
                $links[] = $href;
            }
        }
        return $links;
    }
    public function store($links) {
        $db = new PDO('mysql:host=localhost;dbname=test', 'username', 'password');
        $stmt = $db->prepare("INSERT INTO links (url) VALUES (:url)");
        foreach($links as $link) {
            $stmt->bindParam(':url', $link);
            $stmt->execute();
        }
    }
}
$url = 'https://www.example.com';
$spider = new Spider($url);
$links = $spider->parse();
$spider->store($links);

上述示例中，Spider类具有fetch、parse和store等三个方法。fetch方法用于从指定的URL获取HTML内容，parse方法用于解析HTML数据并提取其中Tag为a的超链接信息，store方法用于存储提取的信息到数据库中。

PHP代码实现爬虫记录

1. 爬虫是什么

2. PHP爬虫的工作原理

2.1. 发送HTTP请求并获取HTML数据

2.2. 解析HTML数据并提取需要的信息

2.3. 存储提取的信息

3. 实现一个简单的PHP爬虫

相关阅读

后端开发标签

Php热门

Php更新