5.Python爬虫相关

发布于:2025-02-11 ⋅ 阅读:(97) ⋅ 点赞:(0)

爬虫

爬虫原理

  • 爬虫,又称网络爬虫,是一种自动获取网页内容的程序。
  • 它模拟人类浏览网页的行为,发送HTTP请求,获取网页源代码,再通过解析、提取等技术手段,获取所需数据。

HTTP请求与响应过程

  • 爬虫首先向目标网站发送HTTP请求,请求可以包含多种参数,如URL、请求方法(GET或POST)、请求头(Headers)等。
  • 服务器接收到请求后,返回相应的HTTP响应,包括状态码、响应头和响应体(网页内容)。

常用爬虫技术

名称 功能
请求库 如 requests、aiohttp 等
解析库 如 BeautifulSoup、lxml、PyQuery 等
存储库 如 pandas、SQLite 等
异步库 如 asyncio、aiohttp 等

实战

  1. 爬取豆瓣电影Top250
import requests
from bs4 import BeautifulSoup
import csv
# 请求 URL
url = 'https://movie.douban.com/top250'
# 请求头部
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
# 解析页面函数
def parse_html(html):
    soup = BeautifulSoup(html, 'lxml')
    movie_list = soup.find('ol', class_='grid_view').find_all('li')
    for movie in movie_list:
        title = movie.find('div', class_='hd').find('span', class_='title').get_text()
        rating_num = movie.find('div', class_='star').find('span', class_='rating_num').get_text()
        comment_num = movie.find('div', class_='star').find_all('span')[-1].get_text()
        writer.writerow([title, rating_num, comment_num])

# 保存数据函数
def save_data():
    f = open('douban_movie_top250.csv', 'a', newline='', encoding='utf-8-sig')
    global writer
    writer = csv.writer(f)
    writer.writerow(['电影名称', '评分', '评价人数'])
    for i in range(10):
        url = 'https://movie.douban.com/top250?start=' + str(i*25) + '&filter='
        response = requests.get(url, headers=headers)
        parse_html(response.text)
    f.close()

if __name__ == '__main__':
    save_data()
  1. 爬取当当网图书信息
import requests
from lxml import etree
import csv

url = 'http://search.dangdang.com/?key=Python&act=input'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}


def parse_html(html):
    selector = etree.HTML(html)
    book_list = selector.xpath('//*[@id="search_nature_rg"]/ul/li')
    for book in book_list:
        title = book.xpath('a/@title')
        if title:
            title = title[0]
        else:
            title = "未知书名"

        link = book.xpath('a/@href')
        if link:
            link = link[0]
        else:
            link = "未知链接"

        price = book.xpath('p[@class="price"]/span[@class="search_now_price"]/text()')
        if price:
            price = price[0]
        else:
            price = "未知价格"

        author = book.xpath('p[@class="search_book_author"]/span[1]/a/@title')
        if author:
            author = author[0]
        else:
            author = "未知作者"

        publish_date = book.xpath('p[@class="search_book_author"]/span[2]/text()')
        if publish_date:
            publish_date = publish_date[0]
        else:
            publish_date = "未知出版日期"

        publisher = book.xpath('p[@class="search_book_author"]/span[3]/a/@title')
        if publisher:
            publisher = publisher[0]
        else:
            publisher = "未知出版社"

        yield {
            '书名': title,
            '链接': link,
            '价格': price,
            '作者': author,
            '出版日期': publish_date,
            '出版社': publisher
        }


def save_data():
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        with open('dangdang_books.csv', 'w', newline='', encoding='utf-8-sig') as f:
            writer = csv.writer(f)
            writer.writerow(['书名', '链接', '价格', '作者', '出版日期', '出版社'])
            for item in parse_html(response.text):
                writer.writerow([item['书名'], item['链接'], item['价格'], item['作者'], item['出版日期'], item['出版社']])
    else:
        print(f"请求失败,状态码:{response.status_code}")


if __name__ == '__main__':
    save_data()

网站公告

今日签到

点亮在社区的每一天
去签到