selenium学习实战【Python爬虫】-EW帮帮网

selenium学习实战【Python爬虫】

在这里插入图片描述

文章目录

selenium学习实战【Python爬虫】

一、声明

本爬虫项目仅用于学习用途。严禁将本项目用于任何非法目的，包括但不限于恶意攻击网站、窃取用户隐私数据、破坏网站正常运营、商业侵权等行为。

二、学习目标

1.爬取网站链接：https://report.iresearch.cn；
2.爬取不付费的报告信息，标题、行业、作者、摘要和报告原件；

三、安装依赖

3.1 安装selenium库

只介绍主要的，别的库百度自行安装。
打开vscode终端，运行下面命令：

pip install -i https://pypi.douban.com/simple selenium

3.2 安装浏览器驱动

针对不同的浏览器，需要安装不同的驱动。下面列举了常见的浏览器与对应的驱动程序下载链接，部分网址需要 “科学上网” 才能打开哦（dddd）。

Firefox 浏览器驱动：https://github.com/mozilla/geckodriver/releases
Chrome 浏览器驱动：https://chromedriver.storage.googleapis.com/index.html
IE 浏览器驱动：http://selenium-release.storage.googleapis.com/index.html
Edge 浏览器驱动：https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
PhantomJS 浏览器驱动：https://phantomjs.org/
Opera 浏览器驱动：https://github.com/operasoftware/operachromiumdriver/releases

我用的时Edge浏览器，所以安装Edge驱动

3.2.1 查看Edge版本

点击三个点，再点击设置
在这里插入图片描述
点击左侧最下面关于Edge

3.2.2 驱动安装

下载好双击安装就行。

四、代码讲解

4.1 配置浏览器

# 创建 EdgeOptions 对象，用于配置浏览器选项
edge_options = Options()
# 添加参数忽略证书错误
edge_options.add_argument('--ignore-certificate-errors')
# 添加参数禁用扩展
edge_options.add_argument('--disable-extensions')
# 添加参数禁用沙盒模式
edge_options.add_argument('--no-sandbox')
# 添加参数禁用 GPU 加速
edge_options.add_argument('--disable-gpu')

# 创建 Edge 浏览器驱动实例
driver = webdriver.Edge(options=edge_options)

4.2 加载更多

打开艾瑞网的报告页面，我们会发现加载更多这个按钮，我们可以用代码点击按钮加载，直到自己需要的数量，我这里设置加载10次
在这里插入图片描述

鼠标右键点击页面–>点击检查，加载更多按钮定义在这里

在这里插入图片描述

# 打开指定 URL 的网页
driver.get(url)

try:
    # 用于保存找到的元素信息
    found_elements_info = []
    found_links = []
    
    wait = WebDriverWait(driver, 10)
    # 1. 点击"加载更多"按钮10次
    load_count = 0
    while load_count < 10:
        try:
            # 定位"加载更多"按钮
            load_more_btn = wait.until(
                EC.element_to_be_clickable(
                    (By.CSS_SELECTOR, 'button#loadbtn')  # 根据实际页面调整选择器
                )
            )
            
            # 点击按钮
            load_more_btn.click()
            load_count += 1
            print(f"已点击加载更多 {load_count}/10 次")
            
            # 等待新内容加载（根据页面加载速度调整）
            time.sleep(2)
            
        except Exception as e:
            print(f"点击加载更多失败: {e}")
            break
    
    # 2. 等待所有内容加载完成
    print("等待内容加载完成...")
    time.sleep(3)  # 额外等待确保所有内容加载完成

4.3 寻找内容

在这里插入图片描述

4.4 完整代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.edge.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# 创建 EdgeOptions 对象，用于配置浏览器选项
edge_options = Options()
# 添加参数忽略证书错误
edge_options.add_argument('--ignore-certificate-errors')
# 添加参数禁用扩展
edge_options.add_argument('--disable-extensions')
# 添加参数禁用沙盒模式
edge_options.add_argument('--no-sandbox')
# 添加参数禁用 GPU 加速
edge_options.add_argument('--disable-gpu')

# 创建 Edge 浏览器驱动实例
driver = webdriver.Edge(options=edge_options)

url = 'https://report.iresearch.cn/'

# 打开指定 URL 的网页
driver.get(url)

try:
    # 用于保存找到的元素信息
    found_elements_info = []
    found_links = []
    
    wait = WebDriverWait(driver, 10)
    # 1. 点击"加载更多"按钮10次
    load_count = 0
    while load_count < 10:
        try:
            # 定位"加载更多"按钮
            load_more_btn = wait.until(
                EC.element_to_be_clickable(
                    (By.CSS_SELECTOR, 'button#loadbtn')  # 根据实际页面调整选择器
                )
            )
            
            # 点击按钮
            load_more_btn.click()
            load_count += 1
            print(f"已点击加载更多 {load_count}/10 次")
            
            # 等待新内容加载（根据页面加载速度调整）
            time.sleep(2)
            
        except Exception as e:
            print(f"点击加载更多失败: {e}")
            break
    
    # 2. 等待所有内容加载完成
    print("等待内容加载完成...")
    time.sleep(3)  # 额外等待确保所有内容加载完成
    
    # 3. 遍历所有找到的元素（包括初始加载和点击加载的）
    elements = driver.find_elements(By.CSS_SELECTOR, 'li[id^="freport."]')
    print(f"共找到 {len(elements)} 个元素")
    
    if elements:
        for elem in elements:
            element_text = elem.text
            
            # 提取链接（处理可能的异常）
            try:
                a_element = elem.find_element(By.CSS_SELECTOR, 'a')
                element_link = a_element.get_attribute('href')
                found_links.append(element_link)
            except:
                element_link = "未找到链接"
            
            # 保存信息
            found_elements_info.append(f"文本内容: {element_text}, 链接: {element_link}")
            print("找到元素：", element_text[:30] + "...")  # 只打印前30个字符避免刷屏
        
        # 4. 将找到的元素信息保存到文件
        with open('found_elements.txt', 'w', encoding='utf-8') as f:
            for info in found_elements_info:
                f.write(info + '\n')
        print("已将找到的元素信息保存到 found_elements.txt")
        
        # 5. 将找到的链接信息保存到文件
        with open('found_links.txt', 'w', encoding='utf-8') as f:
            for link in found_links:
                f.write(link + '\n')
        print("已将找到的链接信息保存到 found_links.txt")
    else:
        print("未找到符合条件的元素。")

except Exception as e:
    print("定位元素失败：", e)
    # 获取页面源代码，用于调试
    page_source = driver.page_source
    with open('page_source.html', 'w', encoding='utf-8') as f:
        f.write(page_source)
    print("已将页面源代码保存到 page_source.html，请查看分析")
finally:
    # 确保浏览器最终会被关闭
    time.sleep(2)  # 可根据需要调整等待时间，方便观察页面
    driver.quit()

五、报告文件爬取

已经修改完成，晚上续写

selenium学习实战【Python爬虫】