Selenium 自动化测试中跳过机器人验证的完整指南：能用-EW帮帮网

Selenium 自动化测试中跳过机器人验证的完整指南：从原理到实战

在网络自动化操作中，我们经常会遇到网站的机器人验证机制。这些机制旨在区分人类用户和自动化程序，但也给我们的 Selenium 自动化任务带来了挑战。本文将深入探讨如何使用 Selenium 跳过这些机器人验证，让你的自动化脚本更加高效和隐蔽。

机器人验证的工作原理

在解决问题之前，我们需要了解网站是如何检测机器人的。现代网站主要通过以下几个方面来识别自动化程序：

浏览器指纹识别：每个浏览器都有独特的指纹，包括 User-Agent、WebGL 渲染结果、字体列表、时区等信息。自动化程序通常使用固定的指纹，容易被识别。
WebDriver 特征检测：Selenium 等自动化工具会暴露特定的 WebDriver 特征，如 window.webdriver 属性，这是机器人检测的重要标志。
行为模式分析：人类用户的浏览行为具有随机性，如滚动速度、点击位置和停留时间等。自动化程序的行为往往过于规律，容易被检测。
环境异常检测：自动化环境可能缺少某些真实浏览器具有的功能或属性，如媒体设备访问权限、特定的浏览器扩展等。

了解了这些检测机制，我们就可以有针对性地制定解决方案。

Selenium 跳过机器人验证的完整解决方案

下面是一个完善的 Selenium 脚本，它采用了多种技术来绕过机器人验证：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
import os
import json


def open_website_with_anti_detection():
    try:
        # 1. 基础配置 - 浏览器选项
        chrome_options = Options()

        # 指定用户数据目录，保留浏览器指纹和登录状态
        user_data_dir = r"D:\python_project\anti_bot\UserData"
        if not os.path.exists(user_data_dir):
            os.makedirs(user_data_dir)
        chrome_options.add_argument(f"--user-data-dir={user_data_dir}")

        # 2. 反检测核心配置 - 隐藏WebDriver特征
        chrome_options.add_argument("--disable-blink-features=AutomationControlled")  # 隐藏自动化标识
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])  # 排除自动化开关
        chrome_options.add_experimental_option('useAutomationExtension', False)  # 禁用自动化扩展

        # 3. 模拟真实浏览器环境
        # 设置高版本User-Agent，接近真实用户
        user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
        chrome_options.add_argument(f"user-agent={user_agent}")

        # 指定Chrome浏览器二进制文件路径
        chrome_binary_path = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
        if os.path.exists(chrome_binary_path):
            chrome_options.binary_location = chrome_binary_path

        # 4. 浏览器环境优化
        chrome_options.add_argument("--disable-gpu")  # 禁用GPU加速，避免被部分反爬系统检测
        chrome_options.add_argument("--disable-features=IsolateOrigins,site-per-process")  # 禁用站点隔离
        # 随机窗口尺寸，模拟真实用户的不同设备
        chrome_options.add_argument(f"--window-size={random.randint(1366, 1920)},{random.randint(768, 1080)}")

        # 5. 驱动配置
        chrome_driver_path = r"D:\chromedriver\chromedriver.exe"
        service = Service(chrome_driver_path)

        # 6. 创建浏览器驱动
        driver = webdriver.Chrome(service=service, options=chrome_options)

        # 7. 注入JavaScript隐藏WebDriver特征，这是反检测的关键步骤
        driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
            "source": """
                // 隐藏WebDriver标识
                Object.defineProperty(navigator, 'webdriver', {
                    get: () => undefined
                })
                // 模拟真实的Chrome浏览器属性
                window.navigator.chrome = {
                    runtime: {},
                    browser: {
                        getVersion: () => '115.0.5790.170'
                    }
                }
                // 模拟媒体设备，避免因缺少摄像头/麦克风权限被检测
                navigator.mediaDevices = {
                    getDevices: () => Promise.resolve([])
                }
                // 模拟浏览器加载完成事件
                window.dispatchEvent(new Event('load'))
            """
        })

        # 8. 打开目标网站，这里以指纹检测页面为例
        driver.get("https://fingerprintjs.github.io/BotD/main/")
        print("已打开指纹检测页面，请查看检测结果")

        # 9. 模拟人机行为 - 滚动和延时
        wait = WebDriverWait(driver, 10)
        for _ in range(3):
            scroll_height = driver.execute_script("return document.body.scrollHeight")
            # 随机滚动到页面不同位置
            driver.execute_script(f"window.scrollTo(0, {random.randint(0, scroll_height)})")
            # 随机延时，模拟人类操作节奏
            time.sleep(random.uniform(1, 3))

        # 10. 打印页面检测结果
        try:
            result_element = wait.until(EC.presence_of_element_located((By.ID, 'result')))
            print("页面检测结果:", result_element.text)
        except:
            print("未获取到检测结果元素")

        # 保持窗口打开，手动查看检测结果
        input("按Enter键关闭浏览器...")

    except Exception as e:
        print(f"出现错误: {e}")
    finally:
        if 'driver' in locals():
            driver.quit()
            print("浏览器已关闭")


if __name__ == "__main__":
    open_website_with_anti_detection()

核心反检测技术详解

1. 隐藏 WebDriver 特征

Selenium 最容易被检测到的特征就是 WebDriver 标识。我们通过以下方法来隐藏这些特征：

--disable-blink-features=AutomationControlled：这是 Selenium 4.8+ 后的关键反检测参数，用于禁用 Chrome 的自动化控制特征。
排除自动化开关：chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) 可以防止浏览器启动时加载自动化相关的开关。
注入 JavaScript 脚本：通过重写 navigator.webdriver 和 window.navigator.chrome 属性，模拟真实浏览器环境。这一步非常重要，因为很多网站会直接检查这些属性来判断是否为自动化程序。

2. 模拟真实浏览器指纹

浏览器指纹是机器人检测的重要依据，我们可以通过以下方式模拟真实指纹：

User-Agent 设置：使用最新版本的 Chrome User-Agent，避免使用旧版本（如 Chrome 91），因为旧版本 UA 很容易被识别为机器人。
随机窗口尺寸：每次运行时生成不同的窗口大小，避免固定值。真实用户使用不同设备访问，窗口尺寸各不相同。
用户数据目录：使用 --user-data-dir 选项指定浏览器用户数据目录，这样可以保留浏览器指纹和登录状态，使后续访问更加真实。

3. 模拟人机行为

行为模式是区分人类和机器人的重要因素，我们可以通过以下方式模拟真实用户行为：

随机滚动：在页面加载完成后，随机滚动到不同位置，模拟人类浏览页面的行为。
随机延时：在操作之间添加随机延时，避免固定频率的操作，这是机器人的典型特征。
显式等待：使用 WebDriverWait 等待元素加载，模拟人类等待页面响应的行为。

4. 环境优化

禁用 GPU 加速：部分反爬系统会检测 GPU 渲染特征，禁用后更接近普通浏览器。
禁用站点隔离：--disable-features=IsolateOrigins,site-per-process 可以禁用站点隔离，避免因环境异常被检测。

进阶反检测技术

1. 使用反指纹浏览器扩展

可以安装一些反指纹浏览器扩展来进一步随机化浏览器指纹，例如：

Chameleon：随机化浏览器指纹，包括 User-Agent、时区、语言等。
Random User-Agent：每次浏览时随机更换 User-Agent。

在 Selenium 中安装扩展的方法：

chrome_options.add_extension("chameleon.crx")

2. 配置代理 IP

使用代理 IP 可以避免同一 IP 频繁访问触发反爬机制：

chrome_options.add_argument("--proxy-server=http://127.0.0.1:8080")  # 替换为实际代理

3. 完善语言环境

检测结果中单一的语言环境（如 en-US）容易被怀疑，添加中文支持：

chrome_options.add_argument("--lang=zh-CN")

4. 模拟媒体设备

# 注入JavaScript模拟媒体设备
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
        navigator.mediaDevices = {
            getDevices: () => Promise.resolve([
                {kind: 'videoinput', label: 'Webcam'},
                {kind: 'audioinput', label: 'Microphone'}
            ])
        }
    """
})

验证反检测效果

有几个很好的网站可以用来验证你的反检测设置效果：

FingerprintJS 检测页面：https://fingerprintjs.github.io/BotD/main/
- 理想状态应显示：Bot: false，Detected bot kind: undefined
Sannysoft 机器人检测：https://bot.sannysoft.com/
- 该网站会从多个维度检测机器人特征，提供详细的检测报告。
AmIACrawler：https://www.amia-crawler.com/
- 专门用于检测自动化程序的网站，提供全面的机器人检测评估。

注意事项和最佳实践

驱动版本匹配：确保 chromedriver 版本与你的 Chrome 浏览器版本一致，否则可能导致运行错误。
定期更新 User-Agent：浏览器版本更新频繁，定期更新 User-Agent 以保持与最新浏览器一致。
避免过度请求：即使使用了反检测技术，也应避免对目标网站进行过度请求，以免触发其他反爬机制。
合规性优先：在进行网络自动化操作时，确保你的行为符合目标网站的使用条款和相关法律法规。
动态调整策略：反爬技术不断更新，定期测试你的自动化脚本，并根据检测结果调整反检测策略。

总结

跳过机器人验证是一个需要不断优化的过程，随着反爬技术的进步，我们的反检测方法也需要不断更新。本文提供的解决方案结合了多种反检测技术，能够有效降低被识别为机器人的概率。记住，最关键的是模拟真实用户的浏览器环境和行为模式，让自动化程序尽可能接近人类用户的操作。

通过不断学习和实践，你可以让你的 Selenium 自动化脚本更加隐蔽和高效，轻松应对各种机器人验证机制。

Selenium 自动化测试中跳过机器人验证的完整指南：能用