python网络爬虫(第三章/共三章：驱动浏览器窗口界面，网页元素定位，模拟用户交互（输入操作、点击操作、文件上传），浏览器窗口切换，循环爬取存储）-EW帮帮网

python网络爬虫(第三章/共三章：驱动浏览器窗口界面，网页元素定位，模拟用户交互（输入操作、点击操作、文件上传），浏览器窗口切换，循环爬取存储）

学习python网络爬虫的完整路径：

（第一章）

python网络爬虫(第一章/共三章：网络爬虫库、robots.txt规则（防止犯法）、查看获取网页源代码)-CSDN博客https://blog.csdn.net/2302_78022640/article/details/149428719?sharetype=blogdetail&sharerId=149428719&sharerefer=PC&sharesource=2302_78022640&spm=1011.2480.3001.8118（第二章）

python网络爬虫(第二章/共三章：安装浏览器驱动，驱动浏览器加载网页、批量下载资源)-CSDN博客https://blog.csdn.net/2302_78022640/article/details/149431071?sharetype=blogdetail&sharerId=149431071&sharerefer=PC&sharesource=2302_78022640&spm=1011.2480.3001.8118

（第三章即此篇文章）

网页元素定位 By

from selenium.webdriver.common.by import By

是 Selenium 自动化中用于元素定位策略的核心导入语句，其作用是提供一套标准化的方式来指定网页元素的定位方式，确保定位逻辑清晰、可维护。

具体作用解析：

统一元素定位策略
By 类是一个枚举类，定义了 Selenium 支持的所有元素定位方式，开发者通过调用 By.XXX 来明确指定定位策略，避免直接使用字符串（如 'id'、'class name'）可能导致的拼写错误或歧义。
支持的定位方式
常用的定位策略包括：
- By.ID：通过元素的 id 属性定位（精度最高，适用于唯一元素）。
- By.CLASS_NAME：通过元素的 class 属性定位（适用于同类元素批量提取）。
- By.TAG_NAME：通过 HTML 标签名定位（如 input、a、title）。
- By.NAME：通过元素的 name 属性定位（常用于表单元素）。
- By.LINK_TEXT / By.PARTIAL_LINK_TEXT：通过链接的完整文本或部分文本定位。
- By.XPATH：通过 XPath 表达式定位（灵活，可定位复杂结构元素）。
- By.CSS_SELECTOR：通过 CSS 选择器定位（高效，适用于样式相关元素）。
与定位方法配合使用
需与 find_element() 或 find_elements() 方法结合，例如：

# 通过 ID 定位单个元素
element = driver.find_element(By.ID, "username")
# 通过类名定位多个元素
elements = driver.find_elements(By.CLASS_NAME, "book-item")

开始代码前的准备：

在 “检查” 界面按 “ctrl+f” 可以打开搜索（下面图片右侧底部），“1 of 43” 表示有43个匹配项，这是第一个

如果后面的代码不知道元素在哪里，可以搜索，然后点击后，原网页（左侧）会转到相应位置

驱动浏览器窗口界面，定位元素，获取

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://www.ptpress.com.cn/')
ele = driver.find_element(By.TAG_NAME,"title")
print(ele.tag_name,ele.text)
x=input('')

运行结果：启动 Edge 浏览器打开人民邮电出版社官网，打印页面标题标签（tag_name 为 “title”）及其文本内容，等待用户输入后退出。
代码解析：通过By.TAG_NAME定位页面的<title>标签，获取标签名称（tag_name）和文本内容（text），实现对页面元素的基础定位与信息提取。

网页内搜索框输入内容

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://www.ptpress.com.cn/')
driver.find_element(By.TAG_NAME,"input").send_keys("Python")
input("")

运行结果：启动 Edge 浏览器打开人民邮电出版社官网，在页面的输入框中自动输入 “Python”，等待用户输入后退出。
代码解析：通过By.TAG_NAME定位页面的<input>标签，使用send_keys方法向输入框中传入文本，模拟用户输入操作。

同上操作后回车跳转

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://www.ptpress.com.cn/')
driver.find_element(By.TAG_NAME,"input").send_keys("Python"+ Keys.RETURN)
input("")

运行结果：启动 Edge 浏览器打开人民邮电出版社官网，在输入框中输入 “Python” 并自动按下回车键提交搜索，等待用户输入后退出。
代码解析：结合send_keys方法和Keys.RETURN（回车键），模拟用户输入关键词后提交搜索的完整操作。

网页内上传图片

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://graph.baidu.com/pcpage/index?tpl_from=pc')
a = driver.find_element(By.NAME,'file')
a.send_keys(r"D:\ALL_project\Python_project\learn\abc.jpg")
input("")

运行结果：启动 Edge 浏览器打开百度图片识别页面，自动定位到文件上传输入框并上传指定路径的图片 “abc.jpg”，等待用户输入后退出。
代码解析：对于文件上传功能，通过By.NAME定位上传输入框，使用send_keys直接传入本地文件路径即可实现自动上传，无需模拟点击系统文件选择窗口。

网页内点击

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
import time
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://www.ptpress.com.cn/periodical')
elments = driver.find_elements(By.CLASS_NAME,"item")
i = 0
for elment in elments:
    print(i,'个',elment.text)
    i += 1
time.sleep(3)
elments[3].click()
input("")

运行结果：启动 Edge 浏览器打开人民邮电出版社期刊页面，获取所有 class 为 “item” 的元素并打印其索引和文本内容，3 秒后点击第 4 个元素（索引为 3），等待用户输入后退出。
代码解析：通过find_elements（复数形式）获取多个符合条件的元素，遍历打印信息后，使用click方法模拟点击第 4 个元素，实现交互操作。

操作网页

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
import time
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://www.ptpress.com.cn/')
elments = driver.find_elements(By.CLASS_NAME,"item")
time.sleep(5) 
elments[3].click()      #点击“工作动态”
time.sleep(5) 
driver.back()           #返回上一页
time.sleep(5)           
driver.forward()        #前进一页
time.sleep(5)           
driver.refresh()        #刷新网页
time.sleep(5)
driver.close()          #关闭当前窗口
time.sleep(5)   
driver.quit()           #关闭浏览器

运行结果：启动 Edge 浏览器打开人民邮电出版社官网，5 秒后点击第 4 个 “item” 元素进入图书页面，之后依次执行后退、前进、刷新操作（各间隔 5 秒），最后关闭当前窗口并退出浏览器。
代码解析：演示浏览器的核心导航操作：back（后退）、forward（前进）、refresh（刷新），以及窗口控制：close（关闭当前窗口）、quit（退出浏览器），time.sleep用于等待页面加载。

启用无界面模式

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
edge_options = Options()
edge_options.add_argument('--headless')
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://www.ptpress.com.cn/')
elements = driver.find_elements(By.TAG_NAME,"a")
for element in elements:
   print(element.text)
driver.quit()

运行结果：以无界面模式（不显示浏览器窗口）启动 Edge 浏览器，打开人民邮电出版社官网，获取所有<a>标签元素并打印其文本内容，完成后退出浏览器。
代码解析：通过add_argument('--headless')启用无界面模式，适用于后台运行脚本，减少资源占用；find_elements(By.TAG_NAME,"a")获取所有链接元素并打印文本。

会话Cookies

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('http://www.taobao.com')  #请求网页http://www.taobao.com
print(driver.get_cookies())
driver.add_cookie({'name': 'zhangsan', 'value' : '98'})
print(driver.get_cookies())

运行结果：启动 Edge 浏览器打开淘宝首页，先打印页面当前的所有 Cookie 信息，然后添加一条名为 “zhangsan”、值为 “98” 的 Cookie，最后再次打印包含新 Cookie 的完整 Cookie 列表。
代码解析：get_cookies方法用于获取当前页面的所有 Cookie，add_cookie方法用于添加自定义 Cookie，常用于模拟登录状态或传递特定信息。

小项目：网页内自动搜索跳转、翻页、循环获取每一页的所有书本内容

代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

def getbook(driver):
    time.sleep(5) #等待，防止窗口被关闭
    #每一页的书本信息读完都保存一次
    file = open('books.txt', 'a', encoding='utf-8')
    books = driver.find_elements(By.CLASS_NAME,'book_item')
    #循环获取写入此页的书本信息
    for book in books:
        book.click()   #打开第四个窗口，即书本信息的窗口
        handles = driver.window_handles
        driver.switch_to.window(handles[3])
        time.sleep(5)
        name = driver.find_element(By.CLASS_NAME,'book-name').text
        price = driver.find_element(By.CLASS_NAME,'price').text
        author = driver.find_element(By.CLASS_NAME,'book-author').text
        file.write(f'图书名：{name}，价格：{price}，作者名：{author}\n')
        driver.close()   #关闭第四个窗口
        #返回第三个窗口，进入下一次循环，继续点击下一本书
        handles = driver.window_handles
        driver.switch_to.window(handles[2])
    file.close() #关闭文件夹，保存内容

edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
#隐藏窗口
# edge_options.add_argument('--headless')
driver = webdriver.Edge(options = edge_options)
driver.get('https://www.ptpress.com.cn/')
#搜索：excel，回车
driver.find_element(By.TAG_NAME,"input").send_keys("excel"+ Keys.RETURN)
#定位到第二个窗口，点击更多
handles = driver.window_handles
driver.switch_to.window(handles[1])
driver.find_element(By.ID,"booksMore").click()
#定位到第三个窗口，开始获取第一页内所有书本的信息
handles = driver.window_handles
driver.switch_to.window(handles[2])
getbook(driver)
#循环点击读取其他页的书本
while True:
   driver.find_element(By.CLASS_NAME,'ivu-page-next').click()
   getbook(driver)

运行结果：启动 Edge 浏览器打开人民邮电出版社官网，搜索 “excel” 后进入图书列表页面，点击 “更多” 加载完整列表，循环进入每本书的详情页，提取图书名、价格、作者信息并写入 “books.txt” 文件，同时自动点击下一页继续爬取（无限循环，需手动停止）。
代码解析：

定义getbook函数：负责提取单页所有图书的信息，通过window_handles切换窗口，完成信息提取后关闭详情页并返回列表页。
主流程：模拟搜索操作，切换窗口进入完整列表页，调用getbook提取第一页信息后，通过循环点击 “下一页” 按钮，实现多页数据爬取。
可通过取消注释edge_options.add_argument('--headless')启用无界面模式运行。

python网络爬虫(第三章/共三章：驱动浏览器窗口界面，网页元素定位，模拟用户交互（输入操作、点击操作、文件上传），浏览器窗口切换，循环爬取存储）