认识爬虫 —— xpath提取

发布于:2025-08-05 ⋅ 阅读:(14) ⋅ 点赞:(0)
爬虫流程:
  1. 得到html页面;即网站向服务器发送http请求
  2. 按照规则进行数据提取:xpath、bs4、re正则
  3. 数据存储:excel、txt、csv、sql
一、如何得到http页面

Python中有两个库:urlib、requests

urlib:python自带,无需额外安装来模拟http请求

requests:不是内置库,需要额外安装

安装requests库

Anaconda Prompt 输入 pip install requests

直接使用requests.get

import requests
url = "https://www.baidu.com"
response = requests.get(url)
print(response)

>>> <Response [200]>

使用response.text

虽然反馈了html,但存在乱码;以python自己猜测网页的编码方式编码,像baidu是utf8 而text猜测的识别的编码方式

import requests
url = "https://www.baidu.com"
response = requests.get(url).text
print(response)

使用response.content

返回的bytes流数据,我们获取这样的数据,再自己完成编码的转换工作

注意:解析网页 —— 网站写的时候用了一个编码,解析的时候也需要相同的编码

import requests
url = "https://www.baidu.com"
response = requests.get(url)
content = response.content.decode('utf8')
print(content)

拓展:

response.encoding

response.text在解码的过程中是以python猜测的编码方式进行解码;response.encoding就是看text方法猜测了哪种编码

response.status_code

requests提供的访问url的状态响应码

import requests
url = "https://www.baidu.com/s?wd=番茄"
response = requests.get(url)
content = response.content.decode('utf8')
print(content)

会报错,因为在网络中数据是由bytes传输的;我们需要借助字典来帮我们完成URL的建立

import requests
url = "https://www.baidu.com/s"
keyword={
            "wd" : "番茄"
        }
response = requests.get(url,params = keyword)
content = response.content.decode('utf8')
print(content)

很少数据,这是因为服务器识别到了爬虫,我们需要将自己的请求伪装成浏览器进行访问 —— 设置请求头参数 Request Headers

import requests
url = "https://www.baidu.com/s"
keyword = {
            "wd" : "番茄"
        }
headers = {
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
    "Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
print(content)

这时我们就完成了第一步

二、 使用xpath进行数据提取

在上一步结尾我们得到的只是一堆包含html标签的字符串,所以我们需要先将字符串转化为HTML树形结构(xpath:专门用于从XML/HTML的树形结构中提取数据)

lxml库:可以将html字符串解析成树形结构(DOM树)从而支持 xpath —— 通过路径定位节点;css选择器 —— 通过类名、ID等定位

etree.HTML

用于将 HTML 字符串解析成一个可操作的 XML/HTML 树结构,方便使用 XPath 或 CSS 选择器提取数据

import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {
            "wd" : "番茄"
        }
headers = {
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
    "Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]/text()")[0]

print(title)

etree.tostring 

将节点树转化成字节流

etree.tostring(html,encoding='utf8').decode('utf8')

encoding='utf8':明确字节流的编码格式

.decode('utf8'):将字节流转化为人类可读的字符串

import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {
            "wd" : "番茄"
        }
headers = {
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
    "Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]")[0]
print(title)

 

import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {
            "wd" : "番茄"
        }
headers = {
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
    "Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]")[0]
result = etree.tostring(title,encoding='utf8').decode('utf8')
print(result)

 

拓展:

etree.parse

从文件/文件对象解析xml/html,返回一个ElementTree对象(ElementTree对象代表整个文档树,支持xpath文档级查询)


网站公告

今日签到

点亮在社区的每一天
去签到