爬虫流程:
- 得到html页面;即网站向服务器发送http请求
- 按照规则进行数据提取:xpath、bs4、re正则
- 数据存储:excel、txt、csv、sql
一、如何得到http页面
Python中有两个库:urlib、requests
urlib:python自带,无需额外安装来模拟http请求
requests:不是内置库,需要额外安装
安装requests库
Anaconda Prompt 输入 pip install requests
直接使用requests.get
import requests
url = "https://www.baidu.com"
response = requests.get(url)
print(response)
>>> <Response [200]>
使用response.text
虽然反馈了html,但存在乱码;以python自己猜测网页的编码方式编码,像baidu是utf8 而text猜测的识别的编码方式
import requests
url = "https://www.baidu.com"
response = requests.get(url).text
print(response)
使用response.content
返回的bytes流数据,我们获取这样的数据,再自己完成编码的转换工作
注意:解析网页 —— 网站写的时候用了一个编码,解析的时候也需要相同的编码
import requests
url = "https://www.baidu.com"
response = requests.get(url)
content = response.content.decode('utf8')
print(content)
拓展:
response.encoding
response.text在解码的过程中是以python猜测的编码方式进行解码;response.encoding就是看text方法猜测了哪种编码
response.status_code
requests提供的访问url的状态响应码
import requests
url = "https://www.baidu.com/s?wd=番茄"
response = requests.get(url)
content = response.content.decode('utf8')
print(content)
会报错,因为在网络中数据是由bytes传输的;我们需要借助字典来帮我们完成URL的建立
import requests
url = "https://www.baidu.com/s"
keyword={
"wd" : "番茄"
}
response = requests.get(url,params = keyword)
content = response.content.decode('utf8')
print(content)
很少数据,这是因为服务器识别到了爬虫,我们需要将自己的请求伪装成浏览器进行访问 —— 设置请求头参数 Request Headers
import requests
url = "https://www.baidu.com/s"
keyword = {
"wd" : "番茄"
}
headers = {
"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
"Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
print(content)
这时我们就完成了第一步
二、 使用xpath进行数据提取
在上一步结尾我们得到的只是一堆包含html标签的字符串,所以我们需要先将字符串转化为HTML树形结构(xpath:专门用于从XML/HTML的树形结构中提取数据)
lxml库:可以将html字符串解析成树形结构(DOM树)从而支持 xpath —— 通过路径定位节点;css选择器 —— 通过类名、ID等定位
etree.HTML
用于将 HTML 字符串解析成一个可操作的 XML/HTML 树结构,方便使用 XPath 或 CSS 选择器提取数据
import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {
"wd" : "番茄"
}
headers = {
"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
"Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]/text()")[0]
print(title)
etree.tostring
将节点树转化成字节流
etree.tostring(html,encoding='utf8').decode('utf8')
encoding='utf8':明确字节流的编码格式
.decode('utf8'):将字节流转化为人类可读的字符串
import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {
"wd" : "番茄"
}
headers = {
"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
"Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]")[0]
print(title)
import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {
"wd" : "番茄"
}
headers = {
"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
"Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]")[0]
result = etree.tostring(title,encoding='utf8').decode('utf8')
print(result)
拓展:
etree.parse
从文件/文件对象解析xml/html,返回一个ElementTree对象(ElementTree对象代表整个文档树,支持xpath文档级查询)