本人目前在做鱼皮的《智能协同云图库》,涉及到了以图搜图+图片爬取,虽然以前有爬过图片,但是用的都是别人现成的代码,不怎么去理解为什么要这样做,这次有在尝试理解每一个步骤。本人基础极差,属于一点基础也没学直接上手做项目的那种类型,所以跟课程有点吃力。但好在gpt非常好用,也算是勉强能够理解了。在这里总结一下思路。
百度的以图搜图可以通过上传url进行,我选择这个url的图片。
https://i2.hdslb.com/bfs/archive/ad698e40cc6dd3d03ae5d0ab7bfa50faf368bd9b.jpg
然后就可以出现这个:
然后可以打开Safari网页检查器(如果不是Safari,应该是开发者工具)
只看XHR类型就可以,也就是只显示接口请求。
记得设置保留日志,因为会有一闪而过的upload。别的网站也可能是别的名字,比如pcsearch这种。
把搜索的网址输进去,再重新搜一遍,会出现:
然后需要关注标头中的内容。
展开请求数据后,可以得到:
sdkParams 通常是由百度官方 SDK 生成的签名参数,里面可能是时间戳、签名、密钥哈希等。这里不需要管它。
package com.bxt.picturebackend.imageSearch.sub;
import cn.hutool.core.util.URLUtil;
import cn.hutool.http.HttpRequest;
import cn.hutool.http.HttpResponse;
import cn.hutool.json.JSONUtil;
import com.bxt.picturebackend.exception.BusinessException;
import com.bxt.picturebackend.exception.ErrorCode;
import lombok.extern.slf4j.Slf4j;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.HexFormat;
import java.util.Map;
@Slf4j
public class GetImagePageUrlApi {
public static String getImagePageUrl(String imageUrl) {
Map<String, Object> formData = new HashMap<>();
formData.put("image", imageUrl);
formData.put("tn","pc");
formData.put("from", "pc");
formData.put("image_source", "PC_UPLOAD_URL");
long upTime = System.currentTimeMillis();
String postUrl = "https://graph.baidu.com/upload?uptime="+ upTime;
String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";
try {
HttpResponse httpResponse=HttpRequest.post(postUrl)
.form(formData)
.timeout(10000)
.header("Acs-Token", acsToken)
.execute();
if (httpResponse.getStatus() != 200) {
log.error("获取以图搜图页面地址失败,状态码:{}", httpResponse.getStatus());
throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
}
String body = httpResponse.body();
System.out.println("body = " + body);
Map<String, Object> responseMap = JSONUtil.toBean(body, Map.class);
System.out.println("responseMap = " + responseMap);
if (responseMap == null ) {
log.error("获取以图搜图页面地址失败,响应内容:{}", body);
throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
}
Map<String, Object> data = (Map<String, Object>) responseMap.get("data");
System.out.println("data = " + data);
String rawUrl = (String) data.get("url");
// 对 URL 进行解码
String searchResultUrl = URLUtil.decode(rawUrl, StandardCharsets.UTF_8);
// 如果 URL 为空
if (searchResultUrl == null) {
throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
}
return searchResultUrl;
}catch (Exception e) {
log.error("获取以图搜图页面地址失败,错误信息:{}", e.getMessage());
throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
}
}
}
用单元测试类测试:
package com.bxt.picturebackend.imageSearch.sub;
import cn.hutool.http.HttpResponse;
import com.mysql.cj.x.protobuf.MysqlxResultset;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
class GetImagePageUrlApiTest {
@Test
void testGetImagePageUrl() {
String testImageUrl = "https://i2.hdslb.com/bfs/archive/ad698e40cc6dd3d03ae5d0ab7bfa50faf368bd9b.jpg";
String response = GetImagePageUrlApi.getImagePageUrl(testImageUrl);
System.out.println(response);
}
}
可以得到:
body = {"status":0,"msg":"Success","data":{"url":"https://graph.baidu.com/s?card_key=\u0026entrance=GENERAL\u0026extUiData%5BisLogoShow%5D=1\u0026f=all\u0026isLogoShow=1\u0026session_id=13377293787626920489\u0026sign=1260533cc766d268eaf8401755063018\u0026tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}}
responseMap = {status=0, msg=Success, data={"url":"https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}}
data = {"url":"https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}
https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData[isLogoShow]=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc
Process finished with exit code 0
这里得到的url就是返回的页面。
然后可以继续分析这个页面
只过滤文稿,可以得到这个页面的html
因为需要的图片位于“相似图片”下方,所以可以去“相似图片”周边找一下
firsturl看起来是有用的。
把后边跟着的那一串字符摘过来:
https:\/\/graph.baidu.com\/ajax\/pcsimi?carousel=503&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&inspire=general_pc&limit=30&next=2&render_type=card&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tk=2e59f&tpl_from=pc
需要稍微改一下,因为其中反斜杠 \
是 JSON 字符串里对斜杠 /
的转义,属于 JSON 格式要求,不是 URL 本身的内容。
把所有的反斜杠“\”都去掉,可以得到下边的网址:
https://graph.baidu.com/ajax/pcsimi?carousel=503&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&inspire=general_pc&limit=30&next=2&render_type=card&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tk=2e59f&tpl_from=pc
输入这个网址,可以得到如下页面:
thumbUrl后边跟着的字符串,是我们需要的内容
可是,直接把它粘过来进行搜索,是会出错的
原因主要是转义字符写法错误,具体问题包括:
URL中出现了错误的转义写法 /u0026,这是 Unicode 转义符,表示字符 &。但在 URL 中不能直接写成 /u0026,正确的是用 & 连接参数。同样的,末尾的 \u0026h=500 也写成了 \u0026,这不是有效的 URL 字符。
改成正确的格式,比如这样:
http://mms1.baidu.com/it/u=771534300,3396233686&fm=253&app=138&f=JPEG?w=800&h=500
就可以正常显示了
补充之前的代码,完整版如下,调用getUrlList可以返回相似图片的url
package com.bxt.picturebackend.imageSearch.sub;
import cn.hutool.core.util.URLUtil;
import cn.hutool.http.HttpRequest;
import cn.hutool.http.HttpResponse;
import cn.hutool.json.JSONUtil;
import com.bxt.picturebackend.exception.BusinessException;
import com.bxt.picturebackend.exception.ErrorCode;
import lombok.extern.slf4j.Slf4j;
import org.springframework.security.web.firewall.FirewalledRequest;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.HexFormat;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static cn.hutool.poi.excel.sax.AttributeName.r;
@Slf4j
public class GetImagePageUrlApi {
public static List<String> getUrlList(String imageUrl){
String imagePageUrl = getImagePageUrl(imageUrl);
if (imagePageUrl == null || imagePageUrl.isEmpty()) {
throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
}
String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";
HttpResponse httpResponse = HttpRequest.get(imagePageUrl)
.timeout(10000)
.header("Acs-Token", acsToken)
.execute();
// System.out.println("httpResponse = " + httpResponse);
if (httpResponse.getStatus() != 200) {
log.error("获取以图搜图页面地址失败,状态码:{}", httpResponse.getStatus());
throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
}
Pattern pattern = Pattern.compile("\"firstUrl\"\\s*:\\s*\"(.*?)\"");
Matcher matcher = pattern.matcher(httpResponse.body());
String firstUrl;
if (matcher.find()) {
// 提取并替换 \/ 为 /
firstUrl = matcher.group(1).replace("\\/", "/");
System.out.println("firstUrl = " + firstUrl);
} else {
throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
}
HttpResponse urlListPage = HttpRequest.get(firstUrl)
.timeout(10000)
.header("Acs-Token", acsToken)
.execute();
// System.out.println(urlListPage);
pattern = Pattern.compile("\"thumbUrl\"\\s*:\\s*\"(.*?)\"");
matcher = pattern.matcher(urlListPage.body());
List<String> urlList = new java.util.ArrayList<>();
while (matcher.find()) {
String thumbUrl = matcher.group(1);
// 转义 \u0026 -> &
thumbUrl = thumbUrl.replaceAll("\\\\u0026", "&");
urlList.add(thumbUrl);
}
// System.out.println("urlList = " + urlList);
return urlList;
}
public static String getImagePageUrl(String imageUrl) {
Map<String, Object> formData = new HashMap<>();
formData.put("image", imageUrl);
formData.put("tn","pc");
formData.put("from", "pc");
formData.put("image_source", "PC_UPLOAD_URL");
long upTime = System.currentTimeMillis();
String postUrl = "https://graph.baidu.com/upload?uptime="+ upTime;
String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";
try {
HttpResponse httpResponse=HttpRequest.post(postUrl)
.form(formData)
.timeout(10000)
.header("Acs-Token", acsToken)
.execute();
if (httpResponse.getStatus() != 200) {
log.error("获取以图搜图页面地址失败,状态码:{}", httpResponse.getStatus());
throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
}
String body = httpResponse.body();
System.out.println("body = " + body);
Map<String, Object> responseMap = JSONUtil.toBean(body, Map.class);
System.out.println("responseMap = " + responseMap);
if (responseMap == null ) {
log.error("获取以图搜图页面地址失败,响应内容:{}", body);
throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
}
Map<String, Object> data = (Map<String, Object>) responseMap.get("data");
System.out.println("data = " + data);
String rawUrl = (String) data.get("url");
// 对 URL 进行解码
String searchResultUrl = URLUtil.decode(rawUrl, StandardCharsets.UTF_8);
// 如果 URL 为空
if (searchResultUrl == null) {
throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
}
return searchResultUrl;
}catch (Exception e) {
log.error("获取以图搜图页面地址失败,错误信息:{}", e.getMessage());
throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
}
}
}
输出最后的list,是这样的:
[http://mms1.baidu.com/it/u=771534300,3396233686&fm=253&app=138&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=4161103281,1829674203&fm=253&app=138&f=JPEG?w=749&h=580, http://mms2.baidu.com/it/u=2706284301,789398194&fm=253&app=120&f=JPEG?w=800&h=500, http://mms1.baidu.com/it/u=1667096992,1485299432&fm=253&app=138&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=2502213264,439196765&fm=253&app=120&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=4000521229,3982402882&fm=253&app=120&f=JPEG?w=655&h=446, http://mms2.baidu.com/it/u=640527677,1986438968&fm=253&app=138&f=JPEG?w=455&h=256, http://mms2.baidu.com/it/u=156995109,2192672339&fm=253&app=120&f=JPEG?w=801&h=500, http://mms0.baidu.com/it/u=48011703,2549638517&fm=253&app=138&f=JPEG?w=800&h=500, http://mms2.baidu.com/it/u=1316957924,1711619045&fm=253&app=120&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=2192255561,2552189568&fm=253&app=138&f=JPEG?w=634&h=356, http://mms0.baidu.com/it/u=2868092005,3149855400&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=2173262737,1364469520&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=896380067,3285805132&fm=253&app=138&f=JPEG?w=1053&h=800, http://mms0.baidu.com/it/u=184083361,1291046512&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=2147020713,3191068967&fm=253&app=138&f=JPEG?w=867&h=500, http://mms0.baidu.com/it/u=864737700,3400231159&fm=253&app=120&f=JPEG?w=800&h=500, http://mms1.baidu.com/it/u=153299186,2018689789&fm=253&app=120&f=JPEG?w=480&h=270, http://mms0.baidu.com/it/u=2253215478,3249860676&fm=253&app=120&f=JPEG?w=800&h=500, http://mms2.baidu.com/it/u=3522373714,3342355003&fm=253&app=120&f=JPEG?w=800&h=500]
全部都是坤坤