Crawl4AI：一个为大型语言模型（LLM）和AI应用设计的网页爬虫和数据提取工具实战-EW帮帮网

这里写目录标题

一、crawl4AI功能及简介
- 1、简介
- 2、特性
二、项目地址
三、环境安装
四、大模型申请
五、代码示例
- 1.生成markdown
- 2.结构化数据

一、crawl4AI功能及简介

1、简介

Crawl4AI 是一个开源的网页爬虫和数据抓取工具，一个python项目，主要为大型语言模型（LLM）和 AI 应用提供数据采集和处理服务。

2、特性

**开源免费：**遵循 MIT 许可协议或 Apache-2.0 许可协议，开发人员可自由使用、修改和分发源代码，无需支付费用；
**专为 LLM 设计：**能够将网页数据处理和清洗成适合 LLM 使用的格式，如 JSON、干净的 HTML 和 Markdown 等，便于后续直接应用于模型训练；
**高效性能：**支持并行处理多个 URL，可同时抓取和处理多个网页，极大地提高了数据收集效率，减少大规模数据收集所需时间；
**多功能支持：**可以提取网页中的文本、图片、音频、视频等媒体标签，以及元数据、内外部链接等，并能对页面进行截图等操作；
**高度可定制：**用户可自定义认证、请求头信息、爬取前页面修改、用户代理以及 JavaScript 脚本执行等，还能根据特定需求自定义爬取深度、频率和提取规则，以适应不同网页结构和数据类型。

二、项目地址

github地址: https://github.com/unclecode/crawl4ai

Crawl4ai官网: https://crawl4ai.com/

三、环境安装

python	3.7+
windows	8+

四、大模型申请

月之暗面 / Kimi chat

API key 申请地址：https://platform.moonshot.cn/console/api-keys
API 文档地址：https://platform.moonshot.cn/docs
API 定价信息：https://platform.moonshot.cn/docs/price/chat
百度 / 文心一言

API申请地址：https://console.bce.baidu.com/qianfan/ais/console/applicationConsole/application
API 文档地址：https://cloud.baidu.com/doc/WENXINWORKSHOP/s/flfmc9do2
API 定价信息：https://cloud.baidu.com/doc/WENXINWORKSHOP/s/Blfmc9dlf
智谱 / GLM

API key 申请地址：https://bigmodel.cn/usercenter/apikeys
API 文档地址：https://bigmodel.cn/dev/api
API 定价信息：https://open.bigmodel.cn/pricing
MiniMax

API key 申请地址：https://platform.minimaxi.com/user-center/basic-information/interface-key
API 文档地址：https://platform.minimaxi.com/document/notice
API 定价信息：https://platform.minimaxi.com/document/price
阿里 / 通义千问（Qwen）

API key 申请地址：https://dashscope.console.aliyun.com/apiKey
API 文档地址：https://help.aliyun.com/zh/dashscope/developer-reference
API 定价信息：https://dashscope.console.aliyun.com/billing
科大讯飞 / 讯飞星火（Spark）

API key 申请地址：https://console.xfyun.cn/services/cbm
API 文档地址：https://www.xfyun.cn/doc/spark/Web.html
API 定价信息：https://xinghuo.xfyun.cn/sparkapi
DeepSeek（深度求索）

API key 申请地址：https://platform.deepseek.com/api_keys
API 文档地址：https://platform.deepseek.com/api-docs/zh-cn/
API 定价信息：https://platform.deepseek.com/api-docs/zh-cn/pricing

五、代码示例

1.生成markdown

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com"
        )
        print(result.markdown)  # Print clean markdown content

if __name__ == "__main__":
    asyncio.run(main())

运行结果如下：
在这里插入图片描述

2.结构化数据

import asyncio
import json
import os
from crawl4ai import LLMExtractionStrategy, AsyncWebCrawler
from tenacity import retry, stop_after_attempt, wait_exponential
class LLMExtractionError(Exception):
    pass
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def extract_with_retry(crawler, url, extraction_strategy):
    try:
        result = await crawler.arun(url=url, extraction_strategy=extraction_strategy, bypass_cache=True)
        print(result)
        print(result.extracted_content)
        print(json.loads(result.extracted_content))
        return json.loads(result.extracted_content)
    except Exception as e:
        raise LLMExtractionError(f"Failed to extract content: {str(e)}")
async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        try:
            content = await extract_with_retry(
                crawler,
                "https://shop.health-100.cn/goods",
                LLMExtractionStrategy(
                    provider="openai/moonshot-v1-8k",
                    api_token='这是需要你自己去申请的',
                    instruction="返回当前页面采集的商品的名称和商品价格，json格式返回",
                    base_url='https://api.moonshot.cn/v1'
                )
            )
            print("Extracted content:", content)
        except LLMExtractionError as e:
            print(f"Extraction failed after retries: {e}")
asyncio.run(main())

输出结果如下：
在这里插入图片描述

以上就是通过crawl4AI的技术将任意网页数据采集生成markdown数据，然后又由大模型将markdown数据结构化成json数据的实战样例。

Crawl4AI：一个为大型语言模型（LLM）和AI应用设计的网页爬虫和数据提取工具实战

这里写目录标题

一、crawl4AI功能及简介

1、简介

2、特性

二、项目地址

三、环境安装

四、大模型申请

五、代码示例

1.生成markdown

2.结构化数据

网站公告

今日签到

热门文章

最新发布