本项目是一个全栈应用,允许用户上传 PDF 文件。后端使用 Flask 构建,它会将原始 PDF 文件存储在 MinIO 存储桶中,并将其提取的文本内容索引到 OpenSearch 中。前端则是一个用于上传文件的简单 React 应用。
代码链接:https://github.com/zhouruiliangxian/Awesome-demo/tree/main/Fullstack/pdf_search_app
项目结构
pdf_search_app/
├── backend/ # Flask 后端
│ ├── .env # 后端的环境变量
│ ├── app.py # 主要的 Flask 应用逻辑
│ └── requirements.txt# Python 依赖项
├── frontend/ # React 前端
│ ├── public/
│ ├── src/
│ │ ├── App.css # 前端样式文件
│ │ └── App.js # 主要的 React 组件
│ └── package.json
└── docker-compose.yml # 用于运行所有服务的 Docker Compose 文件
如何运行本应用
请遵循以下步骤来启动并运行整个应用。
第 1 步:启动基础设施服务
version: '3.8'
services:
opensearch-node:
image: opensearchproject/opensearch:2.19.1
container_name: opensearch-node-pdf
environment:
- cluster.name=opensearch-cluster
- node.name=opensearch-node
- discovery.type=single-node
- bootstrap.memory_lock=true
- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
- "DISABLE_SECURITY_PLUGIN=true"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
volumes:
- opensearch-data:/usr/share/opensearch/data
ports:
- "9200:9200"
- "9600:9600"
networks:
- app-network
opensearch-dashboards:
image: opensearchproject/opensearch-dashboards:2.19.1
container_name: opensearch-dashboards-pdf
ports:
- "5601:5601"
environment:
OPENSEARCH_HOSTS: '["http://opensearch-node:9200"]'
DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true"
networks:
- app-network
depends_on:
- opensearch-node
minio:
image: minio/minio:latest
container_name: minio
ports:
- "9000:9000" # API Port
- "9001:9001" # Console Port
volumes:
- minio-data:/data
environment:
- MINIO_ROOT_USER=minioadmin # Change for production
- MINIO_ROOT_PASSWORD=minioadmin # Change for production
command: server /data --address ":9000" --console-address ":9001"
networks:
- app-network
volumes:
opensearch-data:
minio-data:
networks:
app-network:
driver: bridge
docker-compose.yml
文件将会启动 OpenSearch、OpenSearch Dashboards 和 MinIO。
在 pdf_search_app
的根目录下,运行:
docker-compose up -d
运行后,您可以通过以下地址访问这些服务:
- OpenSearch 仪表盘:
http://localhost:5601
- MinIO 控制台:
http://localhost:9001
(使用docker-compose.yml
中配置的minioadmin
/minioadmin
登录)
第 2 步:运行 Flask 后端
导航到后端目录:
cd backend
创建虚拟环境并安装依赖:
# 创建一个虚拟环境 uv venv # 激活它 (Windows) .\venv\Scripts\activate # (macOS/Linux) # source venv/bin/activate # 安装依赖 uv pip install -r requirements.txt
启动 Flask 服务器:
重要提示:请使用uv run app.py
命令来启动,以确保初始化代码(如创建 MinIO 存储桶)能够被执行。
app.py
文件# -*- coding: utf-8 -*- import os from flask import Flask, request, jsonify from flask_cors import CORS from dotenv import load_dotenv from minio import Minio from opensearchpy import OpenSearch import PyPDF2 import io # --- Initialization --- load_dotenv() app = Flask(__name__) # Enable CORS for React frontend (adjust in production) CORS(app, resources={r"/api/*": {"origins": "http://localhost:3000"}}) # --- Client Connections --- # OpenSearch Client opensearch_client = OpenSearch( hosts=[{'host': os.getenv('OPENSEARCH_HOST'), 'port': int(os.getenv('OPENSEARCH_PORT'))}], http_auth=None, use_ssl=False, verify_certs=False, ssl_assert_hostname=False, ssl_show_warn=False, ) # MinIO Client minio_client = Minio( os.getenv('MINIO_ENDPOINT'), access_key=os.getenv('MINIO_ACCESS_KEY'), secret_key=os.getenv('MINIO_SECRET_KEY'), secure=False # Set to True if using HTTPS ) import time # --- Helper Functions --- def setup_minio_and_opensearch(): """Ensure MinIO bucket and OpenSearch index exist, with retries.""" max_retries = 5 retry_delay = 3 # seconds # Setup MinIO for i in range(max_retries): try: bucket_name = os.getenv('MINIO_BUCKET') found = minio_client.bucket_exists(bucket_name) if not found: minio_client.make_bucket(bucket_name) print(f"MinIO bucket '{bucket_name}' created.") else: print(f"MinIO bucket '{bucket_name}' already exists.") break # Success, exit loop except Exception as e: print(f"MinIO setup failed (attempt {i+1}/{max_retries}): {e}") if i + 1 == max_retries: raise print(f"Retrying in {retry_delay} seconds...") time.sleep(retry_delay) # Setup OpenSearch (can also have a retry loop if needed) index_name = os.getenv('OPENSEARCH_INDEX') if not opensearch_client.indices.exists(index=index_name): opensearch_client.indices.create(index=index_name) print(f"OpenSearch index '{index_name}' created.") else: print(f"OpenSearch index '{index_name}' already exists.") def extract_text_from_pdf(pdf_file): """Extracts text content from a PDF file stream.""" text = "" try: pdf_reader = PyPDF2.PdfReader(pdf_file) for page in pdf_reader.pages: text += page.extract_text() or "" except Exception as e: print(f"Error extracting PDF text: {e}") return None return text # --- API Routes --- @app.route('/api/upload', methods=['POST']) def upload_pdf(): if 'file' not in request.files: return jsonify({"error": "No file part"}), 400 file = request.files['file'] if file.filename == '' or not file.filename.lower().endswith('.pdf'): return jsonify({"error": "Invalid file, please upload a PDF"}), 400 try: # Read file into memory pdf_bytes = file.read() pdf_stream = io.BytesIO(pdf_bytes) file_length = len(pdf_bytes) file_name = file.filename # 1. Upload original PDF to MinIO minio_bucket = os.getenv('MINIO_BUCKET') minio_client.put_object( minio_bucket, file_name, pdf_stream, length=file_length, content_type='application/pdf' ) print(f"Successfully uploaded '{file_name}' to MinIO bucket '{minio_bucket}'.") # 2. Extract text from PDF pdf_stream.seek(0) # Reset stream position after upload extracted_text = extract_text_from_pdf(pdf_stream) if extracted_text is None: return jsonify({"error": "Could not extract text from PDF"}), 500 # 3. Index metadata and text into OpenSearch document = { 'file_name': file_name, 'minio_path': f"/{minio_bucket}/{file_name}", 'content': extracted_text, 'size_bytes': file_length } opensearch_index = os.getenv('OPENSEARCH_INDEX') opensearch_client.index( index=opensearch_index, body=document, refresh=True # Make it immediately searchable ) print(f"Successfully indexed metadata for '{file_name}' in OpenSearch.") return jsonify({ "message": "File uploaded and indexed successfully!", "file_name": file_name, "minio_path": document['minio_path'] }), 201 except Exception as e: print(f"An error occurred: {e}") return jsonify({"error": "An internal error occurred"}), 500 # --- Main Execution --- if __name__ == '__main__': with app.app_context(): setup_minio_and_opensearch() app.run(host='0.0.0.0', port=5001, debug=True)
uv run app.py
后端服务器将在
http://localhost:5001
上启动。首次运行时,它会自动创建所需的 MinIO 存储桶 (pdfs
) 和 OpenSearch 索引 (pdf_documents
)。
第 3 步:运行 React 前端
打开一个新的终端。
导航到前端目录:
cd frontend
安装依赖并启动开发服务器:
npm install npm start
您的浏览器应该会自动打开
http://localhost:3000
,在这里您会看到 PDF 上传界面。
效果测试
工作原理
- 上传: 您在 React 前端选择一个 PDF 文件并点击“上传”。
- API 调用: 前端将文件发送到 Flask 后端的
/api/upload
端点。 - 处理: Flask 服务器执行以下操作:
a. 将原始 PDF 文件直接上传到 MinIO 的pdfs
存储桶中。
b. 使用PyPDF2
库从 PDF 中提取所有文本。
c. 创建一个包含文件名、其在 MinIO 中的路径以及提取出的文本的 JSON 文档。
d. 将此 JSON 文档索引到 OpenSearch 中。 - 结果: 您现在可以访问 OpenSearch 仪表盘 (
http://localhost:5601
),查看pdf_documents
索引,并搜索您上传的 PDF 的内容。