【RAG】Linux系统下ppt转pptx,读取解析pptx文本数据

发布于:2024-05-16 ⋅ 阅读:(120) ⋅ 点赞:(0)

前情提要

检索增强生成(RAG)技术,作为 AI 领域的尖端技术,能够提供可靠且最新的外部知识,极大地便利了各种任务。在 AI 内容生成的浪潮中,RAG 通过其强大的检索能力为生成式 AI 提供了额外的知识,助力其产出高质量内容。尽管大型语言模型(LLMs)在语言处理上展现了突破性的能力,但仍受限于内部知识的幻觉和过时。因此,检索增强的 LLMs 应运而生,它们利用外部权威知识库,而非仅依赖内部知识,以提升生成质量。

遇到问题

针对pptx的文档解析技术存在已久,但是ppt格式文件无法进行解析,且我没有搜索到在Linux系统服务器中ppt转pptx的资料,window系统中倒是可以转换

解决方案

安装系统依赖

apt-get install unoconv
apt-get install libreoffice

安装软件包依赖

pip install unoconv
pip install pyuno
pip install weaviate-client
pip install unstructured[all-docs] == 0.13.3
pip install python-dotenv

代码demo

import glob
import os
import subprocess
import weaviate
import weaviate.classes as wvc
from dotenv import load_dotenv
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import CompositeElement, Table
from unstructured.partition.pptx import partition_pptx
from weaviate.config import AdditionalConfig

load_dotenv()
os.environ['UNO_PATH'] = '/usr/lib/libreoffice'
os.environ['PATH'] += ':/usr/lib/libreoffice/program'

file_path = "/your/ppt_path/case_1.ppt"

def extract_text(file_name: str):
    elements = partition_pptx(
        filename=file_name,
        multipage_sections=True,
        infer_table_structure=True,
        include_page_breaks=False,
    )

    chunks = chunk_by_title(
        elements=elements,
        multipage_sections=True,
        combine_text_under_n_chars=0,
        new_after_n_chars=None,
        max_characters=4096,
    )

    text_list = []

    for chunk in chunks:
        if isinstance(chunk, CompositeElement):
            text = chunk.text
            text_list.append(text)
        elif isinstance(chunk, Table):
            if text_list:
                text_list[-1] = text_list[-1] + "\n" + chunk.metadata.text_as_html
            else:
                text_list.append(chunk.hunk.metadata.text_as_html)
    result_dict = {"无标题":[]}
    for text in text_list:
        split_text = text.split("\n\n", 1)
        if len(split_text) == 2:
            title, text = split_text
            if title not in result_dict:
                result_dict[title] = []
            result_dict[title].append(text)
        else:
            result_dict["无标题"].append(text)
    return result_dict


def split_chunks(text_list: list, source: str):
    chunks = []
    for text in text_list:
        for key, value in text.items():
            chunks.append({"question": key, "answer": value, "source": source})
    return chunks

def convert_ppt_to_pptx(ppt_file_path):
    # Define the command to run LibreOffice in headless mode
    command = [
        'libreoffice',
        '--headless',
        '--convert-to', 'pptx',
        '--outdir', os.path.dirname(ppt_file_path),
        ppt_file_path
    ]
    
    # Run the command
    result = subprocess.run(command, capture_output=True, text=True)
    
    if result.returncode != 0:
        raise RuntimeError(f"Failed to convert '{ppt_file_path}' to PPTX.\nError: {result.stderr}")
    
    return ppt_file_path.replace('.ppt', '.pptx')

pptx_file_path = convert_ppt_to_pptx(file_path)
print("convert ppt to pptx done")
contents = extract_text(pptx_file_path)
for k,v in contents.items():
    print(k,v)
    print("__"*30)

网站公告

今日签到

点亮在社区的每一天
去签到