讓AI讀懂PPT圖表！RAG系統(tǒng)從60分到95分的進(jìn)化之路，LlamaParse+多模態(tài)實(shí)戰(zhàn)全解析原創(chuàng)

發(fā)布于 2025-4-28 09:30

瀏覽

0收藏

背景信息

近期前線人員反饋，在使用問(wèn)答系統(tǒng)處理PPT文件時(shí)回答效果不佳。經(jīng)查發(fā)現(xiàn)，用戶上傳的PPT內(nèi)容多為圖文混排形式，包含大量圖像和統(tǒng)計(jì)圖表。用戶提問(wèn)主要聚焦于圖表數(shù)據(jù)及其相互關(guān)系。

我這里隨意在網(wǎng)上找了一個(gè)PPT做嘗試，這個(gè)PPT有一張圖如下所示：

讓AI讀懂PPT圖表！RAG系統(tǒng)從60分到95分的進(jìn)化之路，LlamaParse+多模態(tài)實(shí)戰(zhàn)全解析-AI.x社區(qū)

提問(wèn)："2017年大數(shù)據(jù)開(kāi)發(fā)人數(shù)比2016年增加了還是減少了，具體增加或減少了多少人？"

系統(tǒng)返回：

根據(jù)提供的資源數(shù)據(jù)，2017年大數(shù)據(jù)開(kāi)發(fā)職位的招聘人數(shù)大幅增加。具體來(lái)說(shuō)，2016年大數(shù)據(jù)開(kāi)發(fā)的招聘人數(shù)為5,667人，而2017年則增加到41,831人。因此，2017年大數(shù)據(jù)開(kāi)發(fā)職位比2016年增加了36,164人 (41,831 - 5,667 = 36,164)。

從圖片我們可以看出明顯存在回答錯(cuò)誤的問(wèn)題。

問(wèn)題分析

目前使用的LangChain UnstructuredPowerPointLoader在解析PPT時(shí)存在以下不足：

對(duì)圖文混排內(nèi)容處理能力弱
圖表數(shù)據(jù)提取不準(zhǔn)確
語(yǔ)義信息丟失嚴(yán)重

PPT文檔特性

非結(jié)構(gòu)化布局：沒(méi)有固定格式，圖文表混排
視覺(jué)化表達(dá)：大量使用圖表而非純文本傳遞信息
天然分塊：每頁(yè)幻燈片構(gòu)成獨(dú)立的知識(shí)單元

傳統(tǒng)文本提取+RAG的處理方式會(huì)丟失視覺(jué)元素中的語(yǔ)義信息，這正是當(dāng)前系統(tǒng)效果不佳的主因。

解決方案

隨著現(xiàn)在多模態(tài)大模型的效果越來(lái)越強(qiáng)，我們就可以使用LVM來(lái)解決這類問(wèn)題。既然僅僅參考從圖片識(shí)別出的文本回答不是很準(zhǔn)確，那么我們可不可以考慮使用文本+原始圖片的方式送給LVM來(lái)回答呢？

首先借助LVM對(duì)PPT進(jìn)行解析，可以解析出每頁(yè)幻燈片對(duì)應(yīng)的文本和圖片，我們把文本進(jìn)行embedding作為召回，在檢索的時(shí)候把檢索到的文本和關(guān)聯(lián)的圖片一起送給大模型用于生成。

讓AI讀懂PPT圖表！RAG系統(tǒng)從60分到95分的進(jìn)化之路，LlamaParse+多模態(tài)實(shí)戰(zhàn)全解析-AI.x社區(qū)

LlamaParse

為了方便演示，這里對(duì)PPT進(jìn)行解析成文本和圖片我使用的工具是LlamaParse。免費(fèi)用戶每天有1k頁(yè)的額度，夠我們?nèi)粘y(cè)試使用。

首先注冊(cè)賬號(hào)并登錄https://cloud.llamaindex.ai，然后打開(kāi)LVM功能

讓AI讀懂PPT圖表！RAG系統(tǒng)從60分到95分的進(jìn)化之路，LlamaParse+多模態(tài)實(shí)戰(zhàn)全解析-AI.x社區(qū)

接著申請(qǐng)項(xiàng)目對(duì)應(yīng)的API key 就可以用來(lái)測(cè)試了，我們可以通過(guò)對(duì)應(yīng)的api 來(lái)獲取每頁(yè)幻燈片的文本內(nèi)容并下載每頁(yè)圖片到本地。

LlamaParse是LlamaCloud的一部分，是一個(gè)GenAI原生文檔解析器，可以為任何下游LLM用例（RAG、代理）解析復(fù)雜的文檔數(shù)據(jù)。

os.environ["LLAMA_CLOUD_API_KEY"] = "xxx"

parser = LlamaParse(
        result_type="json",
        use_vendor_multimodal_model=True,
        vendor_multimodal_model_name="gemini-2.0-flash-001",
        language="ch_sim"
    )
md_result = parser.get_json_result(file_path)
doc_id = md_result[0]["job_id"]
pages = md_result[0]["pages"]
# 下載圖片到本地
parser.get_images(md_result, download_path="data_images")

如何索引

原文檔的每一頁(yè)P(yáng)PT轉(zhuǎn)為圖片，并借助多模態(tài)模型解析成每一頁(yè)的Markdown文本
將每一頁(yè)的Markdown文本塊作為一個(gè)Chunk，并根據(jù)頁(yè)碼與頁(yè)面圖片關(guān)聯(lián)起來(lái)（存儲(chǔ)base64編碼/云路徑/本地路徑）。這樣，在檢索時(shí)可以根據(jù)文本塊找到對(duì)應(yīng)的圖片
嵌入這些文本Chunks，并將它們存儲(chǔ)在向量庫(kù)中

dataset = []
docs = []
base64_map = {}
for page in pages:
    md = page["md"]
    page_number = page["page"]

    # 查找并上傳對(duì)應(yīng)頁(yè)碼的圖片
    local_image_path = find_image_by_page(
        "data_images", doc_id, page_number)
    with open(local_image_path, "rb") as image_file:
        image_base64 = base64.b64encode(image_file.read()).decode('utf-8')

    # 添加到dataset
    dataset.append({
        'content': md,
        'image_base64': image_base64,
        'page_number': page_number
    })
    docs.append(
        Document(
            page_cnotallow=md,
            metadata={
                "page_number": page_number,
            })
    )
    base64_map[page_number] = image_base64
    
vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=embeddings,
    connection_args={"host": "127.0.0.1", "port": "19530"},
    drop_old=True,  # Drop the old Milvus collection if it exists
    collection_name="collection_ppt",
)

檢索和生成

從向量庫(kù)檢索關(guān)聯(lián)的塊，也就是前面對(duì)應(yīng)到PPT頁(yè)面的生成文本
根據(jù)這些塊中的元數(shù)據(jù)，找到對(duì)應(yīng)的頁(yè)面截圖base64
將文本塊組裝成Prompt，與找到的圖片的base64一起輸入多模態(tài)模型，等待響應(yīng)

dat = vectorstore.similarity_search(query=question, k=5)

image_base64_list = []
chunk = []
for doc in dat:
    page_number = doc.metadata["page_number"]
    print(page_number)
    image_base64 = base64_map.get(page_number)
    if image_base64:
        image_base64_list.append(image_base64)
        chunk.append(doc.page_content)
        
openai_api_key = os.environ["OPENAI_API_KEY"]  # 替換為你的 OpenAI API Key
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {openai_api_key}"
}

# 構(gòu)建 messages 內(nèi)容
messages_content = [
    {"type": "text", "text": """The following is the Markdown text and image information parsed in the slide. Markdown text has attempted to convert the relevant charts into tables. Give priority to using picture information to answer questions. Use Markdown text information only when the image cannot be understood.
 
Here is the context:
---------
{context}
---------

Question: {question}
""".format(cnotallow="\n".join(chunk), questinotallow=question)}
]

if image_base64_list:
    # 添加所有檢索到的圖片
    for img_base64 in image_base64_list:
        messages_content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{img_base64}"
            }
        })

payload = {
    "model": "gpt-4.1",
    "messages": [
        {
            "role": "user",
            "content": messages_content
        }
    ],
    "temperature": 0.1,
    "max_tokens": 1000# 根據(jù)需要調(diào)整
}

# 發(fā)送請(qǐng)求到 OpenAI API
try:
    response = requests.post(
        "https://api.openai-proxy.com/v1/chat/completions",
        headers=headers,
        jsnotallow=payload
    )
    response.raise_for_status()  # 檢查請(qǐng)求是否成功

    # 解析并打印結(jié)果
    result = response.json()
    print("OpenAI 分析結(jié)果:")
    print(result["choices"][0]["message"]["content"])

except requests.exceptions.RequestException as e:
    print(f"請(qǐng)求 OpenAI API 失敗: {e}")
    if hasattr(e, 'response') and e.response:
        print(f"錯(cuò)誤詳情: {e.response.text}")

當(dāng)我們執(zhí)行query:

??2017年大數(shù)據(jù)開(kāi)發(fā)人數(shù)比2016年增加了還是減少了, 具體增加或者減少了多少人???

返回的結(jié)果就是正確的了：

讓AI讀懂PPT圖表！RAG系統(tǒng)從60分到95分的進(jìn)化之路，LlamaParse+多模態(tài)實(shí)戰(zhàn)全解析-AI.x社區(qū)