利用人工智能技术实现文档自动翻译和保存 Implementing Document Translation and Storage Using Artificial Intelligence

Roger · June 3

背景与相关技术 Background and Related Technologies

- 简要介绍 OpenAI 和 Llama-Parse。 Briefly introduce OpenAI and Llama-Parse.
- 解释如何利用 Google Colab 和 Google Drive 实现文件的读取和保存。 Explain how to use Google Colab and Google Drive to read and save files.
项目设置与环境配置 Project Setup and Environment Configuration
- 详细描述如何在 Google Colab 上设置项目。 Detail how to set up a project on Google Colab.
- 安装所需的 Python 库。 Install the necessary Python libraries.
- 挂载 Google Drive 并配置 API 密钥。 Mount Google Drive and configure API keys.
PDF 文件加载与解析 PDF File Loading and Parsing
- 使用 Llama-Parse 加载和解析 PDF 文件。 Use Llama-Parse to load and parse PDF files.
- 将解析结果转换为 Markdown 文本。 Convert the parsed results into Markdown text.
文本翻译 Text Translation
- 使用 OpenAI 的 ChatGPT 模型进行文本翻译。 Use OpenAI's ChatGPT model for text translation.
- 详细说明如何编写翻译函数。 Explain in detail how to write the translation function.
保存翻译结果 Saving Translation Results
- 将翻译后的文本保存为PDF 文件。 Save the translated text as a PDF file.
- 介绍使用 python-docx 和 fpdf 库来实现这一功能。 Introduce how to use the python-docx and fpdf libraries to achieve this functionality.
项目优化与用户体验提升 Project Optimization and User Experience Improvement
- 使用 Google Colab 表单简化用户输入。 Simplify user input using Google Colab forms.
- 提供代码优化建议和注意事项。 Provide code optimization suggestions and considerations.
结论 Conclusion
- 总结项目实现的过程和效果。 Summarize the implementation process and effects of the project.
- 展望未来可能的改进和扩展方向。 Look forward to possible improvements and expansion directions in the future.

视频中所使用的代码示例（亲测好用！）：https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_242.ipynb

在 Google Colab 上设置一个新项目并进行文件翻译的步骤如下：

第一步：创建一个新的 Colab 笔记本

打开 Google Colab.
点击左上角的 File 菜单，选择 New notebook。

为了在 Google Colab 上提供一个更友好的界面，可以使用表单来让用户输入必要的参数。以下是经过优化的代码示例，包括所有必要的手工调整部分以及表单输入。

第一步：安装必要的库

!pip install llama-index llama-index-core llama-index-embeddings-openai llama-parse openai python-docx fpdf google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client

第二步：存取 Google Drive

import os
from google.colab import drive

# 设置路径
THESIS_LOC = '/content/drive/MyDrive/PDFtranslate/'  # @param {type:"string"}
drive.mount('/content/drive')
os.chdir(THESIS_LOC)
os.listdir()

第三步：设置 Llama-Parse 和 OpenAI API 密钥

import nest_asyncio
nest_asyncio.apply()

LLAMA_PARSER_KEY = 'your-llama-parse-key'  # @param {type:"string"}
OPENAI_KEY = 'your-openai-key'  # @param {type:"string"}
os.environ["LLAMA_CLOUD_API_KEY"] = LLAMA_PARSER_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_KEY

第四步：设置 Llama-Parse 用的 LLM 模型

from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

LLM_MODEL = "gpt-3.5-turbo-0125"  # @param {type:"string"}
llm = OpenAI(model=LLM_MODEL)
Settings.llm = llm

第五步：加载 PDF

from llama_parse import LlamaParse
from IPython.display import display, Markdown, Latex

PDF_FILE = "Case-23_cr_00118_Transcript-31-05-2024.pdf"  # @param {type:"string"}
parser = LlamaParse(result_type="markdown")

md_documents = parser.load_data(
    file_path=PDF_FILE
)

print(md_documents[0].text)

第六步：使用 GPT 解析 Markdown 文件

from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model=LLM_MODEL), num_workers=8
)
nodes = node_parser.get_nodes_from_documents(md_documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
display(Markdown('\n\n'.join([t.text for t in nodes])))

第七步：使用 ChatGPT 翻译

import openai

openai.api_key = os.getenv('OPENAI_API_KEY')

TRANSLATE_MODEL = "gpt-3.5-turbo-0125"
SYSTEM_PROMPT = '请你成为文章翻译的小帮手，请协助翻译以下法庭文件，以简体中文输出'

def translate_text(text):
    completion = openai.ChatCompletion.create(
        model=TRANSLATE_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text},
        ]
    )
    return completion.choices[0].message['content']

translated_text = []
for node in nodes:
    translated_text.append({
        'original': node.text,
        'translated': translate_text(node.text)
    })

# 显示翻译结果
display(Markdown('\n\n'.join([f"**Original:**\n{t['original']}\n\n**Translated:**\n{t['translated']}" for t in translated_text])))

第八步：保存翻译结果为中英文对照的 Word 和 PDF 文件

import os
from docx import Document
from fpdf import FPDF

DONE_LOC = os.path.join(THESIS_LOC, 'done')

if not os.path.exists(DONE_LOC):
    os.makedirs(DONE_LOC)

# 保存为中英文对照的 Word 文件
doc = Document()
for t in translated_text:
    doc.add_paragraph("Original:\n" + t['original'])
    doc.add_paragraph("\nTranslated:\n" + t['translated'])
    doc.add_paragraph("\n" + "-"*40 + "\n")
doc_path = os.path.join(THESIS_LOC, PDF_FILE.replace('.pdf', '_translated.docx'))
doc.save(doc_path)

# 保存为中英文对照的 PDF 文件
class PDF(FPDF):
    def header(self):
        self.set_font('Arial', 'B', 12)
        self.cell(0, 10, 'Translated Document', 0, 1, 'C')

    def footer(self):
        self.set_y(-15)
        self.set_font('Arial', 'I', 8)
        self.cell(0, 10, f'Page {self.page_no()}', 0, 0, 'C')

pdf = PDF()
pdf.add_page()
pdf.set_auto_page_break(auto=True, margin=15)
pdf.set_font('Arial', '', 12)

for t in translated_text:
    pdf.multi_cell(0, 10, "Original:\n" + t['original'])
    pdf.multi_cell(0, 10, "\nTranslated:\n" + t['translated'])
    pdf.multi_cell(0, 10, "\n" + "-"*40 + "\n")

pdf_path = os.path.join(THESIS_LOC, PDF_FILE.replace('.pdf', '_translated.pdf'))
pdf.output(pdf_path)

# 移动原始 PDF 文件到 done 目录
os.rename(os.path.join(THESIS_LOC, PDF_FILE), os.path.join(DONE_LOC, PDF_FILE))

# 列出存储目录的文件
os.listdir(THESIS_LOC)