pdf转txt的python刮削器对比

对比python的pdf刮削器:pypdf2、pdfplumber和pymupdf。三者都能提取pdf文件的信息并转换成txt文件,可用于ai分析和后续脚本处理。

准备条件

首先安装三个pdf刮削器。

pip install PyPDF2 pdfplumber PyMuPDF

然后准备一个pdf论文[2],和一个简短的提取python脚本如下:

import PyPDF2                                                                                                                                                                                                      
import pdfplumber                                                                                                                                                                                                  
import fitz  # PyMuPDF                                                                                   
import argparse                                     
                                                    
def extract_with_pypdf2(pdf_path):                  
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)                                                                  
        for page in reader.pages:                   
            text += page.extract_text() + "\n"
    return text                             

def extract_with_pdfplumber(pdf_path):                                                                   
    text = ""                                     
    with pdfplumber.open(pdf_path) as pdf:          
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text                                     
                                                    
def extract_with_pymupdf(pdf_path):
    text = ""                          
    doc = fitz.open(pdf_path)
    for page in doc:
        text += page.get_text() + "\n"
    doc.close()
    return text

def extract_pdf_text(pdf_path, method='pypdf2'):
    methods = {
        'pypdf2': extract_with_pypdf2,
        'pdfplumber': extract_with_pdfplumber,
        'pymupdf': extract_with_pymupdf
    }
    if method not in methods:
        raise ValueError(f"Method must be one of: {', '.join(methods.keys())}")
    return methods[method](pdf_path)

def save_text_to_file(text, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(text)

def main():
    # Set up argument parser
    parser = argparse.ArgumentParser(description='Extract text from PDF files')
    parser.add_argument('-i', '--input', required=True, help='Input PDF file path')
    parser.add_argument('-o', '--output', default='output.txt', help='Output text file path')
    parser.add_argument('-m', '--method',
                       choices=['pypdf2', 'pdfplumber', 'pymupdf'],
                       default='pymupdf',
                       help='Method to use for extraction (default: pymupdf)')

    args = parser.parse_args()

    try:
        text = extract_pdf_text(args.input, args.method)

        save_text_to_file(text, args.output)

        print(f"Successfully extracted text from {args.input} to {args.output} using {args.method}")
        print(f"Extracted {len(text)} characters")

    except Exception as e:
        print(f"Error: {str(e)}")

if __name__ == "__main__":
    main()

使用下列命令来生成对应pypdf2、pdfplumber、pymupdf三种方式产生的文本。

python pdf-text-extraction.py -i xxx.pdf -m pypdf2 -o pypdf.txt
python pdf-text-extraction.py -i xxx.pdf -m pdfplumber -o pdfplumber.txt
python pdf-text-extraction.py -i xxx.pdf -m pymupdf -o pymupdf.txt

对比结果

选取其中一段,包括公式和讲解的部分。

目前只有pymupdf正确识别出来了双列的论文,并且没有混淆和连字,可以方便给AI进行进一步分析,建议选用该方法。

参考文献
  1. S. Saarelma, J. Connor and B. , Density pedestal prediction model for tokamak plasmas, IOP Publishing, Nuclear Fusion vol. 64, 2024. no. 7, 076025.
  2. S. Saarelma, J. Connor and B. , Density pedestal prediction model for tokamak plasmas, IOP Publishing, Nuclear Fusion vol. 64, 2024. no. 7, 076025.
文章链接:https://sunwaybits.tech/best-python-pdf-extractors-pypdf2-vs-pdfplumber-vs-pymupdf/
文章标题:pdf转txt的python刮削器对比
文章作者:MyronH
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇