数据小站
数据科学成长之路

python3读取pdf内容

pdf文档解析库pdfminer.six

在python3环境下要对pdf文档进行解析,需要使用pdfminer.six库,通过pip install pdfminer.six进行安装。

在目前,存在pdfminer和pdfminer.six 两个库。在目前pdfminer的github上,已经停止对pdfminer的更新,但是pdfminer依然可用,但是推荐使用pdfminer.six。

pdfminer.six的使用

安装好pdfminer.six之后,通过import pdfminer进行导入使用。pdfminer库的文本提取方法,主要是high_level模块中的extract_text方法,还有extract_text_to_fp、extract_pages方法。

安装过pdfminer.six之后,通过import pdfminer导入使用
模块的导入
from pdfminer import high_level

读取pdf内容示例:

from io import StringIO
from pdfminer import high_level

if __name__ == "__main__":
    output_string = StringIO()
    file = "2019.5p.pdf"
    with open(file,'rb') as reader:
        dt = high_level.extract_text(reader,output_string)
        print(dt)

pdf文档下载地址:http://www.miit.gov.cn/n1146285/n1146352/n3054355/n3057709/n3057716/c7641595/part/7641601.pdf

pdf的数据是二维结构的表格形式为主题的内容。

pdf的文档格式

通过print直接打印出解析出的pdf内容:

print打印输出读取pdf内容

通过打印输出的内容,结果可以看见已经对pdf的内容进行了解析,但解析的内容并不是最符合使用的格式。

pdfminer.layout参数优化

pdfminer.layout的 LAParams 对象,提供了对pdf内容的一系列样式解析方式,主要用来对文中内容解析时,对文字间距格式的控制。 我们在上面步骤解析pdf的基础上,怎么一个LAParams对象,对象的参数char_margin设置为100,然后在看一下效果
from io import StringIO
from pdfminer import high_level
from pdfminer.layout import LAParams

if __name__ == "__main__":
    abc = LAParams(char_margin=100)
    output_string = StringIO()
    file = "2019.5p.pdf"
    with open(file,'rb') as reader:
        dt = high_level.extract_text(reader,output_string,laparams=abc)
        print(dt)
LAParams的参数优化后解析内容样式

发现,此时解析的pdf内容,已经逐行解析,相比不设置LAParams时,效果提升明显。

pdfminer.layout 提供了一套根据pdf格式,按需求设置解析布局样式的方法。

LAParams的参数解析

LAParams的参数解析:
追踪 LAParams 的源码,可以查看到 LAParams 设计的参数文档, LAParams 提供了 line_overlap , char_margin , line_margin , word_margin , boxes_flow 等用来设置对pdf界面布局解析的参数。
如果不手动设置,会默认初始值。
char_margin 参数描述为: If two characters are closer together than this margin they are considered part of the same line. The margin is specified relative to the width of the character.
通过百度翻译: 如果两个字符比此边距更接近,则它们被视为同一行的一部分。边距是相对于字符宽度指定的 . 可以通过char_margin的值,来控制多远的距离为所在的同一行。超出设置值的范围,就当做下一行。
其他参数自行百度翻译。
class LAParams:
    """Parameters for layout analysis
    :param line_overlap: If two characters have more overlap than this they
        are considered to be on the same line. The overlap is specified
        relative to the minimum height of both characters.
    :param char_margin: If two characters are closer together than this
        margin they are considered part of the same line. The margin is
        specified relative to the width of the character.
    :param word_margin: If two characters on the same line are further apart
        than this margin then they are considered to be two separate words, and
        an intermediate space will be added for readability. The margin is
        specified relative to the width of the character.
    :param line_margin: If two lines are are close together they are
        considered to be part of the same paragraph. The margin is
        specified relative to the height of a line.
    :param boxes_flow: Specifies how much a horizontal and vertical position
        of a text matters when determining the order of text boxes. The value
        should be within the range of -1.0 (only horizontal position
        matters) to +1.0 (only vertical position matters). You can also pass
        `None` to disable advanced layout analysis, and instead return text
        based on the position of the bottom left corner of the text box.
    :param detect_vertical: If vertical text should be considered during
        layout analysis
    :param all_texts: If layout analysis should be performed on text in
        figures.
    """

    def __init__(self,
                 line_overlap=0.5,
                 char_margin=2.0,
                 line_margin=0.5,
                 word_margin=0.1,
                 boxes_flow=0.5,
                 detect_vertical=False,
                 all_texts=False):
.....
赞(1) 打赏
未经允许不得转载:技术文档分享 » python3读取pdf内容

评论 抢沙发