pdf文档解析库pdfminer.six
在python3环境下要对pdf文档进行解析,需要使用pdfminer.six库,通过pip install pdfminer.six进行安装。
在目前,存在pdfminer和pdfminer.six 两个库。在目前pdfminer的github上,已经停止对pdfminer的更新,但是pdfminer依然可用,但是推荐使用pdfminer.six。

pdfminer.six的使用
安装好pdfminer.six之后,通过import pdfminer进行导入使用。pdfminer库的文本提取方法,主要是high_level模块中的extract_text方法,还有extract_text_to_fp、extract_pages方法。
安装过pdfminer.six之后,通过import pdfminer导入使用 模块的导入 from pdfminer import high_level
读取pdf内容示例:
from io import StringIO from pdfminer import high_level if __name__ == "__main__": output_string = StringIO() file = "2019.5p.pdf" with open(file,'rb') as reader: dt = high_level.extract_text(reader,output_string) print(dt)
pdf文档下载地址:http://www.miit.gov.cn/n1146285/n1146352/n3054355/n3057709/n3057716/c7641595/part/7641601.pdf
pdf的数据是二维结构的表格形式为主题的内容。

通过print直接打印出解析出的pdf内容:

通过打印输出的内容,结果可以看见已经对pdf的内容进行了解析,但解析的内容并不是最符合使用的格式。
pdfminer.layout参数优化
pdfminer.layout的 LAParams 对象,提供了对pdf内容的一系列样式解析方式,主要用来对文中内容解析时,对文字间距格式的控制。 我们在上面步骤解析pdf的基础上,怎么一个LAParams对象,对象的参数char_margin设置为100,然后在看一下效果from io import StringIO from pdfminer import high_level from pdfminer.layout import LAParams if __name__ == "__main__": abc = LAParams(char_margin=100) output_string = StringIO() file = "2019.5p.pdf" with open(file,'rb') as reader: dt = high_level.extract_text(reader,output_string,laparams=abc) print(dt)

发现,此时解析的pdf内容,已经逐行解析,相比不设置LAParams时,效果提升明显。
pdfminer.layout 提供了一套根据pdf格式,按需求设置解析布局样式的方法。
LAParams的参数解析
LAParams的参数解析: 追踪 LAParams 的源码,可以查看到 LAParams 设计的参数文档, LAParams 提供了 line_overlap , char_margin , line_margin , word_margin , boxes_flow 等用来设置对pdf界面布局解析的参数。 如果不手动设置,会默认初始值。 char_margin 参数描述为: If two characters are closer together than this margin they are considered part of the same line. The margin is specified relative to the width of the character. 通过百度翻译: 如果两个字符比此边距更接近,则它们被视为同一行的一部分。边距是相对于字符宽度指定的 . 可以通过char_margin的值,来控制多远的距离为所在的同一行。超出设置值的范围,就当做下一行。 其他参数自行百度翻译。class LAParams: """Parameters for layout analysis :param line_overlap: If two characters have more overlap than this they are considered to be on the same line. The overlap is specified relative to the minimum height of both characters. :param char_margin: If two characters are closer together than this margin they are considered part of the same line. The margin is specified relative to the width of the character. :param word_margin: If two characters on the same line are further apart than this margin then they are considered to be two separate words, and an intermediate space will be added for readability. The margin is specified relative to the width of the character. :param line_margin: If two lines are are close together they are considered to be part of the same paragraph. The margin is specified relative to the height of a line. :param boxes_flow: Specifies how much a horizontal and vertical position of a text matters when determining the order of text boxes. The value should be within the range of -1.0 (only horizontal position matters) to +1.0 (only vertical position matters). You can also pass `None` to disable advanced layout analysis, and instead return text based on the position of the bottom left corner of the text box. :param detect_vertical: If vertical text should be considered during layout analysis :param all_texts: If layout analysis should be performed on text in figures. """ def __init__(self, line_overlap=0.5, char_margin=2.0, line_margin=0.5, word_margin=0.1, boxes_flow=0.5, detect_vertical=False, all_texts=False): .....