fuzzywuzzy提供了简单的字符串匹配接口。通过编辑距离,来匹配字符串直接的相似度。
fuzzywuzzy优点就是简单易用,轻量级。对两个字符串, 通过计算编辑最小的修改次数,来比较两个字符串之间的相似度。可以用在拼写纠正上。对于中文文本的语义分析,文本内容识别上,基本不适用。
pip install fuzzywuzzy
pip install python-Levenshtein
from fuzzywuzzy import fuzz , process
fuzz模块提供了比较两个字符串相似度的几个常用方法:
ratio() 按顺序比较整个字符串的相似度。
partial_ratio() 近似匹配两个字符串
partial_ratio() 忽略单词顺序
token_set_ratio() 忽略重复
from fuzzywuzzy import fuzz,process
if __name__=='__main__':
st1 ='thank you very mach'
st2 ='thanks you '
xs1 = fuzz.ratio(st1,st2)
xs2 = fuzz.partial_ratio(st1,st2)
xs3 = fuzz.token_sort_ratio(st1,st2)
xs4 = fuzz.token_set_ratio(st1,st2)
print(xs1,xs2,xs3,xs4)
输出结果依次为: 67 91 62 62
process模块提供了从一组字符串列表中,找出与目标字符串最匹配的结果。常用方法:
extract() 以数组形式,返回匹配最高的几个结果
extractOne() 返回最匹配的一个结果。
extract|extractOne方法参数:
query 目标字符串
choices 要匹配的字符串列表或者字典
除 query choices 两个必填之外,可以设置匹配的模式
def extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5):
"""Select the best match in a list or dictionary of choices.
Find best matches in a list or dictionary of choices, return a
list of tuples containing the match and its score. If a dictionary
is used, also returns the key for each match.
def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0): """Find the single best match above a score in a list of choices. This is a convenience method which returns the single best choice. See extract() for the full arguments list. Args: query: A string to match against choices: A list or dictionary of choices, suitable for use with extract(). processor: Optional function for transforming choices before matching. See extract(). scorer: Scoring function for extract(). score_cutoff: Optional argument for score threshold. If the best match is found, but it is not greater than this number, then return None anyway ("not a good enough match"). Defaults to 0. Returns: A tuple containing a single match and its score, if a match was found that was above score_cutoff. Otherwise, returns None. """