sklearn一般接口使用-技术文档分享

数据接口

sklearn提供的数据集都在sklearn.dataset包中，有load和fetch两种方式，数据类型都是集成字典类型

load方式加载的是小数据，不用下载直接load即可使用， fetch的数据集需要下载

from sklearn import datasets
iris = datasets.load_iris()

更多sklearn自带dataset查看链接 sklearn一般数据集接口使用 sklearn dataset api

模型选择

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
 
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 0)

模型选择中最常用的一个方法的train_test_split方法，可以对数据集进行划分，可以指定random_state 来保证每次划分数据一致性，用来对比不同算法性能时剔除数据不一致的因素。

文本特征提取

特征提取在sklearn.feature_extraction的包中，有字典特征提取DictVectorizer、散列特征提取FeatureHasher、文本特征提取、图片特征提取几种方式。

字典特征提取 DictVectorizer

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)
print(X)
[[2. 0. 1.]
  [0. 1. 3.]]
print(v.get_feature_names())
['bar', 'baz', 'foo']
print(type(X))

DictVectorizer 字典特征的使用方式：导入类、生成实例对象、fit字典列表返回生成的one-hot编码的二维数组,类型为scipy.sparse.csr.csr_matrix

DictVectorizer生成实例对象参数：sparse默认为True，默认返回的是稀疏矩阵

实例对象的方法：

`fit`(self, X[, y])	Learn a list of feature name -> indices mappings.
`fit_transform`(self, X[, y])	Learn a list of feature name -> indices mappings and transform X.
`get_feature_names`(self)	Returns a list of feature names, ordered by their indices.
`get_params`(self[, deep])	Get parameters for this estimator.
`inverse_transform`(self, X[, dict_type])	Transform array or sparse matrix X back to feature mappings.
`restrict`(self, support[, indices])	Restrict the features to those in support using feature selection.
`set_params`(self, \\params)	Set the parameters of this estimator.
`transform`(self, X)	Transform feature->value dicts to array or sparse matrix.

文本特征提取 text

文本特征提取在sklearn.feature_extraction.text包中，有以下几种方式：

`feature_extraction.text.CountVectorizer`([…])	Convert a collection of text documents to a matrix of token counts
`feature_extraction.text.HashingVectorizer`([…])	Convert a collection of text documents to a matrix of token occurrences
`feature_extraction.text.TfidfTransformer`([…])	Transform a count matrix to a normalized tf or tf-idf representation
`feature_extraction.text.TfidfVectorizer`([…])	Convert a collection of raw documents to a matrix of TF-IDF features.

CountVectorizer使用

from sklearn.feature_extraction.text import CountVectorizer
c = CountVectorizer()
corpus = [ 'This is the first document.',
             'This document is the second document.',
             'And this is the third one.',
             'Is this the first document?']
X = c.fit_transform(corpus)
print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
  [0 2 0 1 0 1 1 0 1]
  [1 0 0 1 1 0 1 1 1]
  [0 1 1 1 0 0 1 0 1]]

CountVectorizer（）与字典特征提取实例类对象不同，没有sparse参数，默认是生成稀疏矩阵，只能用返回的数组对象的toarray（）方法，转换成二维数组形式

CountVectorizer类实例对象是的参数设置： stop_words停用词

CountVectorizer是根据空格分词，针对中文分词需要用jieba之类库先将中文分词后才能中文特征提取

中文分词jieba使用

import jieba
wenben= '今天天气好晴朗'
data = jieba.cut(wenben)
data_new = ' '.join(list(data))
print(data_new)

jieba.cut()对中文文本分词后生成一个迭代器对象，需要用list强制转换成列表，在通过str的join后，生成以空格分隔的字符串形式数据，才可以使用 CountVectorizer 进行特征提取

逆概词频 ti-idf文本特征提取 TfidfVectorizer

TfidfVectorizer用法同CountVectorizer

更多对文本的词向量处理，查看sklearn中特征工程feature_extraction的使用方法

特征处理、预处理 preprocessing

数值型特征的无量纲化处理，常用标准化、归一化等，在sklearn.preprocessing包中

归一化 sklearn.preprocessing.MinMaxScaler()

import numpy as np
from sklearn.preprocessing import minmax_scale
data = np.random.randint(-10,10,size=(5,5))
print(data)
newdt = minmax_scale(data)
print(newdt)

minmax_scale默认将numpy的数组转换为0-1区间的数组

归一化是将数据映射到最大值减去最小值的区间中，受异常值影响严重，如果最大值、最小值为异常值，会对映射的结果造成影响，鲁棒性差。

标准化 sklearn.preprocessing.StandardScaler()

将数据映射到均值为0，方差为1的分布上，使用方法同MinMaxScaler

公式： (x – mean)/std ，适用于大数据应用场景

降维

降低特征维度的方式有很多种，需要综合使用，在PCA之前，可以先通过特征自身的数据特性进行过滤，如过滤方差小、相关系数高的特征

特征选择sklearn.feature_selection

sklearn.feature_selection.VarianceThresholdg()过滤掉低方差特征

from sklearn.feature_selection import VarianceThreshold
v = VarianceThreshold(threshold=50)
np.random.seed(0)
data = np.random.randint(-10,30,size=(5,6))
print(data)
print(data.std(axis=0)**2)
newdt = v.fit_transform(data)
print(newdt)

过滤掉生成类对象中指定threshold值的方差特征对象

主成分分析PCA sklearn.decomposition.PCA

高纬度降低纬度，对原有数据进行割舍，创造新的数据，PCA因为会创造新的特征，对模型的解释性不是很好

from sklearn.decomposition import PCA
p = PCA(n_components = 0.9)
x = np.random.random(size=(5,5))
y = p.fit_transform(x)
print(y)

PCA类生成实例对象时，参数n_components 设置为小数时，为保留多少信息，设置为整数时，为保留多少特征

sklearn一般接口使用

数据接口

模型选择

文本特征提取

字典特征提取 DictVectorizer

文本特征提取 text

逆概词频 ti-idf文本特征提取 TfidfVectorizer

特征处理、预处理 preprocessing

归一化 sklearn.preprocessing.MinMaxScaler()

标准化 sklearn.preprocessing.StandardScaler()

降维

特征选择sklearn.feature_selection

sklearn.feature_selection.VarianceThresholdg()过滤掉低方差特征

相关系数ρ scipy.stats.pearsonr

主成分分析PCA sklearn.decomposition.PCA

评论抢沙发

置顶推荐

词云

热门文章

数据接口

模型选择

文本特征提取

字典特征提取 DictVectorizer

文本特征提取 text

逆概词频 ti-idf文本特征提取 TfidfVectorizer

特征处理、预处理 preprocessing

归一化 sklearn.preprocessing.MinMaxScaler()

标准化 sklearn.preprocessing.StandardScaler()

降维

特征选择sklearn.feature_selection

sklearn.feature_selection.VarianceThresholdg()过滤掉低方差特征

相关系数ρ scipy.stats.pearsonr

主成分分析PCA sklearn.decomposition.PCA

评论 抢沙发

置顶推荐

词云

热门文章

评论抢沙发