本文共 6610 字,大约阅读时间需要 22 分钟。
from sklearn import preprocessingimport numpy as npx = np.array([[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])x_scaled = preprocessing.scale(x)print x_scaled
scale
处理之后为零均值和单位方差: X_scaled.mean(axis=0)array([ 0., 0., 0.])X_scaled.std(axis=0)array([ 1., 1., 1.])
计算平均值和标准偏差在一个训练集,可以以后再申请相同的转换测试集。 scaler=preprocessing.StandardScaler().fit(X)scalerStandardScaler(copy=True, with_mean=True, with_std=True)scaler.mean_array([ 1. ..., 0. ..., 0.33...])scaler.scale_array([ 0.81..., 0.81..., 1.24...])scaler.transform(X)array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]])
同样的,将相同的转化应用到测试集合。
scaler.transform([[-1.,1.,0.]])array([[-2.44..., 1.22..., -0.26...]])对于StandardScaler你也可以改变它的一些参数,例如
scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True).fit(X)
X_train=np.array([[1.,-1.,2.],... [2.,0.,0.],... [0.,1.,-1.]])...min_max_scaler=preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))X_train_minmax=min_max_scaler.fit_transform(X_train)X_train_minmaxarray([[ 0.5 , 0. , 1. ], [ 1. , 0.5 , 0.33333333], [ 0. , 1. , 0. ]])同理,它也可以直接用到后续的测试集中
X_test=np.array([[-3.,-1.,4.]])X_test_minmax=min_max_scaler.transform(X_test)X_test_minmaxarray([[-1.5 , 0. , 1.66666667]])如果MinMaxScaler给定了feature_range,其公式为
X_std=(X-X.min(axis=0))/(X.max(axis=0)-X.min(axis=0))X_scaled=X_std*(max-min)+min
和scales工作很相似,但是 scales是将特征调整到特定值,而MaxAbsScaler对于本来中心店就是零或者稀疏的数据,是不会改变的。 定心稀疏数据的稀疏结构会破坏数据,因此很少是一个明智的做法。然而,它可以合理规模稀疏的输入,特别是特性在不同的尺度上。
有l1 or l2两个标准
X=[[1.,-1.,2.],... [2.,0.,0.],... [0.,1.,-1.]]X_normalized=preprocessing.normalize(X,norm='l2')X_normalizedarray([[ 0.40..., -0.40..., 0.81...], [ 1. ..., 0. ..., 0. ...], [ 0. ..., 0.70..., -0.70...]])它也可以像前面一样的方式使用:
normalizer=preprocessing.Normalizer().fit(X)# fit does nothingnormalizer.transform(X)array([[ 0.40..., -0.40..., 0.81...], [ 1. ..., 0. ..., 0. ...], [ 0. ..., 0.70..., -0.70...]])normalizer.transform([[-1.,1.,0.]])array([[-0.70..., 0.70..., 0. ...]])
Sparse input
normalize and Normalizer接受scipy密集的数组类和稀疏矩阵.X=[[1.,-1.,2.],... [2.,0.,0.],... [0.,1.,-1.]]binarizer=preprocessing.Binarizer().fit(X)# fit does nothingbinarizerBinarizer(copy=True, threshold=0.0)binarizer.transform(X)array([[ 1., 0., 1.], [ 1., 0., 0.], [ 0., 1., 0.]])调整threshold之后:
binarizer=preprocessing.Binarizer(threshold=1.1)binarizer.transform(X)array([[ 0., 0., 1.], [ 1., 0., 0.], [ 0., 0., 0.]])支持Sparse input
enc = preprocessing.OneHotEncoder()enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) OneHotEncoder(categorical_features='all', dtype=<... 'float'>, handle_unknown='error', n_values='auto', sparse=True)enc.transform([[0, 1, 3]]).toarray()array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])其实可以通过从字典加载特征来实现上面的思想:
measurements = [ {'city': 'Dubai', 'temperature': 33.}, {'city': 'London', 'temperature': 12.}, {'city': 'San Fransisco', 'temperature': 18.},]from sklearn.feature_extraction import DictVectorizervec = DictVectorizer()vec.fit_transform(measurements).toarray()array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]])vec.get_feature_names()['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
提供基本的填充方法,例如使用均值或者中位数填充。当然还有许多其他的方法。 import numpy as npfrom sklearn.preprocessing import Imputerimp = Imputer(missing_values='NaN', strategy='mean', axis=0)imp.fit([[1, 2], [np.nan, 3], [7, 6]])Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)X = [[np.nan, 2], [6, np.nan], [7, 6]]print(imp.transform(X)) [[ 4. 2. ] [ 6. 3.666...] [ 7. 6. ]]缺失值也可以使用0表示:
import scipy.sparse as spX = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])imp = Imputer(missing_values=0, strategy='mean', axis=0)imp.fit(X)Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])print(imp.transform(X_test)) [[ 4. 2. ] [ 6. 3.666...] [ 7. 6. ]]
import numpy as npfrom sklearn.preprocessing import PolynomialFeaturesX = np.arange(6).reshape(3, 2)X array([[0, 1], [2, 3], [4, 5]])poly = PolynomialFeatures(2)poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]])转换
X = np.arange(9).reshape(3, 3)X array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])poly = PolynomialFeatures(degree=3, interaction_only=True)poly.fit_transform(X) array([[ 1., 0., 1., 2., 0., 0., 2., 0.], [ 1., 3., 4., 5., 12., 15., 20., 60.], [ 1., 6., 7., 8., 42., 48., 56., 336.]])转换为:
import numpy as npfrom sklearn.preprocessing import FunctionTransformertransformer = FunctionTransformer(np.log1p)X = np.array([[0, 1], [2, 3]])transformer.transform(X)array([[ 0. , 0.69314718], [ 1.09861229, 1.38629436]])
转载地址:http://vxaji.baihongyu.com/