sklearn
spyder
scaled : 零均值单位方差代码:from sklearn import preprocessing import numpy as np #raw_dataX = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]]) X_scaled = preprocessing.scale(X) 输出:X_scaledOut[33]: array([[ 0. , -1.22474487, 1.33630621], [ 1.22474487, 0. , -0.26726124], [-1.22474487, 1.22474487, -1.]])X_scaled.mean()Out[34]: 4.9343245538895844e-17X_scaled.std(axis=0)Out[35]: array([1., 1., 1.])X_scaled.std(axis=1)Out[36]: array([1., 0.64957343, 1.11980724])X_scaled.mean(axis=0)Out[37]: array([0., 0., 0.])
StandardScaler----计算训练集的平均值和标准差,以便测试数据集使用相同的变换代码:from sklearn import preprocessing import numpy as np #raw_dataX = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]]) scaler = preprocessing.StandardScaler().fit(X) y=scaler.transform(X)print(y)输出:[[ 0. -1.22474487 1.33630621] [ 1.22474487 0. -0.26726124] [-1.22474487 1.22474487 -1.]]
MinMaxScaler(最小最大值标准化)公式:X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) ; X_scaler = X_std/ (max - min) + min代码:X_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])min_max_scaler = preprocessing.MinMaxScaler() X_train_minmax = min_max_scaler.fit_transform(X_train) 输出:array([[ 0.5 , 0. , 1. ], [ 1. , 0.5 , 0.33333333], [ 0. , 1. , 0. ]])
缺失值的插补代码:import numpy as npfrom sklearn.preprocessing import Imputer#用均值插补缺失值imp = Imputer(missing_values='NaN', strategy='mean', axis=0)imp.fit([[1, 2], [np.nan, 3], [7, 6]])Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)X = [[np.nan, 2], [6, np.nan], [7, 6]]print(imp.transform(X)) #对稀疏矩阵进行缺失值插补import scipy.sparse as spX = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])imp = Imputer(missing_values=0, strategy='mean', axis=0)imp.fit(X)Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])print(imp.transform(X_test)) 输出:[[4. 2. ] [6. 3.66666667] [7. 6. ]][[4. 2. ] [6. 3.66666667] [7. 6. ]]