数据处理的统计学习（scikit-learn教程）

发布时间：2020-12-26 04:04:51 所属栏目：大数据来源：网络整理

导读：副标题#e# 数据挖掘入门与实战 ?公众号： datadw Scikit-learn 是一个紧密结合Python科学计算库(Numpy、Scipy、matplotlib)，集成经典机器学习算法的Python模块。一、统计学习：scikit-learn中的设置与评估函数对象（1）数据集 scikit-learn 从二维数组描

分裂：自上而下的方法：所有的观测样例开始于同一个簇。迭代的进行分层。对于预计簇很多的情况，这种方法既慢（由于所有的观测样例作为一个簇开始的，是递归进行分离的）又有统计学行的病态。

连同-驱使聚类（Conectivity-constrained clustering）
使用凝聚聚类，通过一个连通图可以指定某些样例能被聚集在一起。scikit-learn中的图通过邻接矩阵来表示，且通常是一个稀疏矩阵。例如，在聚类一张图片时检索连通区域（有时也被称作连同单元、部件）：

from sklearn.feature_extraction.image import grid_to_graphfrom sklearn.cluster import AgglomerativeClustering################################################################################ Generate datalena = sp.misc.lena()# Downsample the image by a factor of 4lena = lena[::2,::2] + lena[1::2,::2] + lena[::2,1::2] + lena[1::2,1::2]
X = np.reshape(lena,(-1,1))################################################################################ Define the structure A of the data. Pixels connected to their neighbors.connectivity = grid_to_graph(*lena.shape)################################################################################ Compute clusteringprint("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 15 ?# number of regionsward = AgglomerativeClustering(n_clusters=n_clusters,? ?linkage='ward',connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_,lena.shape)print("Elapsed time: ",time.time() - st)print("Number of pixels: ",label.size)print("Number of clusters: ",np.unique(label).size)

特征凝聚：
我们已经知道稀疏性可以缓和高维灾难。i.e相对于特征数量观测样例数量不足的情况。另一种方法是合并相似的特征：特征凝聚。这种方法通过在特征方向上进行聚类实现。在特征方向上聚类也可以理解为聚合转置的数据。

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images,(len(images),-1))
connectivity = grid_to_graph(*images[0].shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity,? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? n_clusters=32)
agglo.fit(X) 
X_reduced = agglo.transform(X)
X_approx = agglo.inverse_transform(X_reduced)
images_approx = np.reshape(X_approx,images.shape)

transeform 和invers_transeform方法
有些模型带有转置方法。例如用来降低数据集的维度

（2）分解：从一个信号到成分和加载

成分及其加载：
如果X是我们的多变量数据，那么我们要要尝试解决的问题就是在不同的观测样例上复写写它：我们想要学习加载L和其它一系列的成分C，如X = LC。存在不同的标准和条件去选择成分。

主成分分析：PCA
主成分分析（PCA）选择在信号上解释极大方差的连续成分。

上面观测样例的点分布在一个方向上是非常平坦的：三个特征单变量的一个甚至可以有其他两个准确的计算出来。PCA用来发现数据在哪个方向上是不平坦的。

当被用来转换数据的时候，PCA可以通过投射到一个主子空间来降低数据的维度。：

# Create a signal with only 2 useful dimensionsx1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1,x2,x3]from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X)print(pca.explained_variance_) ?# As we can see,only the 2 first components are usefulpca.n_components = 2X_reduced = pca.fit_transform(X)
X_reduced.shape

（编辑：PHP编程网 - 湛江站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

7/9

首页

尾页

Flink CDC + Hudi 海量	不良数据会造成更严重
大规模分布式计算学习	几款日常的开源无代码