导航：首页 > 互联网科技 >

如何进行K均值算法K-Means的案例分析

发表于：2024-11-26 作者：千家信息网编辑

千家信息网最后更新 2024年11月26日，今天就跟大家聊聊有关如何进行K均值算法K-Means的案例分析，可能很多人都不太了解，为了让大家更加了解，小编给大家总结了以下内容，希望大家根据这篇文章可以有所收获。背景介绍这是一种无监督算法，可以解

千家信息网最后更新 2024年11月26日如何进行K均值算法K-Means的案例分析

今天就跟大家聊聊有关如何进行K均值算法K-Means的案例分析，可能很多人都不太了解，为了让大家更加了解，小编给大家总结了以下内容，希望大家根据这篇文章可以有所收获。

背景介绍

这是一种无监督算法，可以解决聚类问题。它的过程遵循一种简单的方法，可以通过一定数量的聚类（假设k个聚类）对给定的数据集进行分类。集群中的数据点对同级组是同质的，并且是异构的。

还记得从墨水印迹中找出形状吗？ k表示此活动有点类似。您查看形状并展开以解释存在多少个不同的群集/种群！

K-均值如何形成聚类：

K均值为每个群集选取k个点，称为质心。
每个数据点形成具有最接近质心的群集，即k个群集。
根据现有集群成员查找每个集群的质心。在这里，我们有了新的质心。
当我们有了新的质心时，请重复步骤2和3。找到每个数据点与新质心的最近距离，并与新的k簇相关联。重复此过程，直到会聚发生为止，即质心不变。

如何确定K的值：

在K均值中，我们有聚类，每个聚类都有自己的质心。质心和群集中数据点之间的差平方和构成该群集的平方值之和。同样，当所有聚类的平方和相加时，它成为聚类解的平方和之内的总和。

我们知道，随着簇数的增加，该值会不断减少，但是如果绘制结果，您可能会看到平方距离的总和急剧减小，直到达到某个k值，然后才逐渐减小。在这里，我们可以找到最佳的群集数量。

下面来看使用Python实现的案例：

'''The following code is for the K-MeansCreated by - ANALYTICS VIDHYA'''
# importing required librariesimport pandas as pdfrom sklearn.cluster import KMeans
# read the train and test datasettrain_data = pd.read_csv('train-data.csv')test_data = pd.read_csv('test-data.csv')
# shape of the datasetprint('Shape of training data :',train_data.shape)print('Shape of testing data :',test_data.shape)
# Now, we need to divide the training data into differernt clusters# and predict in which cluster a particular data point belongs.  
'''Create the object of the K-Means modelYou can also add other parameters and test your code hereSome parameters are : n_clusters and max_iterDocumentation of sklearn KMeans: 
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html '''
model = KMeans()  
# fit the model with the training datamodel.fit(train_data)
# Number of Clustersprint('\nDefault number of Clusters : ',model.n_clusters)
# predict the clusters on the train datasetpredict_train = model.predict(train_data)print('\nCLusters on train data',predict_train) 
# predict the target on the test datasetpredict_test = model.predict(test_data)print('Clusters on test data',predict_test) 
# Now, we will train a model with n_cluster = 3model_n3 = KMeans(n_clusters=3)
# fit the model with the training datamodel_n3.fit(train_data)
# Number of Clustersprint('\nNumber of Clusters : ',model_n3.n_clusters)
# predict the clusters on the train datasetpredict_train_3 = model_n3.predict(train_data)print('\nCLusters on train data',predict_train_3) 
# predict the target on the test datasetpredict_test_3 = model_n3.predict(test_data)print('Clusters on test data',predict_test_3)

运行结果：

Shape of training data : (100, 5)Shape of testing data : (100, 5)
Default number of Clusters :  8
CLusters on train data [6 7 0 7 6 5 5 7 7 3 1 1 3 0 7 1 0 4 5 6 4 3 3 0 4 0 1 1 0 3 4 3 3 0 0 1 2 1 4 3 0 2 1 1 0 3 3 0 7 1 3 0 5 1 0 1 5 4 6 4 3 6 5 0 3 0 4 33 1 5 1 6 5 7 7 6 3 5 3 5 3 1 5 2 5 0 3 2 3 4 7 1 0 1 5 3 6 1 6]Clusters on test data [3 6 2 0 5 6 0 3 5 2 3 4 5 5 5 3 3 5 5 70 0 5 5 3 5 0 6 5 0 1 6 3 5 6 0 1 7 3 0 0 6 2 0 5 3 5 7 3 3 4 6 3 1 6 3 1 3 3 2 3 3 5 1 7 5 1 53 3 5 2 0 1 5 0 3 0 3 6 3 5 4 0 2 6 3 5 6 0 6 4 3 5 0 6 6 6 1 0]
Number of Clusters :  3
CLusters on train data [2 0 1 0 2 1 2 0 0 2 0 0 2 1 0 0 1 2 2 2 2 2 2 1 2 1 0 0 1 2 2 2 2 1 1 0 2 0 2 2 1 2 0 0 1 2 2 1 0 0 2 1 2 0 1 0 2 2 2 2 2 2 2 1 2 1 2 22 0 1 0 2 2 0 0 0 2 0 2 2 2 0 2 2 2 1 2 2 2 2 0 0 1 0 2 2 2 0 2]Clusters on test data [2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 01 1 2 2 2 2 1 2 2 1 0 2 2 2 2 1 0 0 2 1 1 2 2 1 2 2 2 0 2 2 2 2 2 0 2 2 0 2 2 2 2 2 2 0 0 2 0 22 2 0 2 1 0 2 1 2 1 2 0 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 0 1]