The SDL Component Suite is an industry leading collection of components supporting scientific and engineering computing. Please visit the SDL Web site for more information....



kMeansClustering


Unit: SDL_math2
Class: none
Declaration: [1]function kMeansClustering (InMat: TMatrix; RowLo, RowHi: integer; NumClusters: integer; var Clusters: TMatrix; var ClassVec: TIntVector): double;
[2]function kMeansClustering (InMat: TMatrix; RowLo, RowHi: integer; NumClusters: integer; InitSeed: integer; var Clusters: TMatrix; var ClassVec: TIntVector; var ClassCnt: TIntVector): double;

The function kMeansClustering performs a cluster analysis using the well-known k-means algorithm (see www.statistics4u.info for more details).

The parameter InMat contains the data to be clustered (the rows contain the objects, the columns contain the variables). The objects actually used for the clustering are determined by the parameters RowLo and RowHi. The parameter NumClusters specifies the number of clusters to be found.

The detected cluster centers are stored in the variable parameter Clusters, the class assignment of the objects is returned in the variable parameter ClassVec (a vector of length InMat.NrOfRows). The class assignment is performed by sorting the number of class members; the largest class receives the class number 1 the smallest class the class number NumClusters. The function returns the sum of squared intra-class distances.

The two overloaded versions of kMeansClustering differ in the way how the initial cluster centers are assigned: in version [1] the initial cluster members are always the first NumClusters rows of the data matrix InMat. In version [2] you may control the assignment of the initial centers by specifying the seed of the random number generator InitSeed. If InitSeed is less or equal to zero, the initial clusters are assigned to the first NumClusters rows of InMat, if InitSeed is greater than zero the initial cluster centers are randomly selected from the data matrix using InitSeed as the seed of the random number generator. Further, version [2] additionally returns the size of the classes in the vector ClassCnt.

kMeansClustering increments the global variable ProcStat and calls the feedback routine MathFeedBackProc in order to allow feedback to the user during time consuming calculations. The maximum number of ProcStat depends on several factors and cannot be specified exactly. In order to be able to estimate the maximum number a regression model the function kMeansEstimatedSteps can be used.

Example: This procedure is used in the following example program (see http://www.lohninger.com/examples.html for downloading the code): findcent



Last Update: 2023-Feb-06