K-means clustering calculator

K-means clustering

The k-means clustering is a centroid cluster (cluster centers). The idea behind the k-means cluster analysis is simple, minimize the accumulated squared distance from the center (SSE). This algorithm can be used in different ways.

1. he post office example. Where to locate two post office stations, and how to assign each household to the stations.
2. Create groups of a list of items based on the characteristics of each item. For example, create learning groups in a class based on student's characteristics like Mathematical level, writing skills, communications skills.
3. Image processing.
The following definition are similar to the one way ANOVA.
Average vector - the vector of averages of each dimension.
Example: for 3 vectors [1,10], [2,10], [3,11] the average vector is [2,10.33].
(1 + 2 + 3)/3 = 2
(10 + 10 + 11)/3 = 10.33

Following the sum of square (SS) distances accumulated over all the points:
SSE (within groups)- from the points to the centers. (i=1 to n).
SSG (between groups) - from the centers to the average vector. (i=1 to n).
SST (total) - from the points to the average vector. (i=1 to n).
SST = SSE + SSG.
n - number of points.

K-means clustering algorithm

The cluster analysis calculator use the k-means algorithm:
The users chooses k, the number of clusters
1. Choose randomly k centers from the list.
2. Assign each point to the closest center.
3. Calculate the center of each cluster, as the average of all the points in the cluster.
4. Calculate SSE.
5. Repeat this process until SSE doesn't go down, or exceeding a predefined maximum (Max Loops) Since we start from a random centers, sometimes the algorithm won't get the best result, hence we repeat the above algorithm a predefined times, and choose the lowest accumulate square distances.

K-means online

K - the number of required clusters.
Max Loops - maximum loops when starting from one random centers, we don't expect to get to this number.
Repeats - the number of times we repeat the algorithm starting each time starting from random centers
Scaling - None - no scaling.
Standardized - for every x_i value use (x_i - x̄)/S.
S is the sample standard deviation.

Outliers

Calculate outlier - when switched on, that calculator will calculate the Silhouette score for every point and assign -1 to any outlier, and assign different shape for the outlier. It will not recalculate the clusters
Remove outlier - filter out the outliers and recalculate the clusters based on the new data. It will not try to calculate outliers on the filtered data
Outlier threshold - the default is zero, but you may use a different value.Every Silhouette score that is smaller than the threshold, will be an outlier

Silhouette method

The tool use the Silhouette method to identify outliers.
The Silhouette score measures how close each point is to his cluster and how far it is from the closest cluster.
For each point i:
a_i - the average distance of point i from all his cluster's points.
b_i - the average distance of point i from all the points in the closest cluster.

S_i =	b_i - a_i
	Max(a_i, b_i)

The Silhouette range is [-1,1].
High Silhouette value - the point is close to his cluster, and far from other clusters.
Zero Silhouette value - moving the point to the closest cluster will not have a big change on the SSE.
Negative Silhouette value - the point may be an outlier, or was assigned to the wrong cluster.

Why is it important to scale the data?

If all the dimensions have the same units, there is no need to scale the data before performing the clustering process, unless you choose to have different weights for the dimensions.
But if you use dimensions with different units of measures, you must scale the dimensions. Otherwise, the weights of the dimensions will spread unintentionally between the dimensions, depend on the units.
For example, you want to classify football players base on their weight and height. If you use kilograms and centimeters, or kilograms and millimeters, you get different results.
When you choose the millimeter unit, mainly the height will influence the result, and the result will probably be the same as clustering based on only the height dimension.

Ratio of variance explained

The ratio of variance explained is the ratio between the sum of squares between the group and the total sum of squares.

Ratio of variance explained =	SSG
	SST

The ratio range is [0,1]. Zero if you have only one cluster, and one when each cluster contains only one point.
Bigger ratio is better but it comes with the price of more clusters.
For example, if we have three police stations, we prefer most of distances will be between the stations and less between the houses and the closest station.

Elbow curve

If you don't know the number of clusters (k), you may use the elbow curve.
The elbow method calculates the ratio of variance explained per each k, and draw a chart of the ratio per k.
The "elbow" is the point when increasing k by one will not increase much the ratio of variance explained.

There is not always a clear "elbow". When k=0, the calculator draws the elbow curve, and chooses k as the smallest point that reaches the ratio of at least 0.9.

Histogram maker

Cluster analysis

Results