Clustering visualization

TeeChart Pro includes classes and components to perform “clustering” on your data, and optionally visualize the results using a chart “Tool” component.

Clustering is the process of grouping data automatically, according to how well related are the individual items.

As an unsupervised algorithm, its widely used in data-mining / machine-learning / B.I. (Business Intelligence) applications.

For more information on clustering visit the following Wikipedia link: http://en.wikipedia.org/wiki/Cluster_analysis

An executable example:

http://www.steema.us/files/public/teechart/vcl/demos/clustering/TeeChart_Clustering.zip

The clustering algorithm can be processed on custom data, not necessarily on TeeChart “Series” data.

Classes and Units

The TeeClustering.pas unit, for both VCL and Firemonkey, contains abstract “engine” classes that perform the clustering algorithms.

Three different clustering methods are provided:

• TKMeansClustering

• THierarchicalClustering

• TQTClustering (Quality Threshold)

These classes derive from a common abstract class: TBaseClustering.

Each clustering method has its own properties that determine how will the clusters be calculated. After calculating, you can access the Clusters property, which is a TList of TCluster objects.

A TCluster contains child clusters (Items[ ]), so you can check which input data items belong to which cluster, or in the case of the Hierarchical type, access the tree structure (clusters and sub-clusters).

The input data (your data) is not contained by the above classes.

Data is passed to the clustering engine through a “provider” class. There is currently one kind of data provider (TSeriesProvider) to cluster XY or XYZ Series points.

This class is implemented in the TeeClusteringTool.pas unit, together with a charting Tool class (TClusteringTool) to make things easier and more automatic.

Basic Example

Example runtime code (it can be done at design-time too, without coding) :

uses TeeClusteringTool;

var tool : TClusteringTool;

tool:=TClusteringTool.Create(Self);

tool.ParentChart:=Chart1;

tool.Series:=Series1; // your series

tool.Method:=cmKMeans;

tool.KMeans.NumClusters:=5;

tool.Execute;

After execution, you can loop on the resulting output clusters, for example:

var t : Integer;

for t:=0 to tool.Clusters.Count-1 do
  Memo1.Lines.Add( 'Cluster: '+IntToStr(t)+' contains:  '+
IntToStr(tool.Clusters[t].Count)+' points' );

TClusteringTool

This tool automatically performs clustering using the choosen method and parameters, and optionally paints each source series point with a different color indicating which cluster they belong to, and/or draws polygons around each group of cluster’s items, among other things.

Properties:

ClusteringTool1.Method := cmHierarchical;

ClusteringTool1.ColorEach := True; // paint Series with one color per cluster

ClusteringTool1.ShowBounds := True; // draws convex polygons bounding each cluster points

ClusteringTool1.Centers.Visible := True; // shows cluster centers

ClusteringTool1.Centroids.Visible := True; // shows cluster centroids

Other properties include Brush, Pen and Transparency, used when drawing cluster polygon boundaries.

Methods:

Several helper methods are provided:

// Obtain cluster's center and centroid XY points in Series scales:

var P : TPointFloat;

P:=ClusteringTool1.GetClusterCenter( ClusteringTool1.Clusters[3] );

P:=ClusteringTool1.GetClusterCentroid( ClusteringTool1.Clusters[2] );

// Obtain an array of XY points (in screen pixel coordinates), that belong to cluster:

var PP : TPointArray;

ClusteringTool1.GetClusterPoints( ClusteringTool1.Clusters[4], PP);

// ...

PP:=nil;

// Get cluster statistics:

var S : TClusterStats;

S:=ClusteringTool1.GetStats( ClusteringTool1.Clusters[0] );

Calculation parameters

Each clustering algorithm needs different parameters:

K-Means:

ClusteringTool1.KMeans.NumClusters := 10; // Number of minimum clusters ("K")

ClusteringTool1.KMeans.MaxIterations := 1000; // Maximum number of iterations before stopping

Hierarchical:

ClusteringTool1.Hierarchical.NumClusters := 8; // Number of tree root clusters

QT:

ClusteringTool1.QTClustering.MinCount := 30; // Minimum number of points to form a cluster

ClusteringTool1.QTClustering.MaxDiameter := 100; // Maximum "diameter" a cluster can grow

Common parameters:

Distance

Cluster calculation is based on the “distance” between a data item and the other data items. There are several ways to calculate the “distance” between items.

The algorithms are agnostic, they call the Provider (ie: Series provider) to obtain the distances.

For example, on a XY scatter plot, the distance between points can be for example the hypotenuse (Pythagoras’ theorem), that is, the simple Euclidean distance between a point XY and another XY.

Distance calculations implemented:

dmEuclidean
dmSquaredEuclidean
dmManhattan
dmMinkowski
dmSorensen
dmChebyshev

Example:

ClusteringTool1.Distance := dmMinkowski;

ClusteringTool1.MinkowskiLambda := 4;

Linkage

There are several ways to calculate the “distance” between clusters when one or the two clusters have more than one item.

This is called “linkage”.

The most simple way is using each cluster “center” (this means no linkage occurs).

4 thoughts to “Clustering visualization”

Jan Lutgert says:

August 24, 2015 at 10:02 am

Very nice implementation of this very useful multivariate data analysis technique. With its use of XY (2D) or XYZ (3D) data the current implementation is great for the analysis of spatial / GIS type problems. However with the amount of work already done it would be relative easy to transform the current implementation into a truly multivariate analysis system. I would guess that currently 90-95% of the work required is already done. Distance calculations would need to be extended to a generic N-dimensional approach. And to be able to compare measurements that are of (vastly) different scales one or more normalization methods (i.e. z-scores) would be required. In the past I wrote a similar K-Means clustering program in Delphi for geo-scientific applications. If people are interested I am willing to share what I know (including code & executable) on the subject.
1. david berneda says:
  
  August 24, 2015 at 1:32 pm
  
  Hi Jan, thanks a lot for your comments. We have a new version in the works, that works with N-dimensions. The XY and XYZ limit is because the current code was oriented to display the data. We can send you the full code and demos if you want to take a look. Normalization is also a planned feature.
  regards !
  david
Marc Meumann says:

January 23, 2020 at 1:47 pm

The Clustering demo code is included with the TeeChart Feature Demo:

https://github.com/Steema/TeeChart-VCL-samples/tree/master/TeeNew
Robin Rai says:

June 9, 2021 at 7:28 am

Impressive article on clustering visualization……………was of great help keep posting

Comments are closed.

Algorithm	Single CPU	Multiple CPU
K-Means	47	31
Hierarchical	4328	4156
QT	2859	703