Big files: XML or JSON ? TeeBI !!

TeeBI Dashboards

 

Introduction

XML and JSON are very typical text formats used to store data, designed to be more comfortable than plain old “csv” text and allowing hierarchical (parent -> child) relationships.

However, even if there are many wonderful standard libraries to process them, there is still a speed problem when loading big quantities of data (say, hundreds or thousands of megabytes).

Content has to be parsed and stored into memory structures representing the “tree” nature of data nodes and attributes, which can be very fast (milliseconds) for small files, but terribly slow (minutes !) for big ones.

TeeBI core base class (TDataItem) is an “agnostic” memory structure providing parent -> child connections, using simple arrays to store data (one array per field or column).

Pseudo-code:


TDataItem = class
Name : String;
Items : Array of TDataItem;   // <--- Children
Kind : TDataKind;  // <-- Integer, String, Float, DateTime, Boolean, or "List of TDataItem"
Data : Array of ...    //  <--  one array for each Kind: "Int32Data : Array of Int32"
end

 

With a TDataItem, loading and saving big quantities of data is insanely fast (0.2 seconds to load or save 1 million rows with 4 columns on a normal PC).

The arrays are saved / loaded to / from Streams directly in one Write / Read operation.

That means we can import data from XML or JSON (or any other format like database datasets, server connections, Excel, etc, etc) into a TDataItem and then save it to a binary TeeBI file for later reuse.


Data := TDataItemPersistence.Load( 'my data.bi ')

 

Once a TDataItem is created or loaded, we can use it in many ways:

  • Search and modify data, re-structure columns
  • Sort data by multiple fields, and by custom expressions
  • Run ultra-fast SQL-like queries and summaries against TDataItems
  • Set master -> detail relationships between different TDataItems
  • Filter rows by code or using expressions (as strings or as TExpression classes)
  • Create calculated columns (using code or expressions)
  • Merge several TDataItems
  • Compare the structure and / or full data of TDataItems to obtain difference sets
  • Present TDataItems using Grids, Charts, Dashboards and PDF output
  • Connect TDataItems to a super-fast TBIDataset (a normal memory TDataset class)
  • Export to any other format (for example XML to JSON and vice-versa)
  • Access remote TDataItems from web servers transparently
  • Apply machine-learning algorithms using R or Python Scikit-learn
  • Access basic statistics of any TDataItem or child item

 

Note to TeeChart developers:

TeeBI includes a new TBIChart control (derived from TChart) that is capable of automatically creating new chart series and fill them from any TDataItem.

BIChart1.Data := MyDataItem;

A planned new feature is to integrate the Data property editor dialog inside the normal TeeChart editor, for design-time support (zero code charting !)

 

TeeBI library is available for download at the following link:

https://github.com/Steema/BI

Supported development environments:

  • Embarcadero Studio (Delphi, C++) from XE4 version and up
  • Lazarus FreePascal
  • …and soon for Microsoft Visual Studio .NET

Several 3rd party products can be optionally used with TeeBI:

https://github.com/Steema/BI/wiki/3rd-party-supported-products

 

For more information:

Please visit the TeeBI community at Google+ and the TeeBI home website for more information and technical details.

 

 

TeeBI Web Server returning a summary chart

TeeBI teaser

Small teaser of TeeBI.

This is the custom web server calculating a query and returning a chart. The only dependencies are TeeChart and Indy http web server component.

TeeBI Web Server returning a summary chart

 

Data is in column-oriented, memory-based custom format for speed reasons, no database or FireDac, etc are necessary. Data is imported from many sources (excel, text files, firedac or any other datasets, etc) to the custom format, which is then transparent to the user apps.

The web server is optional if data is not remote, and it can also return binary streams to normal desktop (vcl or fmx), or mobile apps (fmx).

A table of 2000×5 cells takes 180msec via web. For visual display, an fmx or vcl dbgrid easily handles 1 or more million rows with a custom “TBIDataset” linked to an in-memory data, allowing grouping, filtering and sorting using complex expressions involving columns from the same table or foreign-key(s) fields.

This speed is the basis for a next-coming set of machine-learning and data-mining algorithms, currently under development.

 

Clustering visualization

TeeChart Pro includes classes and components to perform “clustering” on your data, and optionally visualize the results using a chart “Tool” component.

Clustering is the process of grouping data automatically, according to how well related are the individual items.

As an unsupervised algorithm, its widely used in data-mining / machine-learning / B.I. (Business Intelligence) applications.

 

teechart_clustering

For more information on clustering visit the following Wikipedia link: http://en.wikipedia.org/wiki/Cluster_analysis

An executable example:

http://www.steema.us/files/public/teechart/vcl/demos/clustering/TeeChart_Clustering.zip

The clustering algorithm can be processed on custom data, not necessarily on TeeChart “Series” data.

 

Classes and Units

The TeeClustering.pas unit, for both VCL and Firemonkey, contains abstract “engine” classes that perform the clustering algorithms.

Three different clustering methods are provided:

• TKMeansClustering

• THierarchicalClustering

• TQTClustering (Quality Threshold)

These classes derive from a common abstract class: TBaseClustering.

Each clustering method has its own properties that determine how will the clusters be calculated. After calculating, you can access the Clusters property, which is a TList of TCluster objects.

A TCluster contains child clusters (Items[ ]), so you can check which input data items belong to which cluster, or in the case of the Hierarchical type, access the tree structure (clusters and sub-clusters).

The input data (your data) is not contained by the above classes.

Data is passed to the clustering engine through a “provider” class. There is currently one kind of data provider (TSeriesProvider) to cluster XY or XYZ Series points.

This class is implemented in the TeeClusteringTool.pas unit, together with a charting Tool class (TClusteringTool) to make things easier and more automatic.

 

Basic Example

Example runtime code (it can be done at design-time too, without coding) :

uses TeeClusteringTool;

var tool : TClusteringTool;

tool:=TClusteringTool.Create(Self);

tool.ParentChart:=Chart1;

tool.Series:=Series1; // your series

tool.Method:=cmKMeans;

tool.KMeans.NumClusters:=5;

tool.Execute;

 

After execution, you can loop on the resulting output clusters, for example:

var t : Integer;

for t:=0 to tool.Clusters.Count-1 do
Memo1.Lines.Add( ‘Cluster: ‘+IntToStr(t)+’ contains:  ‘+
IntToStr(tool.Clusters[t].Count)+’ points’ );

 

 

TClusteringTool

This tool automatically performs clustering using the choosen method and parameters, and optionally paints each source series point with a different color indicating which cluster they belong to, and/or draws polygons around each group of cluster’s items, among other things.

Properties:

ClusteringTool1.Method := cmHierarchical;

ClusteringTool1.ColorEach := True; // paint Series with one color per cluster

ClusteringTool1.ShowBounds := True; // draws convex polygons bounding each cluster points

ClusteringTool1.Centers.Visible := True; // shows cluster centers

ClusteringTool1.Centroids.Visible := True; // shows cluster centroids

Other properties include Brush, Pen and Transparency, used when drawing cluster polygon boundaries.

Methods:

Several helper methods are provided:

// Obtain cluster’s center and centroid XY points in Series scales:

var P : TPointFloat;

P:=ClusteringTool1.GetClusterCenter( ClusteringTool1.Clusters[3] );

P:=ClusteringTool1.GetClusterCentroid( ClusteringTool1.Clusters[2] );

// Obtain an array of XY points (in screen pixel coordinates), that belong to cluster:

var PP : TPointArray;

ClusteringTool1.GetClusterPoints( ClusteringTool1.Clusters[4], PP);

PP:=nil;

// Get cluster statistics:

var S : TClusterStats;

S:=ClusteringTool1.GetStats( ClusteringTool1.Clusters[0] );

 

Calculation parameters

Each clustering algorithm needs different parameters:

K-Means:

ClusteringTool1.KMeans.NumClusters := 10; // Number of minimum clusters (“K”)

ClusteringTool1.KMeans.MaxIterations := 1000; // Maximum number of iterations before stopping

Hierarchical:

ClusteringTool1.Hierarchical.NumClusters := 8; // Number of tree root clusters

QT:

ClusteringTool1.QTClustering.MinCount := 30; // Minimum number of points to form a cluster

ClusteringTool1.QTClustering.MaxDiameter := 100; // Maximum “diameter” a cluster can grow

 

Common parameters:

Distance

Cluster calculation is based on the “distance” between a data item and the other data items. There are several ways to calculate the “distance” between items.

The algorithms are agnostic, they call the Provider (ie: Series provider) to obtain the distances.

For example, on a XY scatter plot, the distance between points can be for example the hypotenuse (Pythagoras’ theorem), that is, the simple Euclidean distance between a point XY and another XY.

Distance calculations implemented:

dmEuclidean
dmSquaredEuclidean
dmManhattan
dmMinkowski
dmSorensen
dmChebyshev

 

Example:

ClusteringTool1.Distance := dmMinkowski;

ClusteringTool1.MinkowskiLambda := 4;

 

Linkage

There are several ways to calculate the “distance” between clusters when one or the two clusters have more than one item.

This is called “linkage”.

The most simple way is using each cluster “center” (this means no linkage occurs).

Other linkage styles implemented:

lmSingle

Also called “minimum”.

The distance between cluster A and B is the minimum distance between all items in cluster A and all items in cluster B.

lmComplete

Also called “maximum”.

The distance between cluster A and B is the maximum distance between all items in cluster A and all items in cluster B.

lmAverage

The distance between cluster A and B is the average distance between all items in cluster A and all items in cluster B.

lmWard

The result is the increase on “error sum of squares” when adding cluster B items to cluster A.

 

Calculation speed

Clustering is a slow process by nature. Each clustering method has different performance bottlenecks, proportional to the number of input data items.

The TeeClustering.pas unit has been greatly fine-tuned to optimize the speed of each algorithm, although much work is needed to find more advanced techniques that require less CPU cycles.

The QT Threshold algorithm benefits of parallelism, when multiple CPUs can be used together.

Speed examples (revisited):

(Time in milliseconds, Windows 8.1 x64 on Intel i7 4770 @ 3.4Ghz)

IDE XE8 Delphi, Win32, 5000 data points

Algorithm      Single CPU   Multiple CPU

K-Means              47                     31
Hierarchical    4328                4156
QT                     2859                  703

Notes:

x64 bit executables are a little bit faster than 32bit.

Speed is also very dependant on the “distance” calculation method that is used to compare data.  The default Euclidean calculation has a quite big CPU cost as it calculates the Hypotenuse between two data XY value pairs.