CQB Tools : Iclust

Iclust

Background
Input format
Setting the parameters
Output
Run time and complexity issues
Source code

Background

Iclust is an information-theoretic cluster analysis framework, originally published in Slonim et al., PNAS, 2005. It defines a general model-independent clustering procedure that can be applied to various types of data. The Iclust implementation in this web page is an efficient version of the algorithm described in Slonim et al., 2005. The relevant code was written by Olivier Elemento and Noam Slonim. The web implementation and design were done by Mark Schroeder, John Matese, Gasper Tkacik, and Noam Slonim.

Input format

The input file should contain "raw" data patterns (e.g., gene expression profiles associated with genes) delimited by tabs or some number of spaces, where each row indicates the pattern associated with one data item. The input may begin with any number of rows containing labels, and each row may begin with any number of identifiers, however the data patterns to cluster should begin at the same column in each row. A data pattern may be expressed as a list of numbers, where the literal "NAN" should represent missing values.

The data items to cluster are extracted from the input file by first determining the most likely delimiter. Then leading rows and columns that appear to contain identifiers are excluded, with the identifiers recorded for later inclusion in the results. If a non-numeric, non-empty field is encountered where a data item is expected, the entire row or column is assumed to not contain valid data patterns and is typically excluded. Empty fields are replaced with NAN.

In this manner most common data formats (raw, pcl, cdt, etc.) are supported. Some known input formats, identified by the input file extension, will trigger additional filtering (such as excluding the GWEIGHT column and the EWEIGHT row).

The input may contain, instead of raw data patterns, pre-evaluated pairwise similarity relations. In this case, the EstS step may be skipped through the advanced menu. The data must then comprise a square, symmetric matrix that contains only the numeric relations, where diagonal entries are assumed to be maximal in their respective rows/columns and missing values are not allowed.

Setting the parameters

The simplest option is to run Iclust directly over a file that includes "raw" data patterns; a gene expression file in pcl format is a good example. In this case, Iclust will cluster the data based on pairwise similarity relations that are automatically evaluated as part of the process. By default, we use the (empirical) mutual-information relations, multiplied by the sign of the Pearson correlation.

However, you can specify other similarity options using the advanced menu. If your input file already includes pairwise similarity relations, this needs to be specified in the advanced menu, and in that case Iclust is applied directly over these relations.

Many defaults may be overridden in the advanced menu. For example, the number of clusters defaults to the sqrt of the number of data items; if your data consist of 10,000 genes, the resulting partition will include 100 clusters. This parameter, as well as other parameters, may be modified using the advanced menu.

Output

Except where affected by changes in the advanced options, the resulting output includes:

Pairwise relations (affinity) matrix produced by EstS.
Clustering partition of the rows, sorted by the clusters.
Clustering partition of the rows, sorted by the rows.
Clustering statistics.
Source data organized by partition with headers, row numbers, and partition numbers, and an associated heatmap.
Centroids (arithmetic means) of source data within cluster partitions organized by partition with headers, row numbers, and partition numbers, with an associated heatmap.
Various miscellaneous files (such as full operating parameters and raw data) generated during the data extraction, estimation, and clustering processes.

These files can be used by many other tools, for example to analyze the association between the obtained clusters and outside annotations (like Gene Ontology).

Run time and complexity issues

The current implementation will run over matrices that contain between 3 and 20,000 data items (rows), and at least 5 features (columns). The product of the square of the number of rows and the number of columns must be less than 10e10.