Iclust

Contents


Background

Iclust is an information-theoretic cluster analysis framework, originally published in Slonim et al., PNAS, 2005. It defines a general model-independent clustering procedure that can be applied to various types of data. The Iclust implementation in this web page is an efficient version of the algorithm described in Slonim et al., 2005. The relevant code was written by Olivier Elemento and Noam Slonim. The web implementation and design were done by Mark Schroeder, John Matese, Gasper Tkacik, and Noam Slonim.

Input format

The input file should contain "raw" data patterns (e.g., gene expression profiles associated with genes) delimited by tabs or some number of spaces, where each row indicates the pattern associated with one data item. The input may begin with any number of rows containing labels, and each row may begin with any number of identifiers, however the data patterns to cluster should begin at the same column in each row. A data pattern may be expressed as a list of numbers, where the literal "NAN" should represent missing values.

The data items to cluster are extracted from the input file by first determining the most likely delimiter. Then leading rows and columns that appear to contain identifiers are excluded, with the identifiers recorded for later inclusion in the results. If a non-numeric, non-empty field is encountered where a data item is expected, the entire row or column is assumed to not contain valid data patterns and is typically excluded. Empty fields are replaced with NAN.

In this manner most common data formats (raw, pcl, cdt, etc.) are supported. Some known input formats, identified by the input file extension, will trigger additional filtering (such as excluding the GWEIGHT column and the EWEIGHT row).

The input may contain, instead of raw data patterns, pre-evaluated pairwise similarity relations. In this case, the EstS step may be skipped through the advanced menu. The data must then comprise a square, symmetric matrix that contains only the numeric relations, where diagonal entries are assumed to be maximal in their respective rows/columns and missing values are not allowed.

Setting the parameters

The simplest option is to run Iclust directly over a file that includes "raw" data patterns; a gene expression file in pcl format is a good example. In this case, Iclust will cluster the data based on pairwise similarity relations that are automatically evaluated as part of the process. By default, we use the (empirical) mutual-information relations, multiplied by the sign of the Pearson correlation.

However, you can specify other similarity options using the advanced menu. If your input file already includes pairwise similarity relations, this needs to be specified in the advanced menu, and in that case Iclust is applied directly over these relations.

Many defaults may be overridden in the advanced menu. For example, the number of clusters defaults to the sqrt of the number of data items; if your data consist of 10,000 genes, the resulting partition will include 100 clusters. This parameter, as well as other parameters, may be modified using the advanced menu.

Output

Except where affected by changes in the advanced options, the resulting output includes:

These files can be used by many other tools, for example to analyze the association between the obtained clusters and outside annotations (like Gene Ontology).

Run time and complexity issues

The current implementation will run over matrices that contain between 3 and 20,000 data items (rows), and at least 5 features (columns). The product of the square of the number of rows and the number of columns must be less than 10e10.

The data extraction mechanism works well in most cases. However, if your data has binary characters, non-alpha-numeric and non-space characters, or other very strange characteristics, the extraction may not produce the expected results. To help with such cases any messages from the extractor (if there are any) are provided through the advanced menu interface.