1. Flowchart
This is a schematic flowchart of the Array2BIO analysis.

2. Microarray data analysis
2.1. Background correction
| | |
Array2BIO follows the original Affymetrix procedure of background correction. An array of probes is separated into 16 zones (4x4 grid).
Raw intensities for each zone are ranked and the background level is selected as a 2% lowest intensity for the zone. Distances from
each probe to different zone centers are used to estimate the background level at the probe location, which is then subtracted from the raw
probe intensity.
|
|
2.2. Filtering out non-specific hybridization
| | |
Each probe intensity is measured in duplicates - a perfect match (MM) intensity and a mismatch (MM) intensity. MM intensity estimates the cross-reactivity
with other genes. Array2BIO excludes all the probes with the PM intensity less than 1.25*MM intensity. It also calculates the ratio of probes with specific hybridization that survive this filtering. MM intensity is substracted from the PM intensity for the surviving probes, so the raw intensity is measured as the (PM-MM) relative intensity.
|
|
2.3. Normalization and Log2 transformation
| |
Median (PM-MM) array intensity (medArray) is calculated for the probes remaining after the filtering. Individual (PM-MM) probe intensities (probInt) undergo
normalization and base 2 logarithmic transformation after that:
probInt_new = log2(probInt / medArray)
|
|
2.4. Probe (.CEL) to tag (.TAG) mapping
| | |
Affymetrix .CDF files are used to map individual probe intensities (at X,Y location on an array) onto different gene tags (represented as 1415671_at). Usually each tag gets ~10 good probes that span the corresponding gene transcript.
|
|
2.5. Averaging experiment replicas
| | |
Several experiment replicas (multiple .TAG files) can be averaged in comparative analysis to reliably estimate signal and background
gene expression levels.
|
|
2.6. Filtering out probe expression outliers
| | |
It is very common to observe that the expression level of several gene probes differs significantly from the medium level of
the transcript gene expression (medLevel). To filter out probe expression outliers Array2BIO excludes transcript probes that differ from the
medLevel in expression by the given number of standard deviations (as specified by the user). By default, a strict filtering (1 standard deviation)
and a medium stringency filtering (2 standard deviations) are selected for comparative and clustering analyses, respectively.
|
|
3. Statistical methods (comparative analysis)
3.1. Balance analysis of low- and high-expressors
| |
It is well known that the significance of the fold-difference in expression varies drastically for low- and high-expressor genes.
Simply because a division of a small number by another small number can end up being a very big number. Array2BIO utilizes local mean normalization
and local variance correction across intensities to reliably handle low- and high-expressors and to define differential fold-difference
thresholds for different intensities. Array2BIO approach is very similar to the previously described SNOMAD methodology [Colantuoni C, Henry G, Zeger S, Pevsner J., Bioinformatics (2002)].
Briefly, fold-expressions of every Affy tag are ordered by the average expression of signal and control tags (the latter provides an average level
of the tag expression). Then they are divided into 100 groups by the average expression level and a distribution of fold-expressions is calculated
for each group. Z-value based on an average and a standard deviation of fold-expressions in the group is assigned to each tag. Tags with the
Z-value greater than 2 are selected for the further analysis (orange dots on the plot).
|
|
3.2. Welch's t-test of differential expression significance
| |
Signal and control tag expressions that survive the balance analysis of low- and high-expressors are then the subject of the statistical test of
the difference in their expression as related to the uncertainty in defining their expressions. A Welch's t-test is performed on average signal
and contol tag expressions using standard deviations of their probe expression distributions. A p-value is assigned to every differentially
expressed tag and the tags with the p-value of less than 0.05 are selected for the multiple testing correction analysis.
|
|
3.3. Mapping Affy tags onto UCSC known genes
| | |
Array2BIO first identifies a set of unique (non-overlapping) genes in a genome matching the original .CEL files by using the 'known genes' annotation as provided by the UCSC Genome Browser database. Then the Affy tags are
mapped to (and are grouped by) corresponding 'known genes'. Accession numbers of corresponding mRNA sequences and the genomic location are retrieved
for each gene during the mapping process. These are later used to dynamically link genes to the NCBI database and to the ECR Browser of evolutionary
conservation.
|
|
3.4. Gene Ontology (GO) and KEGG analyses of biological functions and gene interactions
| |
Array2BIO utilizes locally installed instances of GO and KEGG databases to contrast the distribution of differentially expressed genes into
different functional categories to the genome average population of these categories. Observed and expected category population values are
compared and the statistical enrichment (or depletion) of a category is quantified by using the hypergeometric distribution statistics.
Functional categories with p-values smaller than 0.05 are selected for the subsequent multiple testing correction analysis.
Gene Ontology database provides biological classification of gene functions as represented by a membership in a functional category
that related to either a particular biological process, molecular function, or cellular component.
KEGG database combines information of gene interactions that are grouped
into (1) metabolism, (2) genetic information processing, (3) environmental information processing, (4) cellular processes, and (5) human diseases
categories.
|
|
3.5. Correction for multiple testing
| |
Array2BIO performs a correction for multiple testing to exclude false positive predictions associated with the fact that the statistical testing
of differential tag expression or enrichment/depletion in GO and KEGG categories is performed multiple times. Therefore the probability to
observe a p<0.05 outcome just by chance increases with the number of tests being performed. Array2BIO provides two statistical methods
to correct for multiple testing (and select only reliable results) and also allows the omission of the multiple testing if the user
does not want to apply it. The default method used by Array2BIO is the Benjamini and Hochberg correction. It is a medium stringency
multiple testing correction method that usually works well. An alternative, Bonferroni correction method is also provided. The latter is
one of the most stringent multiple testing correction methods and should be used only to select the most outstanding overexpressor genes or
enriched/depleted functional categories.
|
|
4. Clustering analysis
4.1. Microarray data clustering
| | |
Array2BIO utilizes the Cluster tool developed by
Mike Eisen [Eisen et al. (1998) PNAS 95:14863] from the Lawrence Berkeley National Lab in its
version
adopted to Unix environment by Michiel de Hoon of the University of Tokyo. The hierarchical cluster analysis is implemented into Array2BIO,
which allows clustering genes and/or conditions, provides 9 distance measures and 4 methods to do so. Due to Cluster limitations Array2BIO
restricts the maximum number of genes to cluster by 2000 or less. The genes are ranked by the deviation in their expression across different
conditions and those with the largest deviation are selected for the clustering.
|
|
4.2. Interactive tree visualization
| |
Array2BIO provides an interactive web utility for the visualization of clustering results. Clustered gene expressions across multiple conditions
visualized in a matrixed format. Tree of clustering relationships is given to the left of the gene expression image. Mouse click on a tree branch
generates a 'zoom in' image of this branch and lead to the generation of a detailed description of related genes (including gene names, accession
numbers, corresponding Affy tags, and genomic locations).
|
|
5. Interconnection with external tools
5.1. ECR Browser - evolutionary conservation analysis
| | |
ECR Browser is a dynamic whole-genome navigation tool for visualizing and studying evolutionary relationships between vertebrate and non-vertebrate genomes. Evolutionary Conserved Regions (ECRs) that have been mapped within alignments of the genomes are presented in this graphical browser, which depicts and color-codes ECRs in relation to known genes that have been annotated in the base genome.
|
|
5.2. DiRE - identification of clusters of transcription factor binding sites in promoters and distant regulatory elements
| | |
DiRE relies on a database of putative transcription factor binding sites that have been carefully annotated across the human genome using evolutionary conservation with the mouse and rat genomes. An efficient search algorithm is applied to this data set to identify combinations of transcription factors, whose binding sites tend to co-occur in close proximity within the promoter regions of the input gene set. These combinations are statistically evaluated, and significant combinations are reported and visualized.
|
|
5.3. NCBI - detailed sequence information
| | |
Detailed mRNA transcript information including: nucleotide and protein sequences, related publications, gene annotation, etc.
|
|
|
|