Computational epigenetics: CpG island mapping by epigenome prediction

We discovered a striking parallelism between certain DNA characteristics of CpG islands and their epigenetic signature, which we used to score all CpG islands according to their CpG island strength (high CpG island strength is indicated by absence of DNA methylation, frequent promoter activity and open chromatin structure). This web page provides CpG island strength predictions and maps of predicted bona fide CpG islands for the human genome. A detailed analysis and methods description is available from the corresponding paper, which is in press at PLoS Computational Biology. Questions should be addressed to Christoph Bock (http://www.mpi-inf.mpg.de/~cbock).


Genome-wide prediction of CpG island strengths / likelihood of being bona fide CpG islands

The combined epigenetic score is calculated as the unweighted average of predictions for absence of DNA methylation, presence of promoter activity and open chromatin structure. It can assume values between 0 and 1. A value of zero corresponds to a completely silenced, inactive and inaccessibly buried CpG island, while a value of one corresponds to a fully unmethylated and highly accessible CpG island with significant promoter activity. Between these two extremes, a value of one third (~0.33) corresponds to a CpG island with high confidence for at least one out of the three indicators of bona fide CpG islands or an equivalent sum of several lower confidence scores. A value of two thirds (~0.67) corresponds to high confidence for at least two out of three indicators (or equivalent). And a value of 0.5 corresponds to a CpG island that is equally likely to be a bona fide CpG island or a false positive (for technical reasons, all scores are multiplied by 1000 for the UCSC Genome Browser tracks).

The optimized score combines the epigenetic predictions with a sequence-based scoring of CpG island strength. At the cost of sacrificing biological interpretability, it achieves an increased prediction performance when evaluated against large-scale DNA methylation and promoter activity data. The optimized score is not discussed in the paper and it should only be used if the goal is highly accurate epigenetic state prediction and candidate region prioritization for experimental follow-up. Statistical analyses on the optimized score are not recommended, and the thresholds derived in the paper do not apply to the optimized score.

CpG island strength predictions Results as calculated for the previous genome assembly (hg17 / NCBI35) Results mapped to the current genome assembly (hg18 / NCBI36)
Strength predicted for all CpG islands* genome-wide using the combined epigenetic score View within the UCSC Genome Browser View within the UCSC Genome Browser
Strength predicted for all CpG islands* genome-wide using the optimized score View within the UCSC Genome Browser View within the UCSC Genome Browser
All predictions and maps in a single file (long loading time, please be patient) View within the UCSC Genome Browser View within the UCSC Genome Browser

* CpG islands according to the Gardiner-Garden & Frommer (1987) sequence criteria: (i) GC content above 50%, (ii) ratio of observed versus expected number of CpG dinucleotides above 0.6, and (iii) more than 200 basepairs in length.


Maps of predicted bona fide CpG islands

Since the combined epigenetic score is directly linked to an epigenetic interpretation, it is possible to select biologically meaningful threshold parameters and thereby derive maps of bona fide CpG islands. For each of the three thresholds motivated above (0.33, 0.5 and 0.67), we constructed a map of predicted bona fide CpG islands exceeding this threshold. These three maps make different trade-offs between sensitivity (i.e. not missing any bona fide CpG islands) and specificity (i.e. minimizing the number of false positives) and are thus tailored to different types of applications.

Maps of predicted bona fide CpG islands Results as calculated for the previous genome assembly (hg17 / NCBI35) Results mapped to the current genome assembly (hg18 / NCBI36)
Map of CpG islands with the highest strength / likelihood to be bona fide CpG islands (combined epigenetic score > 0.67) - for applications that require high specificity View within the UCSC Genome Browser View within the UCSC Genome Browser
Map of CpG islands with (at least) high strength / likelihood to be bona fide CpG islands (combined epigenetic score > 0.5) - recommended for most users View within the UCSC Genome Browser View within the UCSC Genome Browser
Map of CpG islands with (at least) moderate strength / likelihood to be bona fide CpG islands (combined epigenetic score > 0.33) - for applications that require high sensitivity and for genome annotation View within the UCSC Genome Browser View within the UCSC Genome Browser


Downloading the data

In addition to the UCSC Genome Browser tracks provided above, all predictions and maps can be downloaded as a single tab-separated text file. The genome coordinates in this text file refer to the genome assembly on which all predictions where originally calculated (hg17 / NCBI35). A version of the same file remapped to the latest genome assembly (hg18 / NCBI36) is also available for download as a tab-separated text file.


Further information

A detailed account of the methodology used to derive these predictions of CpG island strength and maps of bona fide CpG islands is provided in the corresponding paper:
Bock, C., J. Walter, M. Paulsen and T. Lengauer (2007). "CpG island mapping by epigenome prediction." PLoS Computational Biology. In press. doi:10.1371/journal.pcbi.0030110.eor (view paper)

Please address any questions to Christoph Bock (http://www.mpi-inf.mpg.de/~cbock).