Introduction to Comethyl

Comethyl

Comethyl is a systems biology method for multivariate analysis of whole genome bisulfite sequencing (WGBS) data. Comethyl can be used to construct a weighted region comethylation network from user-defined regions, identify comethylation modules and the genomic regions that make up those modules, analyze functional enrichments, and investigate correlations between comethylation modules and sample traits of interest.

Glossary of Concepts

Term	Definition
Comethylation network	A weighted network in which nodes correspond to genomic regions and edges correspond to correlations between region DNA methylation.
Module	A group of genomic regions with correlated DNA methylation.
Module Eigennode	The first principal component of a module, can be thought of as a weighted average of methylation values.
Connectivity	The sum of the correlations between a region and all other regions in the comethylation network.
Fit	The R-squared value assessing how well the network meets the criteria for scale-free topology.
Soft Power Threshold	The power to which all correlations are raised when constructing the weighted network.
Hub Region	The region in a module whose methylation is most highly correlated with the module eigennode.

Inputs

Bismark Cytosine Reports: Following WGBS of your samples, the raw fastq files should be processed into CpG count matrices where biases have been removed. One available pipeline is CpG_Me, which performs read alignment and quality control of WGBS raw fastq files, and then generates the Bismark Cytosine Reports used by Comethyl.

Sample Trait Table ("sample_info.xlsx"): This is an excel table read in as a data.frame with samples (specifying CpG reports) as rows and sample traits of interest as columns. All values in the table must be numeric, though the data can be either categorical or continuous for any given trait. Sample traits can include all available information about potential variables of interest as well as potential confounding variables. In WGBS datasets, potential confounding variables include cell type proportions as well as technical variables including coverage, read duplication, read trimming, and global cytosine methylation levels. For human populations, metadata should include clinical, diagnostic, and demographic data, as well as sample collection characteristics, such as gestational age and birthweight for cord blood. For experimental studies in animal models or cell cultures, experimental variables should be included in the metadata for exploring module-trait relationships. The comethylation modules identified from the Bismark Cytosine Reports will be correlated with the traits in this table.

Annotation

One unique feature of Comethyl is the ability to define regions based on functional annotations, such as CpG islands, gene bodies, enhancers, or a custom annotation. This allows the user to focus on a specific portion of the genome and integrate with other genomic data sets. Since gene body methylation can correlate positively with expression, gene bodies were selected as alternative regions to explore in the cord blood data set, in addition to the approach of calling genomic regions by CpG location.

Use Case

The example detailed in the vignettes is from a dataset of 74 male cord blood samples from newborns who were later diagnosed with autism spectrum disorder (ASD) and those with typical development (TD). Comethylation modules were associated with 49 sample characteristics including diagnosis, cell types, sample sequencing information such as percent CpG methylation, and demographic data such as home ownership. Raw data is available on GEO (GSE140730), see the previous publication for more details.

Installation

You can install Comethyl from this repository and load it into your R session with the code below.

install.packages(c("BiocManager", "remotes"))
BiocManager::install("cemordaunt/comethyl")
library(comethyl)

Memory Usage

Comethyl performs a large number of correlation calculations when constructing the network, and these can take up a large amount of memory for an extended period of time. Memory usage and time should be taken into account when running getSoftPower() and getModules().

Because getSoftPower() uses an exponentially increasing amount of RAM as the number of regions increase, care should be taken to filter the number of regions so that only the most informative regions are considered. For reference, up to 250,000 regions are typically able to run on a large node with 500GB of RAM. Smaller region sets typically require much less memory to run. It’s also not recommended to use multiple threads with large region sets.

With default system BLAS libraries, getModules() can take multiple days to run, but this can be sped up considerably if R is configured with a fast BLAS library such as OpenBLAS. In testing, getModules() took 85 hours to run on a set of 250,000 regions with a default BLAS, but only 6 hours with OpenBLAS. getSoftPower() was also sped up from 3 hours to 2 hours with OpenBLAS. More information on configuring R with OpenBLAS and similar libraries can be found in the R Installation Manual.

Charles Mordaunt and Julia Mouat