1 Introduction

Chronic lymphocytic leukemia (CLL) is the most frequent leukemia in the western countries and presents with a broad spectrum of clinical behaviors. This can be partially captured by the presence of two biological subtypes distinguished by the extent of somatic mutations in the heavy chain variable region of immunoglobulin genes (IGHV). These groups are unmutated (U) and mutated (M) CLL, with poorer and better clinical outcome, respectively. Nonethless, Kulis et al., 2012 found that CLL can be actually classified in 3 groups or epitypes based on different DNA methylation imprints of pre- and post- germinal center experienced B cells. These epitypes, which add further clinical information beyond IGHV subgroups, were named n-CLL (formed mainly by U-CLL), m-CLL (formed mainly by M-CLL) and i-CLL ( formed by U-CLL and M-CLL). Here, I present all necessary steps to find the CLL epitypes in RRBS data. The code provided here should work for the majority of the cases. More advanced users may check the underlying code for all the functions here. Please, note that these epitypes can be alternatively found by different approaches, including pyrosequencing using few CpGs (Queriós AC, 2015 and Duran-Ferrer M, 2020), 450k and EPIC Illumina DNA methylation arrays Duran-Ferrer M, 2020 and Me-iPLEX Giacopelli 2019. If you have pyrosequencing, 450k or EPIC data you can check this comprehensive tutorial to predict CLL epitypes.

In the CLL1100 study, the CLL epitypes were calculated for 1,023 samples including Illumina 450k array and RRBS data. In the case of Illumina 450k data, we used this algorithm. For RRBS data, we used CpGs covered by at least 5 reads across all samples for both RRBS-SE and RRBS-PE separately. First, CLL patients with 100% and ≤95% IGHV identities were selected to perform differential DNA methylation analysis with mean methylation differences between groups of at least 0.5. These stringent cutoffs for both IGHV and DNA methylation differences allowed a better delineation of borderline cases compared with the traditional 98% IGHV and 0.25 methylation difference cutoffs. This filtering criteria translated into clearer signatures consisting of 32 and 153 differentially methylated CpGs for RRBS-SE and RRBS-PE data, respectively. We then used these CpGs to perform consensus clustering with ConsensusClusterPlus R package v.1.52.0110 with 10,000 permutations allowing from K=2 to K=7 groups, which robustly identified 3 consensus groups in both RRBS data types. Each sample was assigned a probability to belong to each of the groups (using the calcICL function). Samples where the maximum probability was below 0.5 or where 2 epitypes had a probability above 0.35 were considered as unclassified cases. All these steps are exemplified in the following lines of code.

2 Load required data.

2.1 Load packages.

## Load packages and set general options
options(stringsAsFactors = F,error=NULL)
library(data.table)
library(matrixStats)
library(genefilter)
## 
## Attaching package: 'genefilter'
## The following objects are masked from 'package:matrixStats':
## 
##     rowSds, rowVars
library(ConsensusClusterPlus)

2.2 Download required data.

We will download the example data as well as all the required functions from github.

##Download required data
#methylation data matrix
download.file("https://raw.githubusercontent.com/Duran-FerrerM/CLLmap-epigenetics/main/data/DNAme.mat.RRBS-SE.tsv", destfile = "DNAme.mat.RRBS-SE.tsv", method="libcurl")

# samples metadata
download.file("https://raw.githubusercontent.com/Duran-FerrerM/CLLmap-epigenetics/main/data/Samples.metadata.RRBS-SE.tsv", destfile = "Samples.metadata.RRBS-SE.tsv", method="libcurl")

## Download and load required functions
download.file("https://raw.githubusercontent.com/Duran-FerrerM/CLLmap-epigenetics/main/code/Epitype_fun.R", destfile = "Epitype_fun.R", method="libcurl")
source("Epitype_fun.R")

2.3 Load data into R.

We can load data into R using the following lines of code. The path to an alternative file can be changed accordingly.

##
## CHANGE THESE LINES OF CODE BY YOUR DATA/PATH ACCORDINGLY
##

####################################################################################################
DNA.mat.path <- "./DNAme.mat.RRBS-SE.tsv"
Samples.metadata.path <- "./Samples.metadata.RRBS-SE.tsv"
Sample_id.column <- "Participant_id_anonymous" ## Look at the following section where matrix containing sample metadata shows the column is named like this.
####################################################################################################

## LOAD DNA methylation data
DNAme.mat <- fread(DNA.mat.path,data.table = F)
DT::datatable(DNAme.mat, options = list(scrollX = T, scrollY=T), rownames = F)

Load required functions and metadata for all the samples, including IHGV identity. Although not required, having the IHGV identity will provide better results and a correct assignment of n-CLL, i-CLL and m-CLL labels to the generated clusters.

## Load metadata for all the samples.
Samples.metadata <- fread(Samples.metadata.path,data.table = F)
DT::datatable(Samples.metadata, options = list(scrollX = T, scrollY=T), rownames = F)

3 Analsyes

The analyses to find CLL epitypes can be summarized in:

  1. Find informative CpGs to estimate CLL epitypes –> cluster.meth.diff.
  2. Perform consensus clustering –> cluster.search.
  3. Assign samples to a cluster membership –> cluster.assing.

3.1 Find informative CpGs to estimate CLL epitypes.

To find informative CpGs, we will perform differentially methylated analyses in our data using the function cluster.meth.diff. This function contains 5 arguments:

  1. DNA.mat refers to the DNA methylation matrix with the structure previously shown.
  2. IGHV_identity.column refers to the column in the metadata previously loaded, in our case, Samples.metadata, which contains the IGHV identity. While this argument is not strictly required and can be left empty, this information will improve the CLL epitype classification and labeling. If not specified, the most variable top.Up.Down.CpGs in the data will be used to find epitypes.
  3. UCLL.cutoff and MCLL.cutoff are cutoffs for IGHV CLL groups. Do not change unless strictly required due to composition of your CLL cohort.
  4. delta is the methylation difference between compared IGHV groups. Ignored is IGHV identity is not provided.
  5. top.Up.Down.CpGs is the number of CpGs considered hyper- and hypomethylation between IGHV groups. If IGHV identity is not provided, the most variable CpGs in the data are selected (top.Up.Down.CpGs*2).

To perform the cluster search we will not use the CpGs genomic coordinates but only the DNA methylation values for each patient. Thus, we subset the matrix accordingly.

## 1. Perform differential analyses between IGHV groups.
DNAme.mat.meth.diff <- cluster.meth.diff(DNAme.mat = DNAme.mat[,4:ncol(DNAme.mat)], # subset of the original matrix to perform the analysis.
                               IGHV_identity.column = "IGHV_identity",
                               UCLL.cutoff = 100,
                               MCLL.cutoff = 95,
                               delta = 0.5,
                               top.Up.Down.CpGs = 100
                               )
## [cluster.meth.diff] Performing differential analyses.
## take a look at the resulting data:
DT::datatable(DNAme.mat.meth.diff, options = list(scrollX = T, scrollY=T), rownames = F)

3.2 Performing consensus clustering.

To search the optimal number of clusters, the package ConsensusClusterPlus will be used.

## 2. Perform consensus clustering with previous differential data.
k.clusters <- cluster.search(DNAme.mat.meth.diff = DNAme.mat.meth.diff,
                             N.permut = 50, ## Increase to 1,000-10,000 for higher robustness
                             maxK = 7,
                             plot = NULL # change to "pdf" to generate a pdf report with all the plots.
                             )
## [cluster search] Performing consensus clustering.
## end fraction
## clustered

## clustered

## clustered

## clustered

## clustered

## clustered

3.3 Assign samples to cluster membership.

The previous plots indicate that the optimal number of clusters is k=3. Please, for further details regarding their interpretation refer to ConsensusClusterPlus. We will use the function cluster.assign to assign the samples cluster membership, probabilities, and in the case IGHV identity is available for the cases, epitype labeling.

NOTE: Please, note that a representative CLL cohort is expected to find these 3 CLL epitypes. In cases where there is bias in cohort composition, these analyses may not be accurate and the k number of clusters could be lower.

## 3. Assign samples to a cluster membership
Clusters.assigned <- cluster.assing(k.clusters =  k.clusters,
                                    k = 3,
                                    IGHV_identity.column = "IGHV_identity"
                                    )

DT::datatable(Clusters.assigned, options = list(scrollX = T, scrollY=T), rownames = F)
## export results
fwrite(x=Clusters.assigned,file = "Clusters/CLL.epitpyes.tsv",sep = "\t",na = "NA")

3.4 Get all code at once

knitr::opts_chunk$set(echo = TRUE)

## Load packages and set general options
options(stringsAsFactors = F,error=NULL)
library(data.table)
library(matrixStats)
library(genefilter)
library(ConsensusClusterPlus)


##Download required data
#methylation data matrix
download.file("https://raw.githubusercontent.com/Duran-FerrerM/CLLmap-epigenetics/main/data/DNAme.mat.RRBS-SE.tsv", destfile = "DNAme.mat.RRBS-SE.tsv", method="libcurl")

# samples metadata
download.file("https://raw.githubusercontent.com/Duran-FerrerM/CLLmap-epigenetics/main/data/Samples.metadata.RRBS-SE.tsv", destfile = "Samples.metadata.RRBS-SE.tsv", method="libcurl")

## Download and load required functions
download.file("https://raw.githubusercontent.com/Duran-FerrerM/CLLmap-epigenetics/main/code/Epitype_fun.R", destfile = "Epitype_fun.R", method="libcurl")
source("Epitype_fun.R")


##
## CHANGE THESE LINES OF CODE BY YOUR DATA/PATH ACCORDINGLY
##

####################################################################################################
DNA.mat.path <- "./DNAme.mat.RRBS-SE.tsv"
Samples.metadata.path <- "./Samples.metadata.RRBS-SE.tsv"
Sample_id.column <- "Participant_id_anonymous" ## Look at the following section where matrix containing sample metadata shows the column is named like this.
####################################################################################################

## LOAD DNA methylation data
DNAme.mat <- fread(DNA.mat.path,data.table = F)
DT::datatable(DNAme.mat, options = list(scrollX = T, scrollY=T), rownames = F)


## Load metadata for all the samples.
Samples.metadata <- fread(Samples.metadata.path,data.table = F)
DT::datatable(Samples.metadata, options = list(scrollX = T, scrollY=T), rownames = F)


## 1. Perform differential analyses between IGHV groups.
DNAme.mat.meth.diff <- cluster.meth.diff(DNAme.mat = DNAme.mat[,4:ncol(DNAme.mat)], # subset of the original matrix to perform the analysis.
                               IGHV_identity.column = "IGHV_identity",
                               UCLL.cutoff = 100,
                               MCLL.cutoff = 95,
                               delta = 0.5,
                               top.Up.Down.CpGs = 100
                               )

## take a look at the resulting data:
DT::datatable(DNAme.mat.meth.diff, options = list(scrollX = T, scrollY=T), rownames = F)


## 2. Perform consensus clustering with previous differential data.
k.clusters <- cluster.search(DNAme.mat.meth.diff = DNAme.mat.meth.diff,
                             N.permut = 50, ## Increase to 1,000-10,000 for higher robustness
                             maxK = 7,
                             plot = NULL # change to "pdf" to generate a pdf report with all the plots.
                             )



## 3. Assign samples to a cluster membership
Clusters.assigned <- cluster.assing(k.clusters =  k.clusters,
                                    k = 3,
                                    IGHV_identity.column = "IGHV_identity"
                                    )

DT::datatable(Clusters.assigned, options = list(scrollX = T, scrollY=T), rownames = F)

## export results
fwrite(x=Clusters.assigned,file = "Clusters/CLL.epitpyes.tsv",sep = "\t",na = "NA")


sessionInfo()

4 Session Information

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ConsensusClusterPlus_1.60.0 genefilter_1.78.0          
## [3] matrixStats_0.62.0          data.table_1.14.2          
## [5] BiocStyle_2.24.0           
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.36.2        xfun_0.31              bslib_0.3.1           
##  [4] lattice_0.20-45        splines_4.2.0          vctrs_0.4.1           
##  [7] htmltools_0.5.2        stats4_4.2.0           yaml_2.3.5            
## [10] blob_1.2.3             XML_3.99-0.10          survival_3.3-1        
## [13] rlang_1.0.2            jquerylib_0.1.4        DBI_1.1.2             
## [16] BiocGenerics_0.42.0    bit64_4.0.5            GenomeInfoDbData_1.2.8
## [19] stringr_1.4.0          zlibbioc_1.42.0        Biostrings_2.64.0     
## [22] htmlwidgets_1.5.4      evaluate_0.15          memoise_2.0.1         
## [25] Biobase_2.56.0         knitr_1.39             IRanges_2.30.0        
## [28] fastmap_1.1.0          crosstalk_1.2.0        GenomeInfoDb_1.32.2   
## [31] AnnotationDbi_1.58.0   highr_0.9              Rcpp_1.0.8.3          
## [34] xtable_1.8-4           DT_0.23                BiocManager_1.30.18   
## [37] cachem_1.0.6           S4Vectors_0.34.0       magick_2.7.3          
## [40] annotate_1.74.0        jsonlite_1.8.0         XVector_0.36.0        
## [43] bit_4.0.4              png_0.1-7              digest_0.6.29         
## [46] stringi_1.7.6          bookdown_0.27          grid_4.2.0            
## [49] cli_3.3.0              tools_4.2.0            bitops_1.0-7          
## [52] magrittr_2.0.3         sass_0.4.1             RCurl_1.98-1.7        
## [55] RSQLite_2.2.14         cluster_2.1.3          crayon_1.5.1          
## [58] Matrix_1.4-1           rmarkdown_2.14         httr_1.4.3            
## [61] rstudioapi_0.13        R6_2.5.1               compiler_4.2.0