Quick Start
Bed formatted consensus matrix (chr, start, end and samples)
dim(bed)
## [1] 1000 25
# bed formatted file
head(bed[,1:4])
## Chr Start End B6-18mo-M-BM-47-GT18-01783
## 52834 chr5 24841478 24845196 1592
## 29780 chr17 8162955 8164380 109
## 67290 chr8 40577584 40578029 72
## 51295 chr4 145277698 145278483 110
## 4267 chr1 180808752 180815472 2452
## 45102 chr3 88732151 88732652 49
Create the contrasts you want to compare, here we create contrasts for 22 mice samples from different strains.
# create contrast vector which will be compared.
contrasts<- c("B6", "B6", "B6", "B6", "B6", "NZO", "NZO", "NZO", "NZO", "NZO", "NZO",
"B6", "B6", "B6", "B6", "B6", "NZO", "NZO", "NZO", "NZO", "NZO", "NZO")
cinaR
function directly computes the differentially
accessible peaks.
# If reference genome is not set hg38 will be used!
results <- cinaR(bed, contrasts, reference.genome = "mm10")
## >> preparing features information... 2024-05-22 10:41:06
## >> identifying nearest features... 2024-05-22 10:41:07
## >> calculating distance from peak to TSS... 2024-05-22 10:41:08
## >> assigning genomic annotation... 2024-05-22 10:41:08
## >> assigning chromosome lengths 2024-05-22 10:41:22
## >> done... 2024-05-22 10:41:22
Now, you can access differential accessibility (DA) and enrichment results.
names(results)
## [1] "DA.results" "Enrichment.Results"
Inside DA.results
, you have the consensus peaks (cp) and
differentially accessible (DA) peaks. If batch correction was run, then
cp
will be a batch-corrected consensus matrix, otherwise it
is the filtered and normalized version of original consensus peaks you
provided.
names(results$DA.results)
## [1] "cp" "DA.peaks"
There are many information cinaR
provides such as
adjusted p value, log fold-changes, gene names etc for each peak:
colnames(results$DA.results$DA.peaks$B6_NZO)
## [1] "Row.names" "seqnames" "start" "end"
## [5] "width" "strand" "annotation" "geneChr"
## [9] "geneStart" "geneEnd" "geneLength" "geneStrand"
## [13] "geneId" "transcriptId" "distanceToTSS" "gene_name"
## [17] "logFC" "FDR"
Here is an overview of those DA peaks:
head(results$DA.results$DA.peaks$B6_NZO[,1:5])
## Row.names seqnames start end width
## 1 chr1_134559439_134560787 chr1 134559439 134560787 1349
## 2 chr1_138158514_138159483 chr1 138158514 138159483 970
## 3 chr1_164247654_164251852 chr1 164247654 164251852 4199
## 4 chr1_164860953_164861375 chr1 164860953 164861375 423
## 5 chr1_171092293_171092878 chr1 171092293 171092878 586
## 6 chr1_171631196_171631780 chr1 171631196 171631780 585
Since the comparison is
B6_NZO
, if fold-changes are positive it means they are more accesible in B6 compared to NZO and vice versa for negative values!
and here is a little overview for enrichment analyses results:
## module.name overlapping.genes adj.p
## 1 Myeloid lineage 1 TFEB,FBXL5,PLXNC1,GM2A,AGTPBP1,CTSB 0.07637125
## 2 U_metabolism/replication SLC2A6,GM2A,CTSB,PECAM1 0.07637125
## 3 Myeloid lineage 2 RNF157,FMNL3,MTUS1 0.31948190
## 4 U_mitochondrial proteins PIK3R1,PAQR3,UBE3A,MAP4K4,PTPRC 0.31948190
## 5 U_proteasome/ubiquitin cx PIK3R1,IREB2,PTPRC 0.37199629
## 6 U_Immunity/cytoskeleton RPS6,RPS19 0.62554353
PCA Plots
You can easily get the PCA plots of the samples:
pca_plot(results, contrasts, show.names = F)
You can overlay different information onto PCA plots as well!
# Overlaid information
overlaid.info <- c("B6-18mo", "B6-18mo", "B6-18mo", "B6-18mo", "B6-18mo",
"NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo",
"B6-3mo", "B6-3mo", "B6-3mo", "B6-3mo", "B6-3mo",
"NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo")
# Sample IDs
sample.names <- c("S01783", "S01780", "S01781", "S01778", "S01779",
"S03804", "S03805", "S03806", "S03807", "S03808",
"S03809", "S04678", "S04679", "S04680", "S04681",
"S04682", "S10918", "S10916", "S10919", "S10921",
"S10917", "S10920")
pca_plot(results, overlaid.info, sample.names)
Heatmaps
Differential peaks
You can see the available comparisons using:
show_comparisons(results)
## [1] "B6_NZO"
Then, plot the differential peaks for a selected contrast using:
heatmap_differential(results, comparison = "B6_NZO")
Also, you can configure your heatmaps using the additional arguments
of pheatmap
function. For more information check out
?pheatmap
.
heatmap_differential(results, comparison = "B6_NZO", show_colnames = FALSE)
Most variable peaks
You can also plot most variable 100 peaks for all samples:
heatmap_var_peaks(results)
Plus, you can set the number of peaks to be used in these plots, and
again you can change the additional arguments of pheatmap
function. For more information check out ?pheatmap
.
heatmap_var_peaks(results, heatmap.peak.count = 200, cluster_cols = F)
Enrichment Plots
You can plot your enrichment results using:
dot_plot(results)
## Warning: Removed 54 rows containing missing values or values outside the scale range
## (`geom_point()`).
if it gets too crowded, you can get rid of the irrelevant pathways as well:
dot_plot(results, filter.pathways = T)
Creating different contrasts
Note that you can further divide the resolution of contrasts, for instance this is also a valid vector
contrasts <- sapply(strsplit(colnames(bed), split = "-", fixed = T),
function(x){paste(x[1:4], collapse = ".")})[4:25]
unique(contrasts)
## [1] "B6.18mo.M.BM" "B6.18mo.F.BM" "NZO.18mo.F.BM" "NZO.18mo.M.BM"
## [5] "B6.3mo.F.BM" "B6.3mo.M.BM" "NZO.3mo.F.BM" "NZO.3mo.M.BM"
in this case, each of them will be compared to each other which will result in 28 different comparisons.
Comparison scheme
As default, cinaR
will use one vs one (OVA) scheme,
comparing each contrast to others one by one. However, this can be
changed to one vs all (OVA) which will compare each contrast to
rest:
# one-vs-one (results in n choose k comparisons, default)
cinaR(..., comparison.scheme = "OVO")
# one-vs-all (results in n comparisons)
cinaR(..., comparison.scheme = "OVA")
Running for bulk RNA-seq data
To run cinaR
with RNA-seq experiments, just provide the
count matrix, and specify the experiment type:
cinaR(matrix = count.matrix, ..., experiment.type = "RNA-Seq")
Note that, count.matrix
should be in the form of \(g \times (1+n)\) where \(g\) is the number of genes and \(n\) is the number of samples, and plus one
is for gene names.
Note that currently
cinaR
can only handle gene symbols (e.g. FOSL2, FOXA) and ensembl stable IDs (e.g. ENSG00000010404) for both human and mice!
Running enrichment with different dataset
You can run the enrichment analyses with a custom gene set:
cinaR(..., geneset = new_geneset)
cinaRgenesets
Easiest way to do this is to use cinaRgenesets package. You can select your gene set of interest and just plug it into your pipeline.
MSigDB
You can also download different gene sets from MSigDB site. Note that you should use the human gene symbol versions.
You can use
read.gmt
function fromqusage
package to read these genesets into your current environment.
Custom gene sets
A geneset
must be a .gmt
formatted symbol
file.
You can familiarize yourself with the format by checking out :
# default geneset to be used
data("VP2008")
If you have gene and pathway names in a
data.frame
, you can usesplit
function to create your own.gmt
formatted gene sets e.g.split(df$genes, df$pathways)
.
Selecting different reference genomes
For now, cinaR
supports 3 genomes for human and mice
models:
hg38
hg19
mm10
You can set your it using reference.genome
argument.
Batch Effect Correction
If you suspect your data have unknown batch effects, you can use:
cinaR(..., batch.correction = T)
This option will run Surrogate
Variable Analysis (SVA) and try to adjust your data for unknown
batch effects. If however, you already know the batches of the samples,
you can simply set the batch.information
argument as well.
This will not run the SVA but just add the batches to design matrix.
# runs SVA
cinaR(..., batch.correction = T)
# runs SVA with 2 surrogate variables
cinaR(..., batch.correction = T, sv.number = 2)
# adds only batch information to the design matrix! (does not run SVA)
# batch.information should be number a vector where
# the length of it equals to the number of samples.
cinaR(..., batch.correction = T, batch.information = c(rep(0, 11), rep(1,11)))
Reminder - In our example data we have 22 samples
Adding extra covariates
Sometimes, one might want to add additional covariates to adjust the design matrix further, which affects the down-stream analyses. For instance, ages or sexes of the samples could be additional covariates. To account for those:
# Ages of the samples could be not in biological interests but should be accounted for!
cinaR(..., additional.covariates = c((18, 11), (3, 11)))
# More than one covariate for instance, sex and age
sex.info <- c("M", "F", "M", "F", "F", "F", "F", "F", "M", "M", "M",
"F", "F", "M", "M", "M", "F", "F", "M", "M", "F", "M")
age.info <- c((18, 11), (3, 11)
covs <- data.frame(Sex = sex.info, Age = age.info)
cinaR(..., additional.covariates = covs)
Saving DA peaks to excel
Setting save.DA.peaks = TRUE
in cinaR
function will create a DApeaks.xlsx
file in the current
directory. This file includes all the comparisons in different tabs.
Additionally, you can set the path/name of the file using
DA.peaks.path
argument after setting
save.DA.peaks = TRUE
.
For instance,
results <- cinaR(..., save.DA.peaks = T, DA.peaks.path = "./Peaks_mice.xlsx")
will create an excel file with name Peaks_mice.xlsx
in
the current directory.
Using different GLM algorithms
Currently, cinaR
supports 4 different algorithms,
namely;
- edgeR
- limma-voom
- limma-trend
- DESeq2
If not set, it uses edgeR
for differential analyses. You
can change the used algorithm by simply setting DA.choice
argument. For more information, ?cinaR
Some useful arguments
# new FDR threshold for DA peaks
results <- cinaR(..., DA.fdr.threshold = 0.1)
# filters out pathways
results <- cinaR(..., enrichment.FDR.cutoff = 0.1)
# does not run enrichment pipeline
results <- cinaR(..., run.enrichment = FALSE)
# creates the piechart from chIpSeeker package
results <- cinaR(..., show.annotation.pie = TRUE)
# change cut-off value for dot plots
dot_plot(..., fdr.cutoff = 0.05)
References
Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47.
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8
Session Info
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cinaR_0.2.4
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3
## [2] jsonlite_1.8.8
## [3] magrittr_2.0.3
## [4] GenomicFeatures_1.56.0
## [5] farver_2.1.2
## [6] rmarkdown_2.27
## [7] fs_1.6.4
## [8] BiocIO_1.14.0
## [9] zlibbioc_1.50.0
## [10] ragg_1.3.2
## [11] vctrs_0.6.5
## [12] memoise_2.0.1
## [13] Rsamtools_2.20.0
## [14] RCurl_1.98-1.14
## [15] ggtree_3.12.0
## [16] htmltools_0.5.8.1
## [17] S4Arrays_1.4.1
## [18] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [19] plotrix_3.8-4
## [20] curl_5.2.1
## [21] SparseArray_1.4.5
## [22] gridGraphics_0.5-1
## [23] sass_0.4.9
## [24] KernSmooth_2.23-22
## [25] bslib_0.7.0
## [26] desc_1.4.3
## [27] plyr_1.8.9
## [28] cachem_1.1.0
## [29] GenomicAlignments_1.40.0
## [30] igraph_2.0.3
## [31] lifecycle_1.0.4
## [32] pkgconfig_2.0.3
## [33] Matrix_1.7-0
## [34] R6_2.5.1
## [35] fastmap_1.2.0
## [36] GenomeInfoDbData_1.2.12
## [37] MatrixGenerics_1.16.0
## [38] digest_0.6.35
## [39] aplot_0.2.2
## [40] enrichplot_1.24.0
## [41] colorspace_2.1-0
## [42] patchwork_1.2.0
## [43] AnnotationDbi_1.66.0
## [44] S4Vectors_0.42.0
## [45] textshaping_0.3.7
## [46] GenomicRanges_1.56.0
## [47] RSQLite_2.3.6
## [48] labeling_0.4.3
## [49] fansi_1.0.6
## [50] httr_1.4.7
## [51] polyclip_1.10-6
## [52] abind_1.4-5
## [53] compiler_4.4.0
## [54] bit64_4.0.5
## [55] withr_3.0.0
## [56] BiocParallel_1.38.0
## [57] viridis_0.6.5
## [58] DBI_1.2.2
## [59] highr_0.10
## [60] gplots_3.1.3.1
## [61] ggforce_0.4.2
## [62] MASS_7.3-60.2
## [63] ChIPseeker_1.40.0
## [64] DelayedArray_0.30.1
## [65] rjson_0.2.21
## [66] HDO.db_0.99.1
## [67] caTools_1.18.2
## [68] gtools_3.9.5
## [69] tools_4.4.0
## [70] scatterpie_0.2.2
## [71] ape_5.8
## [72] glue_1.7.0
## [73] restfulr_0.0.15
## [74] nlme_3.1-164
## [75] GOSemSim_2.30.0
## [76] shadowtext_0.1.3
## [77] grid_4.4.0
## [78] reshape2_1.4.4
## [79] fgsea_1.30.0
## [80] generics_0.1.3
## [81] gtable_0.3.5
## [82] tidyr_1.3.1
## [83] data.table_1.15.4
## [84] tidygraph_1.3.1
## [85] utf8_1.2.4
## [86] XVector_0.44.0
## [87] BiocGenerics_0.50.0
## [88] stringr_1.5.1
## [89] ggrepel_0.9.5
## [90] pillar_1.9.0
## [91] yulab.utils_0.1.4
## [92] limma_3.60.2
## [93] splines_4.4.0
## [94] dplyr_1.1.4
## [95] tweenr_2.0.3
## [96] treeio_1.28.0
## [97] lattice_0.22-6
## [98] rtracklayer_1.64.0
## [99] bit_4.0.5
## [100] tidyselect_1.2.1
## [101] GO.db_3.19.1
## [102] locfit_1.5-9.9
## [103] Biostrings_2.72.0
## [104] knitr_1.46
## [105] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [106] gridExtra_2.3
## [107] IRanges_2.38.0
## [108] edgeR_4.2.0
## [109] SummarizedExperiment_1.34.0
## [110] stats4_4.4.0
## [111] xfun_0.44
## [112] graphlayouts_1.1.1
## [113] Biobase_2.64.0
## [114] statmod_1.5.0
## [115] matrixStats_1.3.0
## [116] pheatmap_1.0.12
## [117] stringi_1.8.4
## [118] UCSC.utils_1.0.0
## [119] lazyeval_0.2.2
## [120] ggfun_0.1.4
## [121] yaml_2.3.8
## [122] boot_1.3-30
## [123] evaluate_0.23
## [124] codetools_0.2-20
## [125] ggraph_2.2.1
## [126] qvalue_2.36.0
## [127] tibble_3.2.1
## [128] BiocManager_1.30.23
## [129] ggplotify_0.1.2
## [130] cli_3.6.2
## [131] systemfonts_1.1.0
## [132] munsell_0.5.1
## [133] jquerylib_0.1.4
## [134] Rcpp_1.0.12
## [135] GenomeInfoDb_1.40.0
## [136] png_0.1-8
## [137] XML_3.99-0.16.1
## [138] parallel_4.4.0
## [139] pkgdown_2.0.9
## [140] ggplot2_3.5.1
## [141] blob_1.2.4
## [142] DOSE_3.30.1
## [143] bitops_1.0-7
## [144] tidytree_0.4.6
## [145] viridisLite_0.4.2
## [146] scales_1.3.0
## [147] purrr_1.0.2
## [148] crayon_1.5.2
## [149] rlang_1.1.3
## [150] fastmatch_1.1-4
## [151] cowplot_1.1.3
## [152] KEGGREST_1.44.0