| Title: | Robust Pathway Enrichment for DNA Methylation Studies Using Ensemble Voting |
|---|---|
| Description: | Performs pathway enrichment analysis using a voting-based framework that integrates CpG–gene regulatory information from expression quantitative trait methylation (eQTM) data. For a grid of top-ranked CpGs and filtering thresholds, gene sets are generated and refined using an entropy-based pruning strategy that balances information richness, stability, and probe bias correction. In particular, gene lists dominated by genes with disproportionately high numbers of CpG mappings are penalized to mitigate active probe bias—a common artifact in methylation data analysis. Enrichment results across parameter combinations are then aggregated using a voting scheme, prioritizing pathways that are consistently recovered under diverse settings and robust to parameter perturbations. |
| Authors: | Yinan Zheng [aut, cre] (ORCID: <https://orcid.org/0000-0002-2006-7320>) |
| Maintainer: | Yinan Zheng <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.3 |
| Built: | 2026-05-19 09:08:33 UTC |
| Source: | https://github.com/yinanzheng/pathwayvote |
Create an expression quantitative trait methylation (eQTM) object
create_eQTM(data, metadata = list())create_eQTM(data, metadata = list())
data |
A data.frame containing eQTM data with columns: cpg, statistics, p_value, distance, and at least one of entrez or ensembl.
|
metadata |
A list of metadata (optional). |
An eQTM object.
data <- data.frame( cpg = c("cg000001", "cg000002"), statistics = c(2.5, -1.8), p_value = c(0.01, 0.03), distance = c(50000, 80000), entrez = c("673", "1956") ) eqtm_obj <- create_eQTM(data)data <- data.frame( cpg = c("cg000001", "cg000002"), statistics = c(2.5, -1.8), p_value = c(0.01, 0.03), distance = c(50000, 80000), entrez = c("673", "1956") ) eqtm_obj <- create_eQTM(data)
A class to store eQTM data for pathway analysis. eQTM stands for Expression Quantitative Trait Methylation.
dataA data.frame containing eQTM data with columns: cpg, statistics, p_value, distance, and at least one of entrez or ensembl.
metadataA list of metadata (e.g., data source, time point). Reserved for future use.
Retrieve the eQTM data.frame from an eQTM object.
getData(object) ## S4 method for signature 'eQTM' getData(object)getData(object) ## S4 method for signature 'eQTM' getData(object)
object |
An eQTM object. |
A data.frame stored in the object.
Retrieve the metadata list from an eQTM object.
getMetadata(object) ## S4 method for signature 'eQTM' getMetadata(object)getMetadata(object) ## S4 method for signature 'eQTM' getMetadata(object)
object |
An eQTM object. |
A list containing metadata.
Performs pathway enrichment analysis using a voting-based framework that integrates CpG-gene regulatory information from expression quantitative trait methylation (eQTM) data. For a grid of top-ranked CpGs and filtering thresholds, gene sets are generated and refined using an entropy-based pruning strategy that balances information richness, stability, and probe bias correction. In particular, gene lists dominated by genes with disproportionately high numbers of CpG mappings are penalized to mitigate active probe bias, a common artifact in methylation data analysis. Enrichment results across parameter combinations are then aggregated using a voting scheme, prioritizing pathways that are consistently recovered under diverse settings and robust to parameter perturbations.
pathway_vote( cpg_input, eQTM, databases = c("Reactome"), k_grid = NULL, stat_grid = NULL, distance_grid = NULL, grid_size = 5, overlap_threshold = 0.7, fixed_prune = NULL, min_genes_per_hit = 2, readable = FALSE, workers = NULL, verbose = FALSE )pathway_vote( cpg_input, eQTM, databases = c("Reactome"), k_grid = NULL, stat_grid = NULL, distance_grid = NULL, grid_size = 5, overlap_threshold = 0.7, fixed_prune = NULL, min_genes_per_hit = 2, readable = FALSE, workers = NULL, verbose = FALSE )
cpg_input |
A data.frame containing CpG-level results or identifiers. The first column must contain CpG IDs,
which can be Illumina probe IDs (e.g., "cg00000029") for array-based data, or genomic coordinates
(e.g., "chr1:10468" or "chr1:10468:+") for sequencing-based data. These IDs will be matched against
the eQTM object. Optionally, a second column may provide a ranking metric. If supplied, this must be:
(i) the complete set of raw p-values from association tests (required for automatic |
eQTM |
An |
databases |
A character vector of pathway databases. Supporting: "Reactome", "KEGG", and "GO". |
k_grid |
A numeric vector specifying the top-k CpGs used for gene set construction. If |
stat_grid |
A numeric vector of eQTM statistic thresholds. If NULL, generated based on quantiles of the observed distribution. |
distance_grid |
A numeric vector of CpG-gene distance thresholds (in base pairs). If NULL, generated based on quantiles of the observed distribution. |
grid_size |
Integer. Number of values in each grid when auto-generating. Default is 5. |
overlap_threshold |
Numeric between 0 and 1. Controls the maximum allowed Jaccard similarity between gene lists during redundancy filtering. Default is 0.7, which provides robust and stable results across a variety of simulation scenarios. |
fixed_prune |
Integer or NULL. Minimum number of votes to retain a pathway. If NULL, will use cuberoot(N) where N is the number of total enrichment runs. |
min_genes_per_hit |
Minimum number of genes a pathway must include to be considered. Default is 2. |
readable |
Logical. Whether to convert Entrez IDs to gene symbols in enrichment results. |
workers |
Optional integer. Number of parallel workers. If NULL, use 2 logical cores. |
verbose |
Logical. Whether to print progress messages. |
A named list of data.frames containing:
Enrichment results for each selected database (e.g., 'Reactome', 'KEGG', 'GO'). Each data.frame contains columns: 'ID', 'p.adjust', 'Description', and 'geneID'.
'CpG_Gene_Mapping': A data.frame showing the CpG-Gene relationships for genes identified in the significantly enriched pathways, limited to the CpGs present in the input 'cpg_input'.
set.seed(123) # Simulated EWAS result: a mix of signal and noise n_cpg <- 500 ewas <- data.frame( cpg = paste0("cg", sprintf("%08d", 1:n_cpg)), p_value = c(runif(n_cpg*0.1, 1e-9, 1e-5), runif(n_cpg*0.2, 1e-3, 0.05), runif(n_cpg*0.7, 0.05, 1)) ) # Corresponding eQTM mapping (some of these CpGs have gene links) signal_genes <- c("5290", "673", "1956", "7157", "7422") background_genes <- as.character(1000:9999) entrez_signal <- sample(signal_genes, n_cpg * 0.1, replace = TRUE) entrez_background <- sample(setdiff(background_genes, signal_genes), n_cpg * 0.9, replace = TRUE) eqtm_data <- data.frame( cpg = ewas$cpg, statistics = rnorm(n_cpg, mean = 2, sd = 1), p_value = runif(n_cpg, min = 0.001, max = 0.05), distance = sample(1000:100000, n_cpg, replace = TRUE), entrez = c(entrez_signal, entrez_background), stringsAsFactors = FALSE ) eqtm_obj <- create_eQTM(eqtm_data) # Run pathway voting with minimal settings ## Not run: results <- pathway_vote( cpg_input = ewas, eQTM = eqtm_obj, databases = c("GO", "KEGG", "Reactome"), readable = TRUE, verbose = TRUE ) head(results$GO) head(results$KEGG) head(results$Reactome) # Export results to Excel (optional) library(openxlsx) write_enrich_results_xlsx(results, "pathway_vote_results.xlsx") ## End(Not run)set.seed(123) # Simulated EWAS result: a mix of signal and noise n_cpg <- 500 ewas <- data.frame( cpg = paste0("cg", sprintf("%08d", 1:n_cpg)), p_value = c(runif(n_cpg*0.1, 1e-9, 1e-5), runif(n_cpg*0.2, 1e-3, 0.05), runif(n_cpg*0.7, 0.05, 1)) ) # Corresponding eQTM mapping (some of these CpGs have gene links) signal_genes <- c("5290", "673", "1956", "7157", "7422") background_genes <- as.character(1000:9999) entrez_signal <- sample(signal_genes, n_cpg * 0.1, replace = TRUE) entrez_background <- sample(setdiff(background_genes, signal_genes), n_cpg * 0.9, replace = TRUE) eqtm_data <- data.frame( cpg = ewas$cpg, statistics = rnorm(n_cpg, mean = 2, sd = 1), p_value = runif(n_cpg, min = 0.001, max = 0.05), distance = sample(1000:100000, n_cpg, replace = TRUE), entrez = c(entrez_signal, entrez_background), stringsAsFactors = FALSE ) eqtm_obj <- create_eQTM(eqtm_data) # Run pathway voting with minimal settings ## Not run: results <- pathway_vote( cpg_input = ewas, eQTM = eqtm_obj, databases = c("GO", "KEGG", "Reactome"), readable = TRUE, verbose = TRUE ) head(results$GO) head(results$KEGG) head(results$Reactome) # Export results to Excel (optional) library(openxlsx) write_enrich_results_xlsx(results, "pathway_vote_results.xlsx") ## End(Not run)
Exports the results from 'pathway_vote' to a multi-sheet Excel file. Validates that the input is a list, checks for the 'openxlsx' package, and handles sheet naming to comply with Excel limitations.
write_enrich_results_xlsx(results, file = "enrich_results.xlsx")write_enrich_results_xlsx(results, file = "enrich_results.xlsx")
results |
A named list of data.frames (e.g., output from 'pathway_vote'). |
file |
Character. Output file path (e.g., "enrich_results.xlsx"). |
Invisible. The path to the saved file.
## Not run: # Assuming `res` is the output from pathway_vote(...) write_enrich_results_xlsx(res, "my_results.xlsx") ## End(Not run)## Not run: # Assuming `res` is the output from pathway_vote(...) write_enrich_results_xlsx(res, "my_results.xlsx") ## End(Not run)