---
title: "Getting started with DANDELION"
author: "Peixin Tian"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with DANDELION}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

# Overview

`DANDELION` identifies candidate disease-proximal genes (DPGs) by integrating trans-regulatory association evidence with gene-level disease association evidence. In the package interface, upstream exposures are referred to as `gene1`, and candidate disease-proximal genes are referred to as `gene2`.

The exposure side can be either:

- **Gene-based**: columns of `p.trans` are distal regulatory genes.
- **SNP-based**: columns of `p.trans` are SNPs.

For both modes, rows of `p.trans` are candidate disease-proximal genes (`gene2`). The vector `p.wes` contains gene-level disease association p-values and should be named using the same gene identifiers as `rownames(p.trans)`.

`DANDELION` removes nearby cis genes using genomic positions and evaluates distal/trans-mediated paths. It returns structured objects that can be printed, inspected, and passed to downstream pair-organization functions.

# Installation

Install the development version from GitHub:

```{r, eval = FALSE}
# install.packages("remotes")
remotes::install_github("mxxptian/DANDELION")
```

Load the package:

```{r}
library(DANDELION)
```

The optional Bioconductor package `qvalue` is used when available for q-value estimation. If `qvalue` is unavailable, `DANDELION` falls back to Benjamini-Hochberg adjusted p-values.

```{r, eval = FALSE}
# install.packages("BiocManager")
BiocManager::install("qvalue")
```

# Input data structure

## `p.trans`

`p.trans` is a numeric matrix of trans-regulatory association p-values.

- Rows: candidate disease-proximal genes (`gene2`)
- Columns: exposures (`gene1`), either distal genes or SNPs
- Both row names and column names are required

## `p.wes`

`p.wes` is a named numeric vector of gene-level disease association p-values. Its names should overlap with `rownames(p.trans)`.

## `ref.table`

`ref.table` is a gene annotation table used for matching genes and removing local cis genes. It must contain:

| Column | Description |
|---|---|
| `gene_name` | Gene symbol or gene identifier |
| `type` | Gene type, such as `protein_coding` or `lincRNA` |
| `Chromosome` | Chromosome in `chr1`, `chr2`, ... format |
| `start` | Gene start position |
| `end` | Gene end position |

For SNP-based DANDELION, `SNP.ref` is also required and must contain `SNP`, `SNPPos`, and `SNPChr`.

# Gene-based DANDELION example

This example uses a small toy dataset. It is designed to demonstrate input format and function usage rather than reproduce manuscript-scale analyses.

```{r gene-data}
set.seed(1)

# Trans-association p-values: rows are gene2, columns are gene1.
p.trans <- matrix(runif(60, 0.001, 0.9), nrow = 12, ncol = 5)
rownames(p.trans) <- paste0("G", 1:12)
colnames(p.trans) <- paste0("E", 1:5)

# Gene-level disease association p-values.
p.wes <- runif(12, 0.001, 0.9)
names(p.wes) <- paste0("G", 1:12)

# Gene annotation table.
ref.table <- data.frame(
  gene_name = c(paste0("G", 1:12), paste0("E", 1:5)),
  type = "protein_coding",
  Chromosome = "chr1",
  start = seq_len(17) * 1e7,
  end = seq_len(17) * 1e7 + 1000
)

head(p.trans[, 1:3])
head(p.wes)
head(ref.table)
```

Run gene-based DANDELION:

```{r gene-run}
res.gene <- med_gene(
  p.trans = p.trans,
  p.wes = p.wes,
  ref.table = ref.table,
  gene1.list = colnames(p.trans),
  gene1.type = "Gene",
  target.fdr = 0.1,
  dist = 5e6
)

res.gene
```

The returned object has class `dandelion_result` and contains:

```{r gene-result-contents}
names(res.gene)
res.gene$gene1
res.gene$mat.sig[1:5, 1:3]
res.gene$mat.p[1:5, 1:3]
```

Next, organize significant gene-gene pairs:

```{r gene-pairs}
ref.table.keep <- ref.table[
  ref.table$type %in% c("lincRNA", "protein_coding") &
    !(ref.table$Chromosome %in% c("chrM", "chrX", "chrY")),
]

pair.gene <- calc_pair.gene(
  mat.sig = res.gene$mat.sig,
  mat.p = res.gene$mat.p,
  p.wes = p.wes,
  gene1 = res.gene$gene1,
  ref.table.keep = ref.table.keep,
  eta.wgs = 1e-5
)

pair.gene
names(pair.gene)
```

The organized pair object contains the significant pair table, locus-level pair table, WES-significant `gene2` genes, and non-WES-significant `gene2` genes:

```{r gene-pair-contents}
head(pair.gene$pairs_dact)
head(pair.gene$gene.pair)
pair.gene$sig_gene2
pair.gene$non_sig.gene2
```

# SNP-based DANDELION example

For SNP-based DANDELION, columns of `p.trans` are SNPs. The user must provide SNP positions through `SNP.ref`.

```{r snp-data}
set.seed(2)

p.trans.snp <- matrix(runif(60, 0.001, 0.9), nrow = 12, ncol = 5)
rownames(p.trans.snp) <- paste0("G", 1:12)
colnames(p.trans.snp) <- paste0("rs", 1:5)

p.wes.snp <- runif(12, 0.001, 0.9)
names(p.wes.snp) <- paste0("G", 1:12)

ref.table.snp <- data.frame(
  gene_name = c(paste0("G", 1:12), paste0("E", 1:5)),
  type = "protein_coding",
  Chromosome = "chr1",
  start = seq_len(17) * 1e7,
  end = seq_len(17) * 1e7 + 1000
)

SNP.ref <- data.frame(
  SNP = paste0("rs", 1:5),
  SNPPos = seq_len(5) * 2e7,
  SNPChr = "1"
)

uniq_snp <- data.frame(
  SNP = paste0("rs", 1:5),
  GeneSymbol = paste0("E", 1:5)
)

head(p.trans.snp[, 1:3])
head(SNP.ref)
head(uniq_snp)
```

Run SNP-based DANDELION:

```{r snp-run}
res.snp <- med_gene(
  p.trans = p.trans.snp,
  p.wes = p.wes.snp,
  ref.table = ref.table.snp,
  gene1.list = colnames(p.trans.snp),
  gene1.type = "SNP",
  SNP.ref = SNP.ref,
  target.fdr = 0.1,
  dist = 5e6
)

res.snp
```

Organize SNP-gene pairs:

```{r snp-pairs}
ref.table.keep.snp <- ref.table.snp[
  ref.table.snp$type %in% c("lincRNA", "protein_coding") &
    !(ref.table.snp$Chromosome %in% c("chrM", "chrX", "chrY")),
]

pair.snp <- calc_pair.snp(
  mat.sig = res.snp$mat.sig,
  mat.p = res.snp$mat.p,
  p.wes = p.wes.snp,
  gene1 = res.snp$gene1,
  uniq_snp = uniq_snp,
  ref.table.keep = ref.table.keep.snp,
  SNP.ref = SNP.ref,
  eta.wgs = 1e-5
)

pair.snp
head(pair.snp$pairs_dact)
head(pair.snp$gene.pair)
```

# Network visualization

`gen_fig()` generates a network figure from an organized pair table. The pair table must contain `region` and `gene2` columns.

Because the toy example may not produce significant pairs, the following code is shown but not evaluated in this vignette.

```{r network-fig, eval = FALSE}
out_dir <- tempdir()
fig_file <- gen_fig(
  gene.pair = pair.gene$gene.pair,
  p.wes = p.wes,
  eta.wgs = 1e-5,
  pic_dir = out_dir
)

fig_file
```

# Verbose output

By default, `DANDELION` functions do not print progress messages to the console. To show progress messages, set `verbose = TRUE`:

```{r verbose-example, eval = FALSE}
res.verbose <- med_gene(
  p.trans = p.trans,
  p.wes = p.wes,
  ref.table = ref.table,
  gene1.list = colnames(p.trans),
  gene1.type = "Gene",
  verbose = TRUE
)
```

# Recommended workflow

A typical analysis workflow is:

1. Prepare the trans-association p-value matrix `p.trans`.
2. Prepare the named gene-level disease association vector `p.wes`.
3. Prepare gene annotation `ref.table`.
4. Run `med_gene()` with `gene1.type = "Gene"` or `gene1.type = "SNP"`.
5. Organize significant pairs using `calc_pair.gene()` or `calc_pair.snp()`.
6. Inspect `pairs_dact`, `gene.pair`, `sig_gene2`, and `non_sig.gene2`.
7. Optionally visualize the resulting network using `gen_fig()`.

# Session information

```{r session-info}
sessionInfo()