Here zingeR method will be demonstrated clearly and hope that this document can help you.
Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data
command.
When you use zingeR to estimate parameters from a real dataset, you must input a numeric vector to specify the groups or plates that each cell comes from, like other_prior = list(group.condition = the numeric vector)
.
library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
group_condition <- simmethods::group_condition
## group_condition can must be a numeric vector.
other_prior <- list(group.condition = as.numeric(group_condition))
Using simmethods::zingeR_estimation
command to execute the estimation step.
estimate_result <- simmethods::zingeR_estimation(ref_data = ref_data,
other_prior = other_prior,
verbose = T,
seed = 10)
# Estimating parameters using zingeR
After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.
The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data. In addtion, the simulated dataset will have one group of cells.
simulate_result <- simmethods::zingeR_simulation(
ref_data = ref_data,
other_prior = other_prior,
parameters = estimate_result[["estimate_result"]],
return_format = "SCE",
seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# prob.group: 0.1
# fc.group: 2
# Loading required package: edgeR
# Loading required package: limma
#
# Attaching package: 'limma'
# The following object is masked from 'package:BiocGenerics':
#
# plotMA
#
# Attaching package: 'edgeR'
# The following object is masked from 'package:SingleCellExperiment':
#
# cpm
# Preparing dataset. Using existing parameters.
# Sampling.
# Calculating differential expression.
# Simulating data.
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000 160
head(colData(SCE_result))
# DataFrame with 6 rows and 1 column
# cell_name
# <character>
# Cell1 Cell1
# Cell2 Cell2
# Cell3 Cell3
# Cell4 Cell4
# Cell5 Cell5
# Cell6 Cell6
head(rowData(SCE_result))
# DataFrame with 6 rows and 3 columns
# gene_name de_gene de_fc
# <character> <character> <numeric>
# Gene1 Gene1 no 0
# Gene2 Gene2 no 0
# Gene3 Gene3 no 0
# Gene4 Gene4 no 0
# Gene5 Gene5 no 0
# Gene6 Gene6 no 0
In zingeR, users can only set the number of cells and genes which is higher than the reference data. Here, we simulate a new dataset with 1000 cells and 5000 genes:
simulate_result <- simmethods::zingeR_simulation(
ref_data = ref_data,
other_prior = list(group.condition = as.numeric(group_condition),
nCells = 1000,
nGenes = 5000),
parameters = estimate_result[["estimate_result"]],
return_format = "list",
seed = 111
)
# nCells: 1000
# nGenes: 5000
# nGroups: 2
# prob.group: 0.1
# fc.group: 2
# Preparing dataset. Using existing parameters.
# Sampling.
# Calculating differential expression.
# Simulating data.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 5000 1000
In zingeR, we can only simulate two groups and note that zingeR dose not return cell group information.
For demonstration, we will simulate two groups using the learned parameters. We can set de.prob = 0.2
to simulate 20% genes as DEGs.
simulate_result <- simmethods::zingeR_simulation(
ref_data = ref_data,
other_prior = list(group.condition = as.numeric(group_condition),
nCells = 1000,
nGenes = 5000,
de.prob = 0.2,
fc.group = 4),
parameters = estimate_result[["estimate_result"]],
return_format = "list",
seed = 111
)
# nCells: 1000
# nGenes: 5000
# nGroups: 2
# prob.group: 0.2
# fc.group: 4
# Preparing dataset. Using existing parameters.
# Sampling.
# Calculating differential expression.
# Simulating data.
zingeR dose not return cell group information.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 5000 1000
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.2
# yes
# 0.2