% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/detect_damage.R
\name{detect_damage}
\alias{detect_damage}
\title{detect_damage}
\usage{
detect_damage(
  count_matrix,
  ribosome_penalty = 0.01,
  organism = "Hsap",
  annotated_celltypes = FALSE,
  target_damage = c(0.1, 0.8),
  damage_distribution = "right_skewed",
  distribution_steepness = "moderate",
  beta_shape_parameters = NULL,
  damage_levels = 5,
  damage_proportion = 0.15,
  seed = 7,
  mito_quantile = 0.75,
  kN = NULL,
  generate_plot = TRUE,
  display_plot = TRUE,
  palette = c("grey", "#7023FD", "#E60006"),
  filter_threshold = 0.7,
  filter_counts = FALSE,
  verbose = TRUE
)
}
\arguments{
\item{count_matrix}{Matrix or dgCMatrix containing the counts from
single cell RNA sequencing data.}

\item{ribosome_penalty}{Numeric specifying the factor by which the
probability of loosing a transcript from a ribosomal gene is multiplied by.
Here, values closer to 0 represent a greater penalty.
\itemize{
\item Default is 0.01.
}}

\item{organism}{String specifying the organism of origin of the input
data where there are two standard options,
\itemize{
\item "Hsap"
\item "Mmus"
}

If a user wishes to use a non-standard organism they must input a list
containing strings for the patterns to match mitochondrial and ribosomal
genes of the organism. If available, nuclear-encoded genes that are likely
retained in the nucleus, such as in nuclear speckles, must also
be specified. An example for humans is below,
\itemize{
\item organism = c(mito_pattern = "^MT-",
ribo_pattern = "^(RPS|RPL)",
nuclear <- c("NEAT1","XIST", "MALAT1")
\item Default is "Hsap"
}}

\item{annotated_celltypes}{Boolean specifying whether input matrix has
cell type information stored.
\itemize{
\item Default is FALSE
}}

\item{target_damage}{Numeric vector specifying the upper and lower range of
the level of damage that will be introduced.

Here, damage refers to the amount of cytoplasmic RNA lost by a cell where
values closer to 1 indicate more loss and therefore more heavily damaged
cells.
\itemize{
\item Default is c(0.1, 0.8)
}}

\item{damage_distribution}{String specifying whether the distribution of
damage levels among the damaged cells should be shifted towards the
upper or lower range of damage specified in 'target_damage' or follow
a symmetric distribution between them. There are three valid options:
\itemize{
\item "right_skewed"
\item "left_skewed"
\item "symmetric"
\item Default is "right_skewed"
}}

\item{distribution_steepness}{String specifying how concentrated the spread
of damaged cells are about the mean of the target distribution specified in
'target_damage'. Here, an increase in steepness manifests in a more
apparent skewness.There are three valid options:
\itemize{
\item "shallow"
\item "moderate"
\item "steep"
\item Default is "moderate"
}}

\item{beta_shape_parameters}{Numeric vector that allows for the shape
parameters of the beta distribution to defined explicitly. This offers
greater flexibility than allowed by the 'damage_distribution' and
'distribution_steepness' parameters and will override the defaults they
offer.
\itemize{
\item Default is 'NULL'
}}

\item{damage_levels}{Numeric specifying the number of distinct sets of
artificial damaged cells simulated, each with a defined range of loss.
Default ptions include,
\itemize{
\item 3 : c(0.00001, 0.08), c(0.1, 0.4), c(0.5, 0.9)
\item 5 : c(0.00001, 0.08), c(0.1, 0.3), c(0.3, 0.5), c(0.5, 0.7), c(0.7, 0.9)
\item 7 : c(0.00001, 0.08), c(0.1, 0.3), c(0.3, 0.4), c(0.4, 0.5), c(0.5, 0.7),
c(0.7, 0.9), c(0.9, 0.99999).
}

A user can also provide a list specifying sets with their own ranges of
loss,
\itemize{
\item damage_levels = list(
pANN_50 = c(0.1, 0.5),
pANN_100 = c(0.5, 1)
)
}

By introducing more sets of damage a user can improve the accuracy of
loss estimations (scaled_pANN) as they are found through scaling the pANN
within each set according to the lower and upper boundary of the set's
damage level. However, introducing more sets increases the computational
time for the function.
\itemize{
\item Default is 5.
}}

\item{damage_proportion}{Numeric describing what proportion
of the input data should be altered to resemble damaged data.
\itemize{
\item Must range between 0 and 1.
}}

\item{seed}{Numeric specifying the random seed to ensure reproducibility of
the function's output. Setting a seed ensures that the random sampling
and perturbation processes produce the same results when the function
is run multiple times with the same input data and parameters.
\itemize{
\item Default is 7.
}}

\item{mito_quantile}{Numeric between 0 and 1 specifying below what
level of mitochondrial proportion cells are sampled for simulations.
This step is done to protect against simulating damaged cell profiles
from cells that are likely damaged.
\itemize{
\item Default is 0.75.
}}

\item{kN}{Numeric describing how many nearest neighbours are considered
for pANN calculations. kN cannot exceed the total cell number.
\itemize{
\item Default is one third of the total cell number.
}}

\item{generate_plot}{Boolean specifying whether the QC plot should
be outputted. QC plots will be generated by default as we recommend
verifying the perturbed data retains characteristics of true
single cell data.
\itemize{
\item Default is TRUE.
}}

\item{display_plot}{Boolean specifying whether the output QC plot should
be displayed in the global environment. Naturally, this is only relevant
when generate_plot is TRUE.
\itemize{
\item Default is TRUE.
}}

\item{palette}{String specifying the three colours that will be used to
create the continuous colour palette for colouring the 'damage_column'.
\itemize{
\item Default is a range from purple to red,
c("grey", "#7023FD", "#E60006").
}}

\item{filter_threshold}{Numeric specifying the proportion of RNA loss
above which a cell should be considered damaged.
\itemize{
\item Default is 0.75.
}}

\item{filter_counts}{Boolean specifying whether the output matrix
should be filtered, returned containing only cells that fall below
the filter threshold. Alternatively, a data frame containing cell
barcodes and their associated label as either 'damaged' or 'cell'
is returned.
\itemize{
\item Default is FALSE.
}}

\item{verbose}{Boolean specifying whether messages and function progress
should be displayed in the console.
\itemize{
\item Default is TRUE.
}}
}
\value{
Filtered matrix or data frame containing damage labels.
}
\description{
Quality control function to identify and filter damaged cells from an
input count matrix, where 'damage' is defined by the loss of cytoplasmic RNA.
}
\details{
Using the simulation framework of \code{simulate_counts()}, \code{detect_damage()}
generates artificially damaged cell profiles by introducing defined levels
of RNA loss into the input data. True and artificial cells are then
merged and pre-processed to compute the following quality control metrics:
\itemize{
\item Log-normalized feature count
\item Log-normalized total counts
\item Mitochondrial proportion
\item Ribosomal proportion
\item Log-normalized MALAT1 gene expression
}

Principal component analysis (PCA) is performed on these metrics,
and a Euclidean distance matrix is constructed from the PC embeddings.
For each true cell, the proportion of nearest neighbours that are
artificial cells (pANN) is calculated across all damage levels and the
damage level with the highest pANN is assigned to the true cell.
Finally, cells exceeding a specified damage threshold, \code{filter_threshold},
are marked as damaged.

This filtering method is inspired by approaches developed for DoubletFinder
(McGinnis et al., 2019) to detect doublets in single-cell data.
}
\examples{
data("test_counts", package = "DamageDetective")

test <- detect_damage(
  count_matrix = test_counts,
  ribosome_penalty = 0.001,
  damage_levels = 3,
  damage_proportion = 0.1,
  generate_plot = FALSE,
  seed = 7
)
}
\references{
McGinnis, C. S., Murrow, L. M., & Gartner, Z. J. (2019). DoubletFinder:
Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial
Nearest neighbours. \emph{Cell Systems, 8}(4), 329-337.e4.
\doi{10.1016/j.cels.2019.03.003}
}
