% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/summ_distance.R
\name{summ_distance}
\alias{summ_distance}
\title{Summarize pair of distributions with distance}
\usage{
summ_distance(f, g, method = "KS")
}
\arguments{
\item{f}{A pdqr-function of any \link[=meta_type]{type} and
\link[=meta_class]{class}.}

\item{g}{A pdqr-function of any type and class.}

\item{method}{Method for computing distance. Should be one of "KS", "totvar",
"compare", "wass", "cramer", "align", "avgdist", "entropy".}
}
\value{
A single non-negative number representing distance between pair of
distributions. For methods "KS", "totvar", and "compare" it is not bigger
than 1. For method "avgdist" it is almost always bigger than 0.
}
\description{
This function computes distance between two distributions represented by
pdqr-functions. Here "distance" is used in a broad sense: a single
non-negative number representing how much two distributions differ from one
another. Bigger values indicate bigger difference. Zero value means that
input distributions are equivalent based on the method used (except method
"avgdist" which is almost always returns positive value). The notion of
"distance" is useful for doing statistical inference about similarity of two
groups of numbers.
}
\details{
Methods can be separated into three categories: probability based,
metric based, and entropy based.

\strong{Probability based} methods return a number between 0 and 1 which is
computed in the way that mostly based on probability:
\itemize{
\item \emph{Method "KS"} (short for Kolmogorov-Smirnov) computes the supremum of
absolute difference between p-functions corresponding to \code{f} and \code{g} (\verb{|F - G|}). Here "supremum" is meant to describe the fact that if input functions
have different \link[=meta_type]{types}, there can be no point at which "KS"
distance is achieved. Instead, there might be a sequence of points from left
to right with \verb{|F - G|} values tending to the result (see Examples).
\item \emph{Method "totvar"} (short for "total variation") computes a biggest absolute
difference of probabilities for any subset of real line. In other words,
there is a set of points for "discrete" type and intervals for "continuous",
total probability of which under \code{f} and \code{g} differs the most. \strong{Note} that
if \code{f} and \code{g} have different types, output is always 1. The set of interest
consists from all "x" values of "discrete" pdqr-function: probability under
"discrete" distribution is 1 and under "continuous" is 0.
\item \emph{Method "compare"} represents a value computed based on probabilities of
one distribution being bigger than the other (see \link[=methods-group-generic]{pdqr methods for "Ops" group generic family} for more details on comparing
pdqr-functions). It is computed as
\code{2*max(P(F > G), P(F < G)) + 0.5*P(F = G) - 1} (here \code{P(F > G)} is basically
\code{summ_prob_true(f > g)}). This is maximum of two values (\code{P(F > G) + 0.5*P(F = G)} and \code{P(F < G) + 0.5*P(F = G)}), normalized to return values from 0
to 1. Other way to look at this measure is that it computes (before
normalization) two \link[=summ_rocauc]{ROC AUC} values with method \code{"expected"}
for two possible ordering (\verb{f, g}, and \verb{g, f}) and takes their maximum.
}

\strong{Metric based} methods compute "how far" two distributions are apart on the
real line:
\itemize{
\item \emph{Method "wass"} (short for "Wasserstein") computes a 1-Wasserstein
distance: "minimum cost of 'moving' one density into another", or "average
path density point should go while transforming from one into another". It is
computed as integral of \verb{|F - G|} (absolute difference between p-functions).
If any of \code{f} and \code{g} has "continuous" type, \code{\link[stats:integrate]{stats::integrate()}} is used, so
relatively small numerical errors can happen.
\item \emph{Method "cramer"} computes Cramer distance: integral of \code{(F - G)^2}. This
somewhat relates to "wass" method as \link[=summ_var]{variance} relates to \link[=summ_moment]{first central absolute moment}. Relatively small numerical errors
can happen.
\item \emph{Method "align"} computes an absolute value of shift \code{d} (possibly
negative) that should be added to \code{f} to achieve both \code{P(f+d >= g) >= 0.5}
and \code{P(f+d <= g) >= 0.5} (in other words, align \code{f+d} and \code{g}) as close as
reasonably possible. Solution is found numerically with \code{\link[stats:uniroot]{stats::uniroot()}},
so relatively small numerical errors can happen. Also \strong{note} that this
method is somewhat slow (compared to all others). To increase speed, use less
elements in \link[=meta_x_tbl]{"x_tbl" metadata}. For example, with
\code{\link[=form_retype]{form_retype()}} or smaller \code{n_grid} argument in \link[=as_p]{as_*()} functions.
\item \emph{Method "avgdist"} computes average distance between sample values from
inputs. Basically, it is a deterministically computed approximation of
expected value of absolute difference between random variables, or in 'pdqr'
code: \code{summ_mean(abs(f - g))} (but computed without randomness). Computation
is done by approximating possibly present continuous pdqr-functions with
discrete ones (see description of \link[=pdqr-package]{"pdqr.approx_discrete_n_grid" option} for more information) and then computing output value
directly based on two discrete pdqr-functions. \strong{Note} that this method
almost never returns zero, even for identical inputs (except the case of
discrete pdqr-functions with identical one value).
}

\strong{Entropy based} methods compute output based on entropy characteristics:
\itemize{
\item \emph{Method "entropy"} computes sum of two Kullback-Leibler divergences:
\code{KL(f, g) + KL(g, f)}, which are outputs of \code{\link[=summ_entropy2]{summ_entropy2()}} with method
"relative". \strong{Notes}:
\itemize{
\item If \code{f} and \code{g} don't have the same support, distance can be very high.
\item Error is thrown if \code{f} and \code{g} have different types (the same as in
\code{summ_entropy2()}).
}
}
}
\examples{
d_unif <- as_d(dunif, max = 2)
d_norm <- as_d(dnorm, mean = 1)

vapply(
  c(
    "KS", "totvar", "compare",
    "wass", "cramer", "align", "avgdist",
    "entropy"
  ),
  function(meth) {
    summ_distance(d_unif, d_norm, method = meth)
  },
  numeric(1)
)

# "Supremum" quality of "KS" distance
d_dis <- new_d(2, "discrete")
## Distance is 1, which is a limit of |F - G| at points which tend to 2 from
## left
summ_distance(d_dis, d_unif, method = "KS")
}
\seealso{
\code{\link[=summ_separation]{summ_separation()}} for computation of optimal threshold separating
pair of distributions.

Other summary functions: 
\code{\link{summ_center}()},
\code{\link{summ_classmetric}()},
\code{\link{summ_entropy}()},
\code{\link{summ_hdr}()},
\code{\link{summ_interval}()},
\code{\link{summ_moment}()},
\code{\link{summ_order}()},
\code{\link{summ_prob_true}()},
\code{\link{summ_pval}()},
\code{\link{summ_quantile}()},
\code{\link{summ_roc}()},
\code{\link{summ_separation}()},
\code{\link{summ_spread}()}
}
\concept{summary functions}
