Probability theory is a mathematically rigorous way of modeling
uncertainty in the world. It should be noted that the probability values
that are assigned by a human or autonomous system to various events may
be subjective, based on faulty assumptions, estimated poorly, and
otherwise incorrect. However, *probability theory describes
mathematically sound ways to manipulate probabilities*, which are
guaranteed to lead to accurate predictions as long as the base
assumptions are accurate.

A *random variable* $X$ has a domain $Val(X)$, and it is not known for
certain what value $x \in Val(X)$ it will take on. Instead, we define a
*probability distribution* $P(X=x)$ that assigns nonnegative values to
each $x \in Val(X)$ describing the likelihood that $X$ will take on that
value. As an example, the value showing when a die is rolled can be a
random variable $X$ with $Val(X)=\{1,2,3,4,5,6\}$, and assuming the die
is fair, $P(X=x)=1/6$ for all values of $x \in Val(X)$. To make notation
easier, we typically denote random variables in uppercase letters, and
the corresponding value in lowercase.

When the world contains multiple random variables $X_1,\ldots,X_n$,
their distribution is defined by the *joint distribution*
$$P(X_1=x_1,\ldots,X_n=x_n) \equiv P((X_1=x_1) \wedge \cdots \wedge (X_n=x_n)).$$
In other words, the most basic events in the world are simultaneous
settings of values to variables, and we are assigning probabilities to
the *sample space*
$S = \{ (X_1=x_1) \wedge \cdots \wedge(X_n=x_n) \quad|\quad x_1\in Val(X_1),\ldots,x_n\in Val(X_n) \}$.
(Note that due to the symmetry of the $\wedge$ operation, the order of
arguments does not matter.)

To be a valid probability distribution, $P$ must satisfy the axioms:

$P(s) \geq 0$ for all $s \in S$

$\sum_{s \in S} P(s) = 1$.

The joint distribution is typically written as $P(X_1,\ldots,X_n)$. This
is a shorthand notation in which it is implicitly assumed to be a
*function* over possible values assigned to these variables, and *not a
single number*. We shall also refer to $P(x_1,\ldots,x_n)$, *which is a
single number* giving the probability $P(X_1=x_1,\ldots,X_n=x_n)$.
Moreover, in summations, we shall typically drop the domain of values
that a variable can take on. If the notation is insufficiently clear, we
shall explicitly write the assignments and domains.

The joint probability distribution may be used to order to answer *any*
probabilistic question that might be asked about variables in the world.
Let $e$ be an *event* : some logical statement involving any subset of
$X_1,\ldots,X_N$ taking on certain values. Then $P(e)$ is defined as
$$P(e) = \sum_{s\in S} P(s)I[e\text{ holds in }s]$$ where $I[x]$ is the
*indicator function* that is 1 if $x$ is true and 0 if $x$ is false.

The joint probability distribution can also be manipulated to produce
new distributions. As an example, we consider *marginalization*, which
reduces the number of variables under consideration. If the world
consists of random variables $X$ and $Y$, then the *marginal
distribution* of $X$ is the function $P(X)$ specifying
$$P(X=x) = \sum_{y\in Val(y)} P(X=x,Y=y).$$ In shorthand, we say:
$$P(X) = \sum_{y} P(X,y).$$ We also may say that $Y$ is *marginalized
out* of the joint distribution.

This operation can be performed multiple times to reduce the variable set. If the variables are $X$, $Y$, and $Z$, then marginalizing out $Z$ produces a joint distribution over $X$ and $Y$ given by: $$P(X,Y) = \sum_{z} P(X,Y,z),$$ specifying for each pair of values $x$ and $y$ $$P(X=x,Y=y) = \sum_{z \in Val(Z)} P(X=x,Y=y,Z=z).$$ This can then be marginalized again to obtain a probability distribution over $X$.

The *conditional probability* of two events $e_1$ and $e_2$ is written
as $P(e_1 | e_2)$ and is defined axiomatically as
$$P(e_1 | e_2) = P(e_1 \wedge e_2) / P(e_2).$$ Intuitively, it gives the
probability that $e_1$ will hold after knowing that $e_2$ holds. Another
convenient identity for conditional probabilities is
$$P(e_1 \wedge e_2) = P(e_1 | e_2) P(e_2).$$ More complex conditional
rules include multiple events on the right hand side:
$$P(e_1 | e_2 \wedge e_3) = P(e_1 \wedge e_2 \wedge e_3) / P(e_2 \wedge e_3)$$
which can be easily shown to be equal to
\begin{equation}
P(e_1 | e_2 \wedge e_3) = P(e_1 \wedge e_2 | e_3) / P(e_2 | e_3).
\label{eq:ConditioningThreeEvent}
\end{equation}

The *conditional distribution* expresses the probability distribution of
one or more variables given the value of some other variable(s). If $Y$
is known to take on the value $y$, then the conditional distribution of
$X$ given $Y=y$ is $$P(X|y).$$ This is indeed a probability distribution
over $X$ where the probability of $X=x$ is given by the conditional
probability formula $$P(X=x|Y=y) = P(X=x,Y=y)/P(Y=y).$$ This can be
thought of as a probability distribution $Q(X)$ over *restricted sample
space* in which $Y=y$. In other words, the $Q$ is defined on the sample
space $S_{|y} \{ s \in S \quad|\quad (Y=y) \text{ holds in } s \}$. $Q$
then satisfies all of the probability axioms required of a probability
distribution, as long as $S$ is replaced with $S_{|y}$.

It is also possible to condition over multiple variables. The expression $P(X,Y|z,w)$ means the distribution over $X$ and $Y$ where the joint probability of any sample $X=x$, $Y=y$ is $$P(X=x,Y=y|Z=z,W=w) = P(X=x,Y=y,Z=z,W=w)/P(Z=z,W=w).$$ Conditioning rules for events also extends to variables, such as $$P(x,y,z|w) = P(x,y|z,w)P(z|w)$$ which is a restatement of $\eqref{eq:ConditioningThreeEvent}$.

It is important to note that a conditional distribution is a true
probability distribution over the variables on the left-hand side of the
"given" mark. In particular the probability axioms are satisfied for the
sample space of left-hand variables. Considering $P(X|y)$, we have
$$\sum_{x\in Val(X)} P(x|y) = 1$$ and considering $P(X,Y|z,w)$,
$$\sum_{x\in Val(X)} \sum_{y\in Val(Y)} P(x,y|z,w) = 1$$ On the other
hand, $P(X,y|z)$ is *not* a probability distribution over $X$ (fixing
the value of $Y$ at $y$ and $Z$ at $z$.)

We may also refer to a conditional distribution $P(X|Y)$ *without*
giving a specific value of $Y$. This provides the conditional
distribution $P(X|y)$ for all values of $y$. Another way to think about
$P(X|Y)$ is a two-argument function over values of $x$ and $y$ giving
the value $P(X=x|Y=y)$. Such a distribution is referred to as *the
probability of $X$ given $Y$*. If we were to write $P(X,Y|Z,W)$, this
should be thought of as a four-argument function over values $x$, $y$,
$z$, and $w$.

Two events $e_1$ and $e_2$ are *independent* if $P(e_1 | e_2) = P(e_1)$.
Equivalently, $P(e_1 \wedge e_2) = P(e_1) P(e_2)$. Otherwise they are
said to be *dependent*. Two random variables $X$ and $Y$ are said to be
independent if the events $e_1 = (X=x)$ and $e_2 = (Y=y)$ are
independent for all values $x\in Val(X)$ and $y\in Val(Y)$.

*Bayes' rule* is another standard operation that can be used to change
the order of conditioned statements. It is defined on events as
$$P(e_1|e_2) = \frac{P(e_2|e_1)P(e_1)}{P(e_2)}$$ and on random variables
as $$P(X_1|X_2) = \frac{P(X_2|X_1)P(X_1)}{P(X_2)}$$ in which this
statement is taken to hold for all possible values that $X_1$ and $X_2$
can take on. Both forms can be proven from elementary conditioning
operations. In particular, Bayes' rule is useful because we can derive
the conditional distribution $P(X|Y)$ knowing $P(Y|X)$ and $P(X)$. The
distribution of $P(Y)$ can be derived using marginalization, and we can
obtain
$$P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)} = \frac{P(Y|X)P(X)}{\sum_{x\in Val(X)} P(Y,x)} = \frac{P(Y|X)P(X)}{\sum_{x \in Val(X)} P(Y|x)P(x)}.$$
or in long-hand notation, $$\begin{aligned}
P(X=x|Y=y) &= \frac{P(Y=y|X=x)P(X=x)}{P(Y=y)} = \frac{P(Y=y|X=x)P(X=x)}{\sum_{x^\prime \in Val(X)} P(Y=y,X=x^\prime)} \\
&= \frac{P(Y=y|X=x)P(X=x)}{\sum_{x^\prime \in Val(X)} P(Y=y|X=x^\prime)P(X=x^\prime)}.
\end{aligned}$$ (Note that the summation index in the denominator is
*not* the same value of $x$ for which $P(X|Y)$ is being evaluated.)

When considering probabilities over *continuous* random variables it
becomes difficult to speak of a probability distribution, because the
probability of taking on a single value is usually 0. As an example,
suppose $X$ is a real random variable $Val(X) = \mathbb{R}$ denoting a
vehicle's true velocity, and $x \in \mathbb{R}$ is some value, such as a
speedometer reading of 44 km/h. The probability axioms require that the
sum of $P(X=x)$ over the infinite number of values $x \in \mathbb{R}$
must equal 1, which means that $P(X=x)=0$ for almost all values of $x$.
In other words, if we assigned a nonzero probability of the vehicle
traveling at 44 km/h, then we would need to assign zero probability of
traveling at 44.0001 km/h, or 44.0002 km/h, or 43.9999 km/h. Doing so
would be somewhat strange, and would not accurately reflect our
uncertainty of the true value of $X$, which should be a sort of
continuously "smeared" probability distribution of values around
44 km/h.

So, when we speak of probability distributions over continuous
variables, we usually refer to what is known as a *probability density
function* (pdf). Pdfs share many properties of probability distributions
(strictly considered) but are far more convenient to work with, because
they do not collapse to assigning probability 0 everywhere.
Specifically, a pdf $f$ for a random variable $X$ is defined as a
nonnegative function $f:Val(X)\rightarrow \mathbb{R}$, $f(x) \geq 0$
such that: $$P(a \leq X \leq b) = \int_a^b f(x) dx.$$ Where $P$ is a
true probability distribution. As a result, a pdf must integrate to 1
over the real number line:
$$P(-\infty \leq X \leq \infty) = \int_{-\infty}^\infty f(x) dx = 1.$$

An alternative representation of continuous probability distributions is
known as the *cumulative distribution function* (cdf). The cdf $F$
corresponding to the pdf $f$ is a function defined as:
$$F(c) \equiv P(X \leq c) = \int_{-\infty}^c f(x) dx.$$ Specifically,
$F(c)$ is the probability that $X$ takes on a value less than or equal
to $c$. Any cdf satisfies the following properties:

$F(c) \in [0,1]$ for all $c \in \mathbb{R}$.

$F(-\infty) = 0$, $F(\infty) = 1$.

$F^\prime(c) = f(c)$.

As a result of that last property, we see that $F^\prime(c) \geq 0$,
which means that $F$ is *monotonically non-decreasing*. It is also
straightforward to see that $P(a \leq X \leq b) = F(b) - F(a)$.

Both pdfs and cdfs are in some sense equivalent representations of continuous probability distributions, but for computational and mathematical convenience we typically refer to the pdf form. Specifically, the pdf form is what is meant when referring to a probability $P(x)$.

The most common distributions used in robotics are the uniform distribution and the Gaussian (aka normal) distribution.

The

uniform distributionover the range $[a,b]$ prescribes an equivalent probability density to each value $x$ in the range, and 0 everywhere else.

In such a case we say $X \sim U(a,b)$, and the pdf of this distribution is:
$$f(x) = \left\{ \begin{array}{ll} \frac{1}{b-a} & \text{if }a\leq x \leq b \\
0 & \text{otherwise} \end{array}\right.$$ We also denote this function
as $U(x; a,b)$. The cdf of this distribution is
$$F(x) = \left\{ \begin{array}{ll} \frac{x-a}{b-a} & \text{if }a\leq x \leq b \\
0 & \text{if }a<x \\
1 & \text{if }b>x \end{array}\right.$$ Using the notation of the
*Heaviside function* $H(x) = I[x \geq 0]$, we can say more succinctly
that $f(x) = \frac{H(x-a)-H(x-b)}{b-a}$ and
$F(x) = H(x-a)\frac{x-a}{b-a} - H(x-b) \frac{x-b}{b-a}$.

We shall cover the Gaussian distribution below.

Multivariate continuous densities describe the joint distributions of multiple continuous variables. A multivariate cdf $F(x,y)$ gives $$P(X\leq x,Y\leq y) = F(x,y)$$ while the joint pdf $f(x,y)$ is defined such that $$F(a,b) = \int_{-\infty}^{a}\int_{-\infty}^{b} f(x,y) dy dx.$$ An alternate definition of the density takes the limit $$\lim_{\epsilon\rightarrow 0} P(x\leq X\leq x+ \epsilon,y \leq Y \leq y+\epsilon)/\epsilon^2 = f(x,y).$$ Fig. 1 illustrates an example of the pdf and cdf for the bivariate Gaussian distribution.

Marginalization can also be performed on continuous densities. If $f(x,y)$ is the joint density of random variables $X$ and $Y$, then $$g(x) = \int_{-\infty}^{\infty} f(x,y) dy$$ is the density of $X$ unconditional on $Y$. This operation is illustrated in Fig. 2 (left).

We may also speak of the conditional probability density of a continuous variable. The standard definition $P(x|y) = P(x,y)/P(y)$ does not seem to apply, since $P(y)=0$ for essentially every value of $y$. However, it does work when interpreting probabilities as densities. Let $f(x,y)$ be the joint density and $g(y)$ be the density of $Y$. Let $F$ and $G$ be their respective cdfs. Let us then take the limit of the posterior cdf $P(X \leq x | y \leq Y \leq y + \epsilon)$ as $\epsilon \rightarrow 0$: $$P(X \leq x | y \leq Y \leq y + \epsilon) = \frac{ \int_{-\infty}^{x} \int_y^{y+\epsilon} f(x^\prime,y^\prime) dy^\prime dx^\prime }{\int_y^{y+\epsilon} g(y^\prime) dy^\prime}$$ As $\epsilon \rightarrow 0$, this becomes increasingly closer to $$P(X \leq x | y \leq Y \leq y + \epsilon) \rightarrow \frac{ \int_{-\infty}^{x} f(x^\prime,y)\epsilon dx^\prime }{g(y) \epsilon} = \frac{ \int_{-\infty}^{x} f(x^\prime,y) dx^\prime }{g(y)} = P(X \leq x | Y = y).$$ Then, taking the derivative with respect to $x$, the conditional density becomes $$P(X=x | Y=y) = \frac{d}{dx} P(X \leq x | Y = y) = f(x,y)/g(y).$$ This can be thought of as taking a slice through the pdf with $Y=y$, and then normalizing, as illustrated in Fig. 2 (right).

The mean, standard deviation, and variance are quantities that
characterize the distribution of random variable. If $X \sim P(X)$ is a
continuous random variable, then the *mean* is defined as
$$\bar{X} = \int_{-\infty}^{\infty} x P(x) dx.$$ This is also known as
the *expected value* of the distribution, which is denoted $E[X]$.

The *variance* of the distribution is the expected value of the squared
difference between the variable's value and the mean, and gives a sense
of the spread of the distribution:
$$Var[X] = E[(X - \bar{X})^2] = \int_{-\infty}^{\infty} (x - \bar{X})^2 P(x) dx.$$
Variance is always nonnegative, and is only 0 if $X$ has a nonzero
probability of taking on a value other than $\bar{X}$. The *standard
deviation* is simply the square root of the variance,
$Std[X] = \sqrt{Var[X]}$.

Given two variables $X$ and $Y$, their *covariance* is defined as
$$Cov[X,Y] = E[(X - \bar{X})(Y - \bar{Y})].$$ The covariance of two
independent variables is 0. Covariance is positive if knowing that $X$
is larger than average provides information that $Y$ is larger than
average (positive correlation), and it is negative if it provides
information that $Y$ is smaller than average (negative correlation). It
is also apparent that $Cov[X,X]=Var[X]$.

If we wish to specify a probability distribution over which an expected value, variance, or covariance should be evaluated, it can be written in the subscript. For example, the expected value of $X$ knowing that $Y=y$ can be written as: $$E_{P(X|y)}[X] = \int_{-\infty}^{\infty} x P(x|y) dx.$$

It is not so easy to perform operations and transformations on random variables. Adding two random variables is not as simple as adding their pdfs, and even applying simple functions is not straightforward.

Suppose we wish to compute the distribution of a deterministic function
$h(x)$ with $X$ a random variable. To answer this question, we consider
the event space of two continuous random variables $X$ and $Y$ *related*
by the constraint $y=h(x)$. For example, suppose that it is known that
$y = x^2$, but we only have probabilistic knowledge about the value of
$x$. What is the probability distribution over $Y$ that is consistent
with this information?

If we know that $f$ and $F$ are the pdf and cdf of $X$, respectively,
*and* we know that $h$ is monotonically increasing, then we can
determine the pdf of $Y$ using the *probability integral transform*. Let
$g$ and $G$ be the unknown pdf and cdf of $Y$, and let us examine how to
compute a specific value of $G(y)$. Each point $x$ "counts" toward the
sum as long as $h(x) \leq y$. Hence, the following equation holds:
\begin{equation}
G(y) = P(Y \leq y) = \int_{-\infty}^{\infty} I[h(x) \leq y] P(X=x) dx.
\label{eq:ProbabilityTransformedY}
\end{equation}
If we use the assumption that $h$
is monotonically increasing, then $h(x) \leq y$ is true if and only if
$x \leq h^{-1}(y)$. Hence, we can rewrite
$\eqref{eq:ProbabilityTransformedY}$ as
\begin{equation}
G(y) = \int_{-\infty}^{h^{-1}(y)} f(x) dx = F(h^{-1}(y)).
\label{eq:ProbabilityTransformedY2}
\end{equation}
Using this formula, we can derive
a few results about simple transformations:

If $c$ is a constant, then $P(X + c \leq x) = P(X \leq x-c)$.

$P(-X\leq x) = P(X \geq -x)$.

If $a > 0$ is a constant, then $P(aX+b \leq x) = P(X \leq (x-b)/a)$.

If $a < 0$, then $P(aX+b \leq x) = P(X \geq (x-b)/a)$.

Let us now return to the function $h(x)=x^2$ as we proposed originally. This is not monotonically increasing, so we cannot directly apply $\eqref{eq:ProbabilityTransformedY2}$. However, we can split its domain into two parts, $x \geq 0$ and $x < 0$. The second part is monotonically decreasing, so we shall have to handle that slightly differently. We can still apply $\eqref{eq:ProbabilityTransformedY}$ to obtain $$G(y) = P(Y \leq y) = \int_{-\infty}^{\infty} I[x^2 \leq y] f(x) dx$$ which can be split into two parts in which $h$ is monotonic. Performing a flip in the integration range for the negatively decreasing part, we obtain: $$\begin{aligned} G(y) &= \int_{-\infty}^{0} I[x^2 \leq y] f(x) dx + \int_{0}^{\infty} I[x^2 \leq y] f(x) dx \\ & = \int_{-\sqrt y}^{0} f(x) dx + \int_{0}^{\sqrt y} f(x) dx \\ & = (F(0) - F(-\sqrt y)) + (F(\sqrt y) - F(0)) = F(\sqrt y)-F(-\sqrt y). \end{aligned}$$

If we know that a random variable $Z$ is the sum of two random variables
$X$ and $Y$, we must consider *all* the ways that the sum $z=x+y$ can be
formed. Via marginalization,
$$F(z) = P(Z\leq z) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} I[x+y\leq z] P(X=x,Y=y) dx dy.$$
If $X$ and $Y$ are independent, then we can simplify the double integral
into a single integral over either $X$ or $Y$: $$\begin{aligned}
F(z) = P(Z\leq z) &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} I[x+y\leq z] P(X=x)P(Y=y) dx dy \\
&= \int_{-\infty}^{\infty} P(X\leq z-y)P(Y=y) dy %\\
%&= \int_{-\infty}^{\infty} P(Y\leq z-x)P(X=x) dx.
\end{aligned}$$ Another way to express this is with the pdf, which is
the derivative of the cdf: $$\begin{aligned}
f(z) = \frac{d F}{d z}(z)
&= \int_{-\infty}^{\infty} P(X=z-y)P(Y=y) dy %\\
%&= \int_{-\infty}^{\infty} P(Y=z-x)P(X=x) dx.
\end{aligned}$$

This is not so easy to compute in general; for example, the sum of two uniformly distributed random variables $X$ and $Y$ on the range $[0,1]$ has the pdf $P(X+Y = z) = 0$ for $z \leq 0$ and $z \geq 0$, and for all other $z$ $$\begin{aligned} P(X+Y \leq z) &= \int_{-\infty}^{\infty} P(X=z-y)P(Y=y) dy \\ & = \int_{0}^{1} P(X = z-y) dy \\ & = \int_{0}^{1} (I[z\geq y]-I[z\geq y+1]) dy \\ & = \int_{0}^{1} I[z\geq y]dy - \int_{0}^{1} I[z\geq y+1] dy. \end{aligned}$$ If $z \geq 1$, then $z \geq y$ and the indicator function in the first integral is always active. If $z < 1$, then $z < y+1$ and the indicator function in the second integral is never active. Hence, if $z < 1$, $$P(X+Y = z) = \int_0^1 I[z \geq y]dy = \int_0^z 1 dy = z$$ and for $z \geq 1$, $$\begin{aligned} P(X+Y = z) &= \int_0^1 I[z \geq y]dy - \int_0^1 I[z \geq y+1] \\ & = \int_0^1 1 dy - \int_0^{z-1} 1 dy = 1 - (z-1) = 2-z. \end{aligned}$$

Fortunately, simple, closed form solutions exist for certain operations and classes of distributions. One of the most important classes of distribution is the Gaussian distribution, which is closed under linear transformations.

The univariate Gaussian (or normal) distribution with mean $\mu$ and standard deviation $\sigma$ has the pdf: $$P(x) \equiv N(x;\mu,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{\frac{(x-\mu)^2}{2\sigma^2}}$$ This function is has the shape of the famous bell curve, and has a peak at $x=\mu$ and has spread controlled by $\sigma$. We say a Gaussian (or normal) variable $X$ has distribution $\mathcal{N}(\mu,\sigma)$, or $X \sim N(\mu,\sigma^2)$.

The significance of this distribution is the *central limit theorem* :
the distribution of the mean of a sample of independent, identically
distributed random variables approaches a Gaussian as the sample size
grows larger. This holds under very few assumptions about the
distribution of each variable. Specifically, given
$X_1,\ldots,X_n \sim P(X)$, the sample mean $M_n$ is defined as
$$M_n = \frac{1}{n}\sum_{i=1}^n X_n.$$ Roughly, the central limit
theorem states that $M_n$ converges to a gaussian distribution:
$$P(M_n = x) \approx N(x;\bar{X_i},Var[X_i]/n) \text{ as }n\rightarrow \infty.$$

It can also be shown that a linear transformation of a Gaussian variable is also Gaussian. Specifically, with $X \sim N(\mu,\sigma^2)$, we have $$aX+b \sim N(a\mu+b,a^2 \sigma^2).$$

The sum of two independent Gaussian variables is also Gaussian. If $X \sim N(\mu_X,\sigma_X^2)$, $Y \sim N(\mu_Y,\sigma_Y^2)$ are independent, then $$X + Y \sim N(\mu_X+\mu_Y,\sigma_X^2+\sigma_Y^2)$$

A multivariate Gaussian distribution over a vector-valued random variable $\mathbf{X}=[X_1,...,X_n]^T \in \mathbb{R}^n$ with mean vector $\mathbf{\mu}$ and covariance matrix $\Sigma$ has the density function: \begin{equation} P(\mathbf{x}) = N(\mathbf{x};\mathbf{\mu},\Sigma) = \frac{1}{(2\pi)^{n/2}\sqrt{|\Sigma|}} e^{-\frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu})}. \label{eq:MultivariateGaussian} \end{equation} Like the (univariate) Gaussian distribution, it has a peak at $x=\mathbf{\mu}$, and its spread is determined by the matrix $\Sigma$. Fig. 1 shows the pdf of a bivariate Gaussian with $\mathbf{\mu}=0$ and $\Sigma=I$.

The *covariance matrix* is defined as

$\Sigma$ must be symmetric positive definite as well. Its diagonal entries define the variance of individual elements of $\mathbf{x}$, while the off-diagonals determine how much it is skewed.

For example, consider a 2D Gaussian with mean $\mathbf{\mu} = \begin{bmatrix} \mu_{1} \\ \mu_{2} \end{bmatrix}$

and covariance

$$\Sigma = \begin{bmatrix}\Sigma_1 & \Sigma_{12} \\ \Sigma_{12} & \Sigma_{2} \end{bmatrix}.$$If $\Sigma_{12} = 0$, then $X_1$ and $X_2$ are independent, with respective distributions $X_1 \sim N(\mu_1,\Sigma_1)$ and $X_2 \sim N(\mu_2,\Sigma_2)$. Otherwise, the marginal distributions stay the same: $$\begin{aligned} P(X_1=x) &= N(x;\mu_1,\Sigma_1) \\ P(X_2=x) &= N(x;\mu_2,\Sigma_2)\end{aligned}$$ but the joint distribution is skewed. If $\Sigma_{12} > 0$, it is skewed in the upper right and lower left quadrants, while if $\Sigma_{12} < 0$, it is skewed in the upper left and lower right quadrants.

Deriving the following facts requires some extensive manipulation of $\eqref{eq:MultivariateGaussian}$. We shall simply state them without proof; the proofs are left as exercise for the interested reader.

Independent multivariate Gaussian variables can be stacked according to the following rule. If $\mathbf{X} \sim N(\mathbf{\mu}_X,\Sigma_X)$ and $\mathbf{Y} \sim N(\mathbf{\mu}_Y,\Sigma_Y)$ are independent, then $$\mathbf{Z} = \begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim N\left(\begin{bmatrix}\mathbf{\mu}_X \\ \mathbf{\mu}_Y \end{bmatrix}, \begin{bmatrix}\Sigma_X & 0 \\ 0 & \Sigma_Y \end{bmatrix}\right)$$ where the mean and covariance matrix are written in block-matrix form.

If $\mathbf{Z} = \begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix}$ is Gaussian: $$\mathbf{Z} = \begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim N\left( \begin{bmatrix}\mathbf{\mu}_X \\ \mathbf{\mu}_Y \end{bmatrix}, \begin{bmatrix} \Sigma_X & \Sigma_{XY} \\ \Sigma_{YX} & \Sigma_Y\end{bmatrix} \right)$$ then the marginal distribution of $\mathbf{X}$ is $\mathbf{X} \sim N(\mathbf{\mu}_X,\Sigma_X)$, and the marginal distribution of $\mathbf{y}$ is $\mathbf{Y} \sim N(\mathbf{\mu}_Y,\Sigma_Y)$.

Affine transformations of multivariate Gaussians are also Gaussian (Fig. 3). If $\mathbf{X} \sim N(\mathbf{\mu},\Sigma)$, then for any appropriately sized matrix $A$ and vector $\mathbf{b}$, $$A\mathbf{X}+\mathbf{b} \sim N(A\mathbf{\mu}+\mathbf{b},A\Sigma A^T).$$ More specifically, $\mathbf{X}$ and $A\mathbf{X}+\mathbf{b}$ are jointly distributed according to: $$\begin{bmatrix}\mathbf{X} \\ A\mathbf{X}+\mathbf{b} \end{bmatrix} \sim N\left(\begin{bmatrix}\mathbf{\mu} \\ A\mathbf{\mu}+\mathbf{b} \end{bmatrix},\begin{bmatrix}\Sigma & \Sigma A^T \\ A\Sigma & A\Sigma A^T\end{bmatrix}\right).$$

The sum of independent multivariate Gaussians is also Gaussian, which can be derived by stacking and linear transformation rule. If $\mathbf{X} \sim N(\mathbf{\mu}_X,\Sigma_X)$ and $\mathbf{Y} \sim N(\mathbf{\mu}_Y,\Sigma_Y)$ are independent $n$-D Gaussian variables, then $$\mathbf{Z} = \mathbf{X}+\mathbf{Y} = [I_n \quad I_n]\begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix} = N(\mathbf{\mu}_X+\mathbf{\mu}_Y,\Sigma_X+\Sigma_Y)$$ where $[I_n \quad I_n]$ is the horizontal block matrix of two identity matrices.

Conditional distributions of multivariate Gaussians are also Gaussian. Specifically, if $Val(\mathbf{X}) = \mathbb{R}^n$ and $Val(\mathbf{Y}) = \mathbb{R}^m$ are jointly Gaussian, that is, $$\begin{bmatrix}{\mathbf{X}} \\ {\mathbf{Y}} \end{bmatrix} \sim N\left(\begin{bmatrix}{\mathbf{\mu}_X} \\ {\mathbf{\mu}_Y} \end{bmatrix},\begin{bmatrix}{\Sigma_X}&{\Sigma_{XY}} \\ {\Sigma_{XY}^T}&{\Sigma_Y} \end{bmatrix}\right),$$ then the conditional distribution of $\mathbf{X}$ given that $\mathbf{Y}=\mathbf{y}$ is also Gaussian: \begin{equation} P(\mathbf{x}|\mathbf{y}) = N\left( \mathbf{x};\mathbf{\mu}_X+\Sigma_{XY}\Sigma_Y^{-1}(\mathbf{y}-\mathbf{\mu}_Y),\Sigma_X - \Sigma_{XY}\Sigma_Y^{-1}\Sigma_{XY}^T \right) \label{eq:GaussianConditioning} \end{equation} It can be observed that the mean of the new distribution depends on the value of $y$, but the covariance does not. Observe also that if $\mathbf{X}$ and $\mathbf{Y}$ are independent ($\Sigma_{XY}=0$), then knowing the value of $\mathbf{Y}$ does not change the marginal distribution of $\mathbf{X}$.

Fig. 2, right illustrates conditioning on a zero-mean Gaussian with $$\Sigma=\begin{bmatrix}{0.25}&{0.3} \\ {0.3}&{1} \end{bmatrix}$$ at $x_1=1$. According to ($\ref{eq:GaussianConditioning}$), the mean of $x_2$ becomes $0.3\cdot 0.25^{-1} \cdot 1 = 1.2$, and its variance becomes $1 - 0.3\cdot 0.25^{-1} \cdot 0.3 = 0.64$.

In [ ]:

```
```