A.3. Probability Distributions

Probability theory is a mathematically rigorous way of modeling uncertainty in the world. It should be noted that the probability values that are assigned by a human or autonomous system to various events may be subjective, based on faulty assumptions, estimated poorly, and otherwise incorrect. However, probability theory describes mathematically sound ways to manipulate probabilities, which are guaranteed to lead to accurate predictions as long as the base assumptions are accurate.

Random Variables and the Joint Distribution

A random variable $X$ has a domain $Val(X)$, and it is not known for certain what value $x \in Val(X)$ it will take on. Instead, we define a probability distribution $P(X=x)$ that assigns nonnegative values to each $x \in Val(X)$ describing the likelihood that $X$ will take on that value. As an example, the value showing when a die is rolled can be a random variable $X$ with $Val(X)=\{1,2,3,4,5,6\}$, and assuming the die is fair, $P(X=x)=1/6$ for all values of $x \in Val(X)$. To make notation easier, we typically denote random variables in uppercase letters, and the corresponding value in lowercase.

When the world contains multiple random variables $X_1,\ldots,X_n$, their distribution is defined by the joint distribution $$P(X_1=x_1,\ldots,X_n=x_n) \equiv P((X_1=x_1) \wedge \cdots \wedge (X_n=x_n)).$$ In other words, the most basic events in the world are simultaneous settings of values to variables, and we are assigning probabilities to the sample space $S = \{ (X_1=x_1) \wedge \cdots \wedge(X_n=x_n) \quad|\quad x_1\in Val(X_1),\ldots,x_n\in Val(X_n) \}$. (Note that due to the symmetry of the $\wedge$ operation, the order of arguments does not matter.)

To be a valid probability distribution, $P$ must satisfy the axioms:

  • $P(s) \geq 0$ for all $s \in S$

  • $\sum_{s \in S} P(s) = 1$.

The joint distribution is typically written as $P(X_1,\ldots,X_n)$. This is a shorthand notation in which it is implicitly assumed to be a function over possible values assigned to these variables, and not a single number. We shall also refer to $P(x_1,\ldots,x_n)$, which is a single number giving the probability $P(X_1=x_1,\ldots,X_n=x_n)$. Moreover, in summations, we shall typically drop the domain of values that a variable can take on. If the notation is insufficiently clear, we shall explicitly write the assignments and domains.

The joint probability distribution may be used to order to answer any probabilistic question that might be asked about variables in the world. Let $e$ be an event : some logical statement involving any subset of $X_1,\ldots,X_N$ taking on certain values. Then $P(e)$ is defined as $$P(e) = \sum_{s\in S} P(s)I[e\text{ holds in }s]$$ where $I[x]$ is the indicator function that is 1 if $x$ is true and 0 if $x$ is false.

The joint probability distribution can also be manipulated to produce new distributions. As an example, we consider marginalization, which reduces the number of variables under consideration. If the world consists of random variables $X$ and $Y$, then the marginal distribution of $X$ is the function $P(X)$ specifying $$P(X=x) = \sum_{y\in Val(y)} P(X=x,Y=y).$$ In shorthand, we say: $$P(X) = \sum_{y} P(X,y).$$ We also may say that $Y$ is marginalized out of the joint distribution.

This operation can be performed multiple times to reduce the variable set. If the variables are $X$, $Y$, and $Z$, then marginalizing out $Z$ produces a joint distribution over $X$ and $Y$ given by: $$P(X,Y) = \sum_{z} P(X,Y,z),$$ specifying for each pair of values $x$ and $y$ $$P(X=x,Y=y) = \sum_{z \in Val(Z)} P(X=x,Y=y,Z=z).$$ This can then be marginalized again to obtain a probability distribution over $X$.

Conditional Distributions and Independence

The conditional probability of two events $e_1$ and $e_2$ is written as $P(e_1 | e_2)$ and is defined axiomatically as $$P(e_1 | e_2) = P(e_1 \wedge e_2) / P(e_2).$$ Intuitively, it gives the probability that $e_1$ will hold after knowing that $e_2$ holds. Another convenient identity for conditional probabilities is $$P(e_1 \wedge e_2) = P(e_1 | e_2) P(e_2).$$ More complex conditional rules include multiple events on the right hand side: $$P(e_1 | e_2 \wedge e_3) = P(e_1 \wedge e_2 \wedge e_3) / P(e_2 \wedge e_3)$$ which can be easily shown to be equal to $$P(e_1 | e_2 \wedge e_3) = P(e_1 \wedge e_2 | e_3) / P(e_2 | e_3). \label{eq:ConditioningThreeEvent}$$

The conditional distribution expresses the probability distribution of one or more variables given the value of some other variable(s). If $Y$ is known to take on the value $y$, then the conditional distribution of $X$ given $Y=y$ is $$P(X|y).$$ This is indeed a probability distribution over $X$ where the probability of $X=x$ is given by the conditional probability formula $$P(X=x|Y=y) = P(X=x,Y=y)/P(Y=y).$$ This can be thought of as a probability distribution $Q(X)$ over restricted sample space in which $Y=y$. In other words, the $Q$ is defined on the sample space $S_{|y} \{ s \in S \quad|\quad (Y=y) \text{ holds in } s \}$. $Q$ then satisfies all of the probability axioms required of a probability distribution, as long as $S$ is replaced with $S_{|y}$.

It is also possible to condition over multiple variables. The expression $P(X,Y|z,w)$ means the distribution over $X$ and $Y$ where the joint probability of any sample $X=x$, $Y=y$ is $$P(X=x,Y=y|Z=z,W=w) = P(X=x,Y=y,Z=z,W=w)/P(Z=z,W=w).$$ Conditioning rules for events also extends to variables, such as $$P(x,y,z|w) = P(x,y|z,w)P(z|w)$$ which is a restatement of ($\ref{eq:ConditioningThreeEvent}$).

It is important to note that a conditional distribution is a true probability distribution over the variables on the left-hand side of the "given" mark. In particular the probability axioms are satisfied for the sample space of left-hand variables. Considering $P(X|y)$, we have $$\sum_{x\in Val(X)} P(x|y) = 1$$ and considering $P(X,Y|z,w)$, $$\sum_{x\in Val(X)} \sum_{y\in Val(Y)} P(x,y|z,w) = 1$$ On the other hand, $P(X,y|z)$ is not a probability distribution over $X$ (fixing the value of $Y$ at $y$ and $Z$ at $z$.)

We may also refer to a conditional distribution $P(X|Y)$ without giving a specific value of $Y$. This provides the conditional distribution $P(X|y)$ for all values of $y$. Another way to think about $P(X|Y)$ is a two-argument function over values of $x$ and $y$ giving the value $P(X=x|Y=y)$. Such a distribution is referred to as the probability of $X$ given $Y$. If we were to write $P(X,Y|Z,W)$, this should be thought of as a four-argument function over values $x$, $y$, $z$, and $w$.

Two events $e_1$ and $e_2$ are independent if $P(e_1 | e_2) = P(e_1)$. Equivalently, $P(e_1 \wedge e_2) = P(e_1) P(e_2)$. Otherwise they are said to be dependent. Two random variables $X$ and $Y$ are said to be independent if the events $e_1 = (X=x)$ and $e_2 = (Y=y)$ are independent for all values $x\in Val(X)$ and $y\in Val(Y)$.

Bayes' rule is another standard operation that can be used to change the order of conditioned statements. It is defined on events as $$P(e_1|e_2) = \frac{P(e_2|e_1)P(e_1)}{P(e_2)}$$ and on random variables as $$P(X_1|X_2) = \frac{P(X_2|X_1)P(X_1)}{P(X_2)}$$ in which this statement is taken to hold for all possible values that $X_1$ and $X_2$ can take on. Both forms can be proven from elementary conditioning operations. In particular, Bayes' rule is useful because we can derive the conditional distribution $P(X|Y)$ knowing $P(Y|X)$ and $P(X)$. The distribution of $P(Y)$ can be derived using marginalization, and we can obtain $$P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)} = \frac{P(Y|X)P(X)}{\sum_{x\in Val(X)} P(Y,x)} = \frac{P(Y|X)P(X)}{\sum_{x \in Val(X)} P(Y|x)P(x)}.$$ or in long-hand notation, $$\begin{aligned} P(X=x|Y=y) &= \frac{P(Y=y|X=x)P(X=x)}{P(Y=y)} = \frac{P(Y=y|X=x)P(X=x)}{\sum_{x^\prime \in Val(X)} P(Y=y,X=x^\prime)} \\ &= \frac{P(Y=y|X=x)P(X=x)}{\sum_{x^\prime \in Val(X)} P(Y=y|X=x^\prime)P(X=x^\prime)}. \end{aligned}$$ (Note that the summation index in the denominator is not the same value of $x$ for which $P(X|Y)$ is being evaluated.)

Probability Distributions over Continuous Variables

When considering probabilities over continuous random variables it becomes difficult to speak of a probability distribution, because the probability of taking on a single value is usually 0. As an example, suppose $X$ is a real random variable $Val(X) = \mathbb{R}$ denoting a vehicle's true velocity, and $x \in \mathbb{R}$ is some value, such as a speedometer reading of 44 km/h. The probability axioms require that the sum of $P(X=x)$ over the infinite number of values $x \in \mathbb{R}$ must equal 1, which means that $P(X=x)=0$ for almost all values of $x$. In other words, if we assigned a nonzero probability of the vehicle traveling at 44 km/h, then we would need to assign zero probability of traveling at 44.0001 km/h, or 44.0002 km/h, or 43.9999 km/h. Doing so would be somewhat strange, and would not accurately reflect our uncertainty of the true value of $X$, which should be a sort of continuously "smeared" probability distribution of values around 44 km/h.

So, when we speak of probability distributions over continuous variables, we usually refer to what is known as a probability density function (pdf). Pdfs share many properties of probability distributions (strictly considered) but are far more convenient to work with, because they do not collapse to assigning probability 0 everywhere. Specifically, a pdf $f$ for a random variable $X$ is defined as a nonnegative function $f:Val(X)\rightarrow \mathbb{R}$, $f(x) \geq 0$ such that: $$P(a \leq X \leq b) = \int_a^b f(x) dx.$$ Where $P$ is a true probability distribution. As a result, a pdf must integrate to 1 over the real number line: $$P(-\infty \leq X \leq \infty) = \int_{-\infty}^\infty f(x) dx = 1.$$

An alternative representation of continuous probability distributions is known as the cumulative distribution function (cdf). The cdf $F$ corresponding to the pdf $f$ is a function defined as: $$F(c) \equiv P(X \leq c) = \int_{-\infty}^c f(x) dx.$$ Specifically, $F(c)$ is the probability that $X$ takes on a value less than or equal to $c$. Any cdf satisfies the following properties:

  • $F(c) \in [0,1]$ for all $c \in \mathbb{R}$.

  • $F(-\infty) = 0$, $F(\infty) = 1$.

  • $F^\prime(c) = f(c)$.

As a result of that last property, we see that $F^\prime(c) \geq 0$, which means that $F$ is monotonically non-decreasing. It is also straightforward to see that $P(a \leq X \leq b) = F(b) - F(a)$.

Both pdfs and cdfs are in some sense equivalent representations of continuous probability distributions, but for computational and mathematical convenience we typically refer to the pdf form. Specifically, the pdf form is what is meant when referring to a probability $P(x)$.

The most common distributions used in robotics are the uniform distribution and the Gaussian (aka normal) distribution.

The uniform distribution over the range $[a,b]$ prescribes an equivalent probability density to each value $x$ in the range, and 0 everywhere else.

In such a case we say $X \sim U(a,b)$, and the pdf of this distribution is: $$f(x) = \left\{ \begin{array}{ll} \frac{1}{b-a} & \text{if }a\leq x \leq b \\ 0 & \text{otherwise} \end{array}\right.$$ We also denote this function as $U(x; a,b)$. The cdf of this distribution is $$F(x) = \left\{ \begin{array}{ll} \frac{x-a}{b-a} & \text{if }a\leq x \leq b \\ 0 & \text{if }a<x \\ 1 & \text{if }b>x \end{array}\right.$$ Using the notation of the Heaviside function $H(x) = I[x \geq 0]$, we can say more succinctly that $f(x) = \frac{H(x-a)-H(x-b)}{b-a}$ and $F(x) = H(x-a)\frac{x-a}{b-a} - H(x-b) \frac{x-b}{b-a}$.

We shall cover the Gaussian distribution below.

Multivariate continuous densities

Multivariate continuous densities describe the joint distributions of multiple continuous variables. A multivariate cdf $F(x,y)$ gives $$P(X\leq x,Y\leq y) = F(x,y)$$ while the joint pdf $f(x,y)$ is defined such that $$F(a,b) = \int_{-\infty}^{a}\int_{-\infty}^{b} f(x,y) dy dx.$$ An alternate definition of the density takes the limit $$\lim_{\epsilon\rightarrow 0} P(x\leq X\leq x+ \epsilon,y \leq Y \leq y+\epsilon)/\epsilon^2 = f(x,y).$$ Fig. 1 illustrates an example of the pdf and cdf for the bivariate Gaussian distribution.


Figure 1. Left: A multivariate density function. Right: the cumulative distribution function corresponding to that density.

Marginalization can also be performed on continuous densities. If $f(x,y)$ is the joint density of random variables $X$ and $Y$, then $$g(x) = \int_{-\infty}^{\infty} f(x,y) dy$$ is the density of $X$ unconditional on $Y$. This operation is illustrated in Fig. 2 (left).


Figure 2. Left: Marginalization of a multivariate density involves integration over a subset variable to obtain a distribution over the others. Right: Conditioning of a density involves "slicing" the distribution at a particular value and then re-normalizing.

We may also speak of the conditional probability density of a continuous variable. The standard definition $P(x|y) = P(x,y)/P(y)$ does not seem to apply, since $P(y)=0$ for essentially every value of $y$. However, it does work when interpreting probabilities as densities. Let $f(x,y)$ be the joint density and $g(y)$ be the density of $Y$. Let $F$ and $G$ be their respective cdfs. Let us then take the limit of the posterior cdf $P(X \leq x | y \leq Y \leq y + \epsilon)$ as $\epsilon \rightarrow 0$: $$P(X \leq x | y \leq Y \leq y + \epsilon) = \frac{ \int_{-\infty}^{x} \int_y^{y+\epsilon} f(x^\prime,y^\prime) dy^\prime dx^\prime }{\int_y^{y+\epsilon} g(y^\prime) dy^\prime}$$ As $\epsilon \rightarrow 0$, this becomes increasingly closer to $$P(X \leq x | y \leq Y \leq y + \epsilon) \rightarrow \frac{ \int_{-\infty}^{x} f(x^\prime,y)\epsilon dx^\prime }{g(y) \epsilon} = \frac{ \int_{-\infty}^{x} f(x^\prime,y) dx^\prime }{g(y)} = P(X \leq x | Y = y).$$ Then, taking the derivative with respect to $x$, the conditional density becomes $$P(X=x | Y=y) = \frac{d}{dx} P(X \leq x | Y = y) = f(x,y)/g(y).$$ This can be thought of as taking a slice through the pdf with $Y=y$, and then normalizing, as illustrated in Fig. 2 (right).

Mean, Standard Deviation, and Variance

The mean, standard deviation, and variance are quantities that characterize the distribution of random variable. If $X \sim P(X)$ is a continuous random variable, then the mean is defined as $$\bar{X} = \int_{-\infty}^{\infty} x P(x) dx.$$ This is also known as the expected value of the distribution, which is denoted $E[X]$.

The variance of the distribution is the expected value of the squared difference between the variable's value and the mean, and gives a sense of the spread of the distribution: $$Var[X] = E[(X - \bar{X})^2] = \int_{-\infty}^{\infty} (x - \bar{X})^2 P(x) dx.$$ Variance is always nonnegative, and is only 0 if $X$ has a nonzero probability of taking on a value other than $\bar{X}$. The standard deviation is simply the square root of the variance, $Std[X] = \sqrt{Var[X]}$.

Given two variables $X$ and $Y$, their covariance is defined as $$Cov[X,Y] = E[(X - \bar{X})(Y - \bar{Y})].$$ The covariance of two independent variables is 0. Covariance is positive if knowing that $X$ is larger than average provides information that $Y$ is larger than average (positive correlation), and it is negative if it provides information that $Y$ is smaller than average (negative correlation). It is also apparent that $Cov[X,X]=Var[X]$.

If we wish to specify a probability distribution over which an expected value, variance, or covariance should be evaluated, it can be written in the subscript. For example, the expected value of $X$ knowing that $Y=y$ can be written as: $$E_{P(X|y)}[X] = \int_{-\infty}^{\infty} x P(x|y) dx.$$

Operations on Random Variables

It is not so easy to perform operations and transformations on random variables. Adding two random variables is not as simple as adding their pdfs, and even applying simple functions is not straightforward.

Suppose we wish to compute the distribution of a deterministic function $h(x)$ with $X$ a random variable. To answer this question, we consider the event space of two continuous random variables $X$ and $Y$ related by the constraint $y=h(x)$. For example, suppose that it is known that $y = x^2$, but we only have probabilistic knowledge about the value of $x$. What is the probability distribution over $Y$ that is consistent with this information?

If we know that $f$ and $F$ are the pdf and cdf of $X$, respectively, and we know that $h$ is monotonically increasing, then we can determine the pdf of $Y$ using the probability integral transform. Let $g$ and $G$ be the unknown pdf and cdf of $Y$, and let us examine how to compute a specific value of $G(y)$. Each point $x$ "counts" toward the sum as long as $h(x) \leq y$. Hence, the following equation holds: $$G(y) = P(Y \leq y) = \int_{-\infty}^{\infty} I[h(x) \leq y] P(X=x) dx. \label{eq:ProbabilityTransformedY}$$ If we use the assumption that $h$ is monotonically increasing, then $h(x) \leq y$ is true if and only if $x \leq h^{-1}(y)$. Hence, we can rewrite ($\ref{eq:ProbabilityTransformedY}$) as $$G(y) = \int_{-\infty}^{h^{-1}(y)} f(x) dx = F(h^{-1}(y)). \label{eq:ProbabilityTransformedY2}$$ Using this formula, we can derive a few results about simple transformations:

  • If $c$ is a constant, then $P(X + c \leq x) = P(X \leq x-c)$.

  • $P(-X\leq x) = P(X \geq -x)$.

  • If $a > 0$ is a constant, then $P(aX+b \leq x) = P(X \leq (x-b)/a)$.

  • If $a < 0$, then $P(aX+b \leq x) = P(X \geq (x-b)/a)$.

Let us now return to the function $h(x)=x^2$ as we proposed originally. This is not monotonically increasing, so we cannot directly apply ($\ref{eq:ProbabilityTransformedY2}$). However, we can split its domain into two parts, $x \geq 0$ and $x < 0$. The second part is monotonically decreasing, so we shall have to handle that slightly differently. We can still apply ($\ref{eq:ProbabilityTransformedY}$) to obtain $$G(y) = P(Y \leq y) = \int_{-\infty}^{\infty} I[x^2 \leq y] f(x) dx$$ which can be split into two parts in which $h$ is monotonic. Performing a flip in the integration range for the negatively decreasing part, we obtain: $$\begin{aligned} G(y) &= \int_{-\infty}^{0} I[x^2 \leq y] f(x) dx + \int_{0}^{\infty} I[x^2 \leq y] f(x) dx \\ & = \int_{-\sqrt y}^{0} f(x) dx + \int_{0}^{\sqrt y} f(x) dx \\ & = (F(0) - F(-\sqrt y)) + (F(\sqrt y) - F(0)) = F(\sqrt y)-F(-\sqrt y). \end{aligned}$$

If we know that a random variable $Z$ is the sum of two random variables $X$ and $Y$, we must consider all the ways that the sum $z=x+y$ can be formed. Via marginalization, $$F(z) = P(Z\leq z) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} I[x+y\leq z] P(X=x,Y=y) dx dy.$$ If $X$ and $Y$ are independent, then we can simplify the double integral into a single integral over either $X$ or $Y$: $$\begin{aligned} F(z) = P(Z\leq z) &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} I[x+y\leq z] P(X=x)P(Y=y) dx dy \\ &= \int_{-\infty}^{\infty} P(X\leq z-y)P(Y=y) dy %\\ %&= \int_{-\infty}^{\infty} P(Y\leq z-x)P(X=x) dx. \end{aligned}$$ Another way to express this is with the pdf, which is the derivative of the cdf: $$\begin{aligned} f(z) = \frac{d F}{d z}(z) &= \int_{-\infty}^{\infty} P(X=z-y)P(Y=y) dy %\\ %&= \int_{-\infty}^{\infty} P(Y=z-x)P(X=x) dx. \end{aligned}$$

This is not so easy to compute in general; for example, the sum of two uniformly distributed random variables $X$ and $Y$ on the range $[0,1]$ has the pdf $P(X+Y = z) = 0$ for $z \leq 0$ and $z \geq 0$, and for all other $z$ $$\begin{aligned} P(X+Y \leq z) &= \int_{-\infty}^{\infty} P(X=z-y)P(Y=y) dy \\ & = \int_{0}^{1} P(X = z-y) dy \\ & = \int_{0}^{1} (I[z\geq y]-I[z\geq y+1]) dy \\ & = \int_{0}^{1} I[z\geq y]dy - \int_{0}^{1} I[z\geq y+1] dy. \end{aligned}$$ If $z \geq 1$, then $z \geq y$ and the indicator function in the first integral is always active. If $z < 1$, then $z < y+1$ and the indicator function in the second integral is never active. Hence, if $z < 1$, $$P(X+Y = z) = \int_0^1 I[z \geq y]dy = \int_0^z 1 dy = z$$ and for $z \geq 1$, $$\begin{aligned} P(X+Y = z) &= \int_0^1 I[z \geq y]dy - \int_0^1 I[z \geq y+1] \\ & = \int_0^1 1 dy - \int_0^{z-1} 1 dy = 1 - (z-1) = 2-z. \end{aligned}$$

Fortunately, simple, closed form solutions exist for certain operations and classes of distributions. One of the most important classes of distribution is the Gaussian distribution, which is closed under linear transformations.

The Gaussian distribution

Univariate Gaussians

The univariate Gaussian (or normal) distribution with mean $\mu$ and standard deviation $\sigma$ has the pdf: $$P(x) \equiv N(x;\mu,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{\frac{(x-\mu)^2}{2\sigma^2}}$$ This function is has the shape of the famous bell curve, and has a peak at $x=\mu$ and has spread controlled by $\sigma$. We say a Gaussian (or normal) variable $X$ has distribution $\mathcal{N}(\mu,\sigma)$, or $X \sim N(\mu,\sigma^2)$.

The significance of this distribution is the central limit theorem : the distribution of the mean of a sample of independent, identically distributed random variables approaches a Gaussian as the sample size grows larger. This holds under very few assumptions about the distribution of each variable. Specifically, given $X_1,\ldots,X_n \sim P(X)$, the sample mean $M_n$ is defined as $$M_n = \frac{1}{n}\sum_{i=1}^n X_n.$$ Roughly, the central limit theorem states that $M_n$ converges to a gaussian distribution: $$P(M_n = x) \approx N(x;\bar{X_i},Var[X_i]/n) \text{ as }n\rightarrow \infty.$$

It can also be shown that a linear transformation of a Gaussian variable is also Gaussian. Specifically, with $X \sim N(\mu,\sigma^2)$, we have $$aX+b \sim N(a\mu+b,a^2 \sigma^2).$$

The sum of two independent Gaussian variables is also Gaussian. If $X \sim N(\mu_X,\sigma_X^2)$, $Y \sim N(\mu_Y,\sigma_Y^2)$ are independent, then $$X + Y \sim N(\mu_X+\mu_Y,\sigma_X^2+\sigma_Y^2)$$

Multivariate Gaussians

A multivariate Gaussian distribution over a vector-valued random variable $\mathbf{X}=[X_1,...,X_n]^T \in \mathbb{R}^n$ with mean vector $\mathbf{\mu}$ and covariance matrix $\Sigma$ has the density function: $$P(\mathbf{x}) = N(\mathbf{x};\mathbf{\mu},\Sigma) = \frac{1}{(2\pi)^{n/2}\sqrt{|\Sigma|}} e^{-\frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu})}. \label{eq:MultivariateGaussian}$$ Like the (univariate) Gaussian distribution, it has a peak at $x=\mathbf{\mu}$, and its spread is determined by the matrix $\Sigma$. Fig. 1 shows the pdf of a bivariate Gaussian with $\mathbf{\mu}=0$ and $\Sigma=I$.

The covariance matrix is defined as

$$\Sigma_{ij} = Cov[X_i,X_j].$$

$\Sigma$ must be symmetric positive definite as well. Its diagonal entries define the variance of individual elements of $\mathbf{x}$, while the off-diagonals determine how much it is skewed.

For example, consider a 2D Gaussian with mean $\mathbf{\mu} = \begin{bmatrix} \mu_{1} \\ \mu_{2} \end{bmatrix}$

and covariance

$$\Sigma = \begin{bmatrix}\Sigma_1 & \Sigma_{12} \\ \Sigma_{12} & \Sigma_{2} \end{bmatrix}.$$

If $\Sigma_{12} = 0$, then $X_1$ and $X_2$ are independent, with respective distributions $X_1 \sim N(\mu_1,\Sigma_1)$ and $X_2 \sim N(\mu_2,\Sigma_2)$. Otherwise, the marginal distributions stay the same: $$\begin{aligned} P(X_1=x) &= N(x;\mu_1,\Sigma_1) \\ P(X_2=x) &= N(x;\mu_2,\Sigma_2)\end{aligned}$$ but the joint distribution is skewed. If $\Sigma_{12} > 0$, it is skewed in the upper right and lower left quadrants, while if $\Sigma_{12} < 0$, it is skewed in the upper left and lower right quadrants.

Deriving the following facts requires some extensive manipulation of ($\ref{eq:MultivariateGaussian}$). We shall simply state them without proof; the proofs are left as exercise for the interested reader.

Stacking of multivariate Gaussians

Independent multivariate Gaussian variables can be stacked according to the following rule. If $\mathbf{X} \sim N(\mathbf{\mu}_X,\Sigma_X)$ and $\mathbf{Y} \sim N(\mathbf{\mu}_Y,\Sigma_Y)$ are independent, then $$\mathbf{Z} = \begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim N\left(\begin{bmatrix}\mathbf{\mu}_X \\ \mathbf{\mu}_Y \end{bmatrix}, \begin{bmatrix}\Sigma_X & 0 \\ 0 & \Sigma_Y \end{bmatrix}\right)$$ where the mean and covariance matrix are written in block-matrix form.

Marginalization of multivariate Gaussians

If $\mathbf{Z} = \begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix}$ is Gaussian: $$\mathbf{Z} = \begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix} \sim N\left( \begin{bmatrix}\mathbf{\mu}_X \\ \mathbf{\mu}_Y \end{bmatrix}, \begin{bmatrix} \Sigma_X & \Sigma_{XY} \\ \Sigma_{YX} & \Sigma_Y\end{bmatrix} \right)$$ then the marginal distribution of $\mathbf{X}$ is $\mathbf{X} \sim N(\mathbf{\mu}_X,\Sigma_X)$, and the marginal distribution of $\mathbf{y}$ is $\mathbf{Y} \sim N(\mathbf{\mu}_Y,\Sigma_Y)$.


Figure 3. An affine transform of a Gaussian distribution is also a Gaussian. A transform $A\mathbf{x}+\mathbf{b}$ (left) is applied to an initial zero-mean distribution, whose density is displayed in the leftmost contour plot. The middle contour plot displays the density of the resulting distribution. Applying the transform again yields a density in the rightmost contour plot.

Affine transformations of multivariate Gaussians are also Gaussian (Fig. 3). If $\mathbf{X} \sim N(\mathbf{\mu},\Sigma)$, then for any appropriately sized matrix $A$ and vector $\mathbf{b}$, $$A\mathbf{X}+\mathbf{b} \sim N(A\mathbf{\mu}+\mathbf{b},A\Sigma A^T).$$ More specifically, $\mathbf{X}$ and $A\mathbf{X}+\mathbf{b}$ are jointly distributed according to: $$\begin{bmatrix}\mathbf{X} \\ A\mathbf{X}+\mathbf{b} \end{bmatrix} \sim N\left(\begin{bmatrix}\mathbf{\mu} \\ A\mathbf{\mu}+\mathbf{b} \end{bmatrix},\begin{bmatrix}\Sigma & \Sigma A^T \\ A\Sigma & A\Sigma A^T\end{bmatrix}\right).$$

Sum of multivariate Gaussians

The sum of independent multivariate Gaussians is also Gaussian, which can be derived by stacking and linear transformation rule. If $\mathbf{X} \sim N(\mathbf{\mu}_X,\Sigma_X)$ and $\mathbf{Y} \sim N(\mathbf{\mu}_Y,\Sigma_Y)$ are independent $n$-D Gaussian variables, then $$\mathbf{Z} = \mathbf{X}+\mathbf{Y} = [I_n \quad I_n]\begin{bmatrix}\mathbf{X} \\ \mathbf{Y} \end{bmatrix} = N(\mathbf{\mu}_X+\mathbf{\mu}_Y,\Sigma_X+\Sigma_Y)$$ where $[I_n \quad I_n]$ is the horizontal block matrix of two identity matrices.

Conditioning of multivariate Gaussians

Conditional distributions of multivariate Gaussians are also Gaussian. Specifically, if $Val(\mathbf{X}) = \mathbb{R}^n$ and $Val(\mathbf{Y}) = \mathbb{R}^m$ are jointly Gaussian, that is, $$\begin{bmatrix}{\mathbf{X}} \\ {\mathbf{Y}} \end{bmatrix} \sim N\left(\begin{bmatrix}{\mathbf{\mu}_X} \\ {\mathbf{\mu}_Y} \end{bmatrix},\begin{bmatrix}{\Sigma_X}&{\Sigma_{XY}} \\ {\Sigma_{XY}^T}&{\Sigma_Y} \end{bmatrix}\right),$$ then the conditional distribution of $\mathbf{X}$ given that $\mathbf{Y}=\mathbf{y}$ is also Gaussian: $$P(\mathbf{x}|\mathbf{y}) = N\left( \mathbf{x};\mathbf{\mu}_X+\Sigma_{XY}\Sigma_Y^{-1}(\mathbf{y}-\mathbf{\mu}_Y),\Sigma_X - \Sigma_{XY}\Sigma_Y^{-1}\Sigma_{XY}^T \right) \label{eq:GaussianConditioning}$$ It can be observed that the mean of the new distribution depends on the value of $y$, but the covariance does not. Observe also that if $\mathbf{X}$ and $\mathbf{Y}$ are independent ($\Sigma_{XY}=0$), then knowing the value of $\mathbf{Y}$ does not change the marginal distribution of $\mathbf{X}$.

Fig. 2, right illustrates conditioning on a zero-mean Gaussian with $$\Sigma=\begin{bmatrix}{0.25}&{0.3} \\ {0.3}&{1} \end{bmatrix}$$ at $x_1=1$. According to ($\ref{eq:GaussianConditioning}$), the mean of $x_2$ becomes $0.3\cdot 0.25^{-1} \cdot 1 = 1.2$, and its variance becomes $1 - 0.3\cdot 0.25^{-1} \cdot 0.3 = 0.64$.

In [ ]: