Appendix A. MATHEMATICAL PRELIMINARIES¶

A.2. Real Analysis and Calculus in Higher Dimensions¶

We would like to extend concepts that are familiar to us from working with real numbers $\mathbb{R}$ to other spaces of interest. For example, through much of this class we will operate in $n$-dimensional Euclidean spaces $\mathbb{R}^n$. Elements of $\mathbb{R}^n$ (points) are simply tuples of $n$ real numbers. In this section we extend our understanding the notions of sequences, limits, derivatives, and closed and open intervals to a more general setting of metric spaces.

Metric spaces¶

A metric space $S$ is an arbitrary abstract set equipped with a metric $d(x,y)$ that measures a sense of distance between points. The metric is required to be 1) nonnegative ($d(x,y) \geq 0$), 2) reflexive ($d(x,y) = 0$ iff $x=y$) and 3) symmetric ($d(x,y) = d(y,x)$). It is also required to satisfy the triangle inequality: $d(x,y) \leq d(x,z)+d(z,y)$ for all $x,y,z$.

Examples:

$\mathbb{R}$ is a metric space under the absolute distance metric $d(x,y)=|x-y|$.
Any Euclidean space $\mathbb{R}^n$ is a metric space under the Euclidean distance metric (and indeed, many other metrics).
Any set is a metric space under the discrete metric: $d(x,y) = 0$ if $x=y$, and $d(x,y) = 1$ otherwise.

We can easily generalize the notions of convergence and limits of sequences into the metric space setting. This is done simply by replacing the absolute distances $|x-y|$ with the metric $d(x,y)$. Limits of a function also generalize, but only in the standard case (taken relative to some point $c$). The one-sided and asymptotic limits do not generalize in the same way, nor do the notions of minimum/maximum/infimum/supremum.

Neighborhood. For a metric space $(S,d)$, the $r$-neighborhood of a point $x\in S$ is the set $N_r(x) = { y\in S | d(x,y) < r }$. For a Euclidean space $\mathbb{R}^n$ and the Euclidean metric, this is a ball of dimension $n$ and radius $r$ centered on $x$, not including its surface.

Open set. An open set $A$ is a subset of a metric space $(S,d)$ such that every point $x \in A$ has an $r$-neighborhood completely contained within $A$ with some $r > 0$. More precisely, for all $x$, there exists an $r > 0$ such that $N_r(x) \subseteq A$. (Note that $r$ is allowed to depend on $x$.)

Closed set. A closed set $A$ is a subset of a metric space $(S,d)$ such that its complement $S \setminus A$ is an open set. (Note: typically the "universe" of possible elements $S$ is known and is therefore left implicit.)

Examples:

The half-open interval $(a,b]$ is neither open nor closed.
Any finite set of points $\{ x_1,\ldots,x_n \}$ closed in $\mathbb{R}$.
The empty set is both open and closed.
$\mathbb{R}$ is both open and closed (taking $\mathbb{R}$ as the universe).
The union of any number of open sets is open.
The intersection of any finite number of open sets is open.
The union of any finite number of closed sets is closed.
The intersection of any number of closed sets is closed.

Closure. The closure of a set $A$, denoted $cl(A)$ is the set of all points $x \in S$ such that for any $r > 0$, $N_r(x) \cap A \neq \emptyset$.

From this definition, we can see that $A \subseteq cl(A)$, $cl(A)$ is a closed set, and that $A=cl(A)$ iff $A$ is a closed set.

Interior. The interior of a set $A$, denoted $int(A)$ is the set of all points $x \in S$ such that there exists an $r > 0$ such that $N_r(x) \subseteq A$.

From this definition, we can see that $int(A) \subseteq A$, $int(A)$ is an open set, and that $A=int(A)$ iff $A$ is an open set.

Boundary. The boundary of a set $A$, denoted $\partial A$, is defined as $cl(A)\setminus int(A)$.

Or, equivalently, it is the set of points for which all $r$-neighborhoods with $r>0$ contain at least one point of $A$ and one point of its complement.

Scalar fields¶

We will often work with functions that map many variables (a vector) to a single number (a scalar). A scalar field $f : \mathbb{R}^n \rightarrow \mathbb{R}$, is a real-valued function of $n$ real-valued variables. We can either write the arguments explicitly as $f(x_1,\ldots,x_n)$ or as a single vector argument as $f(\mathbf{x})$. The same notions of continuity, minima, and maxima that apply to univariate functions also apply to scalar fields.

Continuity. A scalar field $f$ is continuous on a domain $S\in \mathbb{R}^n$ if for every point $\mathbf{x} \in S$ and $\epsilon > 0$, there exists a neighborhood with radius $\delta$ around $\mathbf{x}$ that satisfies $|f(\mathbf{x})-f(\mathbf{y})| < \epsilon$ for all $y \in N_\delta(\mathbf{x})$.

Extreme value theorem. A continuous scalar field $f$ attains a minimum and maximum value on a closed set $S \subseteq \mathbb{R}^n$.

Local minima. $\mathbf{x}$ is said to be a local minimum of a scalar field $f$ on an open set $S \subseteq \mathbb{R}^n$ if there exists a neighborhood $N_r(\mathbf{x})$ around $\mathbf{x}$ such that $f(\mathbf{x}) < f(\mathbf{y})$ for all $\mathbf{y} \in N_r(\mathbf{x}) \setminus \{ \mathbf{x}\}$. A similar definition holds for local maxima.

Partial derivatives¶

Partial derivatives allow taking derivatives of a scalar field with respect to certain individual arguments. The partial derivative of $f$ with respect to its argument $x_i$ at the values $(x_1,\ldots,x_n)=(a_1,\ldots,a_n)$ is given by $$\frac{\partial f}{\partial x_i}(a_1,\ldots,a_n) = \lim_{h \rightarrow 0} \frac{f(a_1,\ldots,a_{i-1},a_i+h,a_{i+1},\ldots,a_n)-f(a_1,\ldots,a_n)}{h}$$ In other words, this is the deriative in the direction of the $a_i$ while fixing all of the other $a_j$'s, $i\neq j$, constant.

We can write this definition more compactly in vector notation: $$\frac{\partial f}{\partial x_i}(\mathbf{a}) = \lim_{h \rightarrow 0} \frac{f(\mathbf{a}+h\mathbf{e}_{i})-f(\mathbf{a})}{h}$$

A third possible interpretation is that we are taking the derivative of the function $$g_i(x) = f(a_1,\ldots,a_{i-1},x,a_{i+1},\ldots,a_n)$$ at $x = a_i$. In other words, $g_i^\prime(x)|_{x = a_i} =\frac{\partial f}{\partial x_i}(\mathbf{a})$.

(Note that some other texts may use the notation $f_i$ to denote the partial derivative with respect to the $i$'th argument to $f$. I find this notation to be extremely confusing so we will use the $\frac{\partial f}{\partial x_i}$ notation throughout this class.)

Partial differentiation operator. The notation $\frac{\partial}{\partial x_i}$ refers to the partial differentiation operator on the $i$'th argument to $f$.

For shorthand, we may occasionally write $\frac{\partial}{\partial i}$.

Computing partial derivatives¶

The rules for computing regular derivatives can also be applied when computing partial derivatives. The only difference is that all arguments, except for the one being differentiated, are treated as constants. As an exmaple, consider $$f(x_1,x_2) = x_1^k e^{c x_2}.$$ The partial derivatives are: $$\frac{\partial}{\partial x_1} f(x_1,x_2) = k x_1^{k-1} e^{c x_2}$$ because $x_2$ is treated as a constant, and $$\frac{\partial}{\partial x_2} f(x_1,x_2) = c x_1^{k} e^{c x_2}$$ because $x_1$ is treated as a constant.

Be careful when a variable appears in multiple arguments. For example, given a function $f(x,g(x))$, the partial derivative of $f$ with respect to $x_1$ does not use the partial derivative with respect to $x_2$, or the derivative of $g$ in its calculation. Instead, you should treat the second argument as a constant $y$ and proceed as though it has no dependence on $x$.

You should incorporate those derivatives when computing the total derivative with respect to $x$, which is denoted $\frac{d}{dx}$ or $\cdot^\prime$. The total derivative treats the expression as a single function, $h(x) = f(x,g(x))$, and involves a chain rule with both partial derivatives: $$\frac{d}{dx}f(x,g(x)) = h^\prime(x) = \frac{\partial}{\partial x_1} f(x,g(x)) + \frac{\partial}{\partial x_2} f(x,g(x)) g(x)^\prime.$$

Directional derivatives¶

The directional derivative is another type of derivative that measures the rate of change of the scalar field $f$ as you move from $\mathbf{a}$ in a direction $\mathbf{d}$. You can think about moving along a ray $\mathbf{x}(u) = \mathbf{a} + u\mathbf{d}$ standing at $\mathbf{a}$ and beginning to move with velocity $\mathbf{d}$. The directional derivative measures the slope of the field in your desired direction.

The directional derivative of $f$ at $\mathbf{a}$ in direction $\mathbf{d}$ is the value: $$\nabla_{\mathbf{d}}f(\mathbf{a}) = \lim_{h\rightarrow 0} \frac{f(\mathbf{a}+h\mathbf{d})-f(\mathbf{a})}{h}.$$

Note that the standard partial derivative $\frac{\partial}{\partial i}f(\mathbf{a})$ is equivalent to $\nabla_{\mathbf{e}_{i}}f(\mathbf{a})$.

Gradient¶

A fundamental result in multivariate calculus is that all directional derivatives can be expressed as a dot product of $\mathbf{d}$ with a vector known as the gradient of $f$. The gradient of $f$, denoted $\nabla f(\mathbf{a})$, is the vector of all partial derivatives of $f$: $$\nabla f(\mathbf{a}) = \left(\frac{\partial}{\partial x_1} f(\mathbf{a}), \ldots, \frac{\partial}{\partial x_n} f(\mathbf{a})\right).$$

We will prove the key result that $$\nabla_{\mathbf{d}}f(\mathbf{a}) = (\nabla f(\mathbf{a})) \cdot \mathbf{d}$$

We will need a lemma that directional derviatives are linear, specifically that: $$\nabla_{c\mathbf{d}} f(\mathbf{a}) = c \nabla_{\mathbf{d}} f(\mathbf{a})$$ and $$\nabla_{\mathbf{d}+\mathbf{e}_{i}} f(\mathbf{a}) = \nabla_{\mathbf{d}} f(\mathbf{a}) + \nabla_{\mathbf{e}_{i}} f(\mathbf{a})$$ for any scalar $c$, vectors $\mathbf{d}$, $\mathbf{d}_{1}$, and $\mathbf{d}_{2}$.

First, scaling: $$\begin{split} \nabla_{c\mathbf{d}} f(\mathbf{a}) &= \lim_{h\rightarrow 0} \frac{f(\mathbf{a} + ch\mathbf{d})-f(\mathbf{a})}{h} \\ &= c \lim_{h\rightarrow 0} \frac{f(\mathbf{a} + ch\mathbf{d})-f(\mathbf{a})}{ch} \\ &= c \lim_{(ch)\rightarrow 0} \frac{f(\mathbf{a} + ch\mathbf{d})-f(\mathbf{a})}{ch} \\ &= c \lim_{h\rightarrow 0} \frac{f(\mathbf{a} + h\mathbf{d})-f(\mathbf{a})}{h} \\ &= c \nabla_{\mathbf{d}} f(\mathbf{a}) \end{split}$$

Now, let's look at summing. $$\nabla_{\mathbf{d}_{1}+\mathbf{d}_{2}} f(\mathbf{a}) = \lim_{h\rightarrow 0} \frac{f(\mathbf{a}+h\mathbf{d}_{1}+h\mathbf{d}_{2})-f(\mathbf{a})}{h}$$

Now let $g_h(x) = f(\mathbf{a}+h\mathbf{d}_{1} + x\mathbf{d}_{2})$ and evaluate the Taylor expansion of $g_h(x)$ around $x=0$: $$g_h(x) = f(\mathbf{a}+h\mathbf{d}_{1}) + g_h^\prime(x)|_{x=0} x + O(x^2)$$

So $$\begin{split} \nabla_{\mathbf{d}_{1}+\mathbf{d}_{2}} f(\mathbf{a}) &= \lim_{h\rightarrow 0} \frac{1}{h}[f(\mathbf{a}+h\mathbf{d}_{1}) + g_h^\prime(0) h + O(h^2) - f(\mathbf{a})] \\ &= \lim_{h\rightarrow 0} \left(\frac{1}{h}[f(\mathbf{a}+h\mathbf{d}_{1}) - f(\mathbf{a}) + O(h^2)] + g_h^\prime(0) \right)\\ &= \nabla_{\mathbf{d}_{1}} f(\mathbf{a}) + \lim_{h\rightarrow 0} g_h^\prime(x)|_{x=0}. \end{split}$$ The desired result is obtained by noting that $\lim_{h\rightarrow 0} g_h^\prime(x)|_{x=0}$ is identically $\nabla_{\mathbf{d}_{2}}f(\mathbf{a})$.

Returning to the original question, we use linearity of the directional derivative to obtain $$\begin{split} (\nabla f(\mathbf{a})) \cdot \mathbf{d} &= \sum_{i=1}^n d_{i} \nabla_{\mathbf{e}_{i}}f(\mathbf{a}) \\ &= \sum_{i=1}^n \nabla_{d_i \mathbf{e}_{i}}f(\mathbf{a}) \\ &= \nabla {\sum_{i=1}^n d_i \mathbf{e}_{i}}f(\mathbf{a}) \\ &= \nabla_{\mathbf{d}} f(\mathbf{a}) \end{split}$$ as desired. QED.

First-order Taylor expansion of a scalar field. If a scalar field $f$ is differentiable, we can use the gradient to define the first-order Taylor expansion of $f$ about a point $\mathbf{a}$: $$f(\mathbf{x}) = f(\mathbf{a}) + (\nabla f(\mathbf{a})) \cdot (\mathbf{x}-\mathbf{a}) + O(||\mathbf{x}-\mathbf{a}||^2).$$

Note that the equation $f(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a}) \cdot (\mathbf{x}-\mathbf{a})$ defines a plane in the $(\mathbf{x},f)$ space. This says that the first two terms of the Taylor expansion give the plane that is tangent to the surface defined by $f$ at $\mathbf{a}$. Moreover, when $\mathbf{x}$ is close to $\mathbf{a}$, the fit is quite good. It therefore serves a role that is very much like the tangent line found by the Taylor expansion of a univariate function.

Differentiation rules¶

Here, let $f$ and $g$ be scalar fields, let $h$ be a univariate function, and let $c$ be a scalar.

Linearity. $\nabla ( f(\mathbf{x}) + c g(\mathbf{x})) = \nabla f(\mathbf{x}) + c \nabla g(\mathbf{x})$.
Chain rule. $\nabla h(f(\mathbf{x})) = h^\prime(f(\mathbf{x}))\nabla f(\mathbf{x})$.
Multiplication rule. $\nabla (fg)(\mathbf{x}) = f(\mathbf{x}) \nabla g(\mathbf{x}) + g(\mathbf{x}) \nabla f(\mathbf{x})$.

There is another version of the chain rule that is occasionally useful. Let $\mathbf{y}(u)$ be a vector space curve parameterized by $u \in \mathbb{R}$. Then $$\frac{d}{du}(f(\mathbf{y}(u))) = \nabla f(\mathbf{y}(u)) \cdot \mathbf{y}^\prime(u).$$ In other words, you get the directional derivative of $f$ in the direction $\mathbf{y}^\prime(u)$.

Critical points¶

Analogously to univariate functions, the critical points of a scalar field are those points $\mathbf{x}$ where $\nabla f(\mathbf{x})=\mathbf{0}$. All local minima and maxima are critical points, but not all critical points are local minima and maxima. These critical points are known as saddle points because in two dimensional scalar fields they look like saddles.

Vector fields¶

We will also work heavily with functions that map vectors to vectors, which are called vector fields. A vector field is a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ mapping $\mathbf{x}$ to the vector $f(\mathbf{x})$. We will often refer to the components of the vector output using subscripts:

$$f(\mathbf{x}) = \begin{bmatrix} f_1(\mathbf{x}) \\ f_2(\mathbf{x}) \\ \vdots \\ f_m(\mathbf{x}) \end{bmatrix} $$

where each $f_i$ is itself a scalar field. A vector field can be thought of as a "stack" of scalar fields.

Although there is no notion of a minimum / maximum of such a function, we can still define continuity using the Euclidean norm rather than absolute value:

Continuity. A vector field $f$ is continuous on a domain $S\in \mathbb{R}^n$ if for every point $\mathbf{x} \in S$ and $\epsilon > 0$, there exists a neighborhood with radius $\delta$ around $\mathbf{x}$ that satisfies $\|f(\mathbf{x})-f(\mathbf{y})\| < \epsilon$ for all $y \in N_\delta(\mathbf{x})$.

Equivalently, $f$ is continuous on $S$ if each of its components is continuous on $S$.

Partial derivatives of vector fields¶

The partial derivative of a vector field with respect to one of its arguments is a vector, formed by stacking each of the partial derivatives of its components:

$$\frac{\partial f}{\partial x_j}(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_j}(\mathbf{x}) \\ \frac{\partial f_2}{\partial x_j}(\mathbf{x}) \\ \vdots \\ \frac{\partial f_m}{\partial x_j}(\mathbf{x}) \end{bmatrix}.$$

Jacobian matrix¶

The Jacobian of a vector function is given as the following matrix of partial derivatives often written as $\nabla f$ or $\frac{\partial f}{\partial \mathbf{x}}$:

$$\nabla f(\mathbf{x}) \equiv \frac{\partial f}{\partial \mathbf{x}}(\mathbf{x}) \equiv \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & \cdots & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m(\mathbf{x})}{\partial x_1} & \cdots & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \end{bmatrix}$$

You could also express this as gradients of the component functions. Assuming gradients are column vectors, we have

$$\nabla f(\mathbf{x}) = \begin{bmatrix} \nabla f_1(\mathbf{x})^T \\ \vdots \\ \nabla f_m(\mathbf{x})^T \end{bmatrix}. $$

The Jacobian is directly related to the directional derivative of $f$, and the following expression gives the rate of change of $f$ as $\mathbf{x}$ moves in a direction $\Delta \mathbf{x}$:

$$\left. \frac{d}{dt} f(\mathbf{x}+t \Delta \mathbf{x}) \right|_{t=0} = \nabla f(\mathbf{x}) \Delta \mathbf{x}.$$

Differentiation rules¶

Here, let $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$, $g : \mathbb{R}^n \rightarrow \mathbb{R}^m$, and $h : \mathbb{R}^m \rightarrow \mathbb{R}^p$ be vector fields, and let $s : \mathbb{R}^m \rightarrow \mathbb{R}$ be a scalar field.

Linearity: $\nabla(f + c\cdot g)(\mathbf{x}) = \nabla f(\mathbf{x}) + c \nabla g(\mathbf{x})$, with $c$ a scalar.
Chain rule (scalar function of vector function): $\nabla_x s(f(\mathbf{x})) = \nabla s(f(\mathbf{x}))^T \cdot \nabla f(\mathbf{x})$. Note that the $\nabla_x$ denotes in the left hand side denotes that we are taking the gradient of $(s \circ f)$ with respect to $\mathbf{x}$ rather than the arguments of $s$. On the right hand side, $\nabla s (f(\mathbf{x}))$ is the gradient of $s$ with respect to its inputs, evaluated at the point $f(\mathbf{x})$.
Chain rule (vector function of vector function): $\nabla_x h(f(\mathbf{x})) = \nabla h (f(\mathbf{x})) \cdot \nabla f(\mathbf{x})$ (A Jacobian matrix). Again, $\nabla_x$ denotes taking the Jacobian of $h \circ f$ and $\nabla h$ takes the Jacobian of $h$.
Multiplication rule: $\nabla (s(\mathbf{x}) f(\mathbf{x})) = s(\mathbf{x}) \cdot \nabla f(\mathbf{x}) + f(\mathbf{x}) \cdot \nabla s(\mathbf{x})^T$. The second term in the sum is the outer product.
Dot product rule: $\nabla (f(\mathbf{x})^T g(\mathbf{x})) = \nabla f(\mathbf{x}) \cdot g(\mathbf{x}) + \nabla g(\mathbf{x}) \cdot f(\mathbf{x})$.