We would like to extend concepts that are familiar to us from working
with real numbers $\mathbb{R}$ to other spaces of interest. For example,
through much of this class we will operate in $n$-dimensional Euclidean
spaces $\mathbb{R}^n$. Elements of $\mathbb{R}^n$ (points) are simply
tuples of $n$ real numbers. In this section we extend our understanding
the notions of sequences, limits, derivatives, and closed and open
intervals to a more general setting of *metric spaces*.

A metric space $S$ is an arbitrary abstract set equipped with a *metric*
$d(x,y)$ that measures a sense of distance between points. The metric is
required to be 1) nonnegative ($d(x,y) \geq 0$), 2) reflexive
($d(x,y) = 0$ iff $x=y$) and 3) symmetric ($d(x,y) = d(y,x)$). It is
also required to satisfy the triangle inequality:
$d(x,y) \leq d(x,z)+d(z,y)$ for all $x,y,z$.

Examples:

$\mathbb{R}$ is a metric space under the absolute distance metric $d(x,y)=|x-y|$.

Any Euclidean space $\mathbb{R}^n$ is a metric space under the Euclidean distance metric (and indeed, many other metrics).

Any set is a metric space under the discrete metric: $d(x,y) = 0$ if $x=y$, and $d(x,y) = 1$ otherwise.

We can easily generalize the notions of convergence and limits of sequences into the metric space setting. This is done simply by replacing the absolute distances $|x-y|$ with the metric $d(x,y)$. Limits of a function also generalize, but only in the standard case (taken relative to some point $c$). The one-sided and asymptotic limits do not generalize in the same way, nor do the notions of minimum/maximum/infimum/supremum.

Neighborhood. For a metric space $(S,d)$, the $r$-neighborhood of a point $x\in S$ is the set $N_r(x) = { y\in S | d(x,y) < r }$. For a Euclidean space $\mathbb{R}^n$ and the Euclidean metric, this is a ball of dimension $n$ and radius $r$ centered on $x$, not including its surface.

Open set. Anopen set$A$ is a subset of a metric space $(S,d)$ such that every point $x \in A$ has an $r$-neighborhood completely contained within $A$ with some $r > 0$. More precisely, for all $x$, there exists an $r > 0$ such that $N_r(x) \subseteq A$. (Note that $r$ is allowed to depend on $x$.)

Closed set. Aclosed set$A$ is a subset of a metric space $(S,d)$ such that its complement $S \setminus A$ is an open set. (Note: typically the "universe" of possible elements $S$ is known and is therefore left implicit.)

Examples:

The half-open interval $(a,b]$ is neither open nor closed.

Any finite set of points $\{ x_1,\ldots,x_n \}$ closed in $\mathbb{R}$.

The empty set is both open and closed.

$\mathbb{R}$ is both open and closed (taking $\mathbb{R}$ as the universe).

The union of any number of open sets is open.

The intersection of any finite number of open sets is open.

The union of any finite number of closed sets is closed.

The intersection of any number of closed sets is closed.

Closure. The closure of a set $A$, denoted $cl(A)$ is the set of all points $x \in S$ such that for any $r > 0$, $N_r(x) \cap A \neq \emptyset$.

From this definition, we can see that $A \subseteq cl(A)$, $cl(A)$ is a closed set, and that $A=cl(A)$ iff $A$ is a closed set.

Interior. The interior of a set $A$, denoted $int(A)$ is the set of all points $x \in S$ such that there exists an $r > 0$ such that $N_r(x) \subseteq A$.

From this definition, we can see that $int(A) \subseteq A$, $int(A)$ is an open set, and that $A=int(A)$ iff $A$ is an open set.

Boundary. The boundary of a set $A$, denoted $\partial A$, is defined as $cl(A)\setminus int(A)$.

Or, equivalently, it is the set of points for which all $r$-neighborhoods with $r>0$ contain at least one point of $A$ and one point of its complement.

We will often work with functions that map many variables (a vector) to
a single number (a scalar). A *scalar field*
$f : \mathbb{R}^n \rightarrow \mathbb{R}$, is a real-valued function of
$n$ real-valued variables. We can either write the arguments explicitly
as $f(x_1,\ldots,x_n)$ or as a single vector argument as $f(\mathbf{x})$. The
same notions of continuity, minima, and maxima that apply to univariate
functions also apply to scalar fields.

Continuity. A scalar field $f$ is continuous on a domain $S\in \mathbb{R}^n$ if for every point $\mathbf{x} \in S$ and $\epsilon > 0$, there exists a neighborhood with radius $\delta$ around $\mathbf{x}$ that satisfies $|f(\mathbf{x})-f(\mathbf{y})| < \epsilon$ for all $y \in N_\delta(\mathbf{x})$.

Extreme value theorem. A continuous scalar field $f$ attains a minimum and maximum value on aclosedset $S \subseteq \mathbb{R}^n$.

Local minima. $\mathbf{x}$ is said to be a local minimum of a scalar field $f$ on anopenset $S \subseteq \mathbb{R}^n$ if there exists a neighborhood $N_r(\mathbf{x})$ around $\mathbf{x}$ such that $f(\mathbf{x}) < f(\mathbf{y})$ for all $\mathbf{y} \in N_r(\mathbf{x}) \setminus \{ \mathbf{x}\}$. A similar definition holds for local maxima.

Partial derivatives allow taking derivatives of a scalar field with
respect to certain individual arguments. The *partial derivative* of $f$
with respect to its argument $x_i$ at the values
$(x_1,\ldots,x_n)=(a_1,\ldots,a_n)$ is given by
$$\frac{\partial f}{\partial x_i}(a_1,\ldots,a_n) = \lim_{h \rightarrow 0} \frac{f(a_1,\ldots,a_{i-1},a_i+h,a_{i+1},\ldots,a_n)-f(a_1,\ldots,a_n)}{h}$$
In other words, this is the deriative in the direction of the $a_i$
while fixing all of the other $a_j$'s, $i\neq j$, constant.

We can write this definition more compactly in vector notation: $$\frac{\partial f}{\partial x_i}(\mathbf{a}) = \lim_{h \rightarrow 0} \frac{f(\mathbf{a}+h\mathbf{e}_{i})-f(\mathbf{a})}{h}$$

A third possible interpretation is that we are taking the derivative of the function $$g_i(x) = f(a_1,\ldots,a_{i-1},x,a_{i+1},\ldots,a_n)$$ at $x = a_i$. In other words, $g_i^\prime(x)|_{x = a_i} =\frac{\partial f}{\partial x_i}(\mathbf{a})$.

(Note that some other texts may use the notation $f_i$ to denote the partial derivative with respect to the $i$'th argument to $f$. I find this notation to be extremely confusing so we will use the $\frac{\partial f}{\partial x_i}$ notation throughout this class.)

Partial differentiation operator. The notation $\frac{\partial}{\partial x_i}$ refers to the partial differentiation operator on the $i$'th argument to $f$.

For shorthand, we may occasionally write $\frac{\partial}{\partial i}$.

The rules for computing regular derivatives can also be applied when
computing partial derivatives. The only difference is that *all
arguments, except for the one being differentiated, are treated as
constants*. As an exmaple, consider $$f(x_1,x_2) = x_1^k e^{c x_2}.$$
The partial derivatives are:
$$\frac{\partial}{\partial x_1} f(x_1,x_2) = k x_1^{k-1} e^{c x_2}$$
because $x_2$ is treated as a constant, and
$$\frac{\partial}{\partial x_2} f(x_1,x_2) = c x_1^{k} e^{c x_2}$$
because $x_1$ is treated as a constant.

Be careful when a variable appears in multiple arguments. For example,
given a function $f(x,g(x))$, the partial derivative of $f$ with respect
to $x_1$ does *not* use the partial derivative with respect to $x_2$, or
the derivative of $g$ in its calculation. Instead, you should treat the
second argument as a constant $y$ and proceed as though it has no
dependence on $x$.

You *should* incorporate those derivatives when computing the *total
derivative* with respect to $x$, which is denoted $\frac{d}{dx}$ or
$\cdot^\prime$. The total derivative treats the expression as a single
function, $h(x) = f(x,g(x))$, and involves a chain rule with both
partial derivatives:
$$\frac{d}{dx}f(x,g(x)) = h^\prime(x) = \frac{\partial}{\partial x_1} f(x,g(x)) + \frac{\partial}{\partial x_2} f(x,g(x)) g(x)^\prime.$$

The *directional derivative* is another type of derivative that measures
the rate of change of the scalar field $f$ as you move from $\mathbf{a}$ in a
direction $\mathbf{d}$. You can think about moving along a ray
$\mathbf{x}(u) = \mathbf{a} + u\mathbf{d}$ standing at $\mathbf{a}$ and beginning to move
with velocity $\mathbf{d}$. The directional derivative measures the slope of
the field in your desired direction.

The

directional derivativeof $f$ at $\mathbf{a}$ in direction $\mathbf{d}$ is the value: $$\nabla_{\mathbf{d}}f(\mathbf{a}) = \lim_{h\rightarrow 0} \frac{f(\mathbf{a}+h\mathbf{d})-f(\mathbf{a})}{h}.$$

Note that the standard partial derivative $\frac{\partial}{\partial i}f(\mathbf{a})$ is equivalent to $\nabla_{\mathbf{e}_{i}}f(\mathbf{a})$.

A fundamental result in multivariate calculus is that *all directional
derivatives* can be expressed as a dot product of $\mathbf{d}$ with a vector
known as the *gradient* of $f$. The gradient of $f$, denoted
$\nabla f(\mathbf{a})$, is the vector of all partial derivatives of $f$:
$$\nabla f(\mathbf{a}) = \left(\frac{\partial}{\partial x_1} f(\mathbf{a}), \ldots, \frac{\partial}{\partial x_n} f(\mathbf{a})\right).$$

We will prove the key result that $$\nabla_{\mathbf{d}}f(\mathbf{a}) = (\nabla f(\mathbf{a})) \cdot \mathbf{d}$$

We will need a lemma that directional derviatives are linear, specifically that: $$\nabla_{c\mathbf{d}} f(\mathbf{a}) = c \nabla_{\mathbf{d}} f(\mathbf{a})$$ and $$\nabla_{\mathbf{d}+\mathbf{e}_{i}} f(\mathbf{a}) = \nabla_{\mathbf{d}} f(\mathbf{a}) + \nabla_{\mathbf{e}_{i}} f(\mathbf{a})$$ for any scalar $c$, vectors $\mathbf{d}$, $\mathbf{d}_{1}$, and $\mathbf{d}_{2}$.

First, scaling: $$\begin{split} \nabla_{c\mathbf{d}} f(\mathbf{a}) &= \lim_{h\rightarrow 0} \frac{f(\mathbf{a} + ch\mathbf{d})-f(\mathbf{a})}{h} \\ &= c \lim_{h\rightarrow 0} \frac{f(\mathbf{a} + ch\mathbf{d})-f(\mathbf{a})}{ch} \\ &= c \lim_{(ch)\rightarrow 0} \frac{f(\mathbf{a} + ch\mathbf{d})-f(\mathbf{a})}{ch} \\ &= c \lim_{h\rightarrow 0} \frac{f(\mathbf{a} + h\mathbf{d})-f(\mathbf{a})}{h} \\ &= c \nabla_{\mathbf{d}} f(\mathbf{a}) \end{split}$$

Now, let's look at summing. $$\nabla_{\mathbf{d}_{1}+\mathbf{d}_{2}} f(\mathbf{a}) = \lim_{h\rightarrow 0} \frac{f(\mathbf{a}+h\mathbf{d}_{1}+h\mathbf{d}_{2})-f(\mathbf{a})}{h}$$

Now let $g_h(x) = f(\mathbf{a}+h\mathbf{d}_{1} + x\mathbf{d}_{2})$ and evaluate the Taylor expansion of $g_h(x)$ around $x=0$: $$g_h(x) = f(\mathbf{a}+h\mathbf{d}_{1}) + g_h^\prime(x)|_{x=0} x + O(x^2)$$

So $$\begin{split} \nabla_{\mathbf{d}_{1}+\mathbf{d}_{2}} f(\mathbf{a}) &= \lim_{h\rightarrow 0} \frac{1}{h}[f(\mathbf{a}+h\mathbf{d}_{1}) + g_h^\prime(0) h + O(h^2) - f(\mathbf{a})] \\ &= \lim_{h\rightarrow 0} \left(\frac{1}{h}[f(\mathbf{a}+h\mathbf{d}_{1}) - f(\mathbf{a}) + O(h^2)] + g_h^\prime(0) \right)\\ &= \nabla_{\mathbf{d}_{1}} f(\mathbf{a}) + \lim_{h\rightarrow 0} g_h^\prime(x)|_{x=0}. \end{split}$$ The desired result is obtained by noting that $\lim_{h\rightarrow 0} g_h^\prime(x)|_{x=0}$ is identically $\nabla_{\mathbf{d}_{2}}f(\mathbf{a})$.

Returning to the original question, we use linearity of the directional derivative to obtain $$\begin{split} (\nabla f(\mathbf{a})) \cdot \mathbf{d} &= \sum_{i=1}^n d_{i} \nabla_{\mathbf{e}_{i}}f(\mathbf{a}) \\ &= \sum_{i=1}^n \nabla_{d_i \mathbf{e}_{i}}f(\mathbf{a}) \\ &= \nabla {\sum_{i=1}^n d_i \mathbf{e}_{i}}f(\mathbf{a}) \\ &= \nabla_{\mathbf{d}} f(\mathbf{a}) \end{split}$$ as desired. QED.

First-order Taylor expansion of a scalar field. If a scalar field $f$ is differentiable, we can use the gradient to define the first-order Taylor expansion of $f$ about a point $\mathbf{a}$: $$f(\mathbf{x}) = f(\mathbf{a}) + (\nabla f(\mathbf{a})) \cdot (\mathbf{x}-\mathbf{a}) + O(||\mathbf{x}-\mathbf{a}||^2).$$

Note that the equation
$f(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a}) \cdot (\mathbf{x}-\mathbf{a})$ defines a
plane in the $(\mathbf{x},f)$ space. This says that the first two terms of
the Taylor expansion give the plane that is *tangent to the surface*
defined by $f$ at $\mathbf{a}$. Moreover, when $\mathbf{x}$ is close to $\mathbf{a}$,
the fit is quite good. It therefore serves a role that is very much like
the tangent line found by the Taylor expansion of a univariate function.

Here, let $f$ and $g$ be scalar fields, let $h$ be a univariate function, and let $c$ be a scalar.

*Linearity.*$\nabla ( f(\mathbf{x}) + c g(\mathbf{x})) = \nabla f(\mathbf{x}) + c \nabla g(\mathbf{x})$.*Chain rule.*$\nabla h(f(\mathbf{x})) = h^\prime(f(\mathbf{x}))\nabla f(\mathbf{x})$.*Multiplication rule.*$\nabla (fg)(\mathbf{x}) = f(\mathbf{x}) \nabla g(\mathbf{x}) + g(\mathbf{x}) \nabla f(\mathbf{x})$.

There is another version of the chain rule that is occasionally useful. Let $\mathbf{y}(u)$ be a vector space curve parameterized by $u \in \mathbb{R}$. Then $$\frac{d}{du}(f(\mathbf{y}(u))) = \nabla f(\mathbf{y}(u)) \cdot \mathbf{y}^\prime(u).$$ In other words, you get the directional derivative of $f$ in the direction $\mathbf{y}^\prime(u)$.

Analogously to univariate functions, the critical points of a scalar
field are those points $\mathbf{x}$ where $\nabla f(\mathbf{x})=\mathbf{0}$. All local
minima and maxima are critical points, but not all critical points are
local minima and maxima. These critical points are known as *saddle
points* because in two dimensional scalar fields they look like saddles.

We will also work heavily with functions that map vectors to vectors, which are called *vector fields*. A vector field is a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ mapping $\mathbf{x}$ to the *vector* $f(\mathbf{x})$. We will often refer to the components of the vector output using subscripts:

where each $f_i$ is itself a scalar field. A vector field can be thought of as a "stack" of scalar fields.

Although there is no notion of a minimum / maximum of such a function, we can still define continuity using the Euclidean norm rather than absolute value:

Continuity. A vector field $f$ is continuous on a domain $S\in \mathbb{R}^n$ if for every point $\mathbf{x} \in S$ and $\epsilon > 0$, there exists a neighborhood with radius $\delta$ around $\mathbf{x}$ that satisfies $\|f(\mathbf{x})-f(\mathbf{y})\| < \epsilon$ for all $y \in N_\delta(\mathbf{x})$.

Equivalently, $f$ is continuous on $S$ if each of its components is continuous on $S$.

The partial derivative of a vector field with respect to one of its arguments is a vector, formed by stacking each of the partial derivatives of its components:

$$\frac{\partial f}{\partial x_j}(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_j}(\mathbf{x}) \\ \frac{\partial f_2}{\partial x_j}(\mathbf{x}) \\ \vdots \\ \frac{\partial f_m}{\partial x_j}(\mathbf{x}) \end{bmatrix}.$$The Jacobian of a vector function is given as the following *matrix of partial derivatives* often written as $\nabla f$ or $\frac{\partial f}{\partial \mathbf{x}}$:

You could also express this as gradients of the component functions. Assuming gradients are column vectors, we have

$$\nabla f(\mathbf{x}) = \begin{bmatrix} \nabla f_1(\mathbf{x})^T \\ \vdots \\ \nabla f_m(\mathbf{x})^T \end{bmatrix}. $$The Jacobian is directly related to the directional derivative of $f$, and the following expression gives the rate of change of $f$ as $\mathbf{x}$ moves in a direction $\Delta \mathbf{x}$:

$$\left. \frac{d}{dt} f(\mathbf{x}+t \Delta \mathbf{x}) \right|_{t=0} = \nabla f(\mathbf{x}) \Delta \mathbf{x}.$$Here, let $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$, $g : \mathbb{R}^n \rightarrow \mathbb{R}^m$, and $h : \mathbb{R}^m \rightarrow \mathbb{R}^p$ be vector fields, and let $s : \mathbb{R}^m \rightarrow \mathbb{R}$ be a scalar field.

- Linearity: $\nabla(f + c\cdot g)(\mathbf{x}) = \nabla f(\mathbf{x}) + c \nabla g(\mathbf{x})$, with $c$ a scalar.
- Chain rule (scalar function of vector function): $\nabla_x s(f(\mathbf{x})) = \nabla s(f(\mathbf{x}))^T \cdot \nabla f(\mathbf{x})$. Note that the $\nabla_x$ denotes in the left hand side denotes that we are taking the gradient of $(s \circ f)$ with respect to $\mathbf{x}$ rather than the arguments of $s$. On the right hand side, $\nabla s (f(\mathbf{x}))$ is the gradient of $s$ with respect to its inputs, evaluated at the point $f(\mathbf{x})$.
- Chain rule (vector function of vector function): $\nabla_x h(f(\mathbf{x})) = \nabla h (f(\mathbf{x})) \cdot \nabla f(\mathbf{x})$ (A Jacobian matrix). Again, $\nabla_x$ denotes taking the Jacobian of $h \circ f$ and $\nabla h$ takes the Jacobian of $h$.
- Multiplication rule: $\nabla (s(\mathbf{x}) f(\mathbf{x})) = s(\mathbf{x}) \cdot \nabla f(\mathbf{x}) + f(\mathbf{x}) \cdot \nabla s(\mathbf{x})^T$. The second term in the sum is the outer product.
- Dot product rule: $\nabla (f(\mathbf{x})^T g(\mathbf{x})) = \nabla f(\mathbf{x}) \cdot g(\mathbf{x}) + \nabla g(\mathbf{x}) \cdot f(\mathbf{x})$.

In [ ]:

```
```