Finding the optimal (minimum or maximum) value of a real-valued function is a ubiquitous operation in mathematics, engineering, physics, and computer science. This section will discuss mathematical criteria for defining the existence of optima and some numerical approaches for finding them. Minimization is also closely connected to problem of root-finding, which is an important problem in its own right.

Here we consider the problem of finding the minimum value that a
real-valued function $f :\mathbb{R}^n \mapsto \mathbb{R}$ takes on over
the reals. (The problem of maximizing $f$ can be solved by finding the
minimum of $-f$, so we will typically only concern ourselves with
minimization.) This is known as *unconstrained optimization*. The
problem where the domain is limited to a closed set, such as an interval
$[a,b]$, is known as *constrained optimization*, and will be covered in
the next section.

In some cases, it may be possible to examine the function by hand and use mathematical tools to find a closed-form expression of the minimum. When this is possible, such a solution can be very valuable because it can be evaluated quickly on a computer or can be used in further mathematical transformations.

The

minimumof $f$ over a domain $S$ is a value $m$ such that $m=f(c)$ for some value $c \in S$ and that $m \leq f(x)$. If such a value exists, we write: $$m = \min_{x \in S} f(x)$$

We may also be interested in obtaining the x-values $c$ that attains this minimum.

The

argument of the minimum(orarg min) is a point $c$ for which $f(c) = \min_{x \in S} f(x)$. We write $c = \arg\min_{x \in S} f(x).$

Since a function may attain its minimum at multiple points, the $\arg\min$ is technically a set of points. It may also be an empty set if the minimum is not defined, e.g. $\arg \min_{x \in \mathbb{R}} x$.

Similar conditions define the **maximum** and **arg max**.

A value $x$ for which $f$ obtains its minimum (or maximum) over some open interval
containing $x$ is known as a *local minimum* (or maximum) of $f$. If $f$
is differentiable, then one condition for local optimality is that the
derivative of $f$ vanishes at $x$ (Figure 1). In other words, if $f$
attains a minimum (or maximum) at $x$ over the interval $(a,b)$, then
$f^\prime(x) = 0$.

To prove this fact, we will prove the contrapositive: if $x \in (a,b)$ is a point where $f^\prime(x) = L \neq 0$, then $x$ is not a local minimum. Using the definition of a limit, $$L = f^\prime(x) = \lim_{h\rightarrow 0} (f(x+h)-f(x))/h.$$ This means that for any $\epsilon > 0$, we can pick a $\delta$ such that $|(f(x+h)-f(x))/h-L| < \epsilon$ as long as $|h|\leq \delta$. Since $(a,b)$ is an open set, we can choose $\delta$ small enough so that $(x-\delta,x+\delta) \subseteq (a,b)$.

If $L$ is positive, pick $h=\delta$. Then we have $$(f(x+\delta)-f(x))/\delta-L < \epsilon$$ Which can be rearranged to obtain $$f(x+\delta) < f(x)+(L+\epsilon)\delta.$$ Hence, $f(x+\delta)$ is lower than $f(x)$, and hence $f(x)$ is not a minimizer of $f$.

On the other hand, if $L$ is negative, pick $h=-\delta$. Then we have $$-\epsilon < -(f(x-\delta)-f(x))/\delta-L$$ Which can be rearranged to obtain $$f(x-\delta) < f(x)-\delta(L-\epsilon).$$ Since $L-\epsilon$ is negative, this shows that $f(x-\delta)$ is lower than $f(x)$, and hence $f(x)$ is not a minimizer either. QED

Hence it is often useful to find all points where $f^\prime(x) = 0$.
These are known as *critical points*. All local minima and maxima are
critical points. However, not all critical points are local minima and
maxima, as the example $f(x) = x^3$ shows. The derivative vanishes at
$x=0$, but the function does not obtain a local extremum. Critical
points that are not local extrema are known as *inflection points*.

If $f$ is twice differentiable at $x$, then a *sufficient condition* for
a critical point $x$ to be a local minimum is that the second derivative
is positive. In other words, $f^{\prime\prime}(x) > 0$. To show this,
consider the definition of the second derivative:
$$f^{\prime\prime}(x) = \lim_{h\rightarrow 0}(f^\prime(x+h)-f^\prime(x))/h$$
Since $x$ is a critical point, $f^\prime(x)=0$, so
$$f^{\prime\prime}(x) = \lim_{h\rightarrow 0}f^\prime(x+h)/h$$

If $f^{\prime\prime}(x) > 0$, then we can find a $\delta$ small enough such that for all $|h|\leq \delta$, $f^\prime(x+h)/h > 0$. Hence, if $h > 0$, then $f^\prime(x+h) > 0$, which means that $f^\prime$ is positive to the right of $x$. If $h < 0$, then $f^\prime(x+h) < 0$, which means that $f^\prime$ is negative to the left of $x$. This means that on the range $(x,x+\delta)$, $f$ is monotonically increasing, and $(x-\delta,x)$, $f$ is monotonically decreasing. Therefore, on the range $(x-\delta,x+\delta)$, $x$ minimizes $f$.

In many cases, however, $f$ can be very complex, so it may be difficult or impossible to find a closed form expression. Or, $f$ may be given as a black-box computer subroutine, and so a mathematical expression may not be available. In these cases we must resort to numerical (iterative) techniques in order to find a minimum.

The first multivariate optimization technique we will examine is one of the simplest: gradient descent (also known as steepest descent). Gradient descent is an iterative method that is given an initial point, and follows the negative of the gradient in order to move the point toward a critical point, which is hopefully the desired local minimum. Again we are concerned with only local optimization, because global optimization is computationally challenging.

Gradient descent is popular for very large-scale optimization problems because it is easy to implement, can handle "black box" functions, and each iteration is cheap. Its major disadvantage is that it can take a long time to converge. We will also describe a discrete descent method often used to solve large combinatorial problems.

Given a differentiable scalar field $f(\mathbf{x})$ and an initial guess $\mathbf{x}_1$, gradient descent iteratively moves the guess toward lower values of $f$ by taking steps in the direction of the negative gradient $-\nabla f(\mathbf{x})$. Locally, the negated gradient is the steepest descent direction, i.e., the direction that $\mathbf{x}$ would need to move in order to decrease $f$ the fastest. The algorithm typically converges to a local minimum, but may rarely reach a saddle point, or not move at all if $\mathbf{x}_{1}$ lies at a local maximum.

The first order Taylor approximation of $f(\mathbf{x})$ about $f(\mathbf{x}_{1})$ is $$f(\mathbf{x}) = f(\mathbf{x}_{1}) + \nabla f(\mathbf{x}_{1}) \cdot (\mathbf{x}-\mathbf{x}_{1}) + O(||\mathbf{x}-\mathbf{x}_{1}||^2).$$ Consider moving from $\mathbf{x}_{1}$ a small amount $h$ in a unit direction $\mathbf{u}$. We want to find the $\mathbf{u}$ that minimizes $f(\mathbf{x}_{1} + h\mathbf{u})$. Using the Taylor expansion, we see that $$f(\mathbf{x}_{1} + h\mathbf{u}) - f(\mathbf{x}_{1}) = h \nabla f(\mathbf{x}_{1}) \cdot \mathbf{u} + h^2 O(1).$$ If we make the $h^2$ term insignificant by shrinking $h$, we see that in order to decrease $f(\mathbf{x}_{1} + h\mathbf{u}) - f(\mathbf{x}_{1})$ the fastest we must minimize $\nabla f(\mathbf{x}_{1}) \cdot \mathbf{u}$. The unit vector that minimizes $\nabla f(\mathbf{x}_{1}) \cdot \mathbf{u}$ is $\mathbf{u} = -\nabla f(\mathbf{x}_{1}) / || \nabla f(\mathbf{x}_{1}) ||$ as desired.

The algorithm is initialized with a guess $\mathbf{x}_{1}$, a maximum iteration count $N_{max}$, a gradient norm tolerance $\epsilon_g$ that is used to determine whether the algorithm has arrived at a critical point, and a step tolerance $\epsilon_x$ to determine whether significant progress is being made. It proceeds as follows:

**for**$t=1,2,\ldots,N_{max}$$\mathbf{x}_{t+1} \gets \mathbf{x}_{t} - \alpha_t \nabla f(\mathbf{x}_{t})$

**if**$\|\nabla f(\mathbf{x}_{t})\| < \epsilon_g$**return**"Converged on critical point"**if**$\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\| < \epsilon_x$**return**"Converged on an $x$ value"**if**$f(\mathbf{x}_{t+1}) > f(\mathbf{x}_{t})$**return**"Diverging"**return**"Maximum number of iterations reached"

The variable $\alpha_t$ is known as the *step size*, and should be
chosen to maintain a balance between convergence speed and avoiding
divergence. Note that $\alpha_t$ may depend on the step $t$.

To a first-order approximation, each step decreases the value of $f$ by approximately $\alpha_t ||\nabla f(\mathbf{x}_{0})||^2$. If $\alpha_t$ is too small, then the algorithm will converge very slowly. On the other hand, if the step size $\alpha_t$ is not chosen small enough, then the algorithm may fail to reduce the value of $f$. (Because the first order approximation is valid only locally.)

One approach is to *adapt* the step size $\alpha_t$ in order to achieve
a reduction in $f$ while still making sufficiently fast progress. This
procedure is known as a *line search* and is employed in many
multivariate optimization algorithms. The input to a line search is a
function $f$, an initial point $\mathbf{x}$ and direction $\mathbf{d}$, initial
candidate step size $\gamma_1$. The output is a step size $\gamma > 0$
such that $f(\mathbf{x}+\gamma \mathbf{d}) < f(\mathbf{x})$, or failure if the step
size falls below some threshold.

The first step is to check whether $\mathbf{x} + \gamma_1 \mathbf{d}$ decreases the value of $f$. If so, then we can either terminate the line search, or search for a larger valid $\gamma$ by repeated doubling: $\gamma_n = \gamma_1 2^n$. If a step of size $\gamma_1$ increases $f$, then we search through repeated halving: $\gamma_n = \frac{\gamma_1}{2^n}$. This continues until a reduction in $f$ is found, or minimum step size tolerance is reached.

You might be thinking, "wouldn't it make sense for line search to find
the *optimal* step size rather than any one that reduces the function
value?" It turns out that in practice it is usually not worthwhile to do
extra work to obtain an optimal line search because moving along the
current line typically doesn't get you much closer to the local minimum.
More often it is a better use of time to take a step and get a new
gradient direction at the next point. Nevertheless it is sometimes
useful during analysis to assume that optimal line searches are being
performed.

It can be proven, through a relatively technical proof, that gradient descent with an optimal line search obtains a linear convergence. However, there are cases in which the convergence constant is very bad, that is, the error ratio $E_{t+1} / E_{t}$ is close to 1. This occurs when the eigenvalues of $f$'s second derivative matrix, known as the Hessian matrix, is poorly "scaled" at the minimum. Here we will provide some intuition for the causes of slow convergence.

Consider a very simple quadratic function in a two dimensional Euclidean space: $f(x_1,x_2) = a x_1^2 + b x_2^2$, with positive constants $a$ and $b$. The gradient of $f$ is $(2 a x_1, 2 b x_2)$, and of course the minimum of $f$ is at the origin.

If $a=b$, then $f$ increases isotropically from the origin in every direction (Figure 1). Another way to think of this is that the level sets of $f$ are circles. At a point $(x_1,x_2)$, observe that $-\nabla f(x_1,x_2)$ points directly toward the origin. Hence, gradient descent moves any initial point directly toward the origin, and it will perform well.

Now consider the case where $b >> a$, say $b=100a$. Now the level sets
of $f$ are ellipses that are thinner in the $x_1$ direction (Figure 2).
At a point $(x_1,x_2)$, the descent direction
$-\nabla f(x_1,x_2) = (-2 a x_1, -2 b x_2)$ is steeper along the $x_2$
axis than the $x_1$ axis. So, the line search will not be directed
toward the origin. This gives gradient descent a characteristic
zig-zagging trajectory that takes longer to converge to the minimum. In
general, slow convergence holds when the "bowl" around the local minimum
is much thinner in one direction than another. Problems like this are
said to have an *ill-conditioned* Hessian matrix. We will make this more
precise in future lectures.

Gradient descent can be generalized to spaces that involve a discrete
component. The method of *steepest descent* is the discrete analogue of
gradient descent, but the best move is computed using a local
minimization rather rather than computing a gradient. It is typically
able to converge in few steps but it is unable to escape local minima or
plateaus in the objective function.

A discrete search problem is defined by a discrete state space $S$ and a set of edges $E \subseteq S \times S$. This induces a graph $G=(S,E)$ that must be searched in order to find a state that minimizes a given function $f(s)$. A steepest descent search begins at a state $s_0$ and takes steps on $G$ that descend $f(s)$ with the maximum rate of descent.

The algorithm repeatedly computes each transition $s\rightarrow s^\prime$ by performing the local minimization $\arg \min_{s\rightarrow s^\prime \in E} f(s^\prime)$, and terminates when $f(s^\prime) \geq f(s)$. This approach is typically much faster than an exhaustive search if $S$ is large or possibly infinite but the degree of $G$ is small.

*Hill climbing* is a similar approach to steepest descent that is used
for large discrete problems in which the state space is combinatorial.
(Although the name is hill *climbing* the approach can be applied to
either minimization or maximization.) If the state is composed of
several attributes $s=(s_1,\ldots,s_n)$, an optimization can be
formulated over the composite state space
$S = S_1 \times \cdots \times S_n$. In each iteration, each of the
attributes $s_i$ is locally optimized in $S_i$ using a single step of
steepest descent. These sweeps continue until convergence or a
termination condition is reached.

Newton's method is a *higher-order* optimization method that admits faster convergence than gradient descent. The idea is that at a minimum of $f$, the derivative is zero. So, we should seek points $x$ such that $f^\prime(x) = 0$.

Given a current iterate $x_t$, we wish to seek a step $\Delta x_t$ such that $f^\prime(x_t + \Delta x_t)=0$. We can construct a first-order Taylor expansion of $f^\prime$ around $x_t$ to solve for such an approximate step. This approximation uses the second derivative:

$$f^\prime(x_t + \Delta x_t)\approx f^\prime(x_t) + f^{\prime\prime}(x_t) \cdot \Delta x_t$$By setting the right hand side to 0 and solving for $\Delta x_t$, we get the expression $\Delta x_t \approx -f^\prime(x_t) / f^{\prime\prime}(x_t)$. Newton's method uses this as the step direction.

In $n$-D Newton's method, we simply replace the 1-D Taylor expansion of $f^\prime$ with the $n-D$ Taylor expansion of the gradient $\nabla f$:

$$\nabla f(\mathbf{x}_t + \Delta \mathbf{x}_t)\approx \nabla f(\mathbf{x}_t) + \nabla^2 f(\mathbf{x}_t) \cdot \Delta \mathbf{x}_t.$$Letting $H_t = \nabla^2 f(\mathbf{x}_t)$ be the Hessian matrix and $g_t = \nabla f(\mathbf{x}_t)$ the gradient at the current iterate, the solution is given by $\Delta \mathbf{x}_t = -H_t^{-1} g_t$. This is then used as the step direction to determine the next iterate.

Note that by seeking a critical point, the method does not include any information about whether the direction is taking $x_t$ toward a minimum, maximum, or inflection point. Hence, Newton's method is often combined with a line search to prevent divergence. A complete implementation is given below:

**for**$t=1,2,\ldots,N_{max}$$H_t \gets \nabla^2 f(\mathbf{x}_{t})$, $g_t \gets \nabla f(\mathbf{x}_{t}).$

$\Delta \mathbf{x}_{t} \gets -H_t^{-1} g_t$

$\mathbf{x}_{t+1} \gets \mathbf{x}_{t} + \alpha_t \Delta \mathbf{x}_t $ (with $\alpha_t$ determined via line search)

**if**$\|\nabla f(\mathbf{x}_{t})\| < \epsilon_g$**return**"Converged on critical point"**if**$\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\| < \epsilon_x$**return**"Converged on an $x$ value"**if**$f(\mathbf{x}_{t+1}) > f(\mathbf{x}_{t})$**return**"Diverging"**return**"Maximum number of iterations reached"

TODO: Describe newton's method and quasi newton methods.

In the general constrained optimization (minimization) problem, we have a statement of the form

$$\begin{gathered} min_x f(x) \text{ such that} \\ g(x) \geq 0 \\ h(x) = 0. \end{gathered}$$Here, all elements except for $f$ are in general vector-valued: $x \in \mathbb{R}^n$, $g: \mathbb{R}^n \rightarrow \mathbb{R}^m$, and $h: \mathbb{R}^n \rightarrow \mathbb{R}^p$. Analysis and methods for solving these types of problems are considerably more complex than unconstrained optimization techniques. So, it is very convenient to use software toolboxes that allow constrained problems to be solved as a "black box", where the user needs only a minimum of understanding of the underlying methods in order to obtain satisfactory results. That said, algorithmic breakthroughs are often made by understanding constrained optimization more deeply (e.g., support vector machines in machine learning, time scaling methods in robot trajectory optimization).

Extreme Value Theorem. The Extreme Value Theorem states that a continuous function $f$ does indeed attain a minimum and maximum over a closed interval $[a,b]$ at some points $c$ and $d$ in $[a,b]$, respectively.

Although it may seem like a trivial statement, it is possible to construct examples where the minimum and maximum cannot be attained over an interval that is not closed, or not bounded, or when the function is discontinuous.

In the case where the minimum $c$ is in the interior of the interval $(a,b)$, and the derivative of $f$ exists, then $c$ is a critical point, i.e., $f^\prime(c)=0$. If it is not in the interior, then either $c=a$ or $c=b$. This gives rise to an "algorithm" for finding the minimum on an interval:

- Determine all critical points of $f$ in the open interval $(a,b)$. Let these be called $x_1,\ldots,x_k$.
- Return $min f(a), f(b), f(x_1), \ldots, f(x_k)$.

Let us now consider the multi-dimensional constrained optimization case with only equality constraints, i.e., $min_x f(x)$ such that $h(x)=0$. The minimum may not be a critical point of $f$, because most critical points are very unlikely to satisfy $h(x)=0$. If we are to consider the set of points (a manifold) that satisfies $h(x)=0$, the conditions for optimality state that we cannot move *along this manifold* without reducing the value of $f(x)$. This is made more precisely with the concept of *Lagrange multipliers*.

More precisely, the Lagrange multipliers give us a way to define a critical point in an equality-constrained setting, which is a necessary condition for a point to be optimal. Specifically, we define a vector of Lagrange multipliers $\lambda \in \mathbb{R}^p$. The condition states if $x^\star$ is a minimizer of the equality-constrained optimization, then there exists some $\lambda^\star$ satisfying:

$$\nabla f(x^\star) + \lambda^{\star T} \nabla h(x^\star) = 0 $$This, along with the condition $h(x^\star) == 0$, gives a way to characterize all critical points on the manifold.

It is clearer to understand what this means in the case of a single constraint $p=1$. Then $\lambda$ is a scalar, and $\nabla h(x^\star)$ is a gradient vector perpendicular to the surface $h(x)=0$ at $x^\star$. The Lagrange condition becomes

$$\nabla f(x^\star) + \lambda^\star \nabla h(x^\star) = 0.$$Because $\lambda^\star$ can be arbitrary, this means that at $x^\star$, $\nabla f$ and $\nabla h$ are *aligned*, Moreover, $\nabla f \neq 0$. It does not matter how the gradients are scaled with respect to one another, since $\lambda^\star$ can vary freely. In other words, the direction in which $x$ is "pulled" by the gradient of $f$ is canceled out because it cannot move along the surface $h(x)=0$.

The multiple-constraint case is simply an analogue of this, to account for the multiple constraint gradient directions being able to combine, in arbitrary linear combinations, to cancel out the "pull" of $\nabla f$.

For problems with both inequality and equality constraints, the Lagrange multiplier conditions can be extended to form the Karush-Kuhn-Tucker (KKT) conditions. Let us first consider the case where only inequality constraints are present. The set of $x$ satisfying $g(x) \leq 0$ is called the *feasible* set, and we wish to characterize the minimum of $f$ and the minimizing point $x^\star$ over this set.

The KKT conditions are in some sense an analogue of the Extreme Value Theorem, and state that an optimum must be either a critical point in the interior of the feasible set, or it must be on the boundary of the feasible region. Now, if we knew on which boundary of the feasible region it lies, then the Lagrange multiplier conditions become "active" along that boundary and determine the set of critical points.

Let us again examine the case where $g(x)$ is a single scalar, i.e., $m=1$. There are two options for $x$ to be a critical point. Either:

- $\nabla f(x) = 0$ in an interior point ($g(x) > 0$), or
- $x$ is a boundary point ($g(x) = 0$), in which case the Lagrange multiplier conditions $\nabla f(x) - \mu \nabla g(x) = 0$ for some $\mu$. Note that we are using a different symbol for the Lagrange multiplier. If we want the point to be a minimizer, we also include a minus sign in the Langrangian derivative, and require that $\mu > 0$ needs to hold. Otherwise, $f$ can be reduced by moving toward the interior of the feasible set.

This set of options could be analyzed one-by-one to characterize all critical points. But as we add more dimensions to $g$, the options become unwieldy. Consider the $m=2$ case, $g(x) = [g_1(x),g_2(x)]$. Either

- $\nabla f(x) = 0$ in an interior point ($g_1(x) > 0$ and $g_2(x) > 0$), or
- $x$ is a boundary point of $g_1$ but interior to $g_2$ ($g_1(x) = 0$, $g_2(x) > 0$), in which case the Lagrange multiplier conditions $\nabla f(x) + \mu_1 \nabla g_1(x) = 0$ hold for some $\mu_1$.
- $x$ is a boundary point of $g_2$ but interior to $g_1$ ($g_2(x) = 0$, $g_1(x) > 0$), in which case the Lagrange multiplier conditions $\nabla f(x) + \mu_2 \nabla g_2(x) = 0$ hold for some $\mu_2$.
- $x$ is a boundary point of both $g_1$ and $g_2$ ($g_1(x) = 0$, $g_2(x) = 0$), in which case the Lagrange multiplier conditions $\nabla f(x) + \mu_1 \nabla g_1(x) + \mu_2 \nabla g_2(x) = 0$ hold for some $\mu_1, \mu_2$.

In general, for $m$ constraints there are $2^m$ possible combinations of conditions. To compress this into a more digestible form, the KKT conditions write the conditions as follows:

$$\begin{gathered} \nabla f(x) - \mu^T \nabla g(x) = 0 \\ g(x) \geq 0 \\ \mu \geq 0 \\ \mu^T g(x) = 0 \end{gathered}$$The fourth condition is known as the *complementarity* condition, which means that if $g_i(x) > 0$, this forces $\mu_i = 0$. Conversely, if $\mu_i > 0$, then this forces $g_i(x) = 0$. In other words, the multiplier can only be active when the constraint is met exactly.

To complete the picture, we write the full KKT conditions for both inequality and equality constraints.

$$ \begin{gathered} \nabla f(x) - \mu^T \nabla g(x) + \lambda^T \nabla h(x)= 0 \\ g(x) \geq 0 \\ h(x) = 0 \\ \mu \geq 0 \\ \mu^T g(x) = 0.\end{gathered} $$In linear programming (LP), we are interested in solving a optimization problem of the form

$$\begin{gathered} min_x c^T x \text{ such that} \\ Ax \geq b \\ Cx = d. \end{gathered}$$In quadratic programming (QP), we are interested in solving a problem of the form

$$\begin{gathered} min_x \frac{1}{2} x^T P x + q^T x + r \text{ such that} \\ Ax \geq b \\ Cx = d. \\ \end{gathered}$$In both cases, the two classes of techniques that are most often used are active-set methods and interior-point methods. Briefly speaking, active set methods iterate on a subset of inequality constraints that may be solved to equality. With only equality constraints active, the minimizing direction of a LP or QP can be computed efficiently. Interior point methods convert the inequality and equality constraints into *barrier functions* that penalize violation of constraints. These penalties are incorporated into the objective function, which is then addressed using unconstrained optimization techniques.

For QPs solvers to converge to a unique minimum, $P$ must either be positive semi-definite or, for some solvers, strictly positive definite. This is because for non-positive-definite matrices, the objective function is not convex, and there exist suboptimal local optima on the feasible set boundary.

Professional-grade LP and QP software is widely available, highly reliable, and scales to thousands and even millions of variables and constraints. Example software packages include GLPK, Gurobi, and CVXOPT. In order to get optimal scalability to large problems, the A, C, and P matrices ought to be representable using *sparse matrices* (that is, $m\times n$ matrices in which the number of nonzero entries is $O(m+n)$ rather than $O(mn)$), and a package that can exploit sparsity should be employed.

In nonlinear programming (NLP), the $f$, $g$, and $h$ functions may be nonlinear. Problems of this form can be solved to a local optimum using sequential quadratic programming (SQP) or interior-point methods. Like gardient descent, these techniques are given an initial guess $x_0$, and iteratively attempt to reduce the objective function. The challenge in nonlinear programming is that these algorithms must trade off reducing the objective function versus reducing constraint violations, and there is a vast literature on effective methods for doing so.

For the purposes of discussion, we introduce one of the simplest NLP approaches, the *penalty method*. The idea is to define an unconstrained optimization problem whose minimum is close to the minimum of the constrained problem. For simplicity, let us only assume that only equality constraints $h$ are present. We can then define a penalty-augmented objective function
$$f_\mu(x) = f(x) + \mu\|h(x)\|^2 $$
with $\mu > 0$ some parameter. With this objective, we can see that as $\mu$ grows large, the cost of $x$ violating the constraint $h(x)=0$ also grows larger. So, if we solve an optimization problem with $\mu \gg 1$, then the optimal point $x_\mu^\star$ should have a low constraint violation value. Moreover, as $\mu \rightarrow \infty$, $x_\mu^\star$ approaches the optimizer of the original NLP. However, if $\mu$ is large then the $\mu\|h(x)\|^2$ term dominates the $f$ term, so a gradient-based optimization will have a hard time making progress in reducing $f(x)$. A better idea is to start with a lower value of $\mu$, run an unconstrained optimizer for some number of steps, and then increase $\mu$. This then continues until $\mu$ is sufficiently large.

Note that due to nonlinearity, there is no guarantee of finding a global optimum, and moreover there is no guarantee that these methods will even find a point in the feasible set $ \{ x \,|\, g(x) \leq 0, h(x) = 0\}$. There are some *feasible* optimization techniques that will, if given an initial guess in the feasible set, keep all of the subsequent iterates in the feasible set. An example is the *log barrier* method. In this case, assume that only inequality constraints $g$ are present. We can define a barrier-augmented objective function
$$f_\mu(x) = f(x) + \mu\sum_{i=1}^m\log (-g_i(x))$$
that rises to infinity at the boundaries of the feasible set, and is undefined outside of the feasible set. In a manner similar to the penalty method, we can use unconstrained optimization to optimize the barrier-augmented function, and its minimum is guaranteed to be inside the feasible set. As we then shrink $\mu$ toward 0, the effect of the log-barrier decreases, and the optimal value $x_\mu^\star$ approaches the optimum of the original NLP.

Neither the penalty method nor the log barrier method are extremely reliable or efficient in themselves, but variants do form the basis for more reliable solvers such as interior point and augmented Lagrangian methods.
Robust software for solving NLPs are available in SNOPT, IPOPT, Matlab's `fmincon`

function, and Scipy's minimize function. For best results, it is important to provide not only the functions $f$, $g$, and $h$ but also functions to evaluate their gradients (specifically, Jacobian matrices in the case of $g$ and $h$). If gradient functions are not available, the optimizer will fall back to slower finite difference approximations.

Although LP and QP problems are subsumed by the class of NLP problems, general NLP solvers are usually not as fast or reliable as special-purpose solvers. Nevertheless, modern NLP solvers can solve problems with hundreds or thousands of variables and constraints within seconds.

Because gradient-based and NLP solvers are only *local methods* that use local information to help seek local optima, there is no guarantee that they will find a global optimum. (An exception is the class of convex problems, of which LP and QP are common examples.) Hence, to get a suitable solution, they must be provided with good initial guesses that are relatively close to an optimum. But for problems in which we wish to find a global optimum and do not have sufficient prior information about the location of the optimum, global optimization techniques must be used.

The global minimum is said to have a *basin of attraction* of initial points for which a local optimization method will converge to the minimum. Hard optimization problems are often riddled with local minima, and the basin of attraction will be small. But for easier problems, *random restarts* is a simple method for improving the chance of finding a global minimum in a probabilistic sense.

The idea is simple: randomly sample $N$ initial points, and perform a local optimization starting from each initial point. Then, pick the locally optimized point with the best $f$ value. If the basin of attraction of the global optimum is large, then running a local optimizer multiple times from random seeds has a high chance of finding at least one result near the global optimum. Quantifying the probability that random restarts find a global minimum is left as an exercise.

Another class of approaches exhaustively tests points in the optimization domain $D$. Typically, $D$ is taken to be a box in $n$-D space. *Brute-force sampling* generates a set of points in a grid or via random sampling, and then evaluates the objective function. For settings in which the objective function is complex and it is not necessary to have a highly refined objective function value, this may be an effective strategy. (It's certainly the easiest thing to try!)

More sophisticated methods will use information from neighboring points, with the underlying assumption that the function is smooth in some sense, to avoid testing areas of the optimization domain that have no chance to contain an optimum. One example is the DIRECT unconstrained optimization algorithm, which subdivides the optimization domain into progressively smaller rectangles. It stops considering rectangles if it is know that they do not possibly contain an optimal value. It does so using a *Lipschitz condition*
$$\|f(x) - f(y)\| \leq K\|x-y\|,$$
which states that the function values are relatively consistent across neighborhoods. If such a condition holds, we can determine that if we have evaluated a point $x$ and obtained a suboptimal value that is worse than the best found so far ($f(x) \gg f_{best}$), then we do not have to examine a region around $x$ because it is clearly suboptimal. Specifically, for any point $y$ within a radius $r=(f(x) - f_{best})/K$ of $x$, there is no chance that $f(y)$ is less than $f_{best}$.

Exhaustive methods are very straightforward to use, and can accept challenging "black-box" objective functions that are nonsmooth. However, they suffer from the curse of dimensionality and tend to degrade quickly in performance in problems of higher dimension (even a 4- or 5-D optimization problem can take hours).

Branch-and-bound (BNB) methods are global methods that use an auxiliary *bounding function*, custom-defined for a given objective function, to aggressively prune out regions of the space in which there is no chance for an optimum to exist. (The DIRECT algorithm is an example of an BNB method, but one with a relatively simple and loose bound.)

The idea is that along with the function $f$ and the optimization domain $D$, for some problems it is possible to derive a bounding function $f_{min}(S)$ that computes a lower bound on the value $f(x)$ over all of the subset $S\subseteq D$. That is:

$$f_{min}(S) \leq \min_{x \in S} f(x).$$Armed with this bound, we can start to address the problem of whether subsets of $D$ might provide better values than the best value $f_{best}$ found so far. If a subset $S$ under consideration has a higher lower bound than $f_{best}$, that is, $f_{min}(S) \geq f_{best}$, then there is no chance for a point $x\in S$ to be the optimum. Hence, $S$ can be pruned from consideration. It is preferred that the bound should be as tight as possible so that the algorithm can prune as much of the domain as possible.

Although BNB methods could be defined over many domains and subset shapes, in the case of unconstrained optimization it is typical to consider the shape of each subset $S$ to be boxes $S = [a_1,b_1]\times \cdots \times [a_n,b_n]$. A BNB method would then proceed as follows:

*Initialization*: Initialize a queue, starting with the initial domain $D$. Set $f_{best}\gets \infty$.*Evaluate*: Pick a subset $S$ from the queue and evaluate $f(x)$ for some $x \in S$. Maintain the best function value found so far $f_{best} = \min (f_{best}, f(x))$.*Branch*: Subdivide $S$ into subsets. For example, we could break $S$ along its longest edge into boxes $S_1$ and $S_2$.*Bound*: Apply the bound to each subset. If $f_{min}(S_i) \geq f_{best}$, prune it for consideration. Otherwise, add $S_i$ to the queue, and repeat from step 2.

This will then proceed until the subsets are sufficiently small, or some maximum number of iterations is reached. Typically, subsets will be sorted in the queue in order of increasing $f_{min}$.

Now, it remains to be asked. Where do bounds come from? In some cases, this can be done through careful mathematical analysis. In other cases, it may be possible to use an *interval arithmetic* approach. The idea of interval arithmetic is that for functions $f(x)$ composed of a variety of basic arithmetic and trigonometric operations, we can perform those same operations on a set of intervals rather than single points. Specifically, if $x_1 \in [a_1,b_1]$ and $x_2 \in [a_2,b_2]$, then we can determine various elementary rules like

- $x_1 + x_2 \in [a_1 + a_2, b_1 + b_2]$
- $x_1 - x_2 \in [a_1 - b_2, b_1 - a_2]$
- $x_1 * x_2 \in [\min(a_1 a_2,a_1 b_2,b_1 a_2,b_1 b_2), \max(a_1 a_2,a_1 b_2,b_1 a_2,b_1 b_2)]$
- $x_1^2 \in [a_1^2,b_1^2]$ if $a_1 \geq 0$, $[b_1^2,a_1^2]$ if $b_1 \leq 0$, and $[0,\max(a_1^2,b_1^2)]$ otherwise.
- etc...

By replacing each of the standard floating-point operations in $f(x_1,\ldots,x_n)$ with their corresponding interval arithmetic operations, we can determine an interval $[c,d]$ such that $f([a_1,b_1],\ldots,[a_n,b_n]) \in [c,d]$. Software packages that implement robust interval arithmetic are available for various languages.

It should be noted, however, that an implementer must take some care to write his/her operations such that interval arithmetic bounds are tight. For example, $x_1^2$ is much tighter than $x_1 * x_1$ if the interval $[a_1,b_1]$ contains 0. Also, divisions are tricky because loose bounds in the denominator can cause the output interval to be extremely loose. In an extreme case, dividing by an interval that contains 0 yields an unbounded interval!

TODO:

Simulated annealing, evolutionary algorithms, Bayesian optimization.

In [ ]:

```
```