# Gradients and Hessians in Hilbert spaces

Jordan Bell
July 26, 2015

## 1 Gradients

Let $(X,\left\langle\cdot,\cdot\right\rangle)$ be a real Hilbert space. The Riesz representation theorem says that the mapping

 $\Phi(x)(y)=\left\langle y,x\right\rangle,\qquad\Phi:X\to X^{*},$

is an isometric isomorphism. Let $U$ be a nonempty open subset of $X$ and let $f:U\to\mathbb{R}$ be differentiable, with derivative $f^{\prime}:U\to\mathscr{L}(X;\mathbb{R})=X^{*}$. The gradient of $f$ is the function $\mathrm{grad}\,f:U\to X$ defined by

 $\mathrm{grad}\,f=\Phi^{-1}\circ f^{\prime}.$

Thus, for $x\in U$, $\mathrm{grad}\,f(x)$ is the unique element of $X$ satisfying

 $\left\langle\mathrm{grad}\,f(x),y\right\rangle=f^{\prime}(x)(y),\qquad y\in X.$ (1)

Because $\Phi^{-1}:X^{*}\to X$ is continuous, if $f\in C^{1}(U;\mathbb{R})$ then $\mathrm{grad}\,f\in C(U;X)$, being a composition of two continuous functions.

For example, let $T$ be a bounded self-adjoint operator on $X$ and define $f:X\to\mathbb{R}$ by

 $f(x)=\frac{1}{2}\left\langle Tx,x\right\rangle,\qquad x\in X.$

For $x,h\in X$,

 $f(x+h)-f(x)=\frac{1}{2}\left\langle Tx,h\right\rangle+\frac{1}{2}\left\langle Th% ,x\right\rangle+\frac{1}{2}\left\langle Th,h\right\rangle=\left\langle Tx,h% \right\rangle+\frac{1}{2}\left\langle Th,h\right\rangle.$

Thus

 $|f(x+h)-f(x)-\left\langle Tx,h\right\rangle|=\frac{1}{2}|\left\langle Th,h% \right\rangle|\leq\frac{1}{2}\left\|T\right\|\left\|h\right\|^{2}=o(\left\|h% \right\|),$

which shows that $f$ is differentiable at $h$, with $f^{\prime}(x)(y)=\left\langle Tx,y\right\rangle$. Thus by (1), $\mathrm{grad}\,f(x)=Tx$.

For example, let $T\in\mathscr{L}(X;X)$, let $h\in X$, and define $f:X\to\mathbb{R}$ by

 $f(x)=\frac{1}{2}\left\|Tx-h\right\|^{2},\qquad x\in X.$

We calculate that

 $\mathrm{grad}\,f(x)=T^{*}Tx-T^{*}h,\qquad x\in X.$

For $x_{0}\in X$, define

 $\phi(t)=\exp(-tT^{*}T)x_{0}+\int_{0}^{t}\exp(-(t-s)T^{*}T)T^{*}hds,\qquad t% \geq 0.$

It is proved11 1 cf. J.W. Neuberger, A Sequence of Problems on Semigroups, p. 51, Problem 195. that $\phi$ satisfies

 $\phi^{\prime}(t)=-(\mathrm{grad}\,f)(\phi(t)),\qquad\phi(0)=x_{0}.$

For a function $F:X\to X$, we say that $F$ is $L$ Lipschitz if

 $\left\|F(x)-F(y)\right\|\leq L\left\|x-y\right\|,\qquad x,y\in X.$

The following is a useful inequality for functions whose gradients are Lipschitz.22 2 Juan Peypouquet, Convex Optimization in Normed Spaces: Theory, Methods and Examples, p. 15, Lemma 1.30.

###### Lemma 1.

If $f:X\to\mathbb{R}$ is differentiable and $\mathrm{grad}\,f:X\to X$ is $L$ Lipschitz, then

 $f(y)\leq f(x)+\left\langle\mathrm{grad}\,f(x),y-x\right\rangle+\frac{L}{2}% \left\|y-x\right\|^{2},\qquad x,y\in X.$
###### Proof.

Let $h=y-x$ and define $g:[0,1]\to\mathbb{R}$ by $g(t)=f(x+th)$. By the chain rule, for $0,

 $g^{\prime}(t)=f^{\prime}(x+th)(h)=\left\langle\mathrm{grad}\,f(x+th),h\right\rangle.$

Thus by the fundamental theorem of calculus,

 $\int_{0}^{1}\left\langle\mathrm{grad}\,f(x+th),h\right\rangle dt=\int_{0}^{1}g% ^{\prime}(t)dt=g(1)-g(0)=f(x+h)-f(x)=f(y)-f(x),$

and so, using the Cauchy-Schwarz inequality and the fact that $\mathrm{grad}\,f$ is $L$ Lipschitz,

 $\displaystyle f(y)-f(x)$ $\displaystyle=\int_{0}^{1}\left\langle\mathrm{grad}\,f(x+th)-\mathrm{grad}\,f(% x)+\mathrm{grad}\,f(x),h\right\rangle dt$ $\displaystyle=\left\langle\mathrm{grad}\,f(x),h\right\rangle dt+\int_{0}^{1}% \left\langle\mathrm{grad}\,f(x+th)-\mathrm{grad}\,f(x),h\right\rangle dt$ $\displaystyle\leq\left\langle\mathrm{grad}\,f(x),h\right\rangle+\int_{0}^{1}% \left\|\mathrm{grad}\,f(x+th)-\mathrm{grad}\,f(x)\right\|\left\|h\right\|dt$ $\displaystyle\leq\left\langle\mathrm{grad}\,f(x),h\right\rangle+\int_{0}^{1}L% \left\|th\right\|\left\|h\right\|dt$ $\displaystyle=\left\langle\mathrm{grad}\,f(x),y-x\right\rangle+\frac{L}{2}% \left\|y-x\right\|^{2},$

proving the claim. ∎

## 2 Hessians

Let $U$ be a nonempty open subset of $X$. We prove that if a function is $C^{2}$ then its gradient is $C^{1}$.33 3 Rodney Coleman, Calculus on Normed Vector Spaces, p. 139, Theorem 6.5.

###### Theorem 2.

Let $U$ be an open subset of $X$. If $f\in C^{2}(U;\mathbb{R})$, then $\mathrm{grad}\,f\in C^{1}(U;X)$, and

 $f^{\prime\prime}(x)(u)(v)=\left\langle v,(\mathrm{grad}\,f)^{\prime}(x)(u)% \right\rangle,\qquad x\in U,\quad u,v\in X.$ (2)
###### Proof.

That $f$ is $C^{2}$ means that $f^{\prime}:U\to X^{*}$ is $C^{1}$. That is, for all $x\in U$, the map $f^{\prime}:U\to X^{*}$ is continuous at $x$, there is $f^{\prime\prime}(x)\in\mathscr{L}(X;X^{*})$ such that

 $\left\|f^{\prime}(x+h)-f^{\prime}(x)-f^{\prime\prime}(x)(h)\right\|=o(\left\|h% \right\|),$ (3)

as $h\to 0$, and the map $x\mapsto f^{\prime\prime}(x)$ is continuous $U\to\mathscr{L}(X;X^{*})$.

Let $x\in U$ and let $h\in X$. Define $\phi_{h}\in X^{*}$ by

 $\phi_{h}(v)=f^{\prime\prime}(x)(h)(v),\qquad v\in X.$

Define $\nu_{x}(h)=\Phi^{-1}(\phi_{h})\in X$, thus

 $f^{\prime\prime}(x)(h)(v)=\left\langle v,\nu_{x}(h)\right\rangle,\qquad v\in X.$

It is straightforward that $\nu_{x}$ is linear. Because $\Phi$ is an isometric isomorphism,

 $\left\|\nu_{x}(h)\right\|=\left\|\phi_{h}\right\|=\sup_{\left\|v\right\|\leq 1% }|\phi_{h}(v)|=\sup_{\left\|v\right\|\leq 1}|f^{\prime\prime}(x)(h)(v)|\leq% \left\|f^{\prime\prime}(x)\right\|\left\|h\right\|,$

where $(u,v)\mapsto f^{\prime\prime}(x)(u)(v)$ is a bilinear form, with

 $\left\|f^{\prime\prime}(x)\right\|=\sup_{\left\|u\right\|\leq 1,\left\|v\right% \|\leq 1}|f^{\prime\prime}(x)(u)(v)|,$

showing that $\nu_{x}:X\to X$ is a bounded linear operator with $\left\|\nu_{x}\right\|\leq\left\|f^{\prime\prime}(x)\right\|$. For $h$ such that $x+h\in U$ and for $v\in X$,

 $\displaystyle(f^{\prime}(x+h)-f^{\prime}(x)-f^{\prime\prime}(x)(h))(v)$ $\displaystyle=\left\langle v,\mathrm{grad}\,f(x+h)-\mathrm{grad}\,f(x)-\nu_{x}% (h)\right\rangle,$

so

 $\displaystyle\left\|f^{\prime}(x+h)-f^{\prime}(x)-f^{\prime\prime}(x)(h)\right\|$ $\displaystyle=\sup_{\left\|v\right\|\leq 1}|\left\langle v,\mathrm{grad}\,f(x+% h)-\mathrm{grad}\,f(x)-\nu_{x}(h)\right\rangle|$ $\displaystyle=\left\|\mathrm{grad}\,f(x+h)-\mathrm{grad}\,f(x)-\nu_{x}(h)% \right\|.$

Thus by (3),

 $\left\|\mathrm{grad}\,f(x+h)-\mathrm{grad}\,f(x)-\nu_{x}(h)\right\|=o(\left\|h% \right\|)$

as $h\to 0$, and because $\nu_{x}\in\mathscr{L}(X;X)$, this means that $\mathrm{grad}\,f:U\to X$ is differentiable at $x$, with $(\mathrm{grad}\,f)^{\prime}(x)=\nu_{x}$. It remains to prove that $x\mapsto\nu_{x}$ is continuous $U\to\mathscr{L}(X;X)$, namely that $(\mathrm{grad}\,f)^{\prime}$ is continuous. For $x\in U$ and for $h$ with $x+h\in U$,

 $\displaystyle\left\|\nu_{x+h}-\nu_{x}\right\|$ $\displaystyle=\sup_{\left\|u\right\|\leq 1}\left\|\nu_{x+h}(u)-\nu_{x}(u)\right\|$ $\displaystyle=\sup_{\left\|u\right\|\leq 1}\sup_{\left\|v\right\|\leq 1}|\left% \langle v,\nu_{x+h}(u)-\nu_{x}(u)\right\rangle|$ $\displaystyle=\sup_{\left\|u\right\|\leq 1}\sup_{\left\|v\right\|\leq 1}|f^{% \prime\prime}(x+h)(u)(v)-f^{\prime\prime}(x)(u)(v)|$ $\displaystyle=\left\|f^{\prime\prime}(x+h)-f^{\prime\prime}(x)\right\|,$

and because $f^{\prime\prime}$ is continuous on $U$ we get that $x\mapsto\nu_{x}$ is continuous on $U$, completing the proof. ∎

If $f\in C^{2}(U;\mathbb{R})$, we proved in the above theorem that $\mathrm{grad}\,f\in C^{1}(U;X)$. We call the derivative of $\mathrm{grad}\,f$ the Hessian of $f$,44 4 cf. R. A. Tapia, The differentiation and integration of nonlinear operators, pp. 45–101, in Nonlinear Functional Analysis and Applications (Louis B. Rall, ed.)

 $\mathrm{Hess}\,f=(\mathrm{grad}\,f)^{\prime},\qquad U\to\mathscr{L}(X;X),$

and (2) then reads

 $f^{\prime\prime}(x)(u)(v)=\left\langle v,\mathrm{Hess}\,f(x)(u)\right\rangle,% \qquad x\in U,\quad u,v\in X.$

Furthermore, it is a fact that if $f\in C^{2}(U;\mathbb{R})$, then for each $x\in U$, the bilinear form

 $(u,v)\mapsto f^{\prime\prime}(x)(u)(v)$

is symmetric.55 5 Serge Lang, Real and Functional Analysis, third ed., p. 344, Theorem 5.3. Thus, for $x\in U$ and $u,v\in X$,

 $\left\langle v,\mathrm{Hess}\,f(x)(u)\right\rangle=\left\langle u,\mathrm{Hess% }\,f(x)(v)\right\rangle.$

Now, using that $\left\langle\cdot,\cdot\right\rangle$ is symmetric as $X$ is a real Hilbert space, $(\mathrm{Hess}\,f(x))^{*}\in\mathscr{L}(X;X)$ satisfies

 $\left\langle u,\mathrm{Hess}\,f(x)(v)\right\rangle=\left\langle(\mathrm{Hess}% \,f(x))^{*}u,v\right\rangle=\left\langle v,(\mathrm{Hess}\,f(x))^{*}u\right\rangle.$

so

 $\left\langle v,\mathrm{Hess}\,f(x)(u)\right\rangle=\left\langle v,(\mathrm{% Hess}\,f(x))^{*}u\right\rangle.$

Because this is true for all $v$ we have $\mathrm{Hess}\,f(x)(u)=(\mathrm{Hess}\,f(x))^{*}u$, and because this is true for all $u$ we have $\mathrm{Hess}\,f(x)=(\mathrm{Hess}\,f(x))^{*}$, i.e. $\mathrm{Hess}\,f(x)$ is self-adjoint.

###### Theorem 3.

If $U$ is an open subset of $X$ and $f\in C^{2}(U;\mathbb{R})$, then for each $x\in U$ it is the case that $\mathrm{Hess}\,f(x)\in\mathscr{L}(X;X)$ is self-adjoint.

## 3 Critical points

For an open set $U$ in $X$ for $k\geq 1$, and for $f\in C^{k+2}(U;\mathbb{R})$, we say that $x_{0}\in U$ is a critical point of $f$ if $f^{\prime}(x_{0})=0$. If $x_{0}$ is a critical point of $f$, let we say that $x_{0}$ is a nondegenerate critical point of $f$ if $\mathrm{Hess}\,f(x_{0})\in\mathscr{L}(X;X)$ is invertible. The Morse-Palais lemma66 6 Serge Lang, Differential and Riemannian Manifolds, p. 182, chapter VII, Theorem 5.1; Kung-ching Chang, Infinite Dimensional Morse Theory and Multiple Solution Problems, p. 33, Theorem 4.1; André Avez, Calcul différentiel, p. 87, §3; N. A. Bobylev, S. V. Emel’yanov, and S. K. Korovin, Geometrical Methods in Variational Problems, p. 360, Theorem 5.5.2; Hajime Urakawa, Calculus of Variations and Harmonic Maps, p. 87, chapter 3, §1, Theorem 1.10; Jean-Pierre Aubin and Ivar Ekeland, Applied Nonlinear Analysis, p. 52, Theorem 8; Melvyn S. Berger, Nonlinearity and Functional Analysis: Lectures on Nonlinear Problems in Mathemtical Analysis, p. 355, Theorem 6.5.4. states that if $f\in C^{k+2}(U;\mathbb{R})$ with $k\geq 1$, $f(0)=0$, and $0$ is a nondegenerate critical point of $f$, then there is some open subset $V$ of $U$ with $0\in V$ and a $C^{k}$ diffeomorphism $\phi:V\to V$, $\phi(0)=0$, such that

 $f(x)=\frac{1}{2}\left\langle\mathrm{Hess}\,f(0)(\phi(x)),\phi(x)\right\rangle,% \qquad x\in V.$

If $x$ is a critical point of a differentiable function $f:U\to\mathbb{R}$, we call $f(x)$ a critical value of $f$. If $k\geq n$ and $f\in C^{k}(\mathbb{R}^{n};\mathbb{R})$, Sard’s theorem tells us that the set of critical values of $f$ has Lebesgue measure $0$ and is meager.

For Banach spaces $Y$ and $Z$, a Fredholm operator77 7 Martin Schechter, Principles of Functional Analysis, second ed., chapter 5. is a bounded linear operator $T:Y\to Z$ such that (i) $\alpha(T)=\dim\ker T<\infty$, (ii) $T(Y)$ is a closed subset of $Z$, and (iii) $\beta(T)=\dim\ker T^{*}<\infty$. The index of a Fredholm operator $T$ is

 $\mathrm{ind}\,T=\alpha(T)-\beta(T).$

For a differentiable function $f:U\to\mathbb{R}$, $U$ an open subset of $X$, and for $x\in U$, $f^{\prime}(x)\in\mathscr{L}(X;\mathbb{R})=X^{*}$. $f^{\prime}(x)$ is a Fredholm operator if and only if $\dim\ker f^{\prime}(x)<\infty$. For $U$ a connected open subset of $X$ and for $f\in C^{1}(U;\mathbb{R})$, we call $f$ a Fredholm map if $f^{\prime}(x)$ is a Fredholm operator for each $x\in U$. It is a fact that $\mathrm{ind}\,f^{\prime}(x)=\mathrm{ind}\,f^{\prime}(y)$ for all $x,y\in U$, using that $U$ is connected. We denote this common value by $\mathrm{ind}\,f$. A generalization of Sard’s theorem by Smale here tells us that if $X$ is separable, $U$ is a connected open subset of $X$, $f\in C^{k}(U;\mathbb{R})$ is a Fredholm map, and

 $k>\max\{\mathrm{ind}\,f,0\},$

then the set of critical values of $f$ is meager.88 8 Eberhard Zeidler, Nonlinear Functional Analysis and its Applications, IV: Applications to Mathematical Physics, p. 829, Theorem 78.A; Melvyn S. Berger, Nonlinearity and Functional Analysis: Lectures on Nonlinear Problems in Mathematical Analysis, p. 125, Theorem 3.1.45.

A function $f\in C^{1}(X;\mathbb{R})$ is said to satisfy the Palais-Smale condition if $(u_{k})$ is a sequence in $X$ such that (i) $\{f(u_{k})\}$ is a bounded subset of $\mathbb{R}$ and (ii) $\mathrm{grad}\,f(u_{k})\to 0$, then $\{u_{k}\}$ is a precompact subset of $X$: every subsequence of $(u_{k})$ itself has a Cauchy subsequence.

Often when speaking about ordinary differential equations in $\mathbb{R}^{d}$, we deal with differentiable functions whose derivatives are locally Lipschitz. $\mathbb{R}^{d}$ has the Heine-Borel property: a subset $K$ of $\mathbb{R}^{d}$ is compact if and only if $K$ is closed and bounded. In fact no infinite dimensional Banach space has the Heine-Borel property.99 9 Some Fréchet spaces have the Heine-Borel property, like the space of holomorphic functions on the open unit disc, which is what Montel’s theorem says. Thus a locally Lipschitz function need not be Lipschitz on a bounded subset of $X$. (On a compact set, the set is covered by balls on which the function is Lipschitz, and then the function is Lipschitz on the compact set with Lipschitz constant equal to the maximum of finitely many Lipschitz constants on the balls.) We denote by $\mathcal{C}$ the set of function $f:X\to\mathbb{R}$ that are differentiable and such that for each bounded subset $A$ of $X$, the restriction of $\mathrm{grad}\,f$ to $A$ is Lipschitz.

The mountain pass theorem1010 10 Lawrence C. Evans, Partial Differential Equations, p. 480, Theorem 2; Antonio Ambrosetti and David Arcoya Álvarez, An Introduction to Nonlinear Functional Analysis and Elliptic Problems, p. 48, §5.3. states that if (i) $I\in\mathcal{C}$, (ii) $I$ satisfies the Palais-Smale condition, (iii) $I(0)=0$, (iv) there are $r,a>0$ such that $I(u)\geq a$ when $\left\|u\right\|=r$, and (v) there is some $v\in X$ satisfying $\left\|v\right\|>r$ and $I(v)\leq 0$, then

 $\inf_{g\in\Gamma_{v}}\sup_{0\leq t\leq 1}(I\circ g)(t)$

is a critical value of $I$, where

 $\Gamma_{v}=\{g\in C([0,1];X):g(0)=0,g(1)=v\}.$

## 4 Convexity

We prove that a critical point of a differentiable convex function on an open convex set is a minimum.1111 11 N. A. Bobylev, S. V. Emel’yanov, and S. K. Korovin, Geometrical Methods in Variational Problems, p. 39, Theorem 2.1.4.

###### Theorem 4.

If $A$ is an open convex set, $f:A\to\mathbb{R}$ is differentiable and convex, and $x_{0}\in A$ is a critical point of $f$, then $f(x_{0})\leq f(x)$ for all $x\in A$.

###### Proof.

Because $f$ is convex, for $0,

 $f(tx+(1-t)x_{0})\leq tf(x)+(1-t)f(x_{0}),$

i.e.

 $\frac{f(x_{0}+t(x-x_{0}))-f(x_{0})}{t}\leq f(x)-f(x_{0}).$

Taking $t\to 0$,

 $f^{\prime}(x_{0})(x-x_{0})\leq f(x)-f(x_{0}),$

and because $x_{0}$ is a critical point,

 $0\leq f(x)-f(x_{0}),$

i.e. $f(x_{0})\leq f(x)$. ∎

We establish equivalent conditions for a differentiable function to be convex.1212 12 Juan Peypouquet, Convex Optimization in Normed Spaces: Theory, Methods and Examples, p. 38, Proposition 3.10.

###### Theorem 5.

If $A$ is an open convex subset of $X$ and $f:A\to\mathbb{R}$ is differentiable, then the following are equvialent:

1. 1.

$f$ is convex.

2. 2.

$f(y)\geq f(x)+\left\langle\mathrm{grad}\,f(x),y-x\right\rangle$, $x,y\in A$.

3. 3.

$\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),x-y\right\rangle\geq 0$, $x,y\in A$.

###### Proof.

Suppose (1). For $x,y\in A$ and $0, that $f$ is convex means $f(ty+(1-t)x)\leq tf(y)+(1-t)f(x)$, i.e.

 $\frac{f(x+t(y-x))-f(x)}{t}\leq f(y)-f(x),$

and taking $t\to 0$ yields

 $f^{\prime}(x)(y-x)\leq f(y)-f(x),$

i.e.

 $\left\langle\mathrm{grad}\,f(x),y-x\right\rangle\leq f(y)-f(x).$

Suppose (2) and let $x,y\in A$, for which

 $\left\langle\mathrm{grad}\,f(x),y-x\right\rangle\leq f(y)-f(x),\qquad\left% \langle\mathrm{grad}\,f(y),x-y\right\rangle\leq f(x)-f(y).$

Adding these inequalities,

 $\left\langle\mathrm{grad}\,f(x),y-x\right\rangle-\left\langle\mathrm{grad}\,f(% y),y-x\right\rangle\leq 0.$

Suppose (3), let $x,y\in A$, and define $\phi:[0,1]\to\mathbb{R}$ by

 $\phi(t)=f(tx+(1-t)y)-tf(x)-(1-t)f(y).$

$\phi(0)=0$ and $\phi(1)=0$, and for $0, using the chain rule gives

 $\displaystyle\phi^{\prime}(t)$ $\displaystyle=f^{\prime}(tx+(1-t)y)(x-y)-f(x)+f(y)$ $\displaystyle=\left\langle\mathrm{grad}\,f(tx+(1-t)y),x-y\right\rangle-f(x)+f(% y).$

Let $0, let $u=sx+(1-s)y$ and $v=tx+(1-t)y$, which both belong to $A$ because $A$ is convex, and so the above reads

 $\phi^{\prime}(s)=\left\langle\mathrm{grad}\,f(u),x-y\right\rangle-f(x)+f(y),% \qquad\phi^{\prime}(t)=\left\langle\mathrm{grad}\,f(v),x-y\right\rangle-f(x)+f% (y),$

so

 $\phi^{\prime}(s)-\phi^{\prime}(t)=\left\langle\mathrm{grad}\,f(u)-\mathrm{grad% }\,f(v),x-y\right\rangle.$

And

 $(s-t)(x-y)=u-y-(v-y)=u-v,$

so

 $\phi^{\prime}(s)-\phi^{\prime}(t)=\frac{1}{s-t}\left\langle\mathrm{grad}\,f(u)% -\mathrm{grad}\,f(v),u-v\right\rangle.$

But (3) tells us

 $\left\langle\mathrm{grad}\,f(u)-\mathrm{grad}\,f(v),u-v\right\rangle\geq 0,$

so, as $s-t<0$,

 $\phi^{\prime}(s)-\phi^{\prime}(t)\leq 0,$

showing that $\phi^{\prime}$ is nondecreasing. On the other hand, because $\phi(0)=0$ and $\phi(1)=0$, by the mean value theorem there is some $0 for which $\phi^{\prime}(t_{0})=0$. Therefore, because $\phi^{\prime}$ is nondecreasing it holds that

 $\phi^{\prime}(t)\leq 0,\qquad 0\leq t\leq t_{0},$

and

 $\phi^{\prime}(t)\geq 0,\qquad t_{0}\leq t\leq 1.$

That is, $\phi$ is nonincreasing on $[0,t_{0}]$, and with $\phi(0)=0$ this yields $\phi(t)\leq 0$ for $t\in[0,t_{0}]$, and $\phi$ is nondecreasing on $[t_{0},1]$, and with $\phi(1)=0$ this yields $\phi(t)\leq 0$ for $t\in[t_{0},1]$. Therefore $\phi(t)\leq 0$ for $t\in[0,1]$, which means that

 $f(tx+(1-t)y)-tf(x)-(1-t)f(y)\leq 0,\qquad 0\leq t\leq 1,$

showing that $f$ is convex. ∎

###### Theorem 6.

If $A$ is an open convex subset of $X$ and $f:A\to\mathbb{R}$ is twice differentiable, then the following are equivalent:

1. 1.

$f$ is convex.

2. 2.

$\left\langle\mathrm{Hess}\,f(x)(v),v\right\rangle\geq 0$, $x\in A$, $v\in X$.

###### Proof.

Suppose (1) and let $x\in A$. From Theorem 5, $v\in X$ and for $t>0$ with which $x+tv\in A$,

 $\left\langle\mathrm{grad}\,f(x+tv)-\mathrm{grad}\,f(x),tv\right\rangle\geq 0,$

i.e.

 $\frac{f^{\prime}(x+tv)(v)-f^{\prime}(x)(v)}{t}\geq 0.$

Taking $t\to 0$,

 $f^{\prime\prime}(x)(v)(v)\geq 0,$

i.e.

 $\left\langle\mathrm{Hess}\,f(x)(v),v\right\rangle\geq 0.$

Suppose (2), let $x,y\in A$ and define $\phi:[0,1]\to\mathbb{R}$ by

 $\phi(t)=f(tx+(1-t)y)-tf(x)-(1-t)f(y).$

Applying the chain rule, for $0,

 $\phi^{\prime\prime}(t)=f^{\prime\prime}(tx+(1-t)y)(x-y)(x-y),$

i.e.

 $\phi^{\prime\prime}(t)=\left\langle\mathrm{Hess}\,f(tx+(1-t)y)(x-y),x-y\right% \rangle\geq 0,$

showing that $\phi^{\prime}$ is nondecreasing. In the proof of Theorem 5 we deduced from $\phi^{\prime}$ being nondecreasing and satisfying $\phi(0)=0$, $\phi(1)=0$, that $f$ is convex, and the same reasoning yields here that $f$ is convex. ∎

We call a function $F:X\to X$ $\beta$ co-coercive if

 $\left\langle F(x)-F(y),x-y\right\rangle\geq\beta\left\|F(x)-F(y)\right\|^{2}.$

We prove conditions under which the gradient of a differentiable convex function is co-coercive.1313 13 Juan Peypouquet, Convex Optimization in Normed Spaces: Theory, Methods and Examples, p. 40, Theorem 3.13.

###### Theorem 7 (Baillon-Haddad theorem).

Let $f:X\to\mathbb{R}$ be differentiable and convex and let $L>0$. Then $\mathrm{grad}\,f$ is $L$ Lipschitz if and only if $\mathrm{grad}\,f$ is $\frac{1}{L}$ co-coercive.

###### Proof.

Suppose that $\mathrm{grad}\,f$ is $L$ Lipschitz and for $x\in X$, define $h_{x}:X\to\mathbb{R}$ by

 $h_{x}(y)=f(y)-f^{\prime}(x)(y)=f(y)-\left\langle\mathrm{grad}\,f(x),y\right\rangle.$

For $y,z\in X$ and $0, because $f$ is convex,

 $\displaystyle h_{x}(tz+(1-t)y)$ $\displaystyle=f(tz+(1-t)y)-\left\langle\mathrm{grad}\,f(x),tz+(1-t)y\right\rangle$ $\displaystyle\leq tf(z)+(1-t)f(y)-\left\langle\mathrm{grad}\,f(x),tz+(1-t)y\right\rangle$ $\displaystyle=th_{x}(z)+(1-t)h_{x}(y),$

showing that $h_{x}$ is convex. For $y,z\in X$,1414 14 Henri Cartan, Differential Calculus, p. 29, Proposition 2.4.2.

 $h_{x}^{\prime}(y)(z)=f^{\prime}(y)(z)-f^{\prime}(x)(z),$

and in particular $\mathrm{grad}\,h_{x}(x)=0$. Thus by Theorem 4,

 $h_{x}(x)\leq h_{x}(y),\qquad y\in X.$ (4)

For $x,y,z\in X$, by Lemma 1,

 $f(z)\leq f(x)+\left\langle\mathrm{grad}\,f(x),z-x\right\rangle+\frac{L}{2}% \left\|z-x\right\|^{2},$

so

 $h_{y}(z)\leq f(x)-\left\langle\mathrm{grad}\,f(y),z\right\rangle+\left\langle% \mathrm{grad}\,f(x),z-x\right\rangle+\frac{L}{2}\left\|z-x\right\|^{2},$

i.e.

 $h_{y}(z)\leq h_{x}(x)+\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),z% \right\rangle+\frac{L}{2}\left\|z-x\right\|^{2},$

and applying (4),

 $h_{y}(y)\leq h_{x}(x)+\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),z% \right\rangle+\frac{L}{2}\left\|z-x\right\|^{2}.$ (5)

Now,

 $\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|=\sup_{\left\|v\right\|% \leq 1}\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),v\right\rangle$

so for each $\epsilon>0$ there is some $v_{\epsilon}\in X$ with $\left\|v_{\epsilon}\right\|\leq 1$ and

 $\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),v_{\epsilon}\right\rangle% \geq\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|-\epsilon.$

Let $R=\frac{\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|}{L}$, and applying (5) with $z=x-Rv_{\epsilon}$ yields

 $\displaystyle h_{y}(y)$ $\displaystyle\leq h_{x}(x)+\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)% ,x-Rv_{\epsilon}\right\rangle+\frac{L}{2}\left\|Rv_{\epsilon}\right\|^{2}$ $\displaystyle=h_{x}(x)+\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),x% \right\rangle-R\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),v_{\epsilon% }\right\rangle$ $\displaystyle+\frac{1}{2L}\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right% \|^{2}\left\|v_{\epsilon}\right\|^{2}$ $\displaystyle\leq h_{x}(x)+\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)% ,x\right\rangle-R\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|+R\epsilon$ $\displaystyle+\frac{1}{2L}\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right% \|^{2}$ $\displaystyle=h_{x}(x)+\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),x% \right\rangle-\frac{1}{2L}\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right% \|^{2}+R\epsilon.$

Likewise, because $R$ does not change when $x$ and $y$ are switched,

 $h_{x}(x)\leq h_{y}(y)+\left\langle\mathrm{grad}\,f(y)-\mathrm{grad}\,f(x),y% \right\rangle-\frac{1}{2L}\left\|\mathrm{grad}\,f(y)-\mathrm{grad}\,f(x)\right% \|^{2}+R\epsilon.$

Adding these inequalities,

 $\displaystyle 0$ $\displaystyle\leq\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),x\right% \rangle+\left\langle\mathrm{grad}\,f(y)-\mathrm{grad}\,f(x),y\right\rangle$ $\displaystyle-\frac{1}{L}\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right% \|^{2}+2R\epsilon,$

i.e.

 $\frac{1}{L}\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|^{2}\leq\left% \langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),x-y\right\rangle+2R\epsilon.$

This is true for all $\epsilon>0$, so

 $\frac{1}{L}\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|^{2}\leq\left% \langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),x-y\right\rangle,$

showing that $\mathrm{grad}\,f$ is $\frac{1}{L}$ co-coercive.

Suppose that $\mathrm{grad}\,f$ is $\frac{1}{L}$ co-coercive and let $x,y\in X$. Then applying the Cauchy-Schwarz inequality,

 $\displaystyle\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|^{2}$ $\displaystyle\leq L\left\langle\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y),x-y\right\rangle$ $\displaystyle\leq L\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|\left% \|x-y\right\|.$

If $\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|=0$ then certainly $\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|\leq L\left\|x-y\right\|$. Otherwise, dividing by $\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|$ gives

 $\left\|\mathrm{grad}\,f(x)-\mathrm{grad}\,f(y)\right\|\leq L\left\|x-y\right\|,$

showing that $\mathrm{grad}\,f$ is $L$ Lipschitz. ∎