Gradients and Hessians in Hilbert spaces

Jordan Bell
July 26, 2015

1 Gradients

Let (X,,) be a real Hilbert space. The Riesz representation theorem says that the mapping

Φ(x)(y)=y,x,Φ:XX*,

is an isometric isomorphism. Let U be a nonempty open subset of X and let f:U be differentiable, with derivative f:U(X;)=X*. The gradient of f is the function gradf:UX defined by

gradf=Φ-1f.

Thus, for xU, gradf(x) is the unique element of X satisfying

gradf(x),y=f(x)(y),yX. (1)

Because Φ-1:X*X is continuous, if fC1(U;) then gradfC(U;X), being a composition of two continuous functions.

For example, let T be a bounded self-adjoint operator on X and define f:X by

f(x)=12Tx,x,xX.

For x,hX,

f(x+h)-f(x)=12Tx,h+12Th,x+12Th,h=Tx,h+12Th,h.

Thus

|f(x+h)-f(x)-Tx,h|=12|Th,h|12Th2=o(h),

which shows that f is differentiable at h, with f(x)(y)=Tx,y. Thus by (1), gradf(x)=Tx.

For example, let T(X;X), let hX, and define f:X by

f(x)=12Tx-h2,xX.

We calculate that

gradf(x)=T*Tx-T*h,xX.

For x0X, define

ϕ(t)=exp(-tT*T)x0+0texp(-(t-s)T*T)T*h𝑑s,t0.

It is proved11 1 cf. J.W. Neuberger, A Sequence of Problems on Semigroups, p. 51, Problem 195. that ϕ satisfies

ϕ(t)=-(gradf)(ϕ(t)),ϕ(0)=x0.

For a function F:XX, we say that F is L Lipschitz if

F(x)-F(y)Lx-y,x,yX.

The following is a useful inequality for functions whose gradients are Lipschitz.22 2 Juan Peypouquet, Convex Optimization in Normed Spaces: Theory, Methods and Examples, p. 15, Lemma 1.30.

Lemma 1.

If f:X is differentiable and gradf:XX is L Lipschitz, then

f(y)f(x)+gradf(x),y-x+L2y-x2,x,yX.
Proof.

Let h=y-x and define g:[0,1] by g(t)=f(x+th). By the chain rule, for 0<t<1,

g(t)=f(x+th)(h)=gradf(x+th),h.

Thus by the fundamental theorem of calculus,

01gradf(x+th),h𝑑t=01g(t)𝑑t=g(1)-g(0)=f(x+h)-f(x)=f(y)-f(x),

and so, using the Cauchy-Schwarz inequality and the fact that gradf is L Lipschitz,

f(y)-f(x) =01gradf(x+th)-gradf(x)+gradf(x),h𝑑t
=gradf(x),hdt+01gradf(x+th)-gradf(x),h𝑑t
gradf(x),h+01gradf(x+th)-gradf(x)h𝑑t
gradf(x),h+01Lthh𝑑t
=gradf(x),y-x+L2y-x2,

proving the claim. ∎

2 Hessians

Let U be a nonempty open subset of X. We prove that if a function is C2 then its gradient is C1.33 3 Rodney Coleman, Calculus on Normed Vector Spaces, p. 139, Theorem 6.5.

Theorem 2.

Let U be an open subset of X. If fC2(U;), then gradfC1(U;X), and

f′′(x)(u)(v)=v,(gradf)(x)(u),xU,u,vX. (2)
Proof.

That f is C2 means that f:UX* is C1. That is, for all xU, the map f:UX* is continuous at x, there is f′′(x)(X;X*) such that

f(x+h)-f(x)-f′′(x)(h)=o(h), (3)

as h0, and the map xf′′(x) is continuous U(X;X*).

Let xU and let hX. Define ϕhX* by

ϕh(v)=f′′(x)(h)(v),vX.

Define νx(h)=Φ-1(ϕh)X, thus

f′′(x)(h)(v)=v,νx(h),vX.

It is straightforward that νx is linear. Because Φ is an isometric isomorphism,

νx(h)=ϕh=supv1|ϕh(v)|=supv1|f′′(x)(h)(v)|f′′(x)h,

where (u,v)f′′(x)(u)(v) is a bilinear form, with

f′′(x)=supu1,v1|f′′(x)(u)(v)|,

showing that νx:XX is a bounded linear operator with νxf′′(x). For h such that x+hU and for vX,

(f(x+h)-f(x)-f′′(x)(h))(v) =v,gradf(x+h)-gradf(x)-νx(h),

so

f(x+h)-f(x)-f′′(x)(h) =supv1|v,gradf(x+h)-gradf(x)-νx(h)|
=gradf(x+h)-gradf(x)-νx(h).

Thus by (3),

gradf(x+h)-gradf(x)-νx(h)=o(h)

as h0, and because νx(X;X), this means that gradf:UX is differentiable at x, with (gradf)(x)=νx. It remains to prove that xνx is continuous U(X;X), namely that (gradf) is continuous. For xU and for h with x+hU,

νx+h-νx =supu1νx+h(u)-νx(u)
=supu1supv1|v,νx+h(u)-νx(u)|
=supu1supv1|f′′(x+h)(u)(v)-f′′(x)(u)(v)|
=f′′(x+h)-f′′(x),

and because f′′ is continuous on U we get that xνx is continuous on U, completing the proof. ∎

If fC2(U;), we proved in the above theorem that gradfC1(U;X). We call the derivative of gradf the Hessian of f,44 4 cf. R. A. Tapia, The differentiation and integration of nonlinear operators, pp. 45–101, in Nonlinear Functional Analysis and Applications (Louis B. Rall, ed.)

Hessf=(gradf),U(X;X),

and (2) then reads

f′′(x)(u)(v)=v,Hessf(x)(u),xU,u,vX.

Furthermore, it is a fact that if fC2(U;), then for each xU, the bilinear form

(u,v)f′′(x)(u)(v)

is symmetric.55 5 Serge Lang, Real and Functional Analysis, third ed., p. 344, Theorem 5.3. Thus, for xU and u,vX,

v,Hessf(x)(u)=u,Hessf(x)(v).

Now, using that , is symmetric as X is a real Hilbert space, (Hessf(x))*(X;X) satisfies

u,Hessf(x)(v)=(Hessf(x))*u,v=v,(Hessf(x))*u.

so

v,Hessf(x)(u)=v,(Hessf(x))*u.

Because this is true for all v we have Hessf(x)(u)=(Hessf(x))*u, and because this is true for all u we have Hessf(x)=(Hessf(x))*, i.e. Hessf(x) is self-adjoint.

Theorem 3.

If U is an open subset of X and fC2(U;), then for each xU it is the case that Hessf(x)(X;X) is self-adjoint.

3 Critical points

For an open set U in X for k1, and for fCk+2(U;), we say that x0U is a critical point of f if f(x0)=0. If x0 is a critical point of f, let we say that x0 is a nondegenerate critical point of f if Hessf(x0)(X;X) is invertible. The Morse-Palais lemma66 6 Serge Lang, Differential and Riemannian Manifolds, p. 182, chapter VII, Theorem 5.1; Kung-ching Chang, Infinite Dimensional Morse Theory and Multiple Solution Problems, p. 33, Theorem 4.1; André Avez, Calcul différentiel, p. 87, §3; N. A. Bobylev, S. V. Emel’yanov, and S. K. Korovin, Geometrical Methods in Variational Problems, p. 360, Theorem 5.5.2; Hajime Urakawa, Calculus of Variations and Harmonic Maps, p. 87, chapter 3, §1, Theorem 1.10; Jean-Pierre Aubin and Ivar Ekeland, Applied Nonlinear Analysis, p. 52, Theorem 8; Melvyn S. Berger, Nonlinearity and Functional Analysis: Lectures on Nonlinear Problems in Mathemtical Analysis, p. 355, Theorem 6.5.4. states that if fCk+2(U;) with k1, f(0)=0, and 0 is a nondegenerate critical point of f, then there is some open subset V of U with 0V and a Ck diffeomorphism ϕ:VV, ϕ(0)=0, such that

f(x)=12Hessf(0)(ϕ(x)),ϕ(x),xV.

If x is a critical point of a differentiable function f:U, we call f(x) a critical value of f. If kn and fCk(n;), Sard’s theorem tells us that the set of critical values of f has Lebesgue measure 0 and is meager.

For Banach spaces Y and Z, a Fredholm operator77 7 Martin Schechter, Principles of Functional Analysis, second ed., chapter 5. is a bounded linear operator T:YZ such that (i) α(T)=dimkerT<, (ii) T(Y) is a closed subset of Z, and (iii) β(T)=dimkerT*<. The index of a Fredholm operator T is

indT=α(T)-β(T).

For a differentiable function f:U, U an open subset of X, and for xU, f(x)(X;)=X*. f(x) is a Fredholm operator if and only if dimkerf(x)<. For U a connected open subset of X and for fC1(U;), we call f a Fredholm map if f(x) is a Fredholm operator for each xU. It is a fact that indf(x)=indf(y) for all x,yU, using that U is connected. We denote this common value by indf. A generalization of Sard’s theorem by Smale here tells us that if X is separable, U is a connected open subset of X, fCk(U;) is a Fredholm map, and

k>max{indf,0},

then the set of critical values of f is meager.88 8 Eberhard Zeidler, Nonlinear Functional Analysis and its Applications, IV: Applications to Mathematical Physics, p. 829, Theorem 78.A; Melvyn S. Berger, Nonlinearity and Functional Analysis: Lectures on Nonlinear Problems in Mathematical Analysis, p. 125, Theorem 3.1.45.

A function fC1(X;) is said to satisfy the Palais-Smale condition if (uk) is a sequence in X such that (i) {f(uk)} is a bounded subset of and (ii) gradf(uk)0, then {uk} is a precompact subset of X: every subsequence of (uk) itself has a Cauchy subsequence.

Often when speaking about ordinary differential equations in d, we deal with differentiable functions whose derivatives are locally Lipschitz. d has the Heine-Borel property: a subset K of d is compact if and only if K is closed and bounded. In fact no infinite dimensional Banach space has the Heine-Borel property.99 9 Some Fréchet spaces have the Heine-Borel property, like the space of holomorphic functions on the open unit disc, which is what Montel’s theorem says. Thus a locally Lipschitz function need not be Lipschitz on a bounded subset of X. (On a compact set, the set is covered by balls on which the function is Lipschitz, and then the function is Lipschitz on the compact set with Lipschitz constant equal to the maximum of finitely many Lipschitz constants on the balls.) We denote by 𝒞 the set of function f:X that are differentiable and such that for each bounded subset A of X, the restriction of gradf to A is Lipschitz.

The mountain pass theorem1010 10 Lawrence C. Evans, Partial Differential Equations, p. 480, Theorem 2; Antonio Ambrosetti and David Arcoya Álvarez, An Introduction to Nonlinear Functional Analysis and Elliptic Problems, p. 48, §5.3. states that if (i) I𝒞, (ii) I satisfies the Palais-Smale condition, (iii) I(0)=0, (iv) there are r,a>0 such that I(u)a when u=r, and (v) there is some vX satisfying v>r and I(v)0, then

infgΓvsup0t1(Ig)(t)

is a critical value of I, where

Γv={gC([0,1];X):g(0)=0,g(1)=v}.

4 Convexity

We prove that a critical point of a differentiable convex function on an open convex set is a minimum.1111 11 N. A. Bobylev, S. V. Emel’yanov, and S. K. Korovin, Geometrical Methods in Variational Problems, p. 39, Theorem 2.1.4.

Theorem 4.

If A is an open convex set, f:A is differentiable and convex, and x0A is a critical point of f, then f(x0)f(x) for all xA.

Proof.

Because f is convex, for 0<t<1,

f(tx+(1-t)x0)tf(x)+(1-t)f(x0),

i.e.

f(x0+t(x-x0))-f(x0)tf(x)-f(x0).

Taking t0,

f(x0)(x-x0)f(x)-f(x0),

and because x0 is a critical point,

0f(x)-f(x0),

i.e. f(x0)f(x). ∎

We establish equivalent conditions for a differentiable function to be convex.1212 12 Juan Peypouquet, Convex Optimization in Normed Spaces: Theory, Methods and Examples, p. 38, Proposition 3.10.

Theorem 5.

If A is an open convex subset of X and f:A is differentiable, then the following are equvialent:

  1. 1.

    f is convex.

  2. 2.

    f(y)f(x)+gradf(x),y-x, x,yA.

  3. 3.

    gradf(x)-gradf(y),x-y0, x,yA.

Proof.

Suppose (1). For x,yA and 0<t<1, that f is convex means f(ty+(1-t)x)tf(y)+(1-t)f(x), i.e.

f(x+t(y-x))-f(x)tf(y)-f(x),

and taking t0 yields

f(x)(y-x)f(y)-f(x),

i.e.

gradf(x),y-xf(y)-f(x).

Suppose (2) and let x,yA, for which

gradf(x),y-xf(y)-f(x),gradf(y),x-yf(x)-f(y).

Adding these inequalities,

gradf(x),y-x-gradf(y),y-x0.

Suppose (3), let x,yA, and define ϕ:[0,1] by

ϕ(t)=f(tx+(1-t)y)-tf(x)-(1-t)f(y).

ϕ(0)=0 and ϕ(1)=0, and for 0<t<1, using the chain rule gives

ϕ(t) =f(tx+(1-t)y)(x-y)-f(x)+f(y)
=gradf(tx+(1-t)y),x-y-f(x)+f(y).

Let 0<s<t<1, let u=sx+(1-s)y and v=tx+(1-t)y, which both belong to A because A is convex, and so the above reads

ϕ(s)=gradf(u),x-y-f(x)+f(y),ϕ(t)=gradf(v),x-y-f(x)+f(y),

so

ϕ(s)-ϕ(t)=gradf(u)-gradf(v),x-y.

And

(s-t)(x-y)=u-y-(v-y)=u-v,

so

ϕ(s)-ϕ(t)=1s-tgradf(u)-gradf(v),u-v.

But (3) tells us

gradf(u)-gradf(v),u-v0,

so, as s-t<0,

ϕ(s)-ϕ(t)0,

showing that ϕ is nondecreasing. On the other hand, because ϕ(0)=0 and ϕ(1)=0, by the mean value theorem there is some 0<t0<1 for which ϕ(t0)=0. Therefore, because ϕ is nondecreasing it holds that

ϕ(t)0,0tt0,

and

ϕ(t)0,t0t1.

That is, ϕ is nonincreasing on [0,t0], and with ϕ(0)=0 this yields ϕ(t)0 for t[0,t0], and ϕ is nondecreasing on [t0,1], and with ϕ(1)=0 this yields ϕ(t)0 for t[t0,1]. Therefore ϕ(t)0 for t[0,1], which means that

f(tx+(1-t)y)-tf(x)-(1-t)f(y)0,0t1,

showing that f is convex. ∎

Theorem 6.

If A is an open convex subset of X and f:A is twice differentiable, then the following are equivalent:

  1. 1.

    f is convex.

  2. 2.

    Hessf(x)(v),v0, xA, vX.

Proof.

Suppose (1) and let xA. From Theorem 5, vX and for t>0 with which x+tvA,

gradf(x+tv)-gradf(x),tv0,

i.e.

f(x+tv)(v)-f(x)(v)t0.

Taking t0,

f′′(x)(v)(v)0,

i.e.

Hessf(x)(v),v0.

Suppose (2), let x,yA and define ϕ:[0,1] by

ϕ(t)=f(tx+(1-t)y)-tf(x)-(1-t)f(y).

Applying the chain rule, for 0<t<1,

ϕ′′(t)=f′′(tx+(1-t)y)(x-y)(x-y),

i.e.

ϕ′′(t)=Hessf(tx+(1-t)y)(x-y),x-y0,

showing that ϕ is nondecreasing. In the proof of Theorem 5 we deduced from ϕ being nondecreasing and satisfying ϕ(0)=0, ϕ(1)=0, that f is convex, and the same reasoning yields here that f is convex. ∎

We call a function F:XX β co-coercive if

F(x)-F(y),x-yβF(x)-F(y)2.

We prove conditions under which the gradient of a differentiable convex function is co-coercive.1313 13 Juan Peypouquet, Convex Optimization in Normed Spaces: Theory, Methods and Examples, p. 40, Theorem 3.13.

Theorem 7 (Baillon-Haddad theorem).

Let f:X be differentiable and convex and let L>0. Then gradf is L Lipschitz if and only if gradf is 1L co-coercive.

Proof.

Suppose that gradf is L Lipschitz and for xX, define hx:X by

hx(y)=f(y)-f(x)(y)=f(y)-gradf(x),y.

For y,zX and 0<t<1, because f is convex,

hx(tz+(1-t)y) =f(tz+(1-t)y)-gradf(x),tz+(1-t)y
tf(z)+(1-t)f(y)-gradf(x),tz+(1-t)y
=thx(z)+(1-t)hx(y),

showing that hx is convex. For y,zX,1414 14 Henri Cartan, Differential Calculus, p. 29, Proposition 2.4.2.

hx(y)(z)=f(y)(z)-f(x)(z),

and in particular gradhx(x)=0. Thus by Theorem 4,

hx(x)hx(y),yX. (4)

For x,y,zX, by Lemma 1,

f(z)f(x)+gradf(x),z-x+L2z-x2,

so

hy(z)f(x)-gradf(y),z+gradf(x),z-x+L2z-x2,

i.e.

hy(z)hx(x)+gradf(x)-gradf(y),z+L2z-x2,

and applying (4),

hy(y)hx(x)+gradf(x)-gradf(y),z+L2z-x2. (5)

Now,

gradf(x)-gradf(y)=supv1gradf(x)-gradf(y),v

so for each ϵ>0 there is some vϵX with vϵ1 and

gradf(x)-gradf(y),vϵgradf(x)-gradf(y)-ϵ.

Let R=gradf(x)-gradf(y)L, and applying (5) with z=x-Rvϵ yields

hy(y) hx(x)+gradf(x)-gradf(y),x-Rvϵ+L2Rvϵ2
=hx(x)+gradf(x)-gradf(y),x-Rgradf(x)-gradf(y),vϵ
+12Lgradf(x)-gradf(y)2vϵ2
hx(x)+gradf(x)-gradf(y),x-Rgradf(x)-gradf(y)+Rϵ
+12Lgradf(x)-gradf(y)2
=hx(x)+gradf(x)-gradf(y),x-12Lgradf(x)-gradf(y)2+Rϵ.

Likewise, because R does not change when x and y are switched,

hx(x)hy(y)+gradf(y)-gradf(x),y-12Lgradf(y)-gradf(x)2+Rϵ.

Adding these inequalities,

0 gradf(x)-gradf(y),x+gradf(y)-gradf(x),y
-1Lgradf(x)-gradf(y)2+2Rϵ,

i.e.

1Lgradf(x)-gradf(y)2gradf(x)-gradf(y),x-y+2Rϵ.

This is true for all ϵ>0, so

1Lgradf(x)-gradf(y)2gradf(x)-gradf(y),x-y,

showing that gradf is 1L co-coercive.

Suppose that gradf is 1L co-coercive and let x,yX. Then applying the Cauchy-Schwarz inequality,

gradf(x)-gradf(y)2 Lgradf(x)-gradf(y),x-y
Lgradf(x)-gradf(y)x-y.

If gradf(x)-gradf(y)=0 then certainly gradf(x)-gradf(y)Lx-y. Otherwise, dividing by gradf(x)-gradf(y) gives

gradf(x)-gradf(y)Lx-y,

showing that gradf is L Lipschitz. ∎