Subgaussian random variables, Hoeffding’s inequality, and Cramér’s large deviation theorem

Jordan Bell
June 4, 2014

1 Subgaussian random variables

For a random variable X, let ΛX(t)=logE(etX), the cumulant generating function of X. A b-subgaussian random variable, b>0, is a random variable X such that

ΛX(t)b2t22,t.

We remark that for γa,σ2 a Gaussian measure, whose density with respect to Lebesgue measure on is

p(x,a,σ2)=12πσ2e-(x-a)22σ2,

we have

etx𝑑γ0,b2(x)=ebty12πe-y22𝑑y=eb2t2212πe-(y-bt)22𝑑y=eb2t22.

We prove that a b-subgaussian random variable is centered and has variance b2.11 1 Karl R. Stromberg, Probability for Analysts, p. 293, Proposition 9.8.

Theorem 1.

If X is b-subgaussian then E(X)=0 and Var(X)b2.

Proof.

For each ωΩ, k=0ntkX(ω)kk!etX(ω), and by the dominated convergence theorem,

k=0ntkE(X)kk!E(etX)eb2t22=k=0(b2t22)k1k!.

Therefore

1+tE(X)+t2E(X2)+O(t3)1+b2t22+O(t4),

whence

tE(X)+t2E(X2)b2t22+o(t2),

and so, for t>0,

E(X)+tE(X2)b2t2+o(t).

First, this yields E(X)=o(t), which means that E(X)=0. Second, since E(X)=0,

tE(X2)b2t2+o(t),

and then

E(X2)b22+o(1),

which measn that E(X2)b22. ∎

Stromberg attributes the following theorem to Saeki; further, it is proved in Stromberg that if for some t the inequality in the theorem is an equality then the random variable has the Rademacher distribution.22 2 Karl R. Stromberg, Probability for Analysts, p. 293, Proposition 9.9; Omar Rivasplata, Subgaussian random variables: An expository note, http://www.math.ualberta.ca/~orivasplata/publications/subgaussians.pdf

Theorem 2.

If X is a random variable satisfying E(X)=0 and P(X[-1,1])=1, then

E(etX)cosht,t.
Proof.

Define f: by

f(t)=et(cosht-E(etX))=e2t2+12-etE(etX).

Then

f(t)=e2t-etE(etX)-etE(XetX);

the derivative of E(etX) with respect to t is obtained using the dominated convergence theorem. Let Y=1+X, with which

f(t)=e2t-E(etY)-E(XetY)=e2t-E(etY)-E((Y-1)etY)=e2t-E(YetY).

E(X)=0, so E(Y)=1, hence

f(t)=E(e2tY)-E(YetY)=E(Y(e2t-etY)).

Because P(Y[0,2])=1, for t0, we have almost surely e2t-etY0, and therefore almost surely Y(e2t-etY)0. Therefore, for t0,

f(t)=E(Y(e2t-etY))0,

which tells us that for t0,

f(0)f(t).

As f(0)=0, for t0,

0et(cosht-E(etX)),

and so

E(etX)cosht.

Corollary 3.

If a random variable X satisfies E(X)=0 and P(|X|b)=1, then X is b-subgaussian.

2 Hoeffding’s inequality

We first prove Hoeffding’s lemma.33 3 Stéphane Boucheron, Gábor Lugosi, and Pascal Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, p. 27, Lemma 2.2.

Lemma 4 (Hoeffding’s lemma).

If a random variable X satisfies E(X)=0 and P(X[a,b])=1, then X is b-a2-subgaussian.

Proof.

Because P(X[a,b])=1, it follows that

Var(X)(b-a)24,

not using that P(X)=0. (Namely, Popoviciu’s inequality.)

Write μ=X*P and for λ define

dνλ(t)=eλteΛ(λ)dμ(t).

We check

𝑑νλ(t)=1eΛ(λ)eλtd(X*P)(t)=1eΛ(λ)ΩeλX𝑑P=1.

There is a random variable Xλ:(Ωλ,λ,Pλ) for which Xλ*Pλ=νλ. Xλ satisfies Pλ(Xλ[a,b])=1, and so

Var(Xλ)(b-a)24.

We calculate

ΛX(t)=E(XetX)E(etX)

and

ΛX′′(t)=E(X2etX)E(etX)-E(XetX)E(XetX)E(etX)2.

But

E(Xλ)=t𝑑νλ(t)=teλteΛ(λ)𝑑μ(t)=1eΛ(λ)E(XeλX)

and

E(Xλ2)=t2𝑑νλ(t)=1eΛ(λ)E(X2eλX),

and so

Var(Xλ) =E(Xλ2)-E(Xλ)2
=E(X2eλX)eΛ(λ)-E(XeλX)2e2Λ(λ)
=ΛX′′(λ).

For λ, Taylor’s theorem tells us that there is some θ between 0 and λ such that

ΛX(λ)=ΛX(0)+λΛX(0)+λ22ΛX′′(θ)=λ22ΛX′′(θ);

here we have used that E(X)=0. But from what we have shown, Var(Xθ)=ΛX′′(θ) and Var(Xθ)(b-a)24, so

ΛX(λ)=λ22Var(Xθ)λ22(b-a)24,

which shows that X is b-a2-subgaussian. ∎

We now prove Hoeffding’s inequality.44 4 Stéphane Boucheron, Gábor Lugosi, and Pascal Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, p. 34, Theorem 2.8.

Theorem 5 (Hoeffding’s inequality).

Let X1,,Xn be independent random variables such that for each 1kn, P(Xk[ak,bk])=1, and write Sn=k=1nXk. For any a>0,

P(Sn-E(Sn)a)exp(-2a2k=1n(bk-ak)2).
Proof.

For λ>0 and ϕ(t)=eλt, because ϕ is nonnegative and nondecreasing, for X a random variable we have

1Xaϕ(a)ϕ(X),

and so E(1Xaϕ(a))E(ϕ(X)), i.e.

P(Xa)E(eλX)eλa.

Using this with X=Sn-E(Sn) and because the Xk are independent,

P(Sn-E(Sn)a)1eλaE(eλ(Sn-E(Sn)))=e-λak=1nE(eλ(Xk-E(Xk))).

Because P(Xk[ak,bk])=1, we have P(Xk-E(Xk)[ak-E(Xk),bk-E(Xk)])=1, and as (bk-E(Xk))-(ak-E(Xk))=bk-ak, Hoeffding’s lemma tells us

logE(eλ(Xk-E(Xk)))(bk-ak)2λ28,

and thus

P(Sn-E(Sn)a) e-λaexp(k=1n(bk-ak)2λ28)
=exp(-λa+λ28k=1n(bk-ak)2).

We remark that λ does not appear in the left-hand side. Define

g(λ)=-λa+λ28k=1n(bk-ak)2,

for which

g(λ)=-a+λ4k=1n(bk-ak)2.

Then g(λ)=0 if and only if

λ=4ak=1n(bk-ak)2,

at which g assumes its infimum. Then

P(Sn-E(Sn)a) exp(-4a2k=1n(bk-ak)2+16a281k=1n(bk-ak)2)
=exp(-2ak=1n(bk-ak)2),

proving the claim. ∎

3 Cramér’s large deviation theorem

The following is Cramér’s large deviation theorem.55 5 Achim Klenke, Probability Theory: A Comprehensive Course, p. 508, Theorem 23.3.

Theorem 6 (Cramér’s large deviation theorem).

Suppose that Xn:(Ω,,P), n1, are independent identically distributed random variables such that for all t,

Λ(t)=logE(etX1)<.

For x define

Λ*(x)=supt(tx-Λ(t)).

If a>E(X1), then

limn1nlogP(Snan)=-Λ*(a),

where Sn=k=1nXk.

Proof.

For a>E(X1), let Yn=Xn-a, let

L(t)=logE(etY1)=logE(etX1e-ta)=-ta+Λ(t)

and let

L*(x)=supt(tx-L(t))=supt(t(x+a)-Λ(t))=Λ*(x+a).

Lastly, let Tn=k=1nYk=Sn-na, with which

P(Tnbn)=P(Sn(b+a)n).

Thus, if we have

limn1nlogP(Tn0)=-L*(0),

then

limn1nlogP(Snan)=-L*(0)=-Λ*(a).

Therefore it suffices to prove the theorem for when E(X1)<0 and a=0.

Define

ϕ(t)=eΛ(t)=E(etX1)=ΩetX1𝑑P=etxd(X1*P)(x),t,

the moment generating function of X1, and define

ρ=e-Λ*(0)=exp(-supt(-Λ(t)))=exp(inftΛ(t))=inftϕ(t),

using that xex is increasing.

Using the dominated convergence theorem, for k0 we obtain

ϕ(k)(t)=xketxd(X1*P)(x)=E(X1ketX1).

In particular, ϕ(t)=E(X1etX1), for which ϕ(0)=E(X1)<0, and ϕ′′(t)=E(X12etX1)>0 for all t (either the expectation is 0 or positive, and if it is 0 then X12etX1 is 0 almost everywhere, which contradicts E(X1)<0).

Either P(X10)=1 or P(X10)<1. In the first case,

ϕ(t)=ΩX1etX1𝑑P=X10X1etX1𝑑P0,

so, using the dominated convergence theorem,

ρ=inftϕ(t)=limtϕ(t)=X10(limtetX1)dP=X1=0dP=P(X1=0).

Then

P(Sn0)=P(X1=0,,Xn=0)=P(X1=0)P(Xn=0)=ρn.

That is, as a=0,

P(Sna)=e-nΛ*(a),

and the claim is immediate in this case.

In the second case, P(X10)<1. As ϕ′′(t)>0 for all t, there is some τ at which ϕ(τ)<ϕ(t) for all tτ (namely, a unique global minimum). Thus,

ϕ(τ)=ρ,ϕ(τ)=0.

And ϕ(0)=E(X1)<0, which with the above yields τ>0. Because τ>0, Sn(ω)0 if and only if τSn(ω)0 if and only if eτSn(ω)1. Applying Chebyshev’s inequality, and because Xn are independent,

P(Sn0)=P(eτSn1)E(eτSn)=E(eτX1)E(eτXn)=ϕ(τ)n=ρn,

thus logP(Sn0)nlogρ and then

lim supn1nlogP(Sn0)lim supnlogρ=logρ=loge-Λ*(0)=-Λ*(0).

To prove the claim, it now suffices to prove that, in the case P(X10)<1,

lim infn1nlogP(Sn0)logρ. (1)

Let μ=X1*P, and let

dν(x)=eτxρdμ(x).

ν is a Borel probability measure: it is apparent that it is a Borel measure, and

ν()=𝑑ν(x)=eτxρ𝑑μ(x)=1ρeτx𝑑μ(x)=ϕ(τ)ρ=1.

There are independent identically distributed random variables Yn, n1, each with Yn*Q=ν. 66 6 Gerald B. Folland, Real Analysis: Modern Techniques and Their Applications, p. 329, Corollary 10.19. Define

ψ(t)=E(etY1)=etx𝑑ν(x)=etxeτxρ𝑑μ(x)=1ρe(t+τ)x𝑑μ(x)=ϕ(t+τ)ρ,

the moment generating function of Y1. As ϕ(τ)=0,

E(Y1)=ψ(0)=ϕ(τ)ρ=0.

As ρ>0 and ϕ′′(t)>0 for all t,

Var(Y1)=E(Y12)=ψ′′(0)=ϕ′′(τ)ρ(0,).

For Tn=k=1nYk, using that the Xn are independent and that the Yn are independent,

P(Sn0) =x1++xn0d(Sn*P)(x)
=x1++xn0𝑑μ(x1)𝑑μ(xn)
=x1++xn0(ρeτx1dν(x1))(ρeτxndν(xn))
=ρnx1++xn0e-τ(x1++xn)d(Tn*Q).

But

x1++xn0e-τ(x1++xn)d(Tn*Q) =Tn0e-τTn𝑑Q
=E(1{Tn0}e-τTn),

hence

P(Sn0)=ρnE(1{Tn0}e-τTn).

Thus, (1) is equivalent to

lim infn1nlog(ρnE(1{Tn0}e-τTn))logρ,

so, to prove the claim it suffices to prove that

lim infn1nlog(E(1{Tn0}e-τTn))0.

For any c>0,

log(E(1{Tn0}e-τTn)) logE(1{0Tncn}e-τTn)
log(e-τcnQ(0Tncn))
=-τcn+logQ(Tnn[0,c]).

Because the Yn are independent identically distributed L2 random variables with mean 0 and variance σ2=Var(Y1)=ϕ′′(τ)ρ, the central limit theorem tells us that as n,

Q(Tnn[0,c])γ0,σ2([0,c]),

where γa,σ2 is the Gaussian measure, whose density with respect to Lebesgue measure on is

p(t,a,σ2)=1σ2πe-(t-a)22σ2.

Thus, because for c>0 we have γ0,σ2([0,c])>0,

lim infn1nlog(E(1{Tn0}e-τTn)) lim infn(-τcn+1nlogQ(Tnn[0,c]))
=limn-τcn+limn1nlogQ(Tnn[0,c])
=limn1nlogγ0,σ2([0,c])
=0,

which completes the proof. ∎

For example, say that Xn are independent identically distributed random variables with X1*P=γ0,1. We calculate that the cumulant generating function Λ(t)=logE(etX1) is

Λ(t) =log(etx𝑑γ0,1(x))
=log(etxe-x222π𝑑x)
=log(e-12(x-t)22πet22𝑑x)
=loget22
=t22,

thus Λ(t)< for all t. Then

Λ*(x)=supt(tx-Λ(t))=supt(tx-t22)=x22.

Now applying Cramér’s theorem we get that for a>E(X1)=0, for Sn=k=1nXk we have

limn1nlogP(Snan)=-a22.

Another example: If Xn are independent identically distributed random variables with the Rademacher distribution:

Xn*P=12δ-1+12δ1.

Then

E(etX1)=etxd(12δ-1+12δ1)(x)=12e-t+12et=cosht,

so the cumulant generating function of X1 is

Λ(t)=logcosht,

and indeed Λ(t)< for all t. Then, as ddt(tx-logcosht)=x-tanht,

Λ*(x)=supt(tx-logcosht)=arctanhxx-logcosharctanhx.

For x(-1,1),

arctanhx=12log1+x1-x.

Then

cosharctanhx=12(earctanhx+e-arctanhx)=121+x1-x+121-x1+x=11-x2.

With these identities,

Λ*(t) =x2log1+x1-x+12log(1-x2)
=x2log(1+x)-x2log(1-x)+12log(1+x)+12log(1-x)
=1+x2log(1+x)+1-x2log(1-x).

With Sn=k=1nXk, applying Cramér’s theorem, we get that for any a>E(X1)=0,

limn1nlogP(Snan)=-1+x2log(1+x)-1-x2log(1-x).

For a Borel probability measure μ on , we define its Laplace transform μˇ:(0,] by

μˇ(t)=ety𝑑μ(y).

Suppose that |y|𝑑μ(y)< and let M1=y𝑑μ(y), the first moment of μ. For any t the function xetx is convex, so by Jensen’s inequality,

etM1ety𝑑μ(y)=μˇ(t).

Thus for all t,

tM1-logμˇ(t)0.

For a Borel probability measure μ with finite first moment, we define its Cramér transform Iμ:[0,] by77 7 Heinz Bauer, Probability Theory, pp. 89–90, §12.

Iμ(x)=supt(tx-logμˇ(t)).

For t=0, tx-logμˇ(t)=-logμˇ(0)=-log(1)=0, which shows that indeed 0Iμ(x) for all x. But tM1-logμˇ(t)0 for all t yields

Iμ(M1)=0.