The Berry-Esseen theorem

Jordan Bell
June 3, 2015

1 Cumulative distribution functions

For a random variable X:(Ω,,P), we define its cumulative distribution function FX:R[0,1] by

FX(x)=P(Xx)={Xx}dP=txd(X*P)(t)=(X*P)((-,x]).

A distribution function is a function F:[0,1] such that (i) F(-)=limx-F(x)=0, (ii) F()=limxF(x)=1, (iii) F is nondecreasing, (iv) F is right-continuous: for each x,

F(x+)=limtxF(t)=F(x).

It is a fact that the cumulative distribution function of a random variable is a distribution function and that for any distribution function F there is a random variable X for which F=FX.

Let γ1 be the standard Gaussian measure on R: γ1 has density

p(t,0,1)=12πe-t2/2

with respect to Lebesgue measure on . Let Φ be the cumulative distribution function of γ1:

Φ(x)=γ1((-,x])=-x𝑑γ1(t)=-x12πe-t2/2𝑑t.

We first prove the following lemma about distribution functions.11 1 Kai Lai Chung, A Course in Probability Theory, third ed., p. 236, Lemma 1; cf. Allan Gut, Probability: A Graduate Course, second ed., p. 358, Lemma 6.1.

Lemma 1.

Suppose that F is a distribution function, that G: satisfies

G(-)=limx-G(x)=0,G()=limxG(x)=1,

and that G is differentiable and its derivative satisfies

M=supx|G(x)|<. (1)

Writing

Δ=12Msupx|F(x)-G(x)|,

there is some a such that for all T>0,

2MTΔ(30TΔ1-cosxx2𝑑x-π)|1-cosTxx2(F(x+a)-G(x+a))𝑑x|.
Proof.

Because G(-)=0 and G()=1, there is some compact interval K such that -1<G(x)<2 for xK. Then, because G is continuous it is bounded on K, showing that G is bounded on , and because M>0 we get Δ<.

Write H=F-G. Because H()=0 and H(-)=0, there is a compact interval K for which

2MΔ=supx|H(x)|=supxK|H(x)|.

By the Bolzano-Weierstrass theorem, either there is a sequence unK increasing to some uK such that |H(un)|2MΔ or there is a sequence unK decreasing to some uK such that |H(un)|2MΔ.22 2 In the proof in Chung there are merely two cases but it is not explained why those are exhaustive. In the first case, either there is a subsequence vn of un such that |H(vn)|=H(vn) or there is a subsequence vn of un such that |H(vn)|=-H(vn). In the first subcase we get H(u-)=2MΔ, thus

F(u-)-G(u)=2MΔ. (2)

In the second subcase we get H(u-)=-2MΔ, thus

F(u-)-G(u)=-2MΔ. (3)

In the second case, either there is a subsequence vn of un such that |H(vn)|=H(vn) or there is a subsequence vn of un such that |H(vn)=-H(vn). In the first subcase we get H(u+)=2MΔ, thus

F(u)-G(u)=2MΔ. (4)

In the second subcase we get H(u+)=2MΔ, thus

F(u)-G(u)=-2MΔ. (5)

We now deal with the subcase (3). Let a=u-Δ. For |x|<Δ, by (1) we have

|G(x+a)-G(u)|=|uu+x-ΔG(y)𝑑y||x-Δ|M=(Δ-x)M,

whence

G(x+a)G(u)+(x-Δ)M.

Because x+a=x+u-Δ<u and as F is nondecreasing and using (3),

F(x+a)-G(x+a) F(u-)-G(x+a)
F(u-)-(G(u)+(x-Δ)M)
=-M(x+Δ).

Then, because x1-cosTxx2x is an odd function,

-ΔΔ1-cosTxx2(F(x+a)-G(x+a))𝑑x -M-ΔΔ1-cosTxx2(x+Δ)𝑑x
=-2MΔ0Δ1-cosTxx2𝑑x.

On the other hand,

|(-,-Δ)(Δ,)1-cosTxx2(F(x+a)-G(x+a))𝑑x|2MΔ(-,-Δ)(Δ,)1-cosTxx2𝑑x=4MΔΔ1-cosTxx2𝑑x.

Thus

1-cosTxx2(F(x+a)-G(x+a))𝑑x-2MΔ0Δ1-cosTxx2𝑑x+4MΔΔ1-cosTxx2𝑑x=2MΔ(-30Δ1-cosTxx2𝑑x+201-cosTxx2𝑑x)=2MΔ(-30Δ1-cosTxx2𝑑x+2πT2)=2MTΔ(-30TΔ1-cosxx2𝑑x+π),

which yields the claim of the lemma, for the subcase (3). ∎

We now prove a lemma that gives an inequality for characteristic functions.33 3 Kai Lai Chung, A Course in Probability Theory, third ed., p. 237, Lemma 2; Zhengyan Lin and Zhidong Bai, Probability Inequalities, p. 29, Theorem 4.1.a. We remark that because F is a distribution function, it makes sense to speak about the measure induced by F, and because G is of bounded variation and is continuous, its variation function VG is continuous and the functions VG-G and VG are nondecreasing, and it thus makes sense to speak about the signed measure induced by G=VG-(VG-G), which is equal to the difference of the measures induced by VG and VG-G.

Lemma 2.

Suppose that F is a distribution function, that G: satisfies

G(-)=limx-G(x)=0,G()=limxG(x)=1,

that G is differentiable and of bounded variation and that its derivative satisfies

M=supx|G(x)|<,

and that

|F-G|𝑑x<.

Write

Δ=12Msupx|F(x)-G(x)|

and

f(t)=eitx𝑑F(x),g(t)=eitx𝑑G(x).

Then for all T>0,

Δ1πM0T|f(t)-g(t)|t𝑑t+12πT.
Proof.

For any t, because (F-G)(-)=0 and (F-G)()=0 and because |F-G|𝑑x<, integrating by parts gives

f(t)-g(t) =eitx𝑑F(x)-eitx𝑑G(x)
=eitxd(F-G)(x)
=-it(F-G)(x)eitx𝑑x.

Take a to be the real number that Lemma 1 yields. As

f(t)-g(t)-ite-ita(T-|t|)=(T-|t|)(F(x+a)-G(x+a))eitx𝑑x,

we obtain, using Fubini’s theorem,

-TTf(t)-g(t)-ite-ita(T-|t|)𝑑t=-TT((T-|t|)(F(x+a)-G(x+a))eitx𝑑x)𝑑t=(F(x+a)-G(x+a))(-TT(T-|t|)eitx𝑑t)𝑑x=2(F(x+a)-G(x+a))1-cosTxx2𝑑x.

Therefore, because F and G are real valued and thus |f(-t)-g(-t)|=|f(t)-g(t)¯|=|f(t)-g(t)|,

|(F(x+a)-G(x+a))1-cosTxx2𝑑x| 12-TT|f(t)-g(t)||t|(T-|t|)𝑑t
=0T|f(t)-g(t)|t(T-t)𝑑t
T0T|f(t)-g(t)|t𝑑t.

Using this with Lemma 1,

2MTΔ(30TΔ1-cosxx2𝑑x-π) T0T|f(t)-g(t)|t𝑑t.

But

30TΔ1-cosxx2𝑑x-π =301-cosxx2𝑑x-3TΔ1-cosxx2𝑑x-π
301-cosxx2𝑑x-6TΔ1x2𝑑x-π
=3π2-6TΔ-π
=π2-6TΔ,

with which we have

2MTΔ(30TΔ1-cosxx2𝑑x-π)2MTΔ(π2-6TΔ)=MTΔπ-12M,

and hence

MTΔπ-12MT0T|f(t)-g(t)|t𝑑t,

i.e.

Δ12πT+1πM0T|f(t)-g(t)|t𝑑t,

proving the claim. ∎

2 Berry-Esseen theorem

Let Xn,j, n1, 1jkn, be L3 random variables, with kn, such that for each n, the random variables Xn,j, 1jkn, are independent, and such that for all n and j,

E(Xn,j)=0.

Let Fn,j be the cumulative distribution function of Xn,j:

Fn,j(x)=P(Xn,jx).

Let fn,j be the characteristic function of Xn,j (equivalently, the characteristic function of Fn,j):

fn,j(t)=eitxd(Xn,j*P)(x)=eitx𝑑Fn,j(x).

Write, for n1,

Sn=j=1knXn,j,

and let Fn be the cumulative distribution function of Sn:

Fn(x)=P(Snx)

Also, let fn be the characteristic function of Sn (equivalently, the characteristic function of Fn). Because Xn,j, 1jkn, are independent, we have Sn*P=(Xn,1*P)**(Xn,kn*P) and hence

fn(t)=eitxd(Sn*P)(x)=j=1knfn,j(t).

For n1 and 1jkn, write

σn,j2=E(Xn,j2),sn2=j=1knσn,j2

and

γn,j=E(|Xn,j|3),Γn=j=1knγn,j.

We further assume that for each n,

sn2=j=1knσn,j2=1. (6)

We will use the following inequality which we state separately because it is of general use.

Lemma 3.

For n1 and |z|<1,

|log(1+z)-m=1n-1(-1)m-1zmm||z|nn(1-|z|).

We now prove an inequality for fn, the characteristic function of Sn.44 4 Kai Lai Chung, A Course in Probability Theory, third ed., p. 239, Lemma 3.

Lemma 4.

For n1, if |t|<12Γn1/3 then

|fn(t)-e-t2/2|Γn|t|3e-t2/2.
Proof.

For 1jkn and l0 and v,

fn,j(l)(v)=(i)lE(Xn,jleivXn,j).

Thus

fn,j(0)=1,fn,j(0)=iE(Xn,j)=0,fn,j′′(0)=-E(Xn,j2)=-σn,j2,

and

fn,j′′′(v)=-iE(Xn,j3eivXn,j).

Then by Taylor’s theorem, there is some s between 0 and t such that

fn,j(t)=1-σn,j22t2-iE(Xn,j3eisXn,j)6t3.

Put

-iE(Xn,j3eisXn,j)=θγn,j,

for which

|θ|=|E(Xn,j3eisXn,j)|E(|Xn,j|3)1.

Because the L2 norm is upper bounded by the L3 norm and because |t|<12Γn1/3,

|σn,jt||γn,j1/3t||Γn1/3t|<12,

and hence

|fn,j(t)-1| =|-σn,j22t2+θγn,jt36|
12|σn,jt|2+γn,j48Γn
<18+148
<14.

Lemma 3 and the inequality |a+b|22(|a|2+|b|2) then tell us that

|logfn,j(t)-(fn,j(t)-1)| |fn,j(t)-1|22(1-|fn,j(t)-1|)
<23|fn,j(t)-1|2
=23|-σn,j22t2+θγn,j6t3|2
43(σn,j44t4+|θ|2γn,j236t6).

Because σn,jγn,j1/3 and |θ|1,

|logfn,j(t)-(fn,j(t)-1)| 43(σn,jγn,j4t4+γn,j236t6)
=43(|σn,jt|4+|γn,j1/3t|336)γn,j|t|3
43(124+1836)γn,j|t|3
=37216γn,j|t|3
<15γn,j|t|3.

Combining this with fn,j(t)=1-σn,j22t2+θγn,j6t3,

|logfn,j(t)+σn,j22t2||θγn,j6t3|+15γn,j|t|316γn,j|t|3+15γn,j|t|312γn,j|t|3.

Because this is true for each 1jkn and because, according to (6), j=1knσn,j2=1,

|logfn(t)+t22||t|322j=1knγn,j=|t|32Γn.

For any z it is true that |ez-1||z|e|z|, so the above yields

|fn(t)et2/2-1| =|exp(log(fn(t)et2/2))-1|
|log(fn(t)et2/2)|exp(|log(fn(t)et2/2)|)
=|logfn(t)+t22|exp(|log(fn(t)et2/2)|)
|t|32Γnexp(|t|32Γn).

But |t|3<18Γn, so

|fn(t)et2/2-1||t|32Γne1/16|t|3Γn,

which completes the proof. ∎

The next lemma gives a different bound on the characteristic function of Sn.55 5 Kai Lai Chung, A Course in Probability Theory, third ed., p. 240, Lemma 4.

Lemma 5.

For n1, if |t|<14Γn then

|fn(t)|e-t2/3.
Proof.

First, for a distribution function F with characteristic function f,

|f(t)|2 =f(t)f(t)¯
=eitx𝑑F(x)e-itx𝑑F(y)
=(eit(x-y)𝑑F(x))𝑑F(y)
=(cost(x-y)+isint(x-y)dF(x))𝑑F(y).

Because |f(t)|2 is real it follows that

|f(t)|2=(cost(x-y)𝑑F(x))𝑑F(y).

Using

|cosu-(1-u22)||u|36,|a+b|p2p-1(|a|p+|b|p),

we have

|cost(x-y)-(1-(t(x-y))22)|23(|tx|3+|ty|3)=2|t|33(|x|3+|y|3)

and then

|f(t)|2 (1-(t(x-y))22+2|t|33(|x|3+|y|3)dF(x))𝑑F(y).

Using this for fn,j, and using that E(Xn,j)=0,

|fn,j(t)|2 (1-(t(x-y))22+2|t|33(|x|3+|y|3)dFn,j(x))𝑑Fn,j(y)
=1-t2σn,j22-t2y22+2|t|3γn,j3+2|t|3|y|33dF(y)
=1-t2σn,j22-t2σn,j22+2|t|3γn,j3+2|t|3γn,j3
=1-t2σn,j2+4|t|3γn,j3.

Because 1+ueu for all u,

|fn,j(t)|2exp(-t2σn,j2+4|t|3γn,j3).

Then, by (6),

|fn(t)|2 =j=1kn|fn,j(t)|2
j=1knexp(-t2σn,j2+4|t|3γn,j3)
=exp(-t2j=1knσn,j2+4|t|33j=1knγn,j)
=exp(-t2+4|t|33Γn).

As |t|<14Γn,

|fn(t)|exp(-t22+2|t|33Γn)exp(-t22+2|t|212)=e-t2/3,

proving the claim. ∎

We now combine Lemma 4 and Lemma 5.66 6 Kai Lai Chung, A Course in Probability Theory, third ed., p. 240, Lemma 5.

Lemma 6.

For n1, if |t|<14Γn then

|fn(t)-e-t2/2|16Γn|t|3e-t2/3.
Proof.

Either |t|<12Γn1/3 or 12Γn1/3|t|<14Γn. In the first case, Lemma 4 tells us

|fn(t)-e-t2/2|Γn|t|3e-t2/2Γn|t|3e-t2/316Γn|t|3e-t2/3.

In the second case, Lemma 5 tells us

|fn(t)|e-t2/3,

and so, as in this case we have 18Γn|t|3,

|fn(t)-e-t2/2||fn(t)|+e-t2/2e-t2/3+e-t2/22e-t2/316Γn|t|3e-t2/3,

showing that the claim is true in both cases. ∎

We finally prove the Berry-Esseen theorem.77 7 Kai Lai Chung, A Course in Probability Theory, third ed., p. 235, Theorem 7.4.1; cf. Allan Gut, Probability: A Graduate Course, second ed., p. 356, Theorem 6.2; John E. Kolassa, Series Approximation Methods in Statistics, p. 25, Theorem 2.6.1; Alexandr A. Borovkov, Probability Theory, p. 659, Theorem A5.1; Ivan Nourdin and Giovanni Peccati, Normal Approximations with Malliavin Calculus: From Stein’s Method to Universality, p. 71, Theorem 3.7.1.

Theorem 7 (Berry-Esseen theorem).

There is some A0<36 such that for each n1,

supx|Fn(x)-Φ(x)|A0Γn.
Proof.

Let Z be a random variable with Z*P=γ1, i.e. whose cumulative distribution function is Φ. By (6) and because Xn,j, 1jkn, are independent and satisfy E(Xn,j)=0,

E(Sn2)=j=1knE(Xn,j2)=j=1knσn,j2=1.

If x<0 then by Chebyshev’s inequality

Fn(x)=P(Snx)=P(-Sn-x)1x2E(|Sn|2)=1x2

and

Φ(x)=P(Zx)=P(-Z-x)1x2E(|Z|2)=1x2.

If x>0 then also by Chebyshev’s inequality

1-Fn(x)=1-P(Snx)=P(Sn>x)1x2

and

1-Φ(x)=1-P(Zx)=P(Z>x)1x2.

Therefore, because Fn and Φ are nonnegative and 1-Fn and 1-Φ are nonnegative, for all x we have

|Fn(x)-Φ(x)|1x2.

Then, because |Fn|1 and |Φ|1,

|Fn(x)-Φ(x)|𝑑x|x|12𝑑x+|x|>11x2𝑑x=6<.

Φ(x)=12πe-x2/212π. We apply Lemma 2 with F=Fn, G=Φ, and M=12π, and because the characteristic function of Φ is ϕ(t)=e-t2/2, we obtain for T=14Γn,

supx|Fn(x)-Φ(x)| 2π014Γn|fn(t)-ϕ(t)|t𝑑t+96MΓnπ
=2π014Γn|fn(t)-e-t2/2|t+96Γnπ2π.

Then applying Lemma 6,

supx|Fn(x)-Φ(x)| 2π014Γn16Γnt3e-t2/3t𝑑t+96Γnπ2π
=Γn(32π014Γnt2e-t2/3𝑑t+96π2π).

This proves the claim with

A0=32π0t2e-t2/3𝑑t+96π2π=32π33π4+96π2π=35.64.