The weak and strong laws of large numbers

Jordan Bell

May 29, 2015

1 Introduction

Using Egorov’s theorem, one proves that if a sequence of random variables $X_{n}$ converges almost surely to a random variable $X$ then $X_{n}$ converges in probability to $X$ .¹¹ 1 V. I. Bogachev, Measure Theory, volume I, p. 111, Theorem 2.2.3; http://individual.utoronto.ca/jordanbell/notes/L0.pdf, p. 3, Theorem 3.

Let $X_{n}$ be a sequence of $L^{1}$ random variables, $n\geq 1$ . A weak law of large numbers is a statement that

\frac{1}{n}\sum_{k=1}^{n}(X_{k}-E(X_{k}))

(1)

converges in probability to $0$ . A strong law of large numbers is a statement that (1) converges almost surely to $0$ . Thus, if the hypotheses assumed on the sequence of random variables are the same, a strong law implies a weak law.

We shall prove the weak law of large numbers for a sequence of independent identically distributed $L^{1}$ random variables, and the strong law of large for the same hypotheses. We give separate proofs for these theorems as an occasion to inspect different machinery, although to establish the weak law it thus suffices to prove the strong law. One reason to distinguish these laws is for cases when we impose different hypotheses.²² 2 cf. Jordan M. Stoyanov, Counterexamples in Probability, third ed., p. 163, §15.3; Dieter Landers and Lothar Rogge, Identically distributed uncorrelated random variables not fulfilling the WLLN, Bull. Korean Math. Soc. 38 (2001), no. 3, 605–610.

We also prove Markov’s weak law of large numbers, which states that if $X_{n}$ is a sequence of $L^{2}$ random variables that are pairwise uncorrelated and

\frac{1}{n^{2}}\sum_{n=k}^{n}\mathrm{Var}(X_{k})\to 0,

then $\frac{1}{n}\sum_{k=1}^{n}(X_{k}-E(X_{k}))$ converges to $0$ in $L^{2}$ , from which it follows using Chebyshev’s inequality that it converges in probability to $0$ . (We remark that a sequence of $L^{2}$ random variables converging in $L^{2}$ to $0$ does not imply that it converges almost surely to $0$ , although there is indeed a subsequence that converges almost surely to $0$ .³³ 3 http://individual.utoronto.ca/jordanbell/notes/L0.pdf, p. 4, Theorem 5.)

If $(\Omega,\mathscr{F},P)$ is a probability space, $(Y,\mathscr{A})$ is a measurable space, and $T:(\Omega,\mathscr{F})\to(Y,\mathscr{A})$ is measurable, the pushforward measure of $P$ by $T$ is

(T_{*}P)(A)=P(T^{-1}(A)),\qquad A\in\mathscr{A}.

Then $(Y,\mathscr{A},T_{*}P)$ is a probability space. We remind ourselves of the change of variables theorem, which we shall use.⁴⁴ 4 Charalambos D. Aliprantis and Kim C. Border, Infinite Dimensional Analysis: A Hitchhiker’s Guide, third ed., p. 485, Theorem 13.46. Let $f:Y\to\mathbb{R}$ be a function. On the one hand, if $f\in L^{1}(T_{*}P)$ then $f\circ T\in L^{1}(P)$ and

\int_{Y}fd(T_{*}P)=\int_{\Omega}f\circ TdP.

(2)

On the other hand, if $f$ is $T_{*}P$ -measurable and $f\circ T\in L^{1}(P)$ , then $f\in L^{1}(T_{*}P)$ and

\int_{Y}fd(T_{*}P)=\int_{\Omega}f\circ TdP.

2 The weak law of large numbers

Suppose that $X_{n}:(\Omega,\mathscr{F},P)\to\mathbb{R}$ , $n\geq 1$ , are independent identically distributed $L^{1}$ random variables, and write

S_{n}=\sum_{k=1}^{n}X_{k},

for which $E(S_{n})=\sum_{k=1}^{n}E(X_{k})=nE(X_{1})$ .

Lemma 1.

If $X_{n}$ are $L^{2}$ , then for any $\lambda>0$ and for any $n\geq 1$ ,

P\left(\left|\frac{S_{n}}{n}-E(X_{1})\right|\geq\lambda\right)\leq\frac{E(|X_{% 1}-E(X_{1})|^{2})}{n\lambda^{2}}.

Proof.

Using

S_{n}^{2}=\sum_{k=1}^{n}X_{k}^{2}+\sum_{j\neq k}X_{j}X_{k},

we have

	$\displaystyle E\left(\left\|\frac{S_{n}}{n}-E(X_{1})\right\|^{2}\right)$	$\displaystyle=E\left(\sum_{k=1}^{n}\frac{X_{k}^{2}}{n^{2}}+\sum_{j\neq k}\frac% {X_{j}X_{k}}{n^{2}}-2E(X_{1})\frac{S_{n}}{n}+E(X_{1})^{2}\right)$
		$\displaystyle=\sum_{k=1}^{n}\frac{E(X_{k}^{2})}{n^{2}}+\sum_{j\neq k}\frac{E(X% _{j}X_{k})}{n^{2}}-2E(X_{1})^{2}+E(X_{1})^{2}$
		$\displaystyle=\sum_{k=1}^{n}\frac{E(X_{1}^{2})}{n^{2}}+\sum_{j\neq k}\frac{E(X% _{j})E(X_{k})}{n^{2}}-E(X_{1})^{2}$
		$\displaystyle=\sum_{k=1}^{n}\frac{E(X_{1}^{2})}{n^{2}}+\sum_{j\neq k}\frac{E(X% _{1})^{2}}{n^{2}}+E(X_{1})^{2}$
		$\displaystyle=\frac{E(X_{1}^{2})}{n}+\frac{E(X_{1})^{2}}{n}.$

On the other hand,

E(|X_{1}-E(X_{1})|^{2})=E(X_{1}^{2}-2E(X_{1})X_{1}+E(X_{1})^{2})=E(X_{1}^{2})+% E(X_{1})^{2}.

E\left(\left|\frac{S_{n}}{n}-E(X_{1})\right|^{2}\right)=\frac{1}{n}E(|X_{1}-E(% X_{1})|^{2}).

Using this and Chebyshev’s inequality,

	$\displaystyle P\left(\left\|\frac{S_{n}}{n}-E(X_{1})\right\|\geq\lambda\right)$	$\displaystyle\leq\frac{1}{\lambda^{2}}E\left(\left\|\frac{S_{n}}{n}-E(X_{1})% \right\|^{2}\right)$
		$\displaystyle=\frac{1}{n\lambda^{2}}E(\|X_{1}-E(X_{1})\|^{2}),$

proving the claim. ∎

We now prove the weak law of large numbers, which states that if $X_{n}$ are independent identically distributed $L^{1}$ random variables each with mean $0$ , then $\frac{S_{n}}{n}$ converges in probability to $E(X_{1})$ . $Z_{n}=X_{n}-E(X_{n})=X_{n}-E(X_{1})$ are independent and identically distributed $L^{1}$ random variables with mean $0$ , and if $Z_{n}$ converges in probability to $0$ then $X_{n}=Z_{n}+E(X_{1})$ converges in probability to $E(X_{1})$ , showing that it suffices to prove the theorem when $E(X_{1})=0$ .⁵⁵ 5 Allan Gut, Probability: A Graduate Course, second ed., p. 270, Theorem 6.3.1, and p. 121, Theorem 3.1.5.

Theorem 2 (Weak law of large numbers).

Suppose that $E(X_{1})=0$ . For each $\epsilon>0$ ,

P\left(\left|\frac{S_{n}}{n}\right|\geq\epsilon\right)\to 0,\qquad n\to\infty.

Proof.

For $n\geq 1$ and $1\leq k\leq n$ , let

Y_{n,k}=1_{\{|X_{k}|\leq n\epsilon^{3}\}}X_{k}=f\circ X_{k},

where $f(x)=1_{[-n\epsilon^{3},n\epsilon^{3}]}(x)\cdot x$ . Because $X_{1},\ldots,X_{k}$ are independent and identically distributed, so are $Y_{n,1},\ldots,Y_{n,n}$ .⁶⁶ 6 Gerald B. Folland, Real Analysis: Modern Techniques and Their Applications, second ed., p. 316, Proposition 10.2. Moreover, $E(Y_{n,k}^{2})\leq(n\epsilon^{3})^{2}$ , so each $Y_{n,k}$ belongs to $L^{2}$ . Let

T_{n}=\sum_{k=1}^{n}Y_{n,k},

for which $E(T_{n})=nE(Y_{n,1})$ .

If $\omega\in\bigcap_{k=1}^{n}\{|X_{k}|\leq n\epsilon^{3}\}$ , then

T_{n}(\omega)=\sum_{k=1}^{n}Y_{n,k}(\omega)=\sum_{k=1}^{n}X_{k}(\omega)=S_{n}(% \omega),

and using this and Lemma 1, for $t>0$ we have

	$\displaystyle P(\|S_{n}-E(T_{n})\|\geq t)$	$\displaystyle=P\left(\{\|S_{n}-E(T_{n})\|\geq t\}\cap\bigcap_{k=1}^{n}\{\|X_{k}\|% \leq n\epsilon^{3}\}\right)$
		$\displaystyle+P\left(\{\|S_{n}-E(T_{n})\|\geq t\}\cap\bigcup_{k=1}^{n}\{\|X_{k}\|>% n\epsilon^{3}\}\right)$
		$\displaystyle\leq P(\|T_{n}-E(T_{n})\|\geq t)+P\left(\bigcup_{k=1}^{n}\{\|X_{k}\|>% n\epsilon^{3}\}\right)$
		$\displaystyle=P\left(\left\|\frac{T_{n}}{n}-E(Y_{n,1})\right\|\geq\frac{t}{n}% \right)+P\left(\bigcup_{k=1}^{n}\{\|X_{k}\|>n\epsilon^{3}\}\right)$
		$\displaystyle\leq\frac{E(\|Y_{n,1}-E(Y_{n,1})\|^{2})}{n\cdot\left(\frac{t}{n}% \right)^{2}}+\sum_{k=1}^{n}P(\|X_{k}\|>n\epsilon^{3})$
		$\displaystyle=\frac{n}{t^{2}}\cdot(E(Y_{n,1}^{2})-E(Y_{n,1})^{2})+nP(\|X_{1}\|>n% \epsilon^{3})$
		$\displaystyle\leq\frac{n}{t^{2}}\cdot E(Y_{n,1}^{2})+nP(\|X_{1}\|>n\epsilon^{3}).$

For $t=n\epsilon$ this is

	$\displaystyle P(\|S_{n}-E(T_{n})\|\geq n\epsilon)$	$\displaystyle\leq\frac{1}{n\epsilon^{2}}\cdot E(Y_{n,1}^{2})+nP(\|X_{1}\|>n% \epsilon^{3})$
		$\displaystyle=\frac{1}{n\epsilon^{2}}\cdot E(1_{\{\|X_{1}\|\leq n\epsilon^{3}\}}% X_{1}^{2})+nP(\|X_{1}\|>n\epsilon^{3})$
		$\displaystyle\leq\frac{1}{n\epsilon^{2}}\cdot E(1_{\{\|X_{1}\|\leq n\epsilon^{3}% \}}n\epsilon^{3}\|X_{1}\|)+nP(\|X_{1}\|>n\epsilon^{3})$
		$\displaystyle\leq\epsilon E(\|X_{1}\|)+nP(\|X_{1}\|>n\epsilon^{3}).$

Now,

n\epsilon^{3}P(|X_{1}|>n\epsilon^{3})=n\epsilon^{3}\int_{(n\epsilon^{3},\infty% )}d({X_{1}}_{*}P)(y)\leq\int_{(n\epsilon^{3},\infty)}yd({X_{1}}_{*}P)(y),

which tends to $0$ as $n\to\infty$ because

\int_{\mathbb{R}}|y|d({X_{1}}_{*}P)(y)=E(|X_{1}|)<\infty.

Therefore,

\limsup_{n}P(|S_{n}-E(T_{n})|\geq n\epsilon)\leq\epsilon E(|X_{1}|),

that is,

\limsup_{n}P\left(\left|\frac{S_{n}-E(T_{n})}{n}\right|\geq\epsilon\right)\leq% \epsilon E(|X_{1}|),

which implies that $\frac{S_{n}-E(T_{n})}{n}$ converges in probability to $0$ .

Because $E(X_{1})=0$ ,

E(1_{\{|X_{1}|\leq n\epsilon^{3}\}}X_{1})+E(1_{\{|X_{1}|>n\epsilon^{3}\}}X_{1}% )=0,

and hence

|E(T_{n})|=|nE(Y_{n,1})|=n|E(1_{\{|X_{1}|\leq n\epsilon^{3}\}}X_{1})|=n|E(1_{% \{|X_{1}|>n\epsilon^{3}\}}X_{1})|,

thus

\frac{|E(T_{n})|}{n}=|E(1_{\{|X_{1}|>n\epsilon^{3}\}}X_{1})|\leq E(1_{\{|X_{1}% |>n\epsilon^{3}\}}|X_{1}|),

which tends to $0$ as $n\to\infty$ , because $E(|X_{1}|)<\infty$ . Thus $\frac{E(T_{n})}{n}$ converges in probability to $0$ , and therefore $\frac{S_{n}}{n}$ converges in probability to $0$ , completing the proof. ∎

Lemma 3 (Bienaymé’s formula).

If $X_{n}$ , $n\geq 1$ , are $L^{2}$ random variables that are pairwise uncorrelated, then

\mathrm{Var}\left(\sum_{k=1}^{n}X_{k}\right)=\sum_{k=1}^{n}\mathrm{Var}(X_{k}).

Proof.

Let $Y_{k}=X_{k}-E(X_{k})$ . Using that the $X_{k}$ are pairwise uncorrelated, for $j\neq k$ we have

	$\displaystyle E(Y_{j}Y_{k})$	$\displaystyle=E(X_{j}X_{k}-X_{j}E(X_{k})-X_{k}E(X_{j})+E(X_{j})E(X_{k}))$
		$\displaystyle=E(X_{j})E(X_{k})-E(X_{j})E(X_{k})-E(X_{k})E(X_{j})+E(X_{j})E(X_{% k})$
		$\displaystyle=0,$

showing that the $Y_{k}$ are pairwise uncorrelated. Then, because $E(S_{n})=\sum_{k=1}^{n}E(X_{k})$ ,

	$\displaystyle\mathrm{Var}(S_{n})$	$\displaystyle=E\left(\left(S_{n}-\sum_{k=1}^{n}E(X_{k})\right)^{2}\right)$
		$\displaystyle=E\left(\left(\sum_{k=1}^{n}Y_{k}\right)^{2}\right)$
		$\displaystyle=E\left(\sum_{k=1}^{n}Y_{k}^{2}+\sum_{j\neq k}Y_{j}Y_{k}\right)$
		$\displaystyle=\sum_{k=1}^{n}E(Y_{k}^{2})+\sum_{j\neq k}E(Y_{j})E(Y_{k})$
		$\displaystyle=\sum_{k=1}^{n}E(Y_{k}^{2}),$

and as $E(Y_{k}^{2})=\mathrm{Var}(X_{k})$ , we have

\mathrm{Var}(S_{n})=\sum_{k=1}^{n}\mathrm{Var}(X_{k}).

∎

We now prove a weak law of large numbers, that is sometimes attributed to Markov, that neither supposes that the random variables are independent nor supposes that they are identically distributed. We remind ourselves that if a sequence $Y_{n}$ of $L^{2}$ random variables converges to $0$ in $L^{2}$ then it converges in probability to $0$ ; this is proved using Chebyshev’s inequality.

Theorem 4 (Markov’s weak law of large numbers).

If $X_{n}$ , $n\geq 1$ , are $L^{2}$ random variables that are pairwise uncorrelated and which satisfy

\frac{1}{n^{2}}\sum_{n=k}^{n}\mathrm{Var}(X_{k})\to 0,

then $\frac{1}{n}\sum_{k=1}^{n}(X_{k}-E(X_{k}))$ converges to $0$ in $L^{2}$ and thus in probability.

Proof.

Because $E\left(\sum_{k=1}^{n}X_{k}\right)=\sum_{k=1}^{n}E(X_{k})$ and $\mathrm{Var}(X)=E((X-E(X))^{2})$ , using Bienaymé’s formula we get

	$\displaystyle E\left(\left(\frac{1}{n}\sum_{k=1}^{n}(X_{k}-E(X_{k}))\right)^{2% }\right)$	$\displaystyle=\frac{1}{n^{2}}E\left(\left(\sum_{k=1}^{n}X_{k}-\sum_{k=1}^{n}E(% X_{k})\right)^{2}\right)$
		$\displaystyle=\frac{1}{n^{2}}\mathrm{Var}\left(\sum_{k=1}^{n}X_{k}\right)$
		$\displaystyle=\frac{1}{n^{2}}\sum_{k=1}^{n}\mathrm{Var}(X_{k}).$

Thus

E\left(\left(\frac{1}{n}\sum_{k=1}^{n}(X_{k}-E(X_{k}))\right)^{2}\right)=\frac% {1}{n^{2}}\sum_{k=1}^{n}\mathrm{Var}(X_{k})\to 0

as $n\to\infty$ , namely, $\frac{1}{n}\sum_{k=1}^{n}(X_{k}-E(X_{k}))$ converges to $0$ in $L^{2}$ as $n\to\infty$ , proving the claim. ∎

3 Ergodic theory

We here assemble machinery and results that we will use to prove the strong law of large numbers. For a probability space $(\Omega,\mathscr{F},P)$ , a function $T:\Omega\to\Omega$ is said to be a measure-preserving transformation if (i) it is measurable, and (ii) $T_{*}P=P$ . To say that $T_{*}P=P$ means that for any $A\in\mathscr{F}$ , $(T_{*}P)(A)=P(A)$ , i.e. $P(T^{-1}(A))=P(A)$ .

A collection $\mathscr{A}$ of subsets of $\Omega$ is called an algebra of sets if (i) $\emptyset\in\mathscr{A}$ , (ii) $\Omega\in\mathscr{A}$ , (iii) if $A,B\in\mathscr{A}$ then $A\cup B,A\setminus B\in\mathscr{A}$ . If $\mathscr{G}$ is a nonempty collection of subsets of $\Omega$ , it is a fact that there is a unique algebra of sets $A(\mathscr{G})$ that (i) contains $\mathscr{G}$ and (ii) is contained in any algebra of sets that contains $\mathscr{G}$ . We call $A(\mathscr{G})$ the algebra of sets generated by $\mathscr{G}$ .

A collection $\mathscr{S}$ of subsets of $\Omega$ is called a semialgebra of sets if (i) $\emptyset\in\mathscr{S}$ , (ii) $\Omega\in\mathscr{S}$ , (iii) if $A,B\in\mathscr{S}$ then $A\cap B\in\mathscr{S}$ , (iv) if $A,B\in\mathscr{S}$ then there are pairwise disjoint $E_{1},\ldots,E_{n}\in\mathscr{S}$ such that

A\setminus B=\bigcup_{k=1}^{n}E_{k}.

It is a fact that $A(\mathscr{S})$ is equal to the collection of all unions of finitely many pairwise disjoint elements of $\mathscr{S}$ .⁷⁷ 7 V. I. Bogachev, Measure Theory, volume I, p. 8, Lemma 1.2.14.

A nonempty collection $\mathscr{M}$ of subsets of $\Omega$ is called a monotone class if whenever $A_{n}\in\mathscr{M}$ , $A_{n}\subset A_{n+1}$ (an increasing sequence of sets), implies that $\bigcup_{n}A_{n}\in\mathscr{M}$ and $A_{n}\in\mathscr{M}$ , $A_{n+1}\subset A_{n}$ (a decreasing sequence of sets), implies that $\bigcap_{n}A_{n}\in\mathscr{M}$ . In other words, a monotone class is a nonempty collection of subsets of $\Omega$ such that if $A_{n}$ is a monotone sequence in $\mathscr{M}$ then $\lim_{n\to\infty}A_{n}\in\mathscr{M}$ . If $\mathscr{G}$ is a nonempty collection of subsets of $\Omega$ , it is a fact that there is a unique monotone class $M(\mathscr{G})$ that (i) contains $\mathscr{G}$ and (ii) is contained in any monotone class that contains $\mathscr{G}$ . We call $M(\mathscr{G})$ the monotone class generated by $\mathscr{G}$ . The monotone class theorem states that if $\mathscr{A}$ is an algebra of sets then $\sigma(\mathscr{A})=M(\mathscr{A})$ .⁸⁸ 8 V. I. Bogachev, Measure Theory, volume I, p. 33, Theorem 1.9.3.

The following lemma gives conditions under which we can establish that a function is measure-preserving.⁹⁹ 9 Peter Walters, An Introduction to Ergodic Theory, p. 20, Theorem 1.1.

Lemma 5.

Let $T:\Omega\to\Omega$ be a function and suppose that $\mathscr{S}$ is a semialgebra of sets for which $\mathscr{F}$ is the $\sigma$ -algebra generated by $\mathscr{S}$ . If (i) $T^{-1}(A)\in\mathscr{F}$ for each $A\in\mathscr{S}$ and (ii) $P(T^{-1}(A))=P(A)$ for each $A\in\mathscr{S}$ , then $T$ is measure-preserving.

Proof.

Let

\mathscr{C}=\{A\in\mathscr{F}:T^{-1}(A)\in\mathscr{F},P(T^{-1}(A))=P(A)\};

we wish to prove that $\mathscr{C}=\mathscr{F}$ .

If $A_{n}\in\mathscr{C}$ is an increasing sequence, let $A=\bigcup_{n=1}^{\infty}A_{n}$ . Then, as $T^{-1}(A_{n})\in\mathscr{F}$ for each $n$ ,

T^{-1}(A)=\bigcup_{n=1}^{\infty}T^{-1}(A_{n})\in\mathscr{F}

and as (i) $T^{-1}(A_{n})$ is an increasing sequence, (ii) $P$ is continuous from below,¹⁰¹⁰ 10 V. I. Bogachev, Measure Theory, volume I, p. 9, Proposition 1.3.3. and (iii) $P(T^{-1}(A_{n}))=P(A_{n})$ ,

	$\displaystyle P(T^{-1}(A))$	$\displaystyle=P\left(\bigcup_{n=1}^{\infty}T^{-1}(A_{n})\right)$
		$\displaystyle=\lim_{n\to\infty}P(T^{-1}(A_{n}))$
		$\displaystyle=\lim_{n\to\infty}P(A_{n})$
		$\displaystyle=P\left(\bigcup_{n=1}^{\infty}A_{n}\right)$
		$\displaystyle=P(A),$

and hence $A\in\mathscr{C}$ . If $A_{n}\in\mathscr{C}$ is a decreasing sequence, let $A=\bigcap_{n=1}^{\infty}A_{n}$ . Because $T^{-1}(A_{n})\in\mathscr{F}$ ,

T^{-1}(A)=\bigcap_{n=1}^{\infty}T^{-1}(A_{n})\in\mathscr{F},

and as (i) $T^{-1}(A_{n})$ is a decreasing sequence, (ii) $P$ is continuous from above, and (iii) $P(T^{-1}(A_{n}))=P(A_{n})$ ,

	$\displaystyle P(T^{-1}(A))$	$\displaystyle=P\left(\bigcap_{n=1}^{\infty}T^{-1}(A_{n})\right)$
		$\displaystyle=\lim_{n\to\infty}P(T^{-1}(A_{n}))$
		$\displaystyle=\lim_{n\to\infty}P(A_{n})$
		$\displaystyle=P\left(\bigcap_{n=1}^{\infty}A_{n}\right)$
		$\displaystyle=P(A),$

and hence $A\in\mathscr{C}$ . Therefore, $\mathscr{C}$ is a monotone class.

$\mathscr{S}\subset\mathscr{C}$ . If $A\in A(\mathscr{S})$ , then there are pairwise disjoint $A_{1},\ldots,A_{n}\in\mathscr{S}$ with $A=\bigcup_{k=1}^{n}A_{k}$ . As $T^{-1}(A_{k})\in\mathscr{F}$ ,

T^{-1}(A)=\bigcup_{k=1}^{n}T^{-1}(A_{k})\in\mathscr{F}.

As the $A_{k}$ are pairwise disjoint, so are the sets $T^{-1}(A_{k})$ , so, because $P(T^{-1}(A_{k}))=P(A_{k})$ ,

	$\displaystyle P(T^{-1}(A))$	$\displaystyle=P\left(\bigcup_{k=1}^{n}T^{-1}(A_{k})\right)$
		$\displaystyle=\sum_{k=1}^{n}P(T^{-1}(A_{k}))$
		$\displaystyle=\sum_{k=1}^{n}P(A_{k})$
		$\displaystyle=P\left(\bigcup_{k=1}^{n}A_{k}\right)$
		$\displaystyle=P(A).$

Therefore $A\in\mathscr{C}$ . This shows that $A(\mathscr{S})\subset\mathscr{C}$ .

The monotone class theorem tells us $\sigma(A(\mathscr{S}))=M(A(\mathscr{S}))$ . On the one hand, $\mathscr{F}=\sigma(\mathscr{S})$ and hence $\mathscr{F}=\sigma(A(\mathscr{S}))$ . On the other hand, $A(\mathscr{S})\subset\mathscr{C}$ and the fact that $\mathscr{C}$ is a monotone class yield

M(A(\mathscr{S}))\subset M(\mathscr{C})=\mathscr{C}.

Therefore

\mathscr{F}\subset\mathscr{C}.

Of course $\mathscr{C}\subset\mathscr{F}$ , so $\mathscr{C}=\mathscr{F}$ , proving the claim. ∎

Let $(\Omega,\mathscr{F},P)$ be a probability space. A measure-preserving transformation $T:\Omega\to\Omega$ is called ergodic if $A\in\mathscr{F}$ and $T^{-1}(A)=A$ implies that $P(A)=0$ or $P(A)=1$ . It is proved¹¹¹¹ 11 Peter Walters, An Introduction to Ergodic Theory, p. 37, Corollary 1.14.2. using the Birkhoff ergodic theorem that for a measure-preserving transformation $T:\Omega\to\Omega$ , $T$ is ergodic if and only if for all $A,B\in\mathscr{F}$ ,

\frac{1}{n}\sum_{k=0}^{n-1}P(T^{-1}(A)\cap B)\to P(A)P(B).

A measure-preserving transformation $T:\Omega\to\Omega$ is called weak-mixing if for all $A,B\in\mathscr{F}$ ,

\frac{1}{n}\sum_{k=0}^{n-1}|P(T^{-k}(A)\cap B)-P(A)P(B)|\to 0.

It is immediate that a weak-mixing transformation is ergodic.

A measure-preserving transformation $T:\Omega\to\Omega$ is called strong-mixing if for all $A,B\in\mathscr{F}$ ,

P(T^{-n}(A)\cap B)\to P(A)P(B).

If a sequence of real numbers $a_{n}$ tends to $0$ , then

\frac{1}{n}\sum_{k=0}^{n-1}|a_{k}|\to 0,

and using this we check that a strong-mixing transformation is weak-mixing.

The following statement gives conditions under which a measure-preserving transformation is ergodic, weak-mixing, or strong-mixing.¹²¹² 12 Peter Walters, An Introduction to Ergodic Theory, p. 41, Theorem 1.17.

Theorem 6.

Let $(\Omega,\mathscr{F},P)$ be a probability space, let $\mathscr{S}$ be a semialgebra that generates $\mathscr{F}$ , and let $T:\Omega\to\Omega$ be a measure-preserving transformation.

1.

$T$ is ergodic if and only if for all $A,B\in\mathscr{S}$ ,

$\frac{1}{n}\sum_{k=0}^{n-1}P(T^{-k}(A)\cap B)\to P(A)P(B).$
2.

$T$ is weak-mixing if and only if for all $A,B\in\mathscr{S}$ ,

$\frac{1}{n}\sum_{k=0}^{n-1}|P(T^{-k}(A)\cap B)-P(A)P(B)|\to 0.$
3.

$T$ is strong-mixing if and only if for all $A,B\in\mathscr{S}$ ,

$P(T^{-n}(A)\cap B)\to P(A)P(B).$

4 The strong law of large numbers

Let $\mu$ be a Borel probability measure on $\mathbb{R}$ with finite first moment:

\int_{\mathbb{R}}|x|dm(x)<\infty.

We shall specify when we use the hypothesis that $\mu$ has finite first moment; until we say so, what we say merely supposes that it is a Borel probability measure on $\mathbb{R}$ .

For $n\geq 0$ , let $\Omega_{n}=\mathbb{R}$ , let $\mathscr{B}_{n}=\mathscr{B}_{\mathbb{R}}$ , the Borel $\sigma$ -algebra of $\mathbb{R}$ , and let $\mu_{n}=\mu$ , for which $(\Omega_{n},\mathscr{B}_{n},\mu_{n})$ is a probability space. Let $\Omega=\prod_{n=0}^{\infty}\Omega_{n}$ . A cylinder set is a subset of $\Omega$ of the form

\prod_{n=0}^{\infty}A_{n},

where $A_{n}\in\mathscr{B}_{n}$ for each $n$ and where $\{n\geq 0:A_{n}\neq\Omega_{n}\}$ is finite. We denote by $\mathscr{C}$ the collection of all cylinder sets. It is a fact that $\mathscr{C}$ is a semialgebra of sets.¹³¹³ 13 S. J. Taylor, Introduction to Measure Theory and Integration, p. 136, §6.1; http://individual.utoronto.ca/jordanbell/notes/productmeasure.pdf

The product $\sigma$ -algebra is $\mathscr{F}=\sigma(\mathscr{C})$ , the $\sigma$ -algebra generated by the collection of all cylinder sets. The product measure,¹⁴¹⁴ 14 http://individual.utoronto.ca/jordanbell/notes/productmeasure.pdf which we denote by $P$ , is the unique probability measure on $\mathscr{F}$ such that for any cylinder set $\prod_{n=0}^{\infty}A_{n}$ ,

P(\prod_{n=0}^{\infty}A_{n})=\prod_{n=0}^{\infty}\mu_{n}(A_{n}).

Let $\pi_{n}:\Omega\to\Omega_{n}$ , $n\geq 0$ , be the projection map. Define $\tau:\Omega\to\Omega$ by

\tau(\omega_{0},\omega_{1},\ldots)=(\omega_{1},\omega_{2},\ldots),

the left shift map. In other words, for $n\geq 0$ , $\tau$ satisfies $\pi_{n}\circ\tau=\pi_{n+1}$ , and so $\pi_{n}=\pi_{0}\circ(\tau^{n})$ .

Lemma 7.

$\tau$ is measure-preserving.

Proof.

For $A=\prod_{n=0}^{\infty}A_{n}\in\mathscr{C}$ ,

\tau^{-1}(A)=\prod_{n=0}^{\infty}B_{n}=B,

where $B_{0}=\Omega_{0}$ and for $n\geq 1$ , $B_{n}=A_{n-1}$ . $B$ is a cylinder set so a fortiori belongs to $\mathscr{F}$ , and

P(B)=\prod_{n=0}^{\infty}\mu_{n}(B_{n})=\mu_{0}(\Omega_{0})\cdot\prod_{n=1}^{% \infty}\mu_{n}(B_{n})=\prod_{n=1}^{\infty}\mu_{n}(A_{n-1})=\prod_{n=1}^{\infty% }\mu_{n-1}(A_{n-1}),

P(\tau^{-1}(A))=\prod_{n=0}^{\infty}\mu_{n}(A_{n})=P(A).

Therefore by Lemma 5, because $\mathscr{C}$ is a semialgebra that generates $\mathscr{F}$ , it follows that $\tau$ is measure-preserving. ∎

Lemma 8.

$\tau$ is strong-mixing.

Proof.

Let $A=\prod_{n=0}^{\infty}A_{n}$ and $B=\prod_{n=0}^{\infty}B_{n}$ be cylinder sets. For $n\geq 0$ ,

\tau^{-n}(A)=\prod_{m=0}^{\infty}C_{m},

where $C_{m}=\Omega_{m}$ for $0\leq m\leq n-1$ and $C_{m}=A_{m-n}$ for $m\geq n$ . Because $A$ and $B$ are cylinder sets, there is some $N$ such that when $m\geq N$ , $A_{m}=\Omega_{m}$ and $B_{m}=\Omega_{m}$ . Thus for $n\geq N$ ,

	$\displaystyle\tau^{-n}(A)\cap B)$	$\displaystyle=\prod_{m=0}^{\infty}C_{m}\cap\prod_{m=0}^{\infty}B_{m}$
		$\displaystyle=\prod_{m=0}^{\infty}(C_{m}\cap B_{m})$
		$\displaystyle=\prod_{m=0}^{n-1}(C_{m}\cap B_{m})\times\prod_{m=n}^{\infty}(C_{% m}\cap B_{m})$
		$\displaystyle=\prod_{m=0}^{n-1}(\Omega_{m}\cap B_{m})\times\prod_{m=n}^{\infty% }(A_{m-n}\cap\Omega_{m})$
		$\displaystyle=\prod_{m=0}^{n-1}B_{m}\times\prod_{m=n}^{\infty}A_{m-n}.$

Hence

	$\displaystyle P(\tau^{-n}(A)\cap B)$	$\displaystyle=\prod_{m=0}^{n-1}\mu_{m}(B_{m})\cdot\prod_{m=n}^{\infty}\mu_{m}(% A_{m-n})$
		$\displaystyle=\prod_{m=0}^{n-1}\mu_{m}(B_{m})\cdot\prod_{m=0}^{\infty}\mu_{m+n% }(A_{m})$
		$\displaystyle=\prod_{m=0}^{n-1}\mu_{m}(B_{m})\cdot\prod_{m=0}^{\infty}\mu_{m}(% A_{m})$
		$\displaystyle=P(B)\cdot P(A).$

That is, there is some $N$ such that when $n\geq N$ ,

P(\tau^{-n}(A)\cap B)=P(A)P(B),

and so a fortiori,

\lim_{n\to\infty}P(\tau^{-n}(A)\cap B)=P(A)P(B).

Therefore, because the cylinder sets generate the $\sigma$ -algebra $\mathscr{F}$ , by Theorem 3 we get that $\tau$ is strong-mixing. ∎

We now use the hypothesis that $\mu$ has finite first moment.¹⁵¹⁵ 15 Elias M. Stein and Rami Shakarchi, Functional Analysis, Princeton Lectures in Analysis, volume IV, p. 208, chapter 5, §2.1.

Lemma 9.

$\pi_{0}\in L^{1}(P)$ .

Proof.

$T=\pi_{0}:\Omega\to\Omega_{0}$ is measurable, and $T_{*}P=\mu_{0}$ . The statement that $\mu=\mu_{0}$ has finite first moment means that $f:\Omega_{0}\to\mathbb{R}$ defined by $f(x)=|x|$ belongs to $L^{1}(\mu_{0})$ . Therefore by the change of variables theorem (2), we have $f\circ T\in L^{1}(P)$ . ∎

Lemma 10.

For almost all $\omega\in\Omega$ ,

\frac{1}{n}\sum_{k=0}^{n-1}\pi_{k}(\omega)\to\int_{\mathbb{R}}td\mu(t).

Proof.

Because $\tau:\Omega\to\Omega$ is ergodic and $\pi_{0}\in L^{1}(P)$ , the Birkhoff ergodic theorem¹⁶¹⁶ 16 Peter Walters, An Introduction to Ergodic Theory, p. 34, Theorem 1.14. tells us that for almost all $\omega\in\Omega$ ,

\frac{1}{n}\sum_{k=0}^{n-1}\pi_{0}(\tau^{k}(\omega))\to\int_{\Omega}\pi_{0}(% \omega)dP(\omega),

i.e.,

\frac{1}{n}\sum_{k=0}^{n-1}\pi_{k}(\omega)\to\int_{\Omega_{0}}\omega_{0}d\mu_{% 0}(\omega_{0}),

proving the claim. ∎

We will use Lemma 10 to prove the strong law of large numbers. First we prove two lemmas about joint distributions.¹⁷¹⁷ 17 Elias M. Stein and Rami Shakarchi, Functional Analysis, Princeton Lectures in Analysis, volume IV, p. 208, Lemma 2.2, Lemma 2.3.

Lemma 11.

If $X_{1},\ldots,X_{n}:\Omega\to\mathbb{R}$ and $Y_{1},\ldots,Y_{n}:\Omega^{\prime}\to\mathbb{R}$ are random variables with the same joint distribution and for each $1\leq k\leq n$ , $\Phi_{k}:\mathbb{R}^{n}\to\mathbb{R}$ is a continuous function, then for $U_{k}=\Phi_{k}(X_{1},\ldots,X_{n})$ and $V_{k}=\Phi_{k}(Y_{1},\ldots,Y_{n})$ , the random variables $U_{1},\ldots,U_{n}$ and $V_{1},\ldots,V_{n}$ have the same joint distribution.

Proof.

Write $X=(X_{1},\ldots,X_{n})$ , $Y=(Y_{1},\ldots,Y_{n})$ , $U=(U_{1},\ldots,U_{n})$ , and $V=(V_{1},\ldots,V_{n})$ , which are Borel measurable. Let $\Phi=(\Phi_{1},\ldots,\Phi_{n})$ , which is continuous $\mathbb{R}^{n}\to\mathbb{R}^{n}$ , and for which

U=\Phi(X),\qquad V=\Phi(Y).

To show that $U_{1},\ldots,U_{n}$ and $V_{1},\ldots,V_{n}$ have the same joint distribution means to show that $U_{*}P=V_{*}P^{\prime}$ . Let $A\in\mathscr{B}_{\mathbb{R}^{n}}$ , for which $\Phi^{-1}(A)\in\mathscr{B}_{\mathbb{R}^{n}}$ and

	$\displaystyle(U_{*}P)(A)$	$\displaystyle=P(U^{-1}(A))$
		$\displaystyle=P(X^{-1}(\Phi^{-1}(A)))$
		$\displaystyle=P^{\prime}(Y^{-1}(\Phi^{-1}(A)))$
		$\displaystyle=P^{\prime}(V^{-1}(A))$
		$\displaystyle=(V_{*}P^{\prime})(A),$

where, because $X_{1},\ldots,X_{n}$ and $Y_{1},\ldots,Y_{n}$ have the same joint distribution,

P(X^{-1}(\Phi^{-1}(A)))=(X_{*}P)(\Phi^{-1}(A))=(Y_{*}P^{\prime})(\Phi^{-1}(A))% =P^{\prime}(Y^{-1}(\Phi^{-1}(A))).

∎

Lemma 12.

If sequences of random variables $X_{n}:(\Omega,\mathscr{F},P)\to\mathbb{R}$ and $Y_{n}:(\Omega^{\prime},\mathscr{F}^{\prime},P^{\prime})\to\mathbb{R}$ , $n\geq 1$ , have the same finite-dimensional distributions and there is some $a\in\mathbb{R}$ such that $X_{n}$ converges to $a$ almost surely, then $Y_{n}$ converges to $a$ almost surely.

Proof.

That the sequences $X_{n}$ and $Y_{n}$ have the same finite-dimensional distributions means that for each $n\geq 1$ ,

(X_{1},\ldots,X_{n})_{*}P=(Y_{1},\ldots,Y_{n})_{*}P^{\prime},

i.e., for each $n\geq 1$ and for each $A\in\mathscr{B}_{\mathbb{R}^{n}}$ ,

P((X_{1},\ldots,X_{n})\in A)=P^{\prime}((Y_{1},\ldots,Y_{n})\in A).

Define

	$\displaystyle E_{k,N,n}$	$\displaystyle=\left\{\omega\in\Omega:\|X_{n}(\omega)-a\|\leq\frac{1}{k}\right\}$
	$\displaystyle F_{k,N,n}$	$\displaystyle=\left\{\omega\in\Omega^{\prime}:\|Y_{n}(\omega)-a\|\leq\frac{1}{k}% \right\}.$

Then define

	$\displaystyle E$	$\displaystyle=\bigcap_{k=1}^{\infty}\bigcup_{N=1}^{\infty}\bigcap_{n=N}^{% \infty}E_{k,N,n}=\bigcap_{k=1}^{\infty}\bigcup_{N=1}^{\infty}E_{k,N}=\bigcap_{% k=1}^{\infty}E_{k}$
	$\displaystyle F$	$\displaystyle=\bigcap_{k=1}^{\infty}\bigcup_{N=1}^{\infty}\bigcap_{n=N}^{% \infty}F_{k,N,n}=\bigcap_{k=1}^{\infty}\bigcup_{N=1}^{\infty}F_{k,N}=\bigcap_{% k=1}^{\infty}F_{k},$

and

	$\displaystyle G_{k,N,n}$	$\displaystyle=\bigcap_{m=N}^{n}E_{k,N,m}$
	$\displaystyle H_{k,N,n}$	$\displaystyle=\bigcap_{m=N}^{n}F_{k,N,m}.$

Because $(X_{1},\ldots,X_{n})$ and $(Y_{1},\ldots,Y_{n})$ have the same distribution,

P(G_{k,N,n})=P^{\prime}(H_{k,N,n}).

But

E_{k,N}=\bigcap_{n=N}^{\infty}E_{k,N,n}=\bigcap_{n=N}^{\infty}G_{k,N,n}=\lim_{% n\to\infty}G_{k,N,n}

and

F_{k,N}=\lim_{n\to\infty}H_{k,N,n},

P(E_{k,N})=P^{\prime}(F_{k,N}).

Then

P(E_{k})=\lim_{N\to\infty}P(E_{k,N})=\lim_{N\to\infty}P^{\prime}(F_{k,N})=P^{% \prime}(F_{k}).

Then

P(E)=\lim_{k\to\infty}P(E_{k})=\lim_{k\to\infty}P^{\prime}(F_{k})=P^{\prime}(F).

That the sequence $X_{n}$ converges almost surely means that $P(E)=1$ , and therefore $P^{\prime}(F)=1$ , i.e. the sequence $Y_{n}$ converges almost surely. ∎

We now use what we have established to prove the strong law of large numbers.¹⁸¹⁸ 18 Elias M. Stein and Rami Shakarchi, Functional Analysis, Princeton Lectures in Analysis, volume IV, p. 206, Theorem 2.1. (We write $(\Omega^{\prime},\mathscr{F}^{\prime},P^{\prime})$ for a probability space because $(\Omega,\mathscr{F},P)$ denotes the product probability space constructed already.)

Theorem 13.

If $X_{n}:(\Omega^{\prime},\mathscr{F}^{\prime},P^{\prime})\to\mathbb{R}$ , $n\geq 1$ , are independent identically distributed $L^{1}$ random variables, then for almost all $\omega\in\Omega^{\prime}$ ,

\frac{1}{n}\sum_{k=1}^{n}X_{k}(\omega)\to E(X_{1}).

Proof.

For $n\geq 0$ , let $Y_{n}=X_{n+1}$ ; we do this to make the index set the same as for $\Omega_{n}$ . Let $\mu_{n}={Y_{n}}_{*}P^{\prime}$ and set $\mu=\mu_{0}$ . Because the $Y_{n}$ are identically distributed, $\mu_{n}=\mu_{0}$ for each $n$ .

Because $Y_{0}$ is $L^{1}$ , $\mu$ has finite first moment. As $\mu_{0}={Y_{0}}_{*}P^{\prime}$ , applying the change of variables theorem (2),

\int_{\mathbb{R}}td\mu(t)=\int_{\mathbb{R}}td({Y_{0}}_{*}P^{\prime})(t)=\int_{% \Omega^{\prime}}Y_{0}(\omega)dP^{\prime}(\omega)=E(Y_{0}).

With this, Lemma 10 says that for almost all $\omega\in\Omega$ ,

\frac{1}{n}\sum_{k=0}^{n-1}\pi_{k}(\omega)\to E(Y_{0}).

(3)

For each $n$ ,

(\pi_{0},\ldots,\pi_{n})_{*}P=\mu_{0}\times\cdots\times\mu_{n},

and because the $Y_{n}$ are independent,

(Y_{0},\ldots,Y_{n})_{*}P^{\prime}={Y_{0}}_{*}P^{\prime}\times\cdots\times{Y_{% n}}_{*}P^{\prime}=\mu_{0}\times\cdots\times\mu_{n},

so the sequences $\pi_{n}$ and $Y_{n}$ have the same finite-dimensional distributions. For $n\geq 1$ , define $\Phi_{n}:\mathbb{R}^{n}\to\mathbb{R}$ by

\Phi_{n}(t_{1},\ldots,t_{n})=\frac{1}{n}\sum_{k=1}^{n}t_{k}.

Lemma 11 then tells us that the sequences

U_{n}=\Phi_{n}(Y_{0},\ldots,Y_{n-1})=\frac{1}{n}\sum_{k=0}^{n-1}Y_{k}

and

V_{n}=\Phi_{n}(\pi_{0},\ldots,\pi_{n-1})=\frac{1}{n}\sum_{k=0}^{n-1}\pi_{k},

$n\geq 1$ , have the same finite-dimensional distributions.

Now, (3) says that $V_{n}$ converges to $E(Y_{0})$ almost surely, and thus applying Lemma 12 we get that $U_{n}$ converges to $E(Y_{0})$ almost surely. That is,

\frac{1}{n}\sum_{k=0}^{n-1}Y_{k}\to E(Y_{0})

almost surely, and because $Y_{k}=X_{k+1}$ ,

\frac{1}{n}\sum_{k=1}^{n}X_{k}\to E(X_{1})

almost surely, completing the proof. ∎

	$\displaystyle P(\|S_{n}-E(T_{n})\|\geq t)$	$\displaystyle=P\left(\{\|S_{n}-E(T_{n})\|\geq t\}\cap\bigcap_{k=1}^{n}\{\|X_{k}\|% \leq n\epsilon^{3}\}\right)$
		$\displaystyle+P\left(\{\|S_{n}-E(T_{n})\|\geq t\}\cap\bigcup_{k=1}^{n}\{\|X_{k}\|>% n\epsilon^{3}\}\right)$
		$\displaystyle\leq P(\|T_{n}-E(T_{n})\|\geq t)+P\left(\bigcup_{k=1}^{n}\{\|X_{k}\|>% n\epsilon^{3}\}\right)$
		$\displaystyle=P\left(\left\|\frac{T_{n}}{n}-E(Y_{n,1})\right\|\geq\frac{t}{n}% \right)+P\left(\bigcup_{k=1}^{n}\{\|X_{k}\|>n\epsilon^{3}\}\right)$
		$\displaystyle\leq\frac{E(\|Y_{n,1}-E(Y_{n,1})\|^{2})}{n\cdot\left(\frac{t}{n}% \right)^{2}}+\sum_{k=1}^{n}P(\|X_{k}\|>n\epsilon^{3})$
		$\displaystyle=\frac{n}{t^{2}}\cdot(E(Y_{n,1}^{2})-E(Y_{n,1})^{2})+nP(\|X_{1}\|>n% \epsilon^{3})$
		$\displaystyle\leq\frac{n}{t^{2}}\cdot E(Y_{n,1}^{2})+nP(\|X_{1}\|>n\epsilon^{3}).$

	$\displaystyle P(\|S_{n}-E(T_{n})\|\geq n\epsilon)$	$\displaystyle\leq\frac{1}{n\epsilon^{2}}\cdot E(Y_{n,1}^{2})+nP(\|X_{1}\|>n% \epsilon^{3})$
		$\displaystyle=\frac{1}{n\epsilon^{2}}\cdot E(1_{\{\|X_{1}\|\leq n\epsilon^{3}\}}% X_{1}^{2})+nP(\|X_{1}\|>n\epsilon^{3})$
		$\displaystyle\leq\frac{1}{n\epsilon^{2}}\cdot E(1_{\{\|X_{1}\|\leq n\epsilon^{3}% \}}n\epsilon^{3}\|X_{1}\|)+nP(\|X_{1}\|>n\epsilon^{3})$
		$\displaystyle\leq\epsilon E(\|X_{1}\|)+nP(\|X_{1}\|>n\epsilon^{3}).$