Econometrics: Methods and Applications Erasmus University Rotterdam https://www.coursera.org/learn/erasmus-econometrics/home/welcome

Training Exercise 1.3¶

Dataset TrainExer13 contains the winning times (W) of the Olympic 100-meter finals (for men) from 1948 to 2004.

The calendar years 1948-2004 are transformed to games (G) 1-15 to simplify computations. A simple regression model for the trend in winning times is $W_i=\alpha + \beta G_i + \epsilon_i$

(a) Compute $\alpha$ and $\beta$, and determine the values of $R^2$ and $s$

(b) Are you confident on the predictive ability of this model? Motivate your answer.

(c) What prediction do you get for 2008, 2012, and 2016? Compare your predictions with the actual winning times.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
  • https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

  • https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

  • https://numpy.org/doc/stable/reference/generated/numpy.reshape.html

  • https://numpy.org/doc/stable/user/basics.creation.html

  • https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

  • https://matplotlib.org/stable/gallery/statistics/hist.html

In [2]:
TrainExer13 = pd.read_csv('TrainExer13.txt', sep='\t', header=0, names=['G', 'T'], index_col=1)
In [3]:
TrainExer13
Out[3]:
G T
1948 1 10.30
1952 2 10.40
1956 3 10.50
1960 4 10.20
1964 5 10.00
1968 6 9.95
1972 7 10.14
1976 8 10.06
1980 9 10.25
1984 10 9.99
1988 11 9.92
1992 12 9.96
1996 13 9.84
2000 14 9.87
2004 15 9.85
In [4]:
X = pd.Series.to_numpy(TrainExer13['G']).reshape((-1, 1))
In [5]:
y = pd.Series.to_numpy(TrainExer13['T'])
In [6]:
model = LinearRegression().fit(X, y)
In [7]:
model.intercept_
Out[7]:
10.386000000000001
In [8]:
model.coef_
Out[8]:
array([-0.038])
In [9]:
plt.scatter(X,y)
plt.plot(X, model.predict(X))
plt.xlabel('Game G')
plt.ylabel('Winng time W (s)')
plt.title('Winning times (W) of the Olympic 100-meter finals (for men) from 1948 to 2004')
Out[9]:
Text(0.5, 1.0, 'Winning times (W) of the Olympic 100-meter finals (for men) from 1948 to 2004')
In [10]:
residuals = y - model.predict(X)
In [11]:
s = np.sum(np.square(residuals)) / (y.size - 2)
In [12]:
s
Out[12]:
0.015086153846153952
In [13]:
r_sq = model.score(X, y)
In [14]:
r_sq
Out[14]:
0.6733728599027362
In [15]:
model.predict(np.array([16,]).reshape((-1,1)))
Out[15]:
array([9.778])
In [16]:
model.predict(np.array([17,]).reshape((-1,1)))
Out[16]:
array([9.74])
In [17]:
model.predict(np.array([18,]).reshape((-1,1)))
Out[17]:
array([9.702])

https://olympics.com/en/olympic-games/olympic-results

Olympic Games Beijing 2008: 9.890 https://olympics.com/en/olympic-games/beijing-2008/results/athletics/100m-men

Olympic Games London 2012: 9.630 https://olympics.com/en/olympic-games/london-2012/results/athletics/100m-men

Olympic Games Rio 2016: 9.810 https://olympics.com/en/olympic-games/rio-2016/results/athletics/100m-men

In [18]:
print("Residual for 2008 prediction is " + str(9.890 - 9.778))
Residual for 2008 prediction is 0.1120000000000001
In [19]:
print("Residual for 2008 prediction is " + str(9.630 - 9.74))
Residual for 2008 prediction is -0.10999999999999943
In [20]:
print("Residual for 2008 prediction is " + str(9.810 - 9.702))
Residual for 2008 prediction is 0.10800000000000054

Any model of a quantity that cannot be negative, such as duration of 100m race, will eventually break down. But beyond this a priori limited domain, one weakness of this model is that despite residuals being equally spread between positive and negative, the absolute size of the residuals is decreasing as G increases: there are large residuals for lower $G$, and small residuals for higher $G$. This is a symptom of a bad model.