Econometrics: Methods and Applications Erasmus University Rotterdam https://www.coursera.org/learn/erasmus-econometrics/home/welcome

Training Exercise 1.1¶

Dataset TrainExer11 contains survey outcomes of a travel agency that wishes to improve recommendation strategies for its clients. The dataset contains 26 observations on age and average daily expenditures during holidays.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
  • https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

  • https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

  • https://numpy.org/doc/stable/reference/generated/numpy.reshape.html

  • https://numpy.org/doc/stable/user/basics.creation.html

  • https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

  • https://matplotlib.org/stable/gallery/statistics/hist.html

In [2]:
TrainExer11_pd = pd.read_csv('TrainExer11.txt', sep='\t', header=0, index_col=0)
In [3]:
TrainExer11_pd.head()
Out[3]:
Age Expenditures
Observ.
1 49 95
2 15 104
3 43 91
4 45 98
5 40 94
In [4]:
X = pd.Series.to_numpy(TrainExer11_pd['Age']).reshape((-1, 1))
In [5]:
y = pd.Series.to_numpy(TrainExer11_pd['Expenditures'])

(a)¶

(a) Make two histograms, one of expenditures and the other of age. Make also a scatter diagram with expenditures on the vertical axis versus age on the horizontal axis.

In [6]:
plt.hist(X)
plt.xlabel('Age in years')
plt.ylabel('Frequency')
plt.title('Histogram of Ages')
Out[6]:
Text(0.5, 1.0, 'Histogram of Ages')
In [7]:
plt.hist(y)
plt.xlabel('Expenditures per week of vacation')
plt.ylabel('Frequency')
plt.title('Histogram of Weekly Vacation Expenditures')
Out[7]:
Text(0.5, 1.0, 'Histogram of Weekly Vacation Expenditures')
In [8]:
model = LinearRegression().fit(X, y)
In [9]:
r_sq = model.score(X, y)
In [10]:
r_sq
Out[10]:
0.33766820038768774
In [11]:
model.intercept_
Out[11]:
114.24110795493158
In [12]:
model.coef_
Out[12]:
array([-0.3335961])
In [13]:
plt.scatter(X,y)
plt.plot(X, model.predict(X))
plt.xlabel('Age')
plt.ylabel('Expenditures')
plt.title('Expenditures vs. Age')
Out[13]:
Text(0.5, 1.0, 'Expenditures vs. Age')

(c)¶

(c) Propose a method to analyze these data in a way that assists the travel agent in making recommendations to future clients.

Separate the observations into two clusters - namely, the "top left" and the "bottom right".

Those individual clusters appear to have positive regression slopes, while the dataset - the union of the two clusters - has a negative regression slope.


(d)¶

The scatter diagram indicates two groups of clients. Younger clients spend more than older ones. Further, expenditures tend to increase with age for younger clients, whereas the pattern is less clear for older clients.

(d) Compute the sample mean of expenditures of all 26 clients.

In [14]:
np.mean(y)
Out[14]:
101.11538461538461

(e)¶

(e) Compute two sample means of expenditures, one for clients of age forty or more and the other for clients of age below forty.

In [15]:
A = TrainExer11_pd[(TrainExer11_pd['Age'] <= 45) & (TrainExer11_pd['Expenditures'] >= 100)]
In [16]:
B = TrainExer11_pd[(TrainExer11_pd['Age'] > 45) | (TrainExer11_pd['Expenditures'] < 100)]
In [17]:
X_A = pd.Series.to_numpy(A['Age']).reshape((-1, 1))
y_A = pd.Series.to_numpy(A['Expenditures'])
In [18]:
len(A)
Out[18]:
13
In [19]:
X_B = pd.Series.to_numpy(B['Age']).reshape((-1, 1))
y_B = pd.Series.to_numpy(B['Expenditures'])
In [20]:
len(B)
Out[20]:
13
In [21]:
np.mean(y_A)
Out[21]:
106.38461538461539
In [22]:
np.mean(y_B)
Out[22]:
95.84615384615384
In [23]:
model_A = LinearRegression().fit(X_A, y_A)
In [24]:
model_B = LinearRegression().fit(X_B, y_B)
In [25]:
plt.scatter(X_A, y_A)
plt.plot(X_A, model_A.predict(X_A))
plt.scatter(X_B, y_B)
plt.plot(X_B, model_B.predict(X_B))
plt.xlabel('Age')
plt.ylabel('Expenditures')
plt.title('Expenditures vs. Age')
Out[25]:
Text(0.5, 1.0, 'Expenditures vs. Age')

(f)¶

(f) What daily expenditures would you predict for a new client of fifty years old? And for someone who is twenty-five years old?

In [26]:
model_A.predict(np.array([25,]).reshape((-1,1)))
Out[26]:
array([105.18155915])
In [27]:
model_B.predict(np.array([50,]).reshape((-1,1)))
Out[27]:
array([96.19543044])