Worksheet WS20220411
If the relationship between two variables is not linear, it still can be possible to apply the linear model after a non-linear-transformation of one or both of the variables. A linear transformation doesn’t make sense in such cases, because a linear transformation doesn’t effect the strength of a linear relationship.
Below you first find a fictive data set in which a couple of transformations is applied on one or both variables. An OLS Linear Regression model is used to describe the relationship between the transformed variable and the not-transformed variable or, if both variables are transformed, between the two transformed variables.
Table 1
Head of the fictive data set
X | Y |
103 | 171,376 |
125 | 147,400 |
161 | 148,778 |
199 | 150,811 |
239 | 159,908 |
As always, the first step is plotting Y versus X in a scatter plot.
Figure 1
Scatterplot Y ~ X
Figure 1a
Residual plot RESIDUAL ~ X
Comment
The plot in Figure 1 shows a positive association between Y and X. Let’s first estimate an OLS simple linear regression model and also plot the corresponding residual plot (Figure 1a).
Model 1: OLS Linear Regression Model Output Y~X
##
## Call:
## lm(formula = Y ~ X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66995 -25951 -4995 22500 83892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40547.20 13784.20 2.942 0.00554 **
## X 455.70 22.12 20.605 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35480 on 38 degrees of freedom
## Multiple R-squared: 0.9178, Adjusted R-squared: 0.9157
## F-statistic: 424.6 on 1 and 38 DF, p-value: < 2.2e-16
Although R2 is high, the scatter plot shows that the relationship is curved. The residual plot shows a clear pattern (positive values - negative values - positive values). Reason to adjust the model by applying a transformation on either the X-variable, the Y-variable or both.
Exercise 1
What is the equation of the estimated regression line in Model 1?
ANSWER: Y = 40,547 + 455.7X
What is the value of R2?
ANSWER: 0.918
What stands out when you study the residual plot?
ANSWER: a clear pattern, positive values - negative values - positive values
Based on what we see in the graph, a possibility is that the relationship between Y and X can be described by a quadratic model Y = \(\alpha\) + \(\beta\)X2.
To apply this model, we bring it back to a simple linear model using a transformation on the X-variable by squaring the X-values. In the data set we add a variable X2 = X2 and then we apply the OLS linear regression model on Y ~ X2.
Table 2
Head of the Data Set with added Variable X2 = X2
X | X2 | Y |
103 | 10,609 | 171,376 |
125 | 15,625 | 147,400 |
161 | 25,921 | 148,778 |
199 | 39,601 | 150,811 |
239 | 57,121 | 159,908 |
In Figure 2 the Y-values are plotted against the X2-values.
Figure 2
Example Scatterplot Y ~ X2 values
Figure 2a
Example Scatterplot Residuals ~ X2 values
Comment
Figure 2 shows a linear relationship between Y and X2 (= X2). The model parameters are estimated by applying an OLS regression on Y versus X2.
Figure 2a shows the residual plot in which the residuals are plotted against the X2 values. There is no apparent pattern in the residuals as in Figure 1a.
Model 2: OLS Regression Model Output Y~X2
##
## Call:
## lm(formula = Y ~ X2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41303 -9502 -985 10819 27489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.395e+05 4.262e+03 32.73 <2e-16 ***
## X2 4.131e-01 8.797e-03 46.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16110 on 38 degrees of freedom
## Multiple R-squared: 0.9831, Adjusted R-squared: 0.9826
## F-statistic: 2205 on 1 and 38 DF, p-value: < 2.2e-16
There is no apparent pattern in the residuals as in Figure 1a. Together with a higher R2 value, Model 2 seems to fit the data better than Model 1.
Exercise 2
What is the equation of the estimated regression line in Model 1?
ANSWER: Y = 139,504 + 0.4131X2
What is the value of R2?
ANSWER: 0.983
In general, comparing two models is done based on:
Comparing Residual Plots
The residual plot from Model 1 shows a clear pattern in the residuals; the pattern from Model 2 is not that clear.
Model 2 scores better than Model 1 on this point.
Comparing R2 values
Model 1: R2 = 0.918
Model 2: R2 = 0.983
The coefficient of determination R2 is higher for Model 2 than for Model 1.
So Model 2 is better than Model 1 as far as R2 is concerned.
Comparing SE residuals
Model 1: SEres = 35,480
Model 2: SEres = 16,110
The lower the variability in the residuals, the lower the variability in the Y-estimates, the better the model.
So also on this criterion, Model 2 performs better than Model 1.
Conclusion
Model 2 is the better of the two models.
Exercise 3
Using Model 2 to make predictions for a Y-value
Using the model to estimate the Y-value given an X-value is straight forward.
The Model 2 regression equation is: \(\hat{Y}\) = 139,500 + 0.4131 \(\times\) X2
If for instance X = 300, \(\hat{Y}\) = 139,500 + 0.4131 \(\times\) 3002 = …………………………….
ANSWER: 176,686
A 95% CI for this Y-value is calculated as follows:
95% CI = Point Estimate \(\pm\) Margin of Error = ……………. \(\pm\) t{df=n-2} \(\times\) se = …………………………………………………………
The se-value can be found in the regression output, the t-value in a t-table or with the graphical calculator.
ANSWER:
with point_estimate = 176,686, t-value = 2.024 and se = 16,109
<144,076; 209,297>
—
In stead of transforming the X-variable into X2, we could also have chosen to transform the Y-variable into \(\sqrt{Y}\) and apply the OLS linear regrssion model on \(\sqrt{Y}\) versus X: \(\sqrt{Y}\) = \(\alpha\) + \(\beta\)X + \(\epsilon\).
Figure 3
Example Scatterplot SQRT(Y) ~ X values
Figure 3a
Example Scatterplot Residuals ~ X values
Comment
Figure 3 shows a linear relationship between \(\sqrt{Y}\) and X. The model parameters are estimated by applying an OLS linear regression on \(\sqrt{Y}\) versus X.
Figure 3a shows the plot in which the residuals are plotted against the X- values.
Model 3: OLS Regression Model Output SQRT(Y)~X
##
## Call:
## lm(formula = SQRT_Y ~ X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.354 -14.837 -1.208 14.744 69.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 301.40123 9.72900 30.98 <2e-16 ***
## X 0.41383 0.01561 26.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.04 on 38 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.9474
## F-statistic: 702.8 on 1 and 38 DF, p-value: < 2.2e-16
There is a clearer pattern in the residuals than in Figure 2a.
Exercise 4
Compare Model 2 and Model 3.
What is the regression line equation in Model 3?
ANSWER: \(\hat{\sqrt(Y)}\) = 301.4 + 0.4138 X
Based on the distributions in the residual plots, which of the two is the better model?
ANSWER: model 2; residual plot in model 3 has a clear pattern
Based on the R2 values, which of the two is the better model?
ANSWER: model 2, because R2 in model 2 (0.983) is higher than in Model 3 (0.949)
Note. It makes no sense in this case to compare the SEe’s. The SE in model 2, as in model 1, measures the variability in Y-values around the regression line. The SE in model 3 measures the variability in \(\sqrt{Y}\)-values. The dimension of SE in model 1 and model 2, is the same as the dimension of the Y-values. The dimension of SE in model 3 is the same as the dimension of the \(\sqrt{Y}\)-values. So if for instance, the Y-values are in dollars, the dimension of the SEe in model 3 is \(\sqrt{dollar}\) (whatever that may look like).
Exercise 5
Use model 1 to calculate a 95% CI for the Y-value given X = 500.
ANSWER X = 500: \(\hat{Y}\) = 40,547 + 455.7 * 500 = 268,397
95% CI for Y: \(\hat{Y} \pm\) tdf=38 * se = 268,397 \(\pm\) 2.024 * 35,475 = <196,582 ; 340,213>
Use model 2 to calculate a 95% CI for the Y-value given X = 500.
ANSWER
X = 500
\(\hat{Y}\) = 139,504 + 0.4131 * 250,000 = 242,788
95% CI for Y: \(\hat{Y} \pm\) tdf=38 * se = 242,788 \(\pm\) 2.024 * 16,109 = <210,178 ; 275,399>
Use model 3 to calculate a 95% CI for the Y-value given X = 500.
ANSWER
X = 500; X2 = 250,000
\(\hat{\sqrt{Y}}\) = 301.4 + 0.4138 * 500 = 508.32
\(\hat{Y}\) = 508.322 = 258,388
95% CI for \(\sqrt{Y}\): \(\hat{\sqrt{Y}} \pm\) tdf=38 * se = 508.32 \(\pm\) 2.024 * 25.04 = <457.63 ; 559.01>
95% CI for Y: <209,426 ; 312,488>
Another form of transformation is a log-transformation. This can be applied to the X-variable, Y ~ log(X) model, to the Y-variable, log(Y) ~ X model, or to both, log(Y) ~ log(X) model.
These models are shown below. The logaritm with base 10 is used in the models. It is also possible to use another base for the logaritm, the natural logaritm ln is used, the logaritm with base = e (\(\approx\) 2.7182818).
Figure 4
Scatterplot Y ~ log10(X) values
Figure 4a
Scatterplot Residuals ~ log10(X) values
Comment
Figure 4 shows the relationship between Y and log(X). An OLS linear regression model is applied, although based on Figure 4 this doesn’t seem a good idea.
The model parameters are estimated by applying an OLS linear regression on Y versus log(X).
Figure 4a shows the plot in which the residuals are plotted against the log(X)- values.
Model 4: OLS Regression Model Output Y ~ log(X)
##
## Call:
## lm(formula = Y ~ log10(X), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99314 -40909 -18920 36705 152862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -809930 112701 -7.187 1.38e-08 ***
## log10(X) 411580 41615 9.890 4.64e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65470 on 38 degrees of freedom
## Multiple R-squared: 0.7202, Adjusted R-squared: 0.7128
## F-statistic: 97.82 on 1 and 38 DF, p-value: 4.636e-12
There is a clear pattern in the residuals as in Figure 1.
Compared with the first model, this transformation doesn’t lead to a better fitting model, on the contrary.
Exercise 6
What is the equation of the regression line in Model 4?
ANSWER: Y = -809,930 + 411,580log10(X)
Why is this model worse than Model 3?
ANSWER
Figure 5
Scatterplot log(Y) ~ X values
Figure 5a
Scatterplot Residuals ~ X values
Comment
Figure 5 shows a linear relationship between log(Y) and X values. The model parameters are estimated by applying an OLS linear regression on log(Y) versus X.
Figure 5a shows the plot in which the residuals are plotted against the X- values.
Model 5: OLS Regression Model Output LOG(Y) ~ X
##
## Call:
## lm(formula = log10(Y) ~ X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.091705 -0.018100 0.000677 0.020620 0.102983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.062e+00 1.253e-02 403.93 <2e-16 ***
## X 6.686e-04 2.011e-05 33.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03225 on 38 degrees of freedom
## Multiple R-squared: 0.9668, Adjusted R-squared: 0.9659
## F-statistic: 1106 on 1 and 38 DF, p-value: < 2.2e-16
Exercise 6
What is the equation of the regression line in Model 5?
ANSWER: \(\hat{log(Y)}\) = 5.062 + 0.0006686X
Assess Model 5 by examining Figure 5, Figure 5a and the R2 value of Model 5?
ANSWER
Figure 6
Scatterplot log(Y) ~ log(X) values
Figure 6a
Scatterplot Residuals ~ log(X) values
Comment
Figure 6 shows a linear relationship between log(Y) and log(X). The model parameters are estimated by applying an OLS linear regression on log(Y) versus log(X).
Figure 6a shows the plot in which the residuals are plotted against the log(X)- values.
Model 6: OLS Regression Model Output LOG(Y) ~ LOG(X)
##
## Call:
## lm(formula = log10(Y) ~ log10(X), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13923 -0.05482 -0.01547 0.03959 0.22223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74310 0.12690 29.50 <2e-16 ***
## log10(X) 0.63026 0.04686 13.45 5e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07372 on 38 degrees of freedom
## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8218
## F-statistic: 180.9 on 1 and 38 DF, p-value: 5.002e-16
Exercise 7
What is the equation of the regression line in Model 6?
ANSWER: log10(Y) = 3.743 + 0.630 log10(X)
Rewrite this equation to Y = ….
ANSWER
10log10(Y) = 103.743+0.630log10(X)
Y = 103.743 * 100.630log10(X)
Y = 103.743 * (10log10(X))0.630
Y = 103.743 * X0.630
Y = 5534.8 * X0.630
Assess Model 6 by examining Figure 6, Figure 6a and the R2 value of Model 6?
ANSWER
Exercise 8
Rank the six models from 1 (the best fitting) to 6.
ANSWER
| MODEL | Scatter plot | Residual plot | R2 |
|---|---|---|---|
| 1 | linear relationship: yes, but curve model seems better | clear pattern in residuals | 0.918 |
| 2 | linear relationship: yes | resiudals are randomly distributed | 0.983 |
| 3 | linear relationship: yes, but curve model seems better | clear pattern in residuals | 0.949 |
| 4 | does not show linear relation but a curved | strong pattern in residuals | 0.720 |
| 5 | shows linear relationship | randomly distributed residuals | 0.967 |
| 6 | shows curved relationship | clear pattern in residuals | 0.826 |
Ranking: Model 2 > Model 5 > Model 3 > Model 1 > Model 6 > Model 4
Exercise 9
ANSWER \(\hat{\sqrt(Y)}\) = 301.4 + 0.4138 X
Exercise 10
The equation of the regression line of Model 5 is:
\(\hat{log10(Y)}\) = ……… + ………. X (fill in the blanks)
ANSWER
\(\hat{log~10~(Y)}\) = 5.062 + 0.0006686X
Rewrite this equation in Y = ……………………………. .
ANSWER
Y = 105.062+0.0006686X
Y = 105.062 * 100.0006686X
Y = 105.062 * (100.0006686)X
Y = 115,373 \(\times\) 1.00154X
In this example we use data from the World Bank on country level. In the data set the objects are the countries of the world and the variables are:
The question is if there is a relationship between these two variables and if so, to model this relationship with an OLS Regression Model.
Table 3
Head World Bank Data 2019
COUNTRY | COUNTRYCODE | YEAR | GDP_PP | LIFE_EXP |
Afghanistan | AFG | YR2019 | 494 | 65 |
Albania | ALB | YR2019 | 5,396 | 79 |
Algeria | DZA | YR2019 | 3,990 | 77 |
Angola | AGO | YR2019 | 2,810 | 61 |
Antigua and Barbuda | ATG | YR2019 | 17,377 | 77 |
Argentina | ARG | YR2019 | 10,057 | 77 |
The first step is, as always, graphing the data. To examine the relationship between two quantitative variables, a scatter plot is the most commonly used graph.
Figure 7
Life Expectancy versus GDP per Person for 237 Countries in 2019
Figure 7a
Residuals versus GDP per Person
Comment
As is clear from the scatter plot, the relationship between the two variables is not a linear one. Despite this, an OLS linear regression model is applied on the data.
Model 7: OLS Linear Regression Model Life_Expectancy ~ GDP PP in 2019
##
## Call:
## lm(formula = LIFE_EXP ~ GDP_PP, data = df_2019)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.418 -3.229 1.505 4.244 8.592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.955e+01 4.432e-01 156.94 <2e-16 ***
## GDP_PP 1.986e-04 1.599e-05 12.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.651 on 235 degrees of freedom
## Multiple R-squared: 0.3963, Adjusted R-squared: 0.3937
## F-statistic: 154.2 on 1 and 235 DF, p-value: < 2.2e-16
As was already clear from the scatter plot, an OLS linear regression model is not an adequate model to describe the relationship between Life_Expectancy and GDP per Person.
The plot in Figure 7 shows the relationship between LIFE_EXP en GDP_PP as a curve that starts as a fast increasing curve and flattens more and more. Functions like Y = a \(\times\) Xb with \(0 \le b \le 1\) have this behavior.
If Y = a \(\times\) Xb, then
log(Y) = log(a \(\times\) Xb) or
log(Y) = log(a) + log(Xb) or
log(Y) = log(a) + b \(\times\) log(X)
The last equation shows a linear relation between log(Y) and log(X).
That’s why a combined Y \(\rightarrow\) log(Y) and X \(\rightarrow\) log(X) can be tried.
Figure 8
log(Life Expectancy) versus log(GDP per Person) for 237 Countries in 2019
Figure 8a
Residuals versus GDP per Person
Comment
The scatterplot shows that an OLS linear regression model seems an adequate model to describe the relationship.
Model 8: log(Life_Expectancy) ~ log(GDP PP) in 2019
##
## Call:
## lm(formula = log10(LIFE_EXP) ~ log10(GDP_PP), data = df_2019)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.097708 -0.011149 0.002883 0.015812 0.045574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.623892 0.010128 160.33 <2e-16 ***
## log10(GDP_PP) 0.061833 0.002632 23.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02465 on 235 degrees of freedom
## Multiple R-squared: 0.7014, Adjusted R-squared: 0.7001
## F-statistic: 552 on 1 and 235 DF, p-value: < 2.2e-16
Another option is to only transform the X variable into log(X) and use the OLS regression model Y = \(\alpha\) + \(\beta\) \(\times\) log(X)
Figure 9
Life Expectancy versus log10(GDP per Person) for 237 Countries in 2019
Figure 9a
Residuals versus log10(GDP) per Person
Comment
The scatterplot shows that an OLS linear regression model seems an adequate model to describe the relationship.
Model 9: Life_Expectancy ~ log(GDP PP) in 2019
##
## Call:
## lm(formula = LIFE_EXP ~ log10(GDP_PP), data = df_2019)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.167 -1.800 0.528 2.342 7.157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.070 1.562 21.80 <2e-16 ***
## log10(GDP_PP) 10.148 0.406 24.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 235 degrees of freedom
## Multiple R-squared: 0.7266, Adjusted R-squared: 0.7255
## F-statistic: 624.7 on 1 and 235 DF, p-value: < 2.2e-16
Exercise
Compare the Three Models for the World Bank data.
Write up the regression line equatoon for each of the models.
Rank the models form (1) best fitting model to (3) worst fitting model. And as always, explain your choices.
See Table 4 below.
If Model 9 is applied on the data set.
What is, in that case the residual for Jordan?
What is the meaning of the residual value for Jordan in the context. Be precise in your answer.
COUNTRY | COUNTRYCODE | YEAR | GDP_PP | LIFE_EXP |
Germany | DEU | YR2019 | 46,795 | 81 |
Iraq | IRQ | YR2019 | 5,981 | 71 |
Israel | ISR | YR2019 | 43,951 | 83 |
Jordan | JOR | YR2019 | 4,405 | 75 |
Korea, Rep. | KOR | YR2019 | 31,937 | 83 |
Netherlands | NLD | YR2019 | 52,476 | 82 |
United States | USA | YR2019 | 65,280 | 79 |