Inference for Regression Models (3)

Worksheet WS20220411

Transforming Variables to Achieve Linearity

If the relationship between two variables is not linear, it still can be possible to apply the linear model after a non-linear-transformation of one or both of the variables. A linear transformation doesn’t make sense in such cases, because a linear transformation doesn’t effect the strength of a linear relationship.
Below you first find a fictive data set in which a couple of transformations is applied on one or both variables. An OLS Linear Regression model is used to describe the relationship between the transformed variable and the not-transformed variable or, if both variables are transformed, between the two transformed variables.

Example

Table 1
Head of the fictive data set

X	Y
103	171,376
125	147,400
161	148,778
199	150,811
239	159,908

As always, the first step is plotting Y versus X in a scatter plot.

Figure 1
Scatterplot Y ~ X

Figure 1a
Residual plot RESIDUAL ~ X

Comment
The plot in Figure 1 shows a positive association between Y and X. Let’s first estimate an OLS simple linear regression model and also plot the corresponding residual plot (Figure 1a).

Model 1: OLS Linear Regression Model Output Y~X

## 
## Call:
## lm(formula = Y ~ X, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66995 -25951  -4995  22500  83892 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40547.20   13784.20   2.942  0.00554 ** 
## X             455.70      22.12  20.605  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35480 on 38 degrees of freedom
## Multiple R-squared:  0.9178, Adjusted R-squared:  0.9157 
## F-statistic: 424.6 on 1 and 38 DF,  p-value: < 2.2e-16

Although R² is high, the scatter plot shows that the relationship is curved. The residual plot shows a clear pattern (positive values - negative values - positive values). Reason to adjust the model by applying a transformation on either the X-variable, the Y-variable or both.

Exercise 1

What is the equation of the estimated regression line in Model 1?
ANSWER: Y = 40,547 + 455.7X
What is the value of R²?
ANSWER: 0.918
What stands out when you study the residual plot?
ANSWER: a clear pattern, positive values - negative values - positive values

Apply X \(\rightarrow\) X² transformation

Based on what we see in the graph, a possibility is that the relationship between Y and X can be described by a quadratic model Y = \(\alpha\) + \(\beta\)X².
To apply this model, we bring it back to a simple linear model using a transformation on the X-variable by squaring the X-values. In the data set we add a variable X2 = X² and then we apply the OLS linear regression model on Y ~ X2.

Table 2
Head of the Data Set with added Variable X2 = X²

X	X2	Y
103	10,609	171,376
125	15,625	147,400
161	25,921	148,778
199	39,601	150,811
239	57,121	159,908

In Figure 2 the Y-values are plotted against the X²-values.

Figure 2
Example Scatterplot Y ~ X² values

Figure 2a
Example Scatterplot Residuals ~ X² values

Comment
Figure 2 shows a linear relationship between Y and X2 (= X²). The model parameters are estimated by applying an OLS regression on Y versus X².
Figure 2a shows the residual plot in which the residuals are plotted against the X² values. There is no apparent pattern in the residuals as in Figure 1a.

Model 2: OLS Regression Model Output Y~X²

## 
## Call:
## lm(formula = Y ~ X2, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41303  -9502   -985  10819  27489 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.395e+05  4.262e+03   32.73   <2e-16 ***
## X2          4.131e-01  8.797e-03   46.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16110 on 38 degrees of freedom
## Multiple R-squared:  0.9831, Adjusted R-squared:  0.9826 
## F-statistic:  2205 on 1 and 38 DF,  p-value: < 2.2e-16

There is no apparent pattern in the residuals as in Figure 1a. Together with a higher R² value, Model 2 seems to fit the data better than Model 1.

Exercise 2

What is the equation of the estimated regression line in Model 1?
ANSWER: Y = 139,504 + 0.4131X²
What is the value of R²?
ANSWER: 0.983

Comparing two models

In general, comparing two models is done based on:

comparing the residual plots
comparing the R² values
comparing the standard error of the residuals

Comparing Model 1 and Model 2

Comparing Residual Plots
The residual plot from Model 1 shows a clear pattern in the residuals; the pattern from Model 2 is not that clear.
Model 2 scores better than Model 1 on this point.

Comparing R² values
Model 1: R² = 0.918
Model 2: R² = 0.983
The coefficient of determination R² is higher for Model 2 than for Model 1.
So Model 2 is better than Model 1 as far as R² is concerned.

Comparing SE residuals
Model 1: SE_res = 35,480
Model 2: SE_res = 16,110

The lower the variability in the residuals, the lower the variability in the Y-estimates, the better the model.
So also on this criterion, Model 2 performs better than Model 1.

Conclusion
Model 2 is the better of the two models.

Exercise 3
Using Model 2 to make predictions for a Y-value
Using the model to estimate the Y-value given an X-value is straight forward.
The Model 2 regression equation is: \(\hat{Y}\) = 139,500 + 0.4131 \(\times\) X²

If for instance X = 300, \(\hat{Y}\) = 139,500 + 0.4131 \(\times\) 300² = …………………………….

ANSWER: 176,686

A 95% CI for this Y-value is calculated as follows:
95% CI = Point Estimate \(\pm\) Margin of Error = ……………. \(\pm\) t_{df=n-2} \(\times\) s_e = …………………………………………………………
The s_e-value can be found in the regression output, the t-value in a t-table or with the graphical calculator.

ANSWER:
with point_estimate = 176,686, t-value = 2.024 and s_e = 16,109
<144,076; 209,297>
—

Apply Y \(\rightarrow\) \(\sqrt{Y}\) transformation

In stead of transforming the X-variable into X², we could also have chosen to transform the Y-variable into \(\sqrt{Y}\) and apply the OLS linear regrssion model on \(\sqrt{Y}\) versus X: \(\sqrt{Y}\) = \(\alpha\) + \(\beta\)X + \(\epsilon\).

Figure 3
Example Scatterplot SQRT(Y) ~ X values

Figure 3a
Example Scatterplot Residuals ~ X values

Comment
Figure 3 shows a linear relationship between \(\sqrt{Y}\) and X. The model parameters are estimated by applying an OLS linear regression on \(\sqrt{Y}\) versus X.
Figure 3a shows the plot in which the residuals are plotted against the X- values.

Model 3: OLS Regression Model Output SQRT(Y)~X

## 
## Call:
## lm(formula = SQRT_Y ~ X, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.354 -14.837  -1.208  14.744  69.950 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 301.40123    9.72900   30.98   <2e-16 ***
## X             0.41383    0.01561   26.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.04 on 38 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9474 
## F-statistic: 702.8 on 1 and 38 DF,  p-value: < 2.2e-16

There is a clearer pattern in the residuals than in Figure 2a.

Exercise 4
Compare Model 2 and Model 3.

What is the regression line equation in Model 3?
ANSWER: \(\hat{\sqrt(Y)}\) = 301.4 + 0.4138 X
Based on the distributions in the residual plots, which of the two is the better model?
ANSWER: model 2; residual plot in model 3 has a clear pattern
Based on the R² values, which of the two is the better model?
ANSWER: model 2, because R² in model 2 (0.983) is higher than in Model 3 (0.949)

Note. It makes no sense in this case to compare the SE_e’s. The SE in model 2, as in model 1, measures the variability in Y-values around the regression line. The SE in model 3 measures the variability in \(\sqrt{Y}\)-values. The dimension of SE in model 1 and model 2, is the same as the dimension of the Y-values. The dimension of SE in model 3 is the same as the dimension of the \(\sqrt{Y}\)-values. So if for instance, the Y-values are in dollars, the dimension of the SE_e in model 3 is \(\sqrt{dollar}\) (whatever that may look like).

Exercise 5

Use model 1 to calculate a 95% CI for the Y-value given X = 500.
ANSWER X = 500: \(\hat{Y}\) = 40,547 + 455.7 * 500 = 268,397
95% CI for Y: \(\hat{Y} \pm\) t_df=38 * s_e = 268,397 \(\pm\) 2.024 * 35,475 = <196,582 ; 340,213>
Use model 2 to calculate a 95% CI for the Y-value given X = 500.
ANSWER
X = 500
\(\hat{Y}\) = 139,504 + 0.4131 * 250,000 = 242,788
95% CI for Y: \(\hat{Y} \pm\) t_df=38 * s_e = 242,788 \(\pm\) 2.024 * 16,109 = <210,178 ; 275,399>
Use model 3 to calculate a 95% CI for the Y-value given X = 500.
ANSWER
X = 500; X² = 250,000
\(\hat{\sqrt{Y}}\) = 301.4 + 0.4138 * 500 = 508.32
\(\hat{Y}\) = 508.32² = 258,388
95% CI for \(\sqrt{Y}\): \(\hat{\sqrt{Y}} \pm\) t_df=38 * s_e = 508.32 \(\pm\) 2.024 * 25.04 = <457.63 ; 559.01>
95% CI for Y: <209,426 ; 312,488>

Log-transformations

Another form of transformation is a log-transformation. This can be applied to the X-variable, Y ~ log(X) model, to the Y-variable, log(Y) ~ X model, or to both, log(Y) ~ log(X) model.
These models are shown below. The logaritm with base 10 is used in the models. It is also possible to use another base for the logaritm, the natural logaritm ln is used, the logaritm with base = e (\(\approx\) 2.7182818).

Apply X \(\rightarrow\) log10(X) transformation

Figure 4
Scatterplot Y ~ log10(X) values

Figure 4a
Scatterplot Residuals ~ log10(X) values

Comment
Figure 4 shows the relationship between Y and log(X). An OLS linear regression model is applied, although based on Figure 4 this doesn’t seem a good idea.
The model parameters are estimated by applying an OLS linear regression on Y versus log(X).
Figure 4a shows the plot in which the residuals are plotted against the log(X)- values.

Model 4: OLS Regression Model Output Y ~ log(X)

## 
## Call:
## lm(formula = Y ~ log10(X), data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99314 -40909 -18920  36705 152862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -809930     112701  -7.187 1.38e-08 ***
## log10(X)      411580      41615   9.890 4.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65470 on 38 degrees of freedom
## Multiple R-squared:  0.7202, Adjusted R-squared:  0.7128 
## F-statistic: 97.82 on 1 and 38 DF,  p-value: 4.636e-12

There is a clear pattern in the residuals as in Figure 1.
Compared with the first model, this transformation doesn’t lead to a better fitting model, on the contrary.

Exercise 6

What is the equation of the regression line in Model 4?
ANSWER: Y = -809,930 + 411,580log₁₀(X)
Why is this model worse than Model 3?
ANSWER

stronger pattern in residual plot in Model 4
R² is smaller in Model 4 than in Model 3 (0.720 in Model 4, 0.949 in Model 3)

Apply Y \(\rightarrow\) log(Y) transformation

Figure 5
Scatterplot log(Y) ~ X values

Figure 5a
Scatterplot Residuals ~ X values

Comment
Figure 5 shows a linear relationship between log(Y) and X values. The model parameters are estimated by applying an OLS linear regression on log(Y) versus X.
Figure 5a shows the plot in which the residuals are plotted against the X- values.

Model 5: OLS Regression Model Output LOG(Y) ~ X

## 
## Call:
## lm(formula = log10(Y) ~ X, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.091705 -0.018100  0.000677  0.020620  0.102983 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.062e+00  1.253e-02  403.93   <2e-16 ***
## X           6.686e-04  2.011e-05   33.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03225 on 38 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9659 
## F-statistic:  1106 on 1 and 38 DF,  p-value: < 2.2e-16

Exercise 6

What is the equation of the regression line in Model 5?
ANSWER: \(\hat{log(Y)}\) = 5.062 + 0.0006686X
Assess Model 5 by examining Figure 5, Figure 5a and the R² value of Model 5?
ANSWER

Figure 5, scatterplot shows linear relationship between log₁₀(Y) and X
Figure 5a, residual plot, does not show a clear pattern in the residuals
R² = 0.967; 96.7% of the variation in the log₁₀(Y) values is explained by the model (that is, by the variation in the X-values)
Conclusion: good and useful model

Apply Y \(\rightarrow\) log(Y) and X \(\rightarrow\) log(X) transformation

Figure 6
Scatterplot log(Y) ~ log(X) values

Figure 6a
Scatterplot Residuals ~ log(X) values

Comment
Figure 6 shows a linear relationship between log(Y) and log(X). The model parameters are estimated by applying an OLS linear regression on log(Y) versus log(X).
Figure 6a shows the plot in which the residuals are plotted against the log(X)- values.

Model 6: OLS Regression Model Output LOG(Y) ~ LOG(X)

## 
## Call:
## lm(formula = log10(Y) ~ log10(X), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13923 -0.05482 -0.01547  0.03959  0.22223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74310    0.12690   29.50   <2e-16 ***
## log10(X)     0.63026    0.04686   13.45    5e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07372 on 38 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8218 
## F-statistic: 180.9 on 1 and 38 DF,  p-value: 5.002e-16

Exercise 7

What is the equation of the regression line in Model 6?
ANSWER: log₁₀(Y) = 3.743 + 0.630 log₁₀(X)
Rewrite this equation to Y = ….
ANSWER
10^log₁₀(Y) = 10^{3.743+0.630log₁₀(X)}
Y = 10^3.743 * 10^{0.630log₁₀(X)}
Y = 10^3.743 * (10^log₁₀(X))^0.630
Y = 10^3.743 * X^0.630
Y = 5534.8 * X^0.630
Assess Model 6 by examining Figure 6, Figure 6a and the R² value of Model 6?
ANSWER

Figure 6, scatterplot shows a curved relationship between log₁₀(Y) and log₁₀(X)
Figure 6a, residual plot, shows a clear pattern in the residuals
R² = 0.827; 82.7% of the variation in the log₁₀(Y) values is explained by the model (that is, by the variation in the log₁₀(X)-values)
Conclusion: although R² is high, the pattern in the residuals show that another model will be more appropriate

The best of six models

Exercise 8
Rank the six models from 1 (the best fitting) to 6.

ANSWER

MODEL	Scatter plot	Residual plot	R²
1	linear relationship: yes, but curve model seems better	clear pattern in residuals	0.918
2	linear relationship: yes	resiudals are randomly distributed	0.983
3	linear relationship: yes, but curve model seems better	clear pattern in residuals	0.949
4	does not show linear relation but a curved	strong pattern in residuals	0.720
5	shows linear relationship	randomly distributed residuals	0.967
6	shows curved relationship	clear pattern in residuals	0.826

Ranking: Model 2 > Model 5 > Model 3 > Model 1 > Model 6 > Model 4

Exercise 9

The equation of the regression line of Model 3 is:

ANSWER \(\hat{\sqrt(Y)}\) = 301.4 + 0.4138 X

Rewrite this equation in Y = ….. .
ANSWER Y = (301.4 + 0.4138X)² = 90,843 + 249.5X + 0.171X²

Exercise 10

The equation of the regression line of Model 5 is:
\(\hat{log10(Y)}\) = ……… + ………. X (fill in the blanks)
ANSWER
\(\hat{log~10~(Y)}\) = 5.062 + 0.0006686X
Rewrite this equation in Y = ……………………………. .
ANSWER
Y = 10^{5.062+0.0006686X}
Y = 10^5.062 * 10^0.0006686X
Y = 10^5.062 * (10^0.0006686)^X
Y = 115,373 \(\times\) 1.00154^X

A real World Example

In this example we use data from the World Bank on country level. In the data set the objects are the countries of the world and the variables are:

GDP per capita; Gross Domestic Product in current USD in 2019
Life Expectancy at birth in 2019

The question is if there is a relationship between these two variables and if so, to model this relationship with an OLS Regression Model.

Table 3
Head World Bank Data 2019

COUNTRY	COUNTRYCODE	YEAR	GDP_PP	LIFE_EXP
Afghanistan	AFG	YR2019	494	65
Albania	ALB	YR2019	5,396	79
Algeria	DZA	YR2019	3,990	77
Angola	AGO	YR2019	2,810	61
Antigua and Barbuda	ATG	YR2019	17,377	77
Argentina	ARG	YR2019	10,057	77

The first step is, as always, graphing the data. To examine the relationship between two quantitative variables, a scatter plot is the most commonly used graph.

Figure 7
Life Expectancy versus GDP per Person for 237 Countries in 2019

Figure 7a
Residuals versus GDP per Person

Comment
As is clear from the scatter plot, the relationship between the two variables is not a linear one. Despite this, an OLS linear regression model is applied on the data.

Model 7: OLS Linear Regression Model Life_Expectancy ~ GDP PP in 2019

## 
## Call:
## lm(formula = LIFE_EXP ~ GDP_PP, data = df_2019)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.418  -3.229   1.505   4.244   8.592 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.955e+01  4.432e-01  156.94   <2e-16 ***
## GDP_PP      1.986e-04  1.599e-05   12.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.651 on 235 degrees of freedom
## Multiple R-squared:  0.3963, Adjusted R-squared:  0.3937 
## F-statistic: 154.2 on 1 and 235 DF,  p-value: < 2.2e-16

As was already clear from the scatter plot, an OLS linear regression model is not an adequate model to describe the relationship between Life_Expectancy and GDP per Person.

The plot in Figure 7 shows the relationship between LIFE_EXP en GDP_PP as a curve that starts as a fast increasing curve and flattens more and more. Functions like Y = a \(\times\) X^b with \(0 \le b \le 1\) have this behavior.

If Y = a \(\times\) X^b, then
log(Y) = log(a \(\times\) X^b) or
log(Y) = log(a) + log(X^b) or
log(Y) = log(a) + b \(\times\) log(X)

The last equation shows a linear relation between log(Y) and log(X).
That’s why a combined Y \(\rightarrow\) log(Y) and X \(\rightarrow\) log(X) can be tried.

Figure 8
log(Life Expectancy) versus log(GDP per Person) for 237 Countries in 2019

Figure 8a
Residuals versus GDP per Person

Comment
The scatterplot shows that an OLS linear regression model seems an adequate model to describe the relationship.

Model 8: log(Life_Expectancy) ~ log(GDP PP) in 2019

## 
## Call:
## lm(formula = log10(LIFE_EXP) ~ log10(GDP_PP), data = df_2019)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.097708 -0.011149  0.002883  0.015812  0.045574 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.623892   0.010128  160.33   <2e-16 ***
## log10(GDP_PP) 0.061833   0.002632   23.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02465 on 235 degrees of freedom
## Multiple R-squared:  0.7014, Adjusted R-squared:  0.7001 
## F-statistic:   552 on 1 and 235 DF,  p-value: < 2.2e-16

Another option is to only transform the X variable into log(X) and use the OLS regression model Y = \(\alpha\) + \(\beta\) \(\times\) log(X)

Figure 9
Life Expectancy versus log10(GDP per Person) for 237 Countries in 2019

Figure 9a
Residuals versus log10(GDP) per Person

Comment
The scatterplot shows that an OLS linear regression model seems an adequate model to describe the relationship.

Model 9: Life_Expectancy ~ log(GDP PP) in 2019

## 
## Call:
## lm(formula = LIFE_EXP ~ log10(GDP_PP), data = df_2019)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.167  -1.800   0.528   2.342   7.157 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     34.070      1.562   21.80   <2e-16 ***
## log10(GDP_PP)   10.148      0.406   24.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 235 degrees of freedom
## Multiple R-squared:  0.7266, Adjusted R-squared:  0.7255 
## F-statistic: 624.7 on 1 and 235 DF,  p-value: < 2.2e-16

Exercise
Compare the Three Models for the World Bank data.

Write up the regression line equatoon for each of the models.
Rank the models form (1) best fitting model to (3) worst fitting model. And as always, explain your choices.
See Table 4 below.
If Model 9 is applied on the data set.
What is, in that case the residual for Jordan?
What is the meaning of the residual value for Jordan in the context. Be precise in your answer.

Inference for Regression Models (3)

JHvdZwan

04/11/2022

Transforming Variables to Achieve Linearity

Example

Apply X \(\rightarrow\) X² transformation

Comparing two models

Comparing Model 1 and Model 2

Apply Y \(\rightarrow\) \(\sqrt{Y}\) transformation

Log-transformations

Apply X \(\rightarrow\) log10(X) transformation

Apply Y \(\rightarrow\) log(Y) transformation

Apply Y \(\rightarrow\) log(Y) and X \(\rightarrow\) log(X) transformation

The best of six models

A real World Example

COUNTRY	COUNTRYCODE	YEAR	GDP_PP	LIFE_EXP
Germany	DEU	YR2019	46,795	81
Iraq	IRQ	YR2019	5,981	71
Israel	ISR	YR2019	43,951	83
Jordan	JOR	YR2019	4,405	75
Korea, Rep.	KOR	YR2019	31,937	83
Netherlands	NLD	YR2019	52,476	82
United States	USA	YR2019	65,280	79

Inference for Regression Models (3)

JHvdZwan

04/11/2022

Transforming Variables to Achieve Linearity

Example

Apply X \(\rightarrow\) X2 transformation

Comparing two models

Comparing Model 1 and Model 2

Apply Y \(\rightarrow\) \(\sqrt{Y}\) transformation

Log-transformations

Apply X \(\rightarrow\) log10(X) transformation

Apply Y \(\rightarrow\) log(Y) transformation

Apply Y \(\rightarrow\) log(Y) and X \(\rightarrow\) log(X) transformation

The best of six models

A real World Example

Apply X \(\rightarrow\) X² transformation