Worksheet WS20220411

Transforming Variables to Achieve Linearity

If the relationship between two variables is not linear, it still can be possible to apply the linear model after a non-linear-transformation of one or both of the variables. A linear transformation doesn’t make sense in such cases, because a linear transformation doesn’t effect the strength of a linear relationship.
Below you first find a fictive data set in which a couple of transformations is applied on one or both variables. An OLS Linear Regression model is used to describe the relationship between the transformed variable and the not-transformed variable or, if both variables are transformed, between the two transformed variables.

Example

Table 1
Head of the fictive data set

As always, the first step is plotting Y versus X in a scatter plot.

Figure 1
Scatterplot Y ~ X

Figure 1a
Residual plot RESIDUAL ~ X

Comment
The plot in Figure 1 shows a positive association between Y and X. Let’s first estimate an OLS simple linear regression model and also plot the corresponding residual plot (Figure 1a).

Model 1: OLS Linear Regression Model Output Y~X

## 
## Call:
## lm(formula = Y ~ X, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66995 -25951  -4995  22500  83892 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40547.20   13784.20   2.942  0.00554 ** 
## X             455.70      22.12  20.605  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35480 on 38 degrees of freedom
## Multiple R-squared:  0.9178, Adjusted R-squared:  0.9157 
## F-statistic: 424.6 on 1 and 38 DF,  p-value: < 2.2e-16

Although R2 is high, the scatter plot shows that the relationship is curved. The residual plot shows a clear pattern (positive values - negative values - positive values). Reason to adjust the model by applying a transformation on either the X-variable, the Y-variable or both.

Exercise 1

  1. What is the equation of the estimated regression line in Model 1?
    ANSWER: Y = 40,547 + 455.7X

  2. What is the value of R2?
    ANSWER: 0.918

  3. What stands out when you study the residual plot?
    ANSWER: a clear pattern, positive values - negative values - positive values


Apply X \(\rightarrow\) X2 transformation

Based on what we see in the graph, a possibility is that the relationship between Y and X can be described by a quadratic model Y = \(\alpha\) + \(\beta\)X2.
To apply this model, we bring it back to a simple linear model using a transformation on the X-variable by squaring the X-values. In the data set we add a variable X2 = X2 and then we apply the OLS linear regression model on Y ~ X2.

Table 2
Head of the Data Set with added Variable X2 = X2


In Figure 2 the Y-values are plotted against the X2-values.

Figure 2
Example Scatterplot Y ~ X2 values

Figure 2a
Example Scatterplot Residuals ~ X2 values

Comment
Figure 2 shows a linear relationship between Y and X2 (= X2). The model parameters are estimated by applying an OLS regression on Y versus X2.
Figure 2a shows the residual plot in which the residuals are plotted against the X2 values. There is no apparent pattern in the residuals as in Figure 1a.

Model 2: OLS Regression Model Output Y~X2

## 
## Call:
## lm(formula = Y ~ X2, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41303  -9502   -985  10819  27489 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.395e+05  4.262e+03   32.73   <2e-16 ***
## X2          4.131e-01  8.797e-03   46.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16110 on 38 degrees of freedom
## Multiple R-squared:  0.9831, Adjusted R-squared:  0.9826 
## F-statistic:  2205 on 1 and 38 DF,  p-value: < 2.2e-16


There is no apparent pattern in the residuals as in Figure 1a. Together with a higher R2 value, Model 2 seems to fit the data better than Model 1.

Exercise 2

  1. What is the equation of the estimated regression line in Model 1?
    ANSWER: Y = 139,504 + 0.4131X2

  2. What is the value of R2?
    ANSWER: 0.983

Comparing two models

In general, comparing two models is done based on:

  • comparing the residual plots
  • comparing the R2 values
  • comparing the standard error of the residuals

Comparing Model 1 and Model 2

Comparing Residual Plots
The residual plot from Model 1 shows a clear pattern in the residuals; the pattern from Model 2 is not that clear.
Model 2 scores better than Model 1 on this point.

Comparing R2 values
Model 1: R2 = 0.918
Model 2: R2 = 0.983
The coefficient of determination R2 is higher for Model 2 than for Model 1.
So Model 2 is better than Model 1 as far as R2 is concerned.

Comparing SE residuals
Model 1: SEres = 35,480
Model 2: SEres = 16,110

The lower the variability in the residuals, the lower the variability in the Y-estimates, the better the model.
So also on this criterion, Model 2 performs better than Model 1.

Conclusion
Model 2 is the better of the two models.

Exercise 3
Using Model 2 to make predictions for a Y-value
Using the model to estimate the Y-value given an X-value is straight forward.
The Model 2 regression equation is: \(\hat{Y}\) = 139,500 + 0.4131 \(\times\) X2

If for instance X = 300, \(\hat{Y}\) = 139,500 + 0.4131 \(\times\) 3002 = …………………………….

ANSWER: 176,686

A 95% CI for this Y-value is calculated as follows:
95% CI = Point Estimate \(\pm\) Margin of Error = ……………. \(\pm\) t{df=n-2} \(\times\) se = …………………………………………………………
The se-value can be found in the regression output, the t-value in a t-table or with the graphical calculator.

ANSWER:
with point_estimate = 176,686, t-value = 2.024 and se = 16,109
<144,076; 209,297>
—

Apply Y \(\rightarrow\) \(\sqrt{Y}\) transformation

In stead of transforming the X-variable into X2, we could also have chosen to transform the Y-variable into \(\sqrt{Y}\) and apply the OLS linear regrssion model on \(\sqrt{Y}\) versus X: \(\sqrt{Y}\) = \(\alpha\) + \(\beta\)X + \(\epsilon\).

Figure 3
Example Scatterplot SQRT(Y) ~ X values

Figure 3a
Example Scatterplot Residuals ~ X values

Comment
Figure 3 shows a linear relationship between \(\sqrt{Y}\) and X. The model parameters are estimated by applying an OLS linear regression on \(\sqrt{Y}\) versus X.
Figure 3a shows the plot in which the residuals are plotted against the X- values.

Model 3: OLS Regression Model Output SQRT(Y)~X

## 
## Call:
## lm(formula = SQRT_Y ~ X, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.354 -14.837  -1.208  14.744  69.950 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 301.40123    9.72900   30.98   <2e-16 ***
## X             0.41383    0.01561   26.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.04 on 38 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9474 
## F-statistic: 702.8 on 1 and 38 DF,  p-value: < 2.2e-16


There is a clearer pattern in the residuals than in Figure 2a.

Exercise 4
Compare Model 2 and Model 3.

  1. What is the regression line equation in Model 3?
    ANSWER: \(\hat{\sqrt(Y)}\) = 301.4 + 0.4138 X

  2. Based on the distributions in the residual plots, which of the two is the better model?
    ANSWER: model 2; residual plot in model 3 has a clear pattern

  3. Based on the R2 values, which of the two is the better model?
    ANSWER: model 2, because R2 in model 2 (0.983) is higher than in Model 3 (0.949)

Note. It makes no sense in this case to compare the SEe’s. The SE in model 2, as in model 1, measures the variability in Y-values around the regression line. The SE in model 3 measures the variability in \(\sqrt{Y}\)-values. The dimension of SE in model 1 and model 2, is the same as the dimension of the Y-values. The dimension of SE in model 3 is the same as the dimension of the \(\sqrt{Y}\)-values. So if for instance, the Y-values are in dollars, the dimension of the SEe in model 3 is \(\sqrt{dollar}\) (whatever that may look like).

Exercise 5

  1. Use model 1 to calculate a 95% CI for the Y-value given X = 500.
    ANSWER X = 500: \(\hat{Y}\) = 40,547 + 455.7 * 500 = 268,397
    95% CI for Y: \(\hat{Y} \pm\) tdf=38 * se = 268,397 \(\pm\) 2.024 * 35,475 = <196,582 ; 340,213>

  2. Use model 2 to calculate a 95% CI for the Y-value given X = 500.
    ANSWER
    X = 500
    \(\hat{Y}\) = 139,504 + 0.4131 * 250,000 = 242,788
    95% CI for Y: \(\hat{Y} \pm\) tdf=38 * se = 242,788 \(\pm\) 2.024 * 16,109 = <210,178 ; 275,399>

  3. Use model 3 to calculate a 95% CI for the Y-value given X = 500.
    ANSWER
    X = 500; X2 = 250,000
    \(\hat{\sqrt{Y}}\) = 301.4 + 0.4138 * 500 = 508.32
    \(\hat{Y}\) = 508.322 = 258,388
    95% CI for \(\sqrt{Y}\): \(\hat{\sqrt{Y}} \pm\) tdf=38 * se = 508.32 \(\pm\) 2.024 * 25.04 = <457.63 ; 559.01>
    95% CI for Y: <209,426 ; 312,488>

Log-transformations

Another form of transformation is a log-transformation. This can be applied to the X-variable, Y ~ log(X) model, to the Y-variable, log(Y) ~ X model, or to both, log(Y) ~ log(X) model.
These models are shown below. The logaritm with base 10 is used in the models. It is also possible to use another base for the logaritm, the natural logaritm ln is used, the logaritm with base = e (\(\approx\) 2.7182818).


Apply X \(\rightarrow\) log10(X) transformation

Figure 4
Scatterplot Y ~ log10(X) values

Figure 4a
Scatterplot Residuals ~ log10(X) values

Comment
Figure 4 shows the relationship between Y and log(X). An OLS linear regression model is applied, although based on Figure 4 this doesn’t seem a good idea.
The model parameters are estimated by applying an OLS linear regression on Y versus log(X).
Figure 4a shows the plot in which the residuals are plotted against the log(X)- values.

Model 4: OLS Regression Model Output Y ~ log(X)

## 
## Call:
## lm(formula = Y ~ log10(X), data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99314 -40909 -18920  36705 152862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -809930     112701  -7.187 1.38e-08 ***
## log10(X)      411580      41615   9.890 4.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65470 on 38 degrees of freedom
## Multiple R-squared:  0.7202, Adjusted R-squared:  0.7128 
## F-statistic: 97.82 on 1 and 38 DF,  p-value: 4.636e-12

There is a clear pattern in the residuals as in Figure 1.
Compared with the first model, this transformation doesn’t lead to a better fitting model, on the contrary.

Exercise 6

  1. What is the equation of the regression line in Model 4?
    ANSWER: Y = -809,930 + 411,580log10(X)

  2. Why is this model worse than Model 3?
    ANSWER

  • stronger pattern in residual plot in Model 4
  • R2 is smaller in Model 4 than in Model 3 (0.720 in Model 4, 0.949 in Model 3)

Apply Y \(\rightarrow\) log(Y) transformation

Figure 5
Scatterplot log(Y) ~ X values

Figure 5a
Scatterplot Residuals ~ X values

Comment
Figure 5 shows a linear relationship between log(Y) and X values. The model parameters are estimated by applying an OLS linear regression on log(Y) versus X.
Figure 5a shows the plot in which the residuals are plotted against the X- values.

Model 5: OLS Regression Model Output LOG(Y) ~ X

## 
## Call:
## lm(formula = log10(Y) ~ X, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.091705 -0.018100  0.000677  0.020620  0.102983 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.062e+00  1.253e-02  403.93   <2e-16 ***
## X           6.686e-04  2.011e-05   33.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03225 on 38 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9659 
## F-statistic:  1106 on 1 and 38 DF,  p-value: < 2.2e-16

Exercise 6

  1. What is the equation of the regression line in Model 5?
    ANSWER: \(\hat{log(Y)}\) = 5.062 + 0.0006686X

  2. Assess Model 5 by examining Figure 5, Figure 5a and the R2 value of Model 5?
    ANSWER

  • Figure 5, scatterplot shows linear relationship between log10(Y) and X
  • Figure 5a, residual plot, does not show a clear pattern in the residuals
  • R2 = 0.967; 96.7% of the variation in the log10(Y) values is explained by the model (that is, by the variation in the X-values)
  • Conclusion: good and useful model

Apply Y \(\rightarrow\) log(Y) and X \(\rightarrow\) log(X) transformation

Figure 6
Scatterplot log(Y) ~ log(X) values

Figure 6a
Scatterplot Residuals ~ log(X) values

Comment
Figure 6 shows a linear relationship between log(Y) and log(X). The model parameters are estimated by applying an OLS linear regression on log(Y) versus log(X).
Figure 6a shows the plot in which the residuals are plotted against the log(X)- values.

Model 6: OLS Regression Model Output LOG(Y) ~ LOG(X)

## 
## Call:
## lm(formula = log10(Y) ~ log10(X), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13923 -0.05482 -0.01547  0.03959  0.22223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74310    0.12690   29.50   <2e-16 ***
## log10(X)     0.63026    0.04686   13.45    5e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07372 on 38 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8218 
## F-statistic: 180.9 on 1 and 38 DF,  p-value: 5.002e-16

Exercise 7

  1. What is the equation of the regression line in Model 6?
    ANSWER: log10(Y) = 3.743 + 0.630 log10(X)

  2. Rewrite this equation to Y = ….
    ANSWER
    10log10(Y) = 103.743+0.630log10(X)
    Y = 103.743 * 100.630log10(X)
    Y = 103.743 * (10log10(X))0.630
    Y = 103.743 * X0.630
    Y = 5534.8 * X0.630

  3. Assess Model 6 by examining Figure 6, Figure 6a and the R2 value of Model 6?
    ANSWER

  • Figure 6, scatterplot shows a curved relationship between log10(Y) and log10(X)
  • Figure 6a, residual plot, shows a clear pattern in the residuals
  • R2 = 0.827; 82.7% of the variation in the log10(Y) values is explained by the model (that is, by the variation in the log10(X)-values)
  • Conclusion: although R2 is high, the pattern in the residuals show that another model will be more appropriate

The best of six models

Exercise 8
Rank the six models from 1 (the best fitting) to 6.

ANSWER

MODEL Scatter plot Residual plot R2
1 linear relationship: yes, but curve model seems better clear pattern in residuals 0.918
2 linear relationship: yes resiudals are randomly distributed 0.983
3 linear relationship: yes, but curve model seems better clear pattern in residuals 0.949
4 does not show linear relation but a curved strong pattern in residuals 0.720
5 shows linear relationship randomly distributed residuals 0.967
6 shows curved relationship clear pattern in residuals 0.826

Ranking: Model 2 > Model 5 > Model 3 > Model 1 > Model 6 > Model 4

Exercise 9

  1. The equation of the regression line of Model 3 is:

ANSWER \(\hat{\sqrt(Y)}\) = 301.4 + 0.4138 X

  1. Rewrite this equation in Y = ….. .
    ANSWER Y = (301.4 + 0.4138X)2 = 90,843 + 249.5X + 0.171X2

Exercise 10

  1. The equation of the regression line of Model 5 is:
    \(\hat{log10(Y)}\) = ……… + ………. X (fill in the blanks)
    ANSWER
    \(\hat{log~10~(Y)}\) = 5.062 + 0.0006686X

  2. Rewrite this equation in Y = ……………………………. .
    ANSWER
    Y = 105.062+0.0006686X
    Y = 105.062 * 100.0006686X
    Y = 105.062 * (100.0006686)X
    Y = 115,373 \(\times\) 1.00154X


A real World Example

In this example we use data from the World Bank on country level. In the data set the objects are the countries of the world and the variables are:

The question is if there is a relationship between these two variables and if so, to model this relationship with an OLS Regression Model.

Table 3
Head World Bank Data 2019



The first step is, as always, graphing the data. To examine the relationship between two quantitative variables, a scatter plot is the most commonly used graph.

Figure 7
Life Expectancy versus GDP per Person for 237 Countries in 2019

Figure 7a
Residuals versus GDP per Person

Comment
As is clear from the scatter plot, the relationship between the two variables is not a linear one. Despite this, an OLS linear regression model is applied on the data.

Model 7: OLS Linear Regression Model Life_Expectancy ~ GDP PP in 2019

## 
## Call:
## lm(formula = LIFE_EXP ~ GDP_PP, data = df_2019)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.418  -3.229   1.505   4.244   8.592 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.955e+01  4.432e-01  156.94   <2e-16 ***
## GDP_PP      1.986e-04  1.599e-05   12.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.651 on 235 degrees of freedom
## Multiple R-squared:  0.3963, Adjusted R-squared:  0.3937 
## F-statistic: 154.2 on 1 and 235 DF,  p-value: < 2.2e-16

As was already clear from the scatter plot, an OLS linear regression model is not an adequate model to describe the relationship between Life_Expectancy and GDP per Person.

The plot in Figure 7 shows the relationship between LIFE_EXP en GDP_PP as a curve that starts as a fast increasing curve and flattens more and more. Functions like Y = a \(\times\) Xb with \(0 \le b \le 1\) have this behavior.

If Y = a \(\times\) Xb, then
log(Y) = log(a \(\times\) Xb) or
log(Y) = log(a) + log(Xb) or
log(Y) = log(a) + b \(\times\) log(X)

The last equation shows a linear relation between log(Y) and log(X).
That’s why a combined Y \(\rightarrow\) log(Y) and X \(\rightarrow\) log(X) can be tried.

Figure 8
log(Life Expectancy) versus log(GDP per Person) for 237 Countries in 2019

Figure 8a
Residuals versus GDP per Person

Comment
The scatterplot shows that an OLS linear regression model seems an adequate model to describe the relationship.

Model 8: log(Life_Expectancy) ~ log(GDP PP) in 2019

## 
## Call:
## lm(formula = log10(LIFE_EXP) ~ log10(GDP_PP), data = df_2019)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.097708 -0.011149  0.002883  0.015812  0.045574 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.623892   0.010128  160.33   <2e-16 ***
## log10(GDP_PP) 0.061833   0.002632   23.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02465 on 235 degrees of freedom
## Multiple R-squared:  0.7014, Adjusted R-squared:  0.7001 
## F-statistic:   552 on 1 and 235 DF,  p-value: < 2.2e-16

Another option is to only transform the X variable into log(X) and use the OLS regression model Y = \(\alpha\) + \(\beta\) \(\times\) log(X)

Figure 9
Life Expectancy versus log10(GDP per Person) for 237 Countries in 2019

Figure 9a
Residuals versus log10(GDP) per Person

Comment
The scatterplot shows that an OLS linear regression model seems an adequate model to describe the relationship.

Model 9: Life_Expectancy ~ log(GDP PP) in 2019

## 
## Call:
## lm(formula = LIFE_EXP ~ log10(GDP_PP), data = df_2019)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.167  -1.800   0.528   2.342   7.157 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     34.070      1.562   21.80   <2e-16 ***
## log10(GDP_PP)   10.148      0.406   24.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 235 degrees of freedom
## Multiple R-squared:  0.7266, Adjusted R-squared:  0.7255 
## F-statistic: 624.7 on 1 and 235 DF,  p-value: < 2.2e-16

Exercise
Compare the Three Models for the World Bank data.

  1. Write up the regression line equatoon for each of the models.














  2. Rank the models form (1) best fitting model to (3) worst fitting model. And as always, explain your choices.












  3. See Table 4 below.
    If Model 9 is applied on the data set.
    What is, in that case the residual for Jordan?











  4. What is the meaning of the residual value for Jordan in the context. Be precise in your answer.