Sunday, October 15, 2017

How to interpret the summary of linear regression with log-transformed variable

How should we interpret the coefficients of linear regression when we use log-transformation?

On the area of econometrics and data science, we sometimes use log-transformed weights for linear regression. Usually, one of the advantages of linear regression is that we can easily interpret the outcome. But by log-transformation, how should we interpret the outcome?

Overview


In many cases, we adopt linear regression to analyze data. That lets us understand how influential each feature is.

So when we use it, to make the way of interpretation easy, we want as simple features as possible. If you transform the features, you need to adjust your interpretation to that.




Simple linear regression case


At first, let’s look at the simple regression case on R. Here, I’ll use cars data set.

print(head(cars))

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

We can use the feature, speed, as explaining variable and dist as explained variable.

cars.lm <- lm(cars$dist ~ cars$speed)

plot(cars$speed, cars$dist)
abline(cars.lm, lwd=2, col="red")

enter image description here

The red line above means the relation between speed and dist.

summary(cars.lm)
Call:
lm(formula = cars$dist ~ cars$speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

That red line on the image is .

enter image description here

Simply, on this case, when the feature, speed, increases by 1, the dist increases by 3.9324.

Linear regression to log-transformed features


When we make linear regression model and estimate the parameters with log-transformation, how should we interpret those?

There are some patterns. And those can be summarized on the table below. The “b” means the coefficient of the variable X.

enter image description here

In a sense, the table above explains everything.
For example, when we make linear regression model with log-transformed explained variable Y and original explaining variable X and it estimates a and b on .
X increases by 1, Y increases by .

On the case of cars data set, it becomes like this.

cars.lm.log <- lm(log(cars$dist) ~ cars$speed)
summary(cars.lm.log)
Call:
lm(formula = log(cars$dist) ~ cars$speed)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.46604 -0.20800 -0.01683  0.24080  1.01519 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.67612    0.19614   8.546 3.34e-11 ***
cars$speed   0.12077    0.01206  10.015 2.41e-13 ***
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Residual standard error: 0.4463 on 48 degrees of freedom
Multiple R-squared:  0.6763,    Adjusted R-squared:  0.6696 
F-statistic: 100.3 on 1 and 48 DF,  p-value: 2.413e-13

By cars$speed, we can understand that as the speed increases by 1, the dist increases by 12%.

Related article


On the article below, I used log-transformation.

How to deal with heteroscedasticity

On the article below, I wrote about heteroscedasticity. Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.