Today we will learn

  1. Writing Functions
  2. Bivariate OLS “by hand”
  3. Residuals, \(R^2\) etc.
  4. R Regression Commands
  5. Non-Linearities in Linear Regression

In other words, our goals are to:

  • Manually estimate an OLS-Model
  • Calculate predicted values, residuals, \(R^2\)
  • Visualize results with some nice plots
  • Learn how to let R do the work for you

# The first line sets an option for the final document that can be produced from
# the .Rmd file. Don't worry about it.
knitr::opts_chunk$set(
  echo = TRUE, # show results
  collapse = TRUE # not interrupt chunks
)

# The next bit (lines 50-69) is quite powerful and useful.
# First you define which packages you need for your analysis and assign it to
# the p_needed object.
p_needed <- c(
  "foreign", # import files
  "viridis", # color
  "ggplot2", # plotting
  "here", # directory
  "stargazer", # for regression tables
  "dplyr" # for glimpse command
) 

# Now you check which packages are already installed on your computer.
# The function installed.packages() returns a vector with all the installed
# packages.
packages <- rownames(installed.packages())
# Then you check which of the packages you need are not installed on your
# computer yet. Essentially you compare the vector p_needed with the vector
# packages. The result of this comparison is assigned to p_to_install.
p_to_install <- p_needed[!(p_needed %in% packages)]
# If at least one element is in p_to_install you then install those missing
# packages.
if (length(p_to_install) > 0) {
  install.packages(p_to_install)
}
# Now that all packages are installed on the computer, you can load them for
# this project. Additionally the expression returns whether the packages were
# successfully loaded.
sapply(p_needed, require, character.only = TRUE)
## Loading required package: foreign
## Loading required package: viridis
## Loading required package: viridisLite
## Loading required package: ggplot2
## Loading required package: here
## here() starts at C:/Users/vikto/Dropbox/github/qm21/week04_ols_basics
## Loading required package: stargazer
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
##   foreign   viridis   ggplot2      here stargazer     dplyr 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE
# This is an option for stargazer tables
# It automatically adapts the output to html or latex,
# depending on whether we want a html or pdf file
stargazer_opt <- ifelse(knitr::is_latex_output(), "latex", "html")

# Don't worry about this part: it ensures that if the file is knitted to html,
# significance notes are depicted correctly
if (stargazer_opt == "html"){
  fargs <- formals(stargazer)
  fargs$notes.append = FALSE
  fargs$notes = c("<sup>&sstarf;</sup>p<0.1; <sup>&sstarf;&sstarf;</sup>p<0.05; <sup>&sstarf;&sstarf;&sstarf;</sup>p<0.01")
  formals(stargazer) <- fargs
}

# set the seed for replicability
set.seed(2021)

1 Writing Functions

Functions in R are very useful! They take objects as input, do something with this input, and return an output. The output is just the result of what happens to the input within the function. In other words, the output is a function of the input. The great thing is that we can freely define what this function should do.

Just as with loops, try to get into the habit of using them. You will save a lot of time and make your R-code more efficient and structured.

Functions always have the same structure:

  • stored as objects, with a name, preferably a verb (that describes what the function does).
  • () after function() contains placeholders for input.
  • {} includes the “function” itself, using the defined placeholdes instead of real input.
do_this <- function([inputs separated by commas]){
 # what to do with those inputs
}

I usually best understand those concepts by looking at an example. How about squaring numbers?

1.1 Example I: Squaring numbers

powertwo <- function(x){
  x^2
}

powertwo(x = 5)
## [1] 25

# And now we can use this function with any input.
powertwo(x = 2.25)
## [1] 5.0625

# Or even for a sequence of numbers.
powertwo(x = seq(1, 10, 0.5))
##  [1]   1.00   2.25   4.00   6.25   9.00  12.25  16.00  20.25  25.00  30.25
## [11]  36.00  42.25  49.00  56.25  64.00  72.25  81.00  90.25 100.00

1.2 Example II: Growth and Presidents’ vote shares

Let’s write a more useful function! In the lecture our estimated model for the relationship between economic growth and vote share was:

Vote Share = 49.699 + 1.127 * Growth

We can write a function that calculates predicted values for us!

Let’s start simple. How do you calculate the predicted value for Growth = 1?

49.699 + 1.127 * 1
## [1] 50.826

Now you can generalize this equation for any value of Growth and put it in a function.

voteshare_pred <- function(growth) {
  49.699 + 1.127 * growth
}

voteshare_pred(growth = 1)
## [1] 50.826

Let’s make this function even more general. Write a general function to predict values for any bivariate OLS.

predict_bi <- function(x, intercept, slope) {
  
  intercept + slope * x
  
}

predict_bi(x = 1, 
           intercept = 49.699, 
           slope = 1.127)
## [1] 50.826

Let it work for us. Predict the values of vote share for a sequence of possible values of growth ranging from -1 to 3 by 0.1 increments. Please use the general function.

growth <- seq(from = -1, to = 3, by = 0.1)

predicted_vote <- predict_bi(x = growth, 
                             intercept = 49.699, 
                             slope = 1.127)


plot(x = growth, 
     y = predicted_vote,
     xlab = "Economic Growth",
     ylab = "Predicted Voteshare",
     main = "Predicted Voteshare from a bivariate OLS",
     pch = 19,
     col = viridis(1),
     bty = "n",
     las = 1)

2 Voteshare of the Green Party and Young Voters

2.1 Exploring Data Set

Let’s examine some district level data from the 2017 federal elections in Germany. Load the data into R and have a look at some summary statistics.

load(here("raw-data/election2017.Rdata"))

# Explore the data set
summary(election2017)
##     cduvote         spdvote          leftvote        greenvote     
##  Min.   :13.91   Min.   : 7.797   Min.   : 4.164   Min.   : 2.233  
##  1st Qu.:28.09   1st Qu.:15.677   1st Qu.: 6.003   1st Qu.: 6.010  
##  Median :31.73   Median :20.654   Median : 7.079   Median : 8.078  
##  Mean   :31.77   Mean   :20.655   Mean   : 9.274   Mean   : 8.778  
##  3rd Qu.:35.20   3rd Qu.:25.029   3rd Qu.:11.165   3rd Qu.:11.439  
##  Max.   :53.10   Max.   :37.821   Max.   :29.300   Max.   :21.167  
##  NA's   :46                                                        
##    age18to34       popdensity     
##  Min.   :14.10   Min.   :   36.9  
##  1st Qu.:18.40   1st Qu.:  143.8  
##  Median :19.80   Median :  267.8  
##  Mean   :20.35   Mean   :  905.0  
##  3rd Qu.:22.55   3rd Qu.:  946.5  
##  Max.   :32.90   Max.   :12655.6  
## 
head(election2017)
##                                    cduvote  spdvote  leftvote greenvote
## Flensburg – Schleswig             34.21853 23.69539  8.213221 13.081293
## Nordfriesland – Dithmarschen Nord 38.33641 22.53848  6.220532 10.975195
## Steinburg – Dithmarschen Süd      36.19255 22.73696  6.672015  9.902579
## Rendsburg-Eckernförde             36.20959 22.88824  6.374946 12.374254
## Kiel                              26.64327 23.86314 10.298789 17.249439
## Plön – Neumünster                 33.54276 23.76220  6.515010 12.527392
##                                   age18to34 popdensity
## Flensburg – Schleswig                  20.0      132.9
## Nordfriesland – Dithmarschen Nord      18.4       83.6
## Steinburg – Dithmarschen Süd           17.5      110.4
## Rendsburg-Eckernförde                  16.6      114.9
## Kiel                                   28.2     1873.8
## Plön – Neumünster                      17.4      168.8
glimpse(election2017)
## Rows: 299
## Columns: 6
## $ cduvote    <dbl> 34.21853, 38.33641, 36.19255, 36.20959, 26.64327, 33.54276,~
## $ spdvote    <dbl> 23.69539, 22.53848, 22.73696, 22.88824, 23.86314, 23.76220,~
## $ leftvote   <dbl> 8.213221, 6.220532, 6.672015, 6.374946, 10.298789, 6.515010~
## $ greenvote  <dbl> 13.081293, 10.975195, 9.902579, 12.374254, 17.249439, 12.52~
## $ age18to34  <dbl> 20.0, 18.4, 17.5, 16.6, 28.2, 17.4, 17.9, 17.4, 15.8, 16.7,~
## $ popdensity <dbl> 132.9, 83.6, 110.4, 114.9, 1873.8, 168.8, 462.9, 234.9, 143~

The data set contains, for each voting district, the following information:

  • Vote shares of CDU, SPD, The Left, The Greens, AfD and FDP.
  • Percentage of the population between 18 and 34 years old.
  • Population density (inhabitants per \(km^2\)).

Base R

# Have a look at the green party's district level vote share
hist(election2017$greenvote, 
     main = "Histogram of the Green's District-level Vote Share in 2017",
     xlab = "Vote Share",
     breaks = 20,
     col = viridis(1),
     border = F,
     las = 1)


# Have a look at the share of "young voters" in districts
hist(election2017$age18to34, 
     main = "Histogram of the Share of Young Voters in Districts",
     xlab = "Share of 18- to 34-year-old voters in district",
     breaks = seq(14, 33, by = 1),
     col = viridis(1),
     border = F,
     las = 1)

Are younger districts “greener”? Let’s have a first look at the scatterplot.

plot(x = election2017$age18to34, 
     y = election2017$greenvote, 
     main = "Scatterplot of Green Vote Share and\nPercentage of Young People in Districts",
     xlab = "Proportion of young people (18 to 34) in %",
     ylab = "Vote share for the Green Party in %",
     bty = "n", 
     las = 1,                      
     pch = 19,                      
     col = viridis(1, alpha = 0.5)) # make color 50% transparent

ggplot2

# Green party's district level vote share
ggplot(
  data = election2017,
  aes(election2017$greenvote)
) +
  geom_histogram(
    boundary = 10,
    binwidth = 1,
    color = "white",
    fill = viridis(1)
  ) +
  labs(
    title = "Histogram of the Green's District-level Vote Share in 2017",
    x = "Vote Share",
    y = "Frequency"
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = c(seq(5, 20, by = 5)))
## Warning: Use of `election2017$greenvote` is discouraged. Use `greenvote`
## instead.



# share of "young voters" in districts
ggplot(
  data = election2017,
  aes(x = greenvote)
) +
  geom_histogram(
    boundary = 20, 
    binwidth = 1,
    color = "white",
    fill = viridis(1)
  ) +
  labs(
    title = "Histogram of the share of 18- to 34-year-old Voters in Districts",
    x = "Share of 18- to 34-year-old voters in district",
    y = "Frequency"
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = c(seq(5, 20, by = 5)))

Are younger districts “greener”? Let’s have a first look at the scatterplot.

ggplot(data = election2017,  # data used for plotting
       mapping = aes(x = age18to34, 
                     y = greenvote)) +
  geom_point(color = viridis(1, alpha = 0.5), # add points
             size = 2) + 
  theme_minimal() + # change the appearance
  labs(x = "Proportion of young people (18 to 34) in %",
       y = "Vote share for the Green Party in %",
       title = "Scatterplot of Green Vote Share and\nPercentage of Young People in Districts")  

2.2 Bivariate OLS “by hand”

We will now to fit our first regression and we will do this by hand. As a reminder, this is what we are working with:

\[y = \underbrace{\beta_0}_{intercept} + \underbrace{\beta_1}_{slope} x + \underbrace{\epsilon}_{error}\]

In our example, we are trying to estimate this, with \(i\) depicting a district (i.e., an observation):

\[\text{Green Vote Share}_i = \underbrace{\hat{\beta_0}}_{intercept} + \underbrace{\hat{\beta_1}}_{slope} \text{Share of Young Voters}_i\]

We start by estimating the slope, \(\hat{\beta_1}\).

We know that the slope is calculated by \(\hat{\beta_1} = \dfrac{Cov(x,y)}{Var(x)} = \dfrac{\sum(x_i - \bar{x})(y_i-\bar{y})}{\sum(x_i - \bar{x})^2}\).

We then need to translate this into R:

  • cov = sum((x - mean(x)) * (y - mean(y)))
  • var = sum((x - mean(x))^2)

So make this work for our example:

# Calculate Variance of x
var_x_hand <- sum((election2017$age18to34 - mean(election2017$age18to34))^2)

# Calculate Covariance of x and y
cov_hand <- sum(
  (election2017$age18to34 - mean(election2017$age18to34)) *
    (election2017$greenvote - mean(election2017$greenvote))
)

slope <- cov_hand / var_x_hand

slope
## [1] 0.8176557

Now we can calculate the intercept: \[\hat{\beta_0} = \bar y - \hat{\beta_1} \times \bar x.\]

With R commands, the intercept is calculated by mean(y) - slope*mean(x)

intercept <- mean(election2017$greenvote) - slope * mean(election2017$age18to34)

intercept
## [1] -7.861828

Why not make a generalized function out of it?

ols_bi <- function(y, x) {
  cov <- sum((x - mean(x)) * (y - mean(y)))
  var <- sum((x - mean(x))^2)
  b1 <- cov / var
  b0 <- mean(y) - b1 * mean(x)
  cat("Congrats, you just wrote your first estimator!\n \n intercept \t slope \n", b0, "\t", b1)
  return(c(b0, b1))
}

coef_hand <- ols_bi(y = election2017$greenvote, x = election2017$age18to34)
## Congrats, you just wrote your first estimator!
##  
##  intercept    slope 
##  -7.861828    0.8176557

Ok, now that we fitted our first regression, let’s add the regression line to our scatterplot.

Base R

# we need two colors (one for the dots, one for the line)
twocols <- viridis(2, 
                   alpha = c(0.5, 1)) # only first color transparent

plot(x = election2017$age18to34,
     y = election2017$greenvote,
     main = "Scatterplot of Green Vote Share and\nPercentage of Young People in Districts",
     xlab = "Proportion of young people (18 to 34) in %",
     ylab = "Vote share for the Green Party (%)",
     bty = "n",
     las = 1,
     pch = 19, 
     col = twocols[1],
     ylim = c(0, 22))

# add the line
abline(a = coef_hand[1], # a is the intercept of the line
       b = coef_hand[2], # b is the slope of the line
       col = twocols[2],
       lwd = 2)

ggplot2


# we need two colors (one for the dots, one for the line)
twocols <- viridis(2, 
                   alpha = c(0.5, 1)) # only first color transparent

ggplot(data = election2017,  # data used for plotting
       mapping = aes(x = age18to34, y = greenvote)) +
  geom_point(color = twocols[1], # add points
             size = 2) + 
  theme_minimal() + # change the appearance
  theme(panel.grid.major = element_blank(),   # remove major grid lines
        panel.grid.minor = element_blank()) + # remove minor grid lines
  labs(x = "Proportion of young people (18 to 34) in %",
       y = "Vote share for the Green Party (%)",
       title = "Scatterplot of Green Vote Share and\nPercentage of Young People in Districts",
       subtitle = "Using manual calculations")  +
  geom_abline(intercept = coef_hand[1], # add the line
              slope = coef_hand[2],
              color = twocols[2],
              size = 1)

# but with ggplot, there is also a shortcut... ggplot can estimate the line for 
# you and add it into the plot. We will not be using this a lot, but for the 
# sake of completeness, here is how to do this: 
ggplot(data = election2017,  # data used for plotting
       mapping = aes(x = age18to34, y = greenvote)) +
  geom_point(color = twocols[1], # add points
             size = 2) + 
  theme_minimal() + # change the appearance#
  theme(panel.grid.major = element_blank(),   # remove major grid lines
        panel.grid.minor = element_blank()) + # remove minor grid lines
  labs(x = "Proportion of young people (18 to 34) in %",
       y = "Vote share for the Green Party (%)",
       title = "Scatterplot of Green Vote Share and\nPercentage of Young People in Districts",
       subtitle = "Using built-in function")  +
   geom_smooth(method = "lm",  # add line with ggplot command
               se = FALSE, # only plot the line 
               color = twocols[2])
## `geom_smooth()` using formula 'y ~ x'

3 Residuals, \(R^2\), etc.

Deriving and understanding the intercept and slope of a regression line is great and important. However, we also want to learn how well the regression line fits our data. This is why we have to talk about Residuals, Total Sum of Squares (TSS), Explained Sum of Squares (ESS), and \(R^2\):

  • Residuals: \(\hat{\epsilon_i} = y_i - \hat{y_i} = y_i - (\hat{\beta_0} + \hat{\beta_1}x_i)\), so y - y_hat
  • TSS (aka “Variation in DV”): \(TSS = \sum(y_i - \bar y)^2\), so sum((y - mean(y))^2)
  • ESS (aka “The variation in DV we can explain”): \(ESS = \sum(\hat y_i - \bar y)^2\), so sum((y_hat - mean(y)) ^ 2)
  • RSS (aka “Unexplained by model variation in DV”): \(RSS = \sum\hat{\epsilon_i}^2 = \sum(y_i - \hat{y_i})^2\), so sum((y - y_hat)^2)
  • \(R^2\) = Explained Sum of Squares (EES) / Total Sum of Squares(TSS)

Just like the slope and intercept, we can calculate all those things by hand:

TSS_hand <- sum((election2017$greenvote - mean(election2017$greenvote))^2)

# How do we get y_hat? Remember our own predict function?
greenvote_hat <- predict_bi(
  x = election2017$age18to34,
  intercept = coef_hand[1],
  slope = coef_hand[2]
)

ESS_hand <- sum((greenvote_hat - mean(election2017$greenvote))^2)
RSS_hand <- TSS_hand - ESS_hand

r_squared <- ESS_hand / TSS_hand

r_squared
## [1] 0.4562751

Now we also want to calculate the residuals:

election2017$residuals_hand <- election2017$greenvote - greenvote_hat

Base R

And make a residual plot.

plot(x = election2017$age18to34, 
     y = election2017$residuals_hand,
     bty = "n",
     las = 1,
     main = "Residual Plot for greenvote",
     ylab = "Residuals",
     xlab = "Proportion of young people (18 to 34) in %",
     pch = 19,
     col = twocols[1])
abline(h = 0,
       col = twocols[2],
       lwd = 2)
grid() # add grid

ggplot2

And make a residual plot.


ggplot(data = election2017,  # data used for plotting
       mapping = aes(x = age18to34, y = residuals_hand)) +
  geom_point(color = twocols[1], # add points
             size = 2) + 
  theme_minimal() + # change the appearance
  labs(x = "Residual Plot for greenvote",
       y = "Residuals",
       title = "Residual Plot for greenvote")  +
  geom_hline(yintercept = 0, # add the line
             color = twocols[2],
             size = 1)

Animated Plot

And here you can see the parts of the variance visually on the scatterplot with our example:

4 R Regression Commands

It comes as no surprise that all of these functions are already in R…

lm_res <- lm(greenvote ~ age18to34, 
             data = election2017)  

# lm() stands for linear model.

lm_res
## 
## Call:
## lm(formula = greenvote ~ age18to34, data = election2017)
## 
## Coefficients:
## (Intercept)    age18to34  
##     -7.8618       0.8177

summary(lm_res)
## 
## Call:
## lm(formula = greenvote ~ age18to34, data = election2017)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8166 -2.0586 -0.6483  2.0717  8.5081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.86183    1.06742  -7.365 1.75e-12 ***
## age18to34    0.81766    0.05179  15.787  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.917 on 297 degrees of freedom
## Multiple R-squared:  0.4563, Adjusted R-squared:  0.4544 
## F-statistic: 249.2 on 1 and 297 DF,  p-value: < 2.2e-16

There is a very neat package to generate ready-to-publish regression tables. It is called stargazer and it generates Latex (pronounced lah-tekh) or html output. This is how it works:

library(stargazer) # we have already installed it in setup chunk

stargazer(lm_res, 
          type = stargazer_opt, 
          covariate.labels = c("Share of young people"),
          dep.var.labels = c("Green Vote Share"))
Dependent variable:
Green Vote Share
Share of young people 0.818***
(0.052)
Constant -7.862***
(1.067)
Observations 299
R2 0.456
Adjusted R2 0.454
Residual Std. Error 2.917 (df = 297)
F Statistic 249.232*** (df = 1; 297)
Note: p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01

Now you can make ready-to-publish regression tables. (To get this code chunk working in your own projects, you need to generate the stargazer_opt object in the setup chunk.) And we are only in Week 4!!

You can get the coefficients with the coef() command for the output of an lm() object, which we called lm_res. Let’s compare them to our results from above.

cbind(coef_lm = coef(lm_res), 
      coef_hand)
##                coef_lm  coef_hand
## (Intercept) -7.8618276 -7.8618276
## age18to34    0.8176557  0.8176557

coef(lm_res) - coef_hand
##   (Intercept)     age18to34 
##  2.930989e-14 -8.881784e-16

Fitted (predicted) values are also already in R. Just use the fitted() command.

We can check whether the in-built function returns the same values as the ones we calculated “by hand”:

head(fitted(lm_res) - greenvote_hat)
##             Flensburg – Schleswig Nordfriesland – Dithmarschen Nord 
##                     -1.652012e-13                     -3.907985e-14 
##      Steinburg – Dithmarschen Süd             Rendsburg-Eckernförde 
##                      8.881784e-16                      8.881784e-16 
##                              Kiel                 Plön – Neumünster 
##                      1.776357e-15                      8.881784e-16

# Of course we can also get the residuals.
head(residuals(lm_res) - election2017$residuals_hand)
##             Flensburg – Schleswig Nordfriesland – Dithmarschen Nord 
##                      1.652012e-13                      3.907985e-14 
##      Steinburg – Dithmarschen Süd             Rendsburg-Eckernförde 
##                     -8.881784e-16                     -8.881784e-16 
##                              Kiel                 Plön – Neumünster 
##                     -1.776357e-15                     -8.881784e-16

We did pretty good calculating it by hand, didn’t we?

We have re-estimated the results with a built in command already, and now let’s present our results in a plot. Additionally, we make a residual plot to investigate whether there are any problems with the model.

Base R

Plot values and regression line in a nice plot.

plot(x = election2017$age18to34, 
     y = election2017$greenvote,
     bty = "n",
     las = 1,
     pch = 19, 
     col = twocols[1], 
     cex = 1,
     ylim = c(0, 25),
     ylab = "Vote share for the Green Party (%)",
     xlab = "Percentage of population between 18 and 34 (%)",
     main = "Younger Districts - Greener Votes?")
abline(lm_res, 
       col = twocols[2],
       lwd = 2)

Now we want to have a look at the residual plot. And label the outliers just like in the plot in the lecture.

plot(x = election2017$age18to34, 
     y = residuals(lm_res),
     bty = "n",
     las = 1,
     pch = 19,
     col = twocols[1],
     ylim = c(-9, 9),
     xlab = "% of the population between 18 and 34", 
     ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0,
       col = twocols[2],
       lwd = 2)
grid()

# Label the largest positive outlier
text(x = election2017$age18to34[residuals(lm_res) == max(residuals(lm_res))], 
     y = max(residuals(lm_res)), 
     names(which(residuals(lm_res) == max(residuals(lm_res)))),
     cex = 0.6, 
     pos = 1)

# Label the largest negative outlier
text(x = election2017$age18to34[residuals(lm_res) == min(residuals(lm_res))], 
     y = min(residuals(lm_res)), 
     names(which(residuals(lm_res) == min(residuals(lm_res)))),
     cex = 0.6, 
     pos = 1)

ggplot2

Plot values and regression line in a nice plot.

ggplot(data = election2017,  # data used for plotting
       mapping = aes(x = age18to34, y = greenvote)) +
  geom_point(color = twocols[1], # add points
             size = 2) + 
  theme_minimal() + # change the appearance
  theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank()) +
  labs(x = "Proportion of young people (18 to 34) in %",
       y = "Vote share for the Green Party (%)",
       title = "Scatterplot of green votes and share of young people")  +
  geom_abline(intercept = coef(lm_res)[1], # add the line
              slope = coef(lm_res)[2],
              color = twocols[2],
              size = 1)

Now we want to have a look at the residual plot. And label the outliers just like in the plot in the lecture.

ggplot(mapping = aes(x = election2017$age18to34, 
                     y = residuals(lm_res))) +
  geom_point(color = twocols[1], # add points
             size = 2) + 
  theme_minimal() + # change the appearance
  labs(x = "Residuals",
       y = "Vote share for the Green Party (%)",
       title = "Residual Plot for greenvote")  +
  geom_hline(yintercept = 0, # add the line
             color = twocols[2],
             size = 1) +
  geom_text(aes(x = election2017$age18to34[residuals(lm_res) == max(residuals(lm_res))], 
                y = max(residuals(lm_res)),
                label = names(which(residuals(lm_res) == max(residuals(lm_res)))),
                vjust = 2),
            size = 3) +
  geom_text(aes(x = election2017$age18to34[residuals(lm_res) == min(residuals(lm_res))], 
                y = min(residuals(lm_res)),
                label = names(which(residuals(lm_res) == min(residuals(lm_res)))),
                vjust = 2),
            size = 3)

Exercise Section: Non-Linearities in Linear Regression

In today’s exercise section, you will explore how linear models can be used to model non-linear relationships.

Let’s start with a hypothesis: Somebody argues that the Greens got more votes in urban districts than in districts on the countryside. We measure whether a district is urban or countryside with its population density (popdensity).

Exercise 1: Visual Exploratory Data Analysis

Let’s have a look at the data first. The hypothesis suggests a relationship between two variables.

  • The independent variable is population density (popdensity).
  • The dependent variable is Green vote share (greenvote).

1.1 Plot univariate distribution

  • Plot the univariate distribution of popdensity. Either use a histogram or a density plot (or both).
hist(x = election2017$age18to34)

hist(x = election2017$popdensity)

1.2 Now let’s look at the bivariate distribution of popdensity and greenvote.

  • Create a scatterplot of both variables.
  • Make an educated decision on which variable goes on the x-axis and which on the y-axis.
  • Discuss what the relationship between both variables looks like. Is it linear?
plot(x = election2017$popdensity,
     y = election2017$greenvote)

1.3 Use log(popdensity) instead of popdensity.

To model nonlinear relationships, it is sometimes (but not always!) useful to log-transform one of the variables. Let’s try this out in this case.

  • Calculate the log-values of population density and store it as a new variable logpopdensity. Hint: We can easily calculate log values with the function log().
  • Repeat 1.2, but this time use logpopdensity instead of popdensity.
  • Again discuss what the relationship between both variables looks like.

logpopdensity <- log(election2017$popdensity)
plot(x = log(election2017$popdensity),
     y = election2017$greenvote)

1.4 Linear model and linear regression with log-transformed variables

Use the lm() command to estimate two OLS Models:

  • A linear model where you regress greenvote on popdensity. Store the regression output in an object called linear_model.
  • A second model where you regress greenvote on logpopdensity. Store the regression output in an object called loglinear_model.

linear_model <- lm(greenvote ~ popdensity,
                   data = election2017)

loglinear_model <- lm(greenvote ~ logpopdensity,
                   data = election2017)

loglinear_model
## 
## Call:
## lm(formula = greenvote ~ logpopdensity, data = election2017)
## 
## Coefficients:
##   (Intercept)  logpopdensity  
##        -1.123          1.663
summary(linear_model)
## 
## Call:
## lm(formula = greenvote ~ popdensity, data = election2017)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.799 -2.523 -0.584  2.492 12.660 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6237125  0.2368277  32.191   <2e-16 ***
## popdensity  0.0012753  0.0001376   9.271   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.484 on 297 degrees of freedom
## Multiple R-squared:  0.2244, Adjusted R-squared:  0.2218 
## F-statistic: 85.95 on 1 and 297 DF,  p-value: < 2.2e-16
summary(loglinear_model)
## 
## Call:
## lm(formula = greenvote ~ logpopdensity, data = election2017)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6231 -2.3104 -0.2683  2.3917 11.4100 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.1229     0.9493  -1.183    0.238    
## logpopdensity   1.6634     0.1561  10.656   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.365 on 297 degrees of freedom
## Multiple R-squared:  0.2766, Adjusted R-squared:  0.2741 
## F-statistic: 113.5 on 1 and 297 DF,  p-value: < 2.2e-16

1.5 Inspect the results

If you did everything correctly, you can run the following code to inspect your results. If you have time left, discuss the plots in your group.

  • How does the first model compare to the model with the transformed variable?
  • Where does the nonlinearity in the second model come from?

First, the linear model (without log-transformation):

plot(
  x = election2017$popdensity,
  y = election2017$greenvote,
  bty = "n",
  las = 1,
  pch = 19,
  col = twocols[1],
  ylim = c(0, 25),
  ylab = "Green Voteshare in %", xlab = "Population Density per km^2",
  main = "A linear model"
)
abline(linear_model,
  col = twocols[2],
  lwd = 2
)

# We will see the problem when looking at the residuals.
plot(
  x = election2017$popdensity,
  y = residuals(linear_model),
  bty = "n",
  las = 1,
  pch = 19,
  col = twocols[1],
  ylab = "Residuals", xlab = "Population Density per km^2",
  main = "Residuals of the simple linear model"
)
abline(
  h = 0,
  col = twocols[2],
  lwd = 2
)

Second, the model with logpopdensity:

plot(
  x = logpopdensity,
  y = election2017$greenvote,
  bty = "n",
  las = 1,
  pch = 19,
  col = viridis(1, alpha = 0.5),
  ylab = "Green Voteshare in %",
  xlab = "log of Population Density per km^2",
  main = "A model with a transformed variable (log scale)"
)
abline(loglinear_model,
  col = twocols[2],
  lwd = 2
)


# Residual Plot
plot(
  x = logpopdensity,
  y = residuals(loglinear_model),
  bty = "n",
  las = 1,
  pch = 19,
  col = twocols[1],
  ylab = "Residuals", xlab = "Log of Population Density per km^2",
  main = "Residuals of the model with a transformed variable"
)
abline(
  h = 0,
  col = twocols[2],
  lwd = 2
)

We can (and should!) transform population density back to human-readable scale. Now it should become apparent why log-transformation enables us to model non-linear relationships while using a linear model.

plot(
  x = election2017$popdensity,
  y = election2017$greenvote,
  bty = "n",
  las = 1,
  pch = 19,
  col = twocols[1],
  ylim = c(0, 25),
  ylab = "Green Voteshare in %", xlab = "Population Density per km^2",
  main = "A non-linear linear model (human readable scale)"
)
curve(loglinear_model$coefficient[1] + loglinear_model$coefficient[2] * log(x),
  add = T,
  col = twocols[2],
  from = 5,
  to = 13000,
  n = 100000,
  lwd = 2
)

Concluding remarks

  • In your homework you will:
    • Calculate a simple bivariate regression by hand and in R.
    • Have a look at the relationship between corruption and wealth.