In other words, our goals are to:
# The first line sets an option for the final document that can be produced from
# the .Rmd file. Don't worry about it.
::opts_chunk$set(
knitrecho = TRUE, # show results
collapse = TRUE # not interrupt chunks
)
# The next bit (lines 50-69) is quite powerful and useful.
# First you define which packages you need for your analysis and assign it to
# the p_needed object.
<- c(
p_needed "foreign", # import files
"viridis", # color
"ggplot2", # plotting
"here", # directory
"stargazer", # for regression tables
"dplyr" # for glimpse command
)
# Now you check which packages are already installed on your computer.
# The function installed.packages() returns a vector with all the installed
# packages.
<- rownames(installed.packages())
packages # Then you check which of the packages you need are not installed on your
# computer yet. Essentially you compare the vector p_needed with the vector
# packages. The result of this comparison is assigned to p_to_install.
<- p_needed[!(p_needed %in% packages)]
p_to_install # If at least one element is in p_to_install you then install those missing
# packages.
if (length(p_to_install) > 0) {
install.packages(p_to_install)
}# Now that all packages are installed on the computer, you can load them for
# this project. Additionally the expression returns whether the packages were
# successfully loaded.
sapply(p_needed, require, character.only = TRUE)
## Loading required package: foreign
## Loading required package: viridis
## Loading required package: viridisLite
## Loading required package: ggplot2
## Loading required package: here
## here() starts at C:/Users/vikto/Dropbox/github/qm21/week04_ols_basics
## Loading required package: stargazer
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## foreign viridis ggplot2 here stargazer dplyr
## TRUE TRUE TRUE TRUE TRUE TRUE
# This is an option for stargazer tables
# It automatically adapts the output to html or latex,
# depending on whether we want a html or pdf file
<- ifelse(knitr::is_latex_output(), "latex", "html")
stargazer_opt
# Don't worry about this part: it ensures that if the file is knitted to html,
# significance notes are depicted correctly
if (stargazer_opt == "html"){
<- formals(stargazer)
fargs $notes.append = FALSE
fargs$notes = c("<sup>⋆</sup>p<0.1; <sup>⋆⋆</sup>p<0.05; <sup>⋆⋆⋆</sup>p<0.01")
fargsformals(stargazer) <- fargs
}
# set the seed for replicability
set.seed(2021)
Functions in R are very useful! They take objects as input, do something with this input, and return an output. The output is just the result of what happens to the input within the function. In other words, the output is a function of the input. The great thing is that we can freely define what this function should do.
Just as with loops, try to get into the habit of using them. You will save a lot of time and make your R-code more efficient and structured.
Functions always have the same structure:
()
after function()
contains placeholders for input.{}
includes the “function” itself, using the defined placeholdes instead of real input.<- function([inputs separated by commas]){
do_this # what to do with those inputs
}
I usually best understand those concepts by looking at an example. How about squaring numbers?
<- function(x){
powertwo ^2
x
}
powertwo(x = 5)
## [1] 25
# And now we can use this function with any input.
powertwo(x = 2.25)
## [1] 5.0625
# Or even for a sequence of numbers.
powertwo(x = seq(1, 10, 0.5))
## [1] 1.00 2.25 4.00 6.25 9.00 12.25 16.00 20.25 25.00 30.25
## [11] 36.00 42.25 49.00 56.25 64.00 72.25 81.00 90.25 100.00
Deriving and understanding the intercept and slope of a regression line is great and important. However, we also want to learn how well the regression line fits our data. This is why we have to talk about Residuals, Total Sum of Squares (TSS), Explained Sum of Squares (ESS), and \(R^2\):
y - y_hat
sum((y - mean(y))^2)
sum((y_hat - mean(y)) ^ 2)
sum((y - y_hat)^2)
Just like the slope and intercept, we can calculate all those things by hand:
<- sum((election2017$greenvote - mean(election2017$greenvote))^2)
TSS_hand
# How do we get y_hat? Remember our own predict function?
<- predict_bi(
greenvote_hat x = election2017$age18to34,
intercept = coef_hand[1],
slope = coef_hand[2]
)
<- sum((greenvote_hat - mean(election2017$greenvote))^2)
ESS_hand <- TSS_hand - ESS_hand
RSS_hand
<- ESS_hand / TSS_hand
r_squared
r_squared## [1] 0.4562751
Now we also want to calculate the residuals:
$residuals_hand <- election2017$greenvote - greenvote_hat election2017
And make a residual plot.
plot(x = election2017$age18to34,
y = election2017$residuals_hand,
bty = "n",
las = 1,
main = "Residual Plot for greenvote",
ylab = "Residuals",
xlab = "Proportion of young people (18 to 34) in %",
pch = 19,
col = twocols[1])
abline(h = 0,
col = twocols[2],
lwd = 2)
grid() # add grid
And make a residual plot.
ggplot(data = election2017, # data used for plotting
mapping = aes(x = age18to34, y = residuals_hand)) +
geom_point(color = twocols[1], # add points
size = 2) +
theme_minimal() + # change the appearance
labs(x = "Residual Plot for greenvote",
y = "Residuals",
title = "Residual Plot for greenvote") +
geom_hline(yintercept = 0, # add the line
color = twocols[2],
size = 1)
And here you can see the parts of the variance visually on the scatterplot with our example:
It comes as no surprise that all of these functions are already in R…
<- lm(greenvote ~ age18to34,
lm_res data = election2017)
# lm() stands for linear model.
lm_res##
## Call:
## lm(formula = greenvote ~ age18to34, data = election2017)
##
## Coefficients:
## (Intercept) age18to34
## -7.8618 0.8177
summary(lm_res)
##
## Call:
## lm(formula = greenvote ~ age18to34, data = election2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8166 -2.0586 -0.6483 2.0717 8.5081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.86183 1.06742 -7.365 1.75e-12 ***
## age18to34 0.81766 0.05179 15.787 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.917 on 297 degrees of freedom
## Multiple R-squared: 0.4563, Adjusted R-squared: 0.4544
## F-statistic: 249.2 on 1 and 297 DF, p-value: < 2.2e-16
There is a very neat package to generate ready-to-publish regression tables. It is called stargazer
and it generates Latex (pronounced lah-tekh) or html output. This is how it works:
library(stargazer) # we have already installed it in setup chunk
stargazer(lm_res,
type = stargazer_opt,
covariate.labels = c("Share of young people"),
dep.var.labels = c("Green Vote Share"))
Dependent variable: | |
Green Vote Share | |
Share of young people | 0.818*** |
(0.052) | |
Constant | -7.862*** |
(1.067) | |
Observations | 299 |
R2 | 0.456 |
Adjusted R2 | 0.454 |
Residual Std. Error | 2.917 (df = 297) |
F Statistic | 249.232*** (df = 1; 297) |
Note: | ⋆p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01 |
Now you can make ready-to-publish regression tables. (To get this code chunk working in your own projects, you need to generate the stargazer_opt
object in the setup chunk.) And we are only in Week 4!!
You can get the coefficients with the coef()
command for the output of an lm()
object, which we called lm_res
. Let’s compare them to our results from above.
cbind(coef_lm = coef(lm_res),
coef_hand)## coef_lm coef_hand
## (Intercept) -7.8618276 -7.8618276
## age18to34 0.8176557 0.8176557
coef(lm_res) - coef_hand
## (Intercept) age18to34
## 2.930989e-14 -8.881784e-16
Fitted (predicted) values are also already in R. Just use the fitted()
command.
We can check whether the in-built function returns the same values as the ones we calculated “by hand”:
head(fitted(lm_res) - greenvote_hat)
## Flensburg – Schleswig Nordfriesland – Dithmarschen Nord
## -1.652012e-13 -3.907985e-14
## Steinburg – Dithmarschen Süd Rendsburg-Eckernförde
## 8.881784e-16 8.881784e-16
## Kiel Plön – Neumünster
## 1.776357e-15 8.881784e-16
# Of course we can also get the residuals.
head(residuals(lm_res) - election2017$residuals_hand)
## Flensburg – Schleswig Nordfriesland – Dithmarschen Nord
## 1.652012e-13 3.907985e-14
## Steinburg – Dithmarschen Süd Rendsburg-Eckernförde
## -8.881784e-16 -8.881784e-16
## Kiel Plön – Neumünster
## -1.776357e-15 -8.881784e-16
We did pretty good calculating it by hand, didn’t we?
We have re-estimated the results with a built in command already, and now let’s present our results in a plot. Additionally, we make a residual plot to investigate whether there are any problems with the model.
Plot values and regression line in a nice plot.
plot(x = election2017$age18to34,
y = election2017$greenvote,
bty = "n",
las = 1,
pch = 19,
col = twocols[1],
cex = 1,
ylim = c(0, 25),
ylab = "Vote share for the Green Party (%)",
xlab = "Percentage of population between 18 and 34 (%)",
main = "Younger Districts - Greener Votes?")
abline(lm_res,
col = twocols[2],
lwd = 2)
Now we want to have a look at the residual plot. And label the outliers just like in the plot in the lecture.
plot(x = election2017$age18to34,
y = residuals(lm_res),
bty = "n",
las = 1,
pch = 19,
col = twocols[1],
ylim = c(-9, 9),
xlab = "% of the population between 18 and 34",
ylab = "Residuals",
main = "Residual Plot")
abline(h = 0,
col = twocols[2],
lwd = 2)
grid()
# Label the largest positive outlier
text(x = election2017$age18to34[residuals(lm_res) == max(residuals(lm_res))],
y = max(residuals(lm_res)),
names(which(residuals(lm_res) == max(residuals(lm_res)))),
cex = 0.6,
pos = 1)
# Label the largest negative outlier
text(x = election2017$age18to34[residuals(lm_res) == min(residuals(lm_res))],
y = min(residuals(lm_res)),
names(which(residuals(lm_res) == min(residuals(lm_res)))),
cex = 0.6,
pos = 1)
Plot values and regression line in a nice plot.
ggplot(data = election2017, # data used for plotting
mapping = aes(x = age18to34, y = greenvote)) +
geom_point(color = twocols[1], # add points
size = 2) +
theme_minimal() + # change the appearance
theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank()) +
labs(x = "Proportion of young people (18 to 34) in %",
y = "Vote share for the Green Party (%)",
title = "Scatterplot of green votes and share of young people") +
geom_abline(intercept = coef(lm_res)[1], # add the line
slope = coef(lm_res)[2],
color = twocols[2],
size = 1)
Now we want to have a look at the residual plot. And label the outliers just like in the plot in the lecture.
ggplot(mapping = aes(x = election2017$age18to34,
y = residuals(lm_res))) +
geom_point(color = twocols[1], # add points
size = 2) +
theme_minimal() + # change the appearance
labs(x = "Residuals",
y = "Vote share for the Green Party (%)",
title = "Residual Plot for greenvote") +
geom_hline(yintercept = 0, # add the line
color = twocols[2],
size = 1) +
geom_text(aes(x = election2017$age18to34[residuals(lm_res) == max(residuals(lm_res))],
y = max(residuals(lm_res)),
label = names(which(residuals(lm_res) == max(residuals(lm_res)))),
vjust = 2),
size = 3) +
geom_text(aes(x = election2017$age18to34[residuals(lm_res) == min(residuals(lm_res))],
y = min(residuals(lm_res)),
label = names(which(residuals(lm_res) == min(residuals(lm_res)))),
vjust = 2),
size = 3)
In today’s exercise section, you will explore how linear models can be used to model non-linear relationships.
Let’s start with a hypothesis: Somebody argues that the Greens got more votes in urban districts than in districts on the countryside. We measure whether a district is urban or countryside with its population density (popdensity
).
Let’s have a look at the data first. The hypothesis suggests a relationship between two variables.
popdensity
).greenvote
).popdensity
. Either use a histogram or a density plot (or both).hist(x = election2017$age18to34)
hist(x = election2017$popdensity)
popdensity
and greenvote
.plot(x = election2017$popdensity,
y = election2017$greenvote)
log(popdensity)
instead of popdensity
.To model nonlinear relationships, it is sometimes (but not always!) useful to log-transform one of the variables. Let’s try this out in this case.
logpopdensity
. Hint: We can easily calculate log values with the function log()
.logpopdensity
instead of popdensity
.
<- log(election2017$popdensity)
logpopdensity plot(x = log(election2017$popdensity),
y = election2017$greenvote)
Use the lm()
command to estimate two OLS Models:
greenvote
on popdensity
. Store the regression output in an object called linear_model
.greenvote
on logpopdensity
. Store the regression output in an object called loglinear_model
.
<- lm(greenvote ~ popdensity,
linear_model data = election2017)
<- lm(greenvote ~ logpopdensity,
loglinear_model data = election2017)
loglinear_model##
## Call:
## lm(formula = greenvote ~ logpopdensity, data = election2017)
##
## Coefficients:
## (Intercept) logpopdensity
## -1.123 1.663
summary(linear_model)
##
## Call:
## lm(formula = greenvote ~ popdensity, data = election2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.799 -2.523 -0.584 2.492 12.660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6237125 0.2368277 32.191 <2e-16 ***
## popdensity 0.0012753 0.0001376 9.271 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.484 on 297 degrees of freedom
## Multiple R-squared: 0.2244, Adjusted R-squared: 0.2218
## F-statistic: 85.95 on 1 and 297 DF, p-value: < 2.2e-16
summary(loglinear_model)
##
## Call:
## lm(formula = greenvote ~ logpopdensity, data = election2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6231 -2.3104 -0.2683 2.3917 11.4100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1229 0.9493 -1.183 0.238
## logpopdensity 1.6634 0.1561 10.656 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.365 on 297 degrees of freedom
## Multiple R-squared: 0.2766, Adjusted R-squared: 0.2741
## F-statistic: 113.5 on 1 and 297 DF, p-value: < 2.2e-16
If you did everything correctly, you can run the following code to inspect your results. If you have time left, discuss the plots in your group.
First, the linear model (without log-transformation):
plot(
x = election2017$popdensity,
y = election2017$greenvote,
bty = "n",
las = 1,
pch = 19,
col = twocols[1],
ylim = c(0, 25),
ylab = "Green Voteshare in %", xlab = "Population Density per km^2",
main = "A linear model"
)abline(linear_model,
col = twocols[2],
lwd = 2
)
# We will see the problem when looking at the residuals.
plot(
x = election2017$popdensity,
y = residuals(linear_model),
bty = "n",
las = 1,
pch = 19,
col = twocols[1],
ylab = "Residuals", xlab = "Population Density per km^2",
main = "Residuals of the simple linear model"
)abline(
h = 0,
col = twocols[2],
lwd = 2
)
Second, the model with logpopdensity
:
plot(
x = logpopdensity,
y = election2017$greenvote,
bty = "n",
las = 1,
pch = 19,
col = viridis(1, alpha = 0.5),
ylab = "Green Voteshare in %",
xlab = "log of Population Density per km^2",
main = "A model with a transformed variable (log scale)"
)abline(loglinear_model,
col = twocols[2],
lwd = 2
)
# Residual Plot
plot(
x = logpopdensity,
y = residuals(loglinear_model),
bty = "n",
las = 1,
pch = 19,
col = twocols[1],
ylab = "Residuals", xlab = "Log of Population Density per km^2",
main = "Residuals of the model with a transformed variable"
)abline(
h = 0,
col = twocols[2],
lwd = 2
)
We can (and should!) transform population density back to human-readable scale. Now it should become apparent why log-transformation enables us to model non-linear relationships while using a linear model.
plot(
x = election2017$popdensity,
y = election2017$greenvote,
bty = "n",
las = 1,
pch = 19,
col = twocols[1],
ylim = c(0, 25),
ylab = "Green Voteshare in %", xlab = "Population Density per km^2",
main = "A non-linear linear model (human readable scale)"
)curve(loglinear_model$coefficient[1] + loglinear_model$coefficient[2] * log(x),
add = T,
col = twocols[2],
from = 5,
to = 13000,
n = 100000,
lwd = 2
)