In this section we will consider the testing of statistical hypothesis for the relationship of two or more variables, and the graphs that will help in the interpretation of the results.
Regression analysis is a broad term for a set of methodologies used to predict a response variable (also called a dependent, criterion, or outcome variable) from one or more predictor variables (also called independent or explanatory variables). In general, regression analysis can be used to identify the explanatory variables that are related to a response variable, to describe the form of the relationships involved, and to provide an equation for predicting the response variable from the explanatory variables.
When the regression model contains one dependent variable and one independent variable, the approach is called simple linear regression.
The data set women in the base installation provides the height and weight for a set of 15 women ages 30 to 39. Suppose you want to predict weight from height. Having an equation for predicting weight from height can help you to identify overweight or underweight individuals.
But first, we have to prove that our data are close to a normal distribution, an assumption of the parametric methods, like linear regression. First we will build a Q-Q plot that will show how good is the agreement between the quantiles of the data (using z transformation) and the theoretical quantile of the corresponding normal distribution (\(\mu = 0, \sigma = 1\))
library(EnvStats)
#Q-Q plot
women
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
EnvStats::qqPlot(women$weight, add.line = TRUE, points.col = "blue", line.col = "red")
EnvStats::qqPlot(women$height, add.line = TRUE, points.col = "blue", line.col = "red")
Now we will test normality using the Shapiro-Wilk test, with the null hypothesis that data are normally distributed.
shapiro.test(women$weight)
##
## Shapiro-Wilk normality test
##
## data: women$weight
## W = 0.96036, p-value = 0.6986
shapiro.test(women$height)
##
## Shapiro-Wilk normality test
##
## data: women$height
## W = 0.96359, p-value = 0.7545
Then we proceed with the parametric regression analysis:
library(ggplot2)
# linear regression using lm (linear model) function
fit <- lm(weight ~ height, data=women)
summary(fit)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
#graph with regression line, confidence interval (95%)
ggplot(data=women, aes(x=women$height, y=women$weight)) +
geom_point(pch=19, color="blue", size=2) +
geom_smooth(method="lm", color="red", linetype=2) +
labs(x="Height, inches", y="Weight, lb")
## Warning: Use of `women$height` is discouraged.
## ℹ Use `height` instead.
## Warning: Use of `women$weight` is discouraged.
## ℹ Use `weight` instead.
## Warning: Use of `women$height` is discouraged.
## ℹ Use `height` instead.
## Warning: Use of `women$weight` is discouraged.
## ℹ Use `weight` instead.
## `geom_smooth()` using formula = 'y ~ x'
#graph: residuals
plot(women$height,residuals(fit),
xlab="Height (in inches)",
ylab="residuals")
The resulting predictive equation is:
From the Pr(>|t|) column, you see that the regression coefficient (3.45) is significantly different from zero (p < 0.001) and indicates that there is an expected increase of 3.45 pounds of weight for every 1 inch increase in height. The multiple R-squared (0.991) indicates that the model accounts for 99.1% of the variance in weights. The residual standard error (1.53 pounds) can be thought of as the average error in predicting weight from height using this model. The F statistic tests whether the predictor variables, taken together, predict the response variable above chance levels.
The residuals plot from the linear regression, indicates a non-random distribution of the prediction error of the linear model. You might be able to improve your prediction using a regression with a quadratic term (that is, X2), so we have now a polynomial regression
fit2 <- lm(weight ~ height + I(height^2), data=women)
summary(fit2)
##
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
ggplot(women, aes(x = women$height, y = women$weight)) +
geom_point(aes(x = women$height, y = women$weight), color ="blue") +
stat_smooth(aes(), method = "lm", formula = y ~ poly(x,2), color ="red", linetype=2, se =TRUE, size = 1) +
labs(x="Height, inches", y="Weight, lb.")
## Warning: Use of `women$height` is discouraged.
## ℹ Use `height` instead.
## Warning: Use of `women$weight` is discouraged.
## ℹ Use `weight` instead.
## Warning: Use of `women$height` is discouraged.
## ℹ Use `height` instead.
## Warning: Use of `women$weight` is discouraged.
## ℹ Use `weight` instead.
plot(women$height,residuals(fit2),
xlab="Height (in inches)",
ylab="residuals")
The new polynomial equation is:
The amount of variance accounted for has increased to 99.9%. The significance of the squared term (t = 13.89, p < .001) suggests that inclusion of the quadratic term improves the model fit. The prediction error (residuals) is smaller, and their distribution looks more random.
chapter 19
Previously we considered regression models for predicting a quantitative response variable from quantitative predictor variables. But we can include nominal or ordinal factors as predictors as well. When factors are included as explanatory variables, our focus usually shifts from prediction to understanding group differences, and the methodology is referred to as analysis of variance (ANOVA). ANOVA methodology is used to analyze a wide variety of experimental and quasi-experimental designs.
In the following example, fifty patients received one of five cholesterol-reducing drug regimens (trt). Three of the treatment conditions involved the same drug administered as 20 mg once per day (1time), 10mg twice per day (2times), or 5 mg four times per day (4times). The two remaining conditions (drugD and drugE) represented competing drugs. Which drug regimen produced the greatest cholesterol reduction (response)?
We are going to use box-plots to observe differences between treatments, using ggplot2.
# obtaining data from a package
library(multcomp)
data(cholesterol)
cholesterol
## trt response
## 1 1time 3.8612
## 2 1time 10.3868
## 3 1time 5.9059
## 4 1time 3.0609
## 5 1time 7.7204
## 6 1time 2.7139
## 7 1time 4.9243
## 8 1time 2.3039
## 9 1time 7.5301
## 10 1time 9.4123
## 11 2times 10.3993
## 12 2times 8.6027
## 13 2times 13.6320
## 14 2times 3.5054
## 15 2times 7.7703
## 16 2times 8.6266
## 17 2times 9.2274
## 18 2times 6.3159
## 19 2times 15.8258
## 20 2times 8.3443
## 21 4times 13.9621
## 22 4times 13.9606
## 23 4times 13.9176
## 24 4times 8.0534
## 25 4times 11.0432
## 26 4times 12.3692
## 27 4times 10.3921
## 28 4times 9.0286
## 29 4times 12.8416
## 30 4times 18.1794
## 31 drugD 16.9819
## 32 drugD 15.4576
## 33 drugD 19.9793
## 34 drugD 14.7389
## 35 drugD 13.5850
## 36 drugD 10.8648
## 37 drugD 17.5897
## 38 drugD 8.8194
## 39 drugD 17.9635
## 40 drugD 17.6316
## 41 drugE 21.5119
## 42 drugE 27.2445
## 43 drugE 20.5199
## 44 drugE 15.7707
## 45 drugE 22.8850
## 46 drugE 23.9527
## 47 drugE 21.5925
## 48 drugE 18.3058
## 49 drugE 20.3851
## 50 drugE 17.3071
# creating box-plot graph for each group
library(ggplot2)
ggplot(cholesterol, aes(x=trt, y=response)) + geom_boxplot(fill="cornflowerblue") +
labs(x = "Tratamientos", y = "Reducción Colesterol, %")
In a one-way ANOVA, we are interested in comparing the dependent variable means of two or more groups defined by a categorical grouping factor.
library(data.table)
attach(cholesterol)
# Mean and StDev by treatment using data.table
cholTr <- data.table(cholesterol)
cholStat <- cholTr[, list(Media=mean(response), StDev=sd(response)), by=list(Treatment=trt)]
cholStat
## Treatment Media StDev
## 1: 1time 5.78197 2.878113
## 2: 2times 9.22497 3.483054
## 3: 4times 12.37478 2.923119
## 4: drugD 15.36117 3.454636
## 5: drugE 20.94752 3.345003
# ANOVA
fit <- aov(response ~ trt)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## trt 4 1351.4 337.8 32.43 9.82e-13 ***
## Residuals 45 468.8 10.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
detach(cholesterol)
The results of the analysis indicate that there are significant differential effects of the treatments to reduce cholesterol, this is because the probability of a real no-difference (type I error) is very low (Pr(>F) = 9.82e-13).
But, wait! To apply the procedure above, there are a couple of requirements from the data:
First will see if the response variable has a normal distribution:
# Q-Q plot
EnvStats::qqPlot(cholesterol$response, add.line = TRUE, points.col = "blue", line.col = "red")
The data falls within the 95% confidence envelope, suggesting that the normality assumption has been met fairly well.
Now we will test homogeneity of variance, using the Bartlett’s test:
bartlett.test(response ~ trt, data=cholesterol)
##
## Bartlett test of homogeneity of variances
##
## data: response by trt
## Bartlett's K-squared = 0.57975, df = 4, p-value = 0.9653
Bartlett’s test indicates that the variances in the five groups do not differ significantly (p = 0.97).
The ANOVA F test for treatment tells you that the five drug regimens are not equally effective, but it does not tell you which treatments differ from one another. You can use a multiple comparison procedure to answer this question. For example, the TukeyHSD() function provides a test of all pairwise differences between group means, as shown next.
TukeyHSD(fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response ~ trt)
##
## $trt
## diff lwr upr p adj
## 2times-1time 3.44300 -0.6582817 7.544282 0.1380949
## 4times-1time 6.59281 2.4915283 10.694092 0.0003542
## drugD-1time 9.57920 5.4779183 13.680482 0.0000003
## drugE-1time 15.16555 11.0642683 19.266832 0.0000000
## 4times-2times 3.14981 -0.9514717 7.251092 0.2050382
## drugD-2times 6.13620 2.0349183 10.237482 0.0009611
## drugE-2times 11.72255 7.6212683 15.823832 0.0000000
## drugD-4times 2.98639 -1.1148917 7.087672 0.2512446
## drugE-4times 8.57274 4.4714583 12.674022 0.0000037
## drugE-drugD 5.58635 1.4850683 9.687632 0.0030633
par(las=2)
par(mar=c(5,8,4,2))
plot(TukeyHSD(fit))
For example, the mean cholesterol reductions for 1-time and 2-times are not significantly different from each other (p = 0.138), whereas the difference between 1-time and 4-times is significantly different (p < .001). In this graph, confidence intervals that include 0 indicate treatments that are not significantly different (p > 0.5).
When the hypothesis testing is about two samples, the ANOVA test is equivalent to a t-test.
#datos
drogaA <- c(8.8,8.4,7.9,8.7,9.1,9.6)
drogaB <- c(9.9,9.0,11.1,9.6,8.7,10.4,9.5)
#tabla de datos - no funciona data.frame(drogaA, drogaB)
tabla <- list(Droga_A = drogaA, Droga_B = drogaB)
as.data.frame(lapply(tabla, `length<-`, max(sapply(tabla, length))))
## Droga_A Droga_B
## 1 8.8 9.9
## 2 8.4 9.0
## 3 7.9 11.1
## 4 8.7 9.6
## 5 9.1 8.7
## 6 9.6 10.4
## 7 NA 9.5
#hipótesis alterna: droga A - drogaB diferente de 0 (menor o mayor)
pruebat <- t.test(drogaA,drogaB, var.equal = TRUE, alternative = "two.sided")
pruebat
##
## Two Sample t-test
##
## data: drogaA and drogaB
## t = -2.4765, df = 11, p-value = 0.03076
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.8752609 -0.1104534
## sample estimates:
## mean of x mean of y
## 8.750000 9.742857
chapter 18
Kabacoff, R., 2015. R in action: data analysis and graphics with R, Second edition. ed. Manning, Shelter Island.
Verzani, J., 2012. Getting started with RStudio. O’Reilly, Sebastopol, Calif.