Advanced regression modeling

Modeling of bear weight based upon bodily measurements using mixed effects and generalized least squares modeling accounting for covariance-variance structures and within-group correlation.

Based upon predicted vs. observed, residuals, residual sum of squares, and AIC, this script determines that bear weight is modeled best by a cubic-splines transformed mixed-effect model in the form of:

lme(weight ~ chestGSc + lengthSc + headWSc + ageNA + 
      splinepoints_chest + splinepoints_length, 
            random = ~ 1 | name/age,
            weights=varExp(form=~as.vector(chestGSc + splinepoints_chest)),
            correlation=corCAR1(form = ~ age | name))

where:

  • chestGSc = scaled chest girth
  • lengthSc = scaled length
  • headWSc = scaled head width
  • ageNA = dummy variable, indicating whether or not age was imputed
  • splinepoints_chest = cubic splines transformation of chest girth, with knots at 7, 25, and 90 percent
  • splinepoints_length = cubic splines transformation of length, with knots at 10, 35, and 75 percent
  • autocorrelation is modeled with an order of 1, with a continuous time covarariate of age grouped by the name of each bear
  • within-group variance is modeled using the exponential variance of scaled chest girth plus its spline transformation

1 Read in Data

require(pacman)
p_load(ggplot2, nlme, lme4, reshape2, dplyr, car)

bears <- read.fwf("Bears.full.dat", skip=20,
      col.names=c("id", "age", "month", "sex", "headL", "headW",
                 "neckG", "length", "chestG", "weight", "obs", "name"),
      widths=c(9, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10), 
      strip.white=T, na.string = "")

Missing bear names are set to “Unnamed 1” and “Unnamed 2” and relevant columns are converted to factors

bears$name <- as.character(bears$name)
bears[87,12] <- "Unnamed1"
bears[88,12] <- "Unnamed2"
bears$name <- as.factor(bears$name)
bears$month <- as.factor(bears$month)
bears$sex <- as.factor(bears$sex)
attach(bears)

2 Correlations

Exploratory correlations show strong relationships between possible predictor variables and bear weight.

cor(weight, bears[,5:9])
         headL     headW     neckG    length    chestG
[1,] 0.8333214 0.7556092 0.9432771 0.8746451 0.9659937

3 Initial plotting

Initial plots show nonlinear relationships for length, chest, and neck.

bearsplotdata <- bears[,0:12]
bearsplotdata <- bearsplotdata[,-c(1,4,11,12)]
bearsplotdata <- melt(bearsplotdata, id.vars=c("weight"))

bearsplot1 <- ggplot(data=bearsplotdata, aes(y=weight, x=value))+
  theme_bw()+
  geom_point()+
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        strip.background = element_blank(),
        panel.grid.minor = element_blank()) + 
  facet_wrap(~variable, ncol=3, scales="free_x")
print(bearsplot1)

4 Transformations

Transform length, chest, and neck based upon observed relationships in the initial plots.

bears$lengthtransf <- 1/bears$length
bears$chestGtransf <- bears$chestG^2
bears$neckGtransf <- bears$neckG^2

5 Scale response and predictor variables

The measurements and weights that we are using in the model vary by more than an order of magnitude. In order for the model to converge, we need to make sure that the predictors and response variables have similar ranges.

All predictor variables were scaled by subtracting the mean and dividing by the standard deviation.

bears$headLSc <- scale(bears$headL)
bears$neckGSctransf <- scale(bears$neckGtransf)
bears$neckGSc <- scale(bears$neckG)
bears$headWSc <- scale(bears$headW)
bears$lengthSc <- scale(bears$length)
bears$lengthSctransf <- scale(bears$lengthtransf)
bears$chestGSc <- scale(bears$chestG)
bears$chestGSctransf <- scale(bears$chestGtransf)

6 Deal with missing age data

I chose a model-based imputation as outlined in Andrew Gelman’s 2012 textbook “Data Analysis Using Regression and Multilevel/Hierarchical Models”. I create a dummy variable for future models indicating whether or not the age data is imputed, and then infer the value based upon the relationship between age and head width.

I build a linear model relating age to scaled head width, resulting in an estimate of 21.532. This means that, for every 1 unit change in the standard deviation of head width, age increases by 21.532 months. I then apply this relationship to the missing age data.

As a means to check whether the imputation was successful, I compare the mean of the imputed age to the mean of the known age. I then compare the correlation between age and a variable not used to impute age (chest girth) for imputed and known age.

Age is similar between the two (43.4 vs 46.8), as are the correlations (0.734 vs. 0.731). Thus, I continue to use the imputed age data in the modeling process.

bears$ageNA <- 0
bears$ageNA[is.na(bears$age)] <- 1

summary(lm(bears$age[bears$ageNA==0] ~ bears$headWSc[bears$ageNA==0]))

Call:
lm(formula = bears$age[bears$ageNA == 0] ~ bears$headWSc[bears$ageNA == 
    0])

Residuals:
    Min      1Q  Median      3Q     Max 
-63.441 -14.084  -6.891   9.156  79.171 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       45.893      2.808  16.341  < 2e-16 ***
bears$headWSc[bears$ageNA == 0]   21.532      2.656   8.106 4.65e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.44 on 81 degrees of freedom
Multiple R-squared:  0.4479,    Adjusted R-squared:  0.441 
F-statistic:  65.7 on 1 and 81 DF,  p-value: 4.653e-12
meanage <- mean(bears$age[bears$ageNA==0])
bearschange <- bears$headWSc[bears$ageNA==1]*21.532

bears$age[bears$ageNA==1] <- meanage + bearschange
bears$age <- jitter(bears$age, amount=0.0001)

mean(bears$age[bears$ageNA==1])
[1] 46.83522
mean(bears$age[bears$ageNA==0])
[1] 43.43373
cor(bears$chestG[bears$ageNA==0], bears$age[bears$ageNA==0])
[1] 0.7342366
cor(bears$chestG[bears$ageNA==1], bears$age[bears$ageNA==1])
[1] 0.7314328
bears$agescale <- scale(bears$age)

7 Linear model

I start by building a simple linear model to get a base AIC score and model structure moving forward.

lm1 <- lm(weight ~ chestGSctransf + sex + neckGSctransf + lengthSctransf + headLSc 
          + headWSc, data=bears)

summary(lm1)

Call:
lm(formula = weight ~ chestGSctransf + sex + neckGSctransf + 
    lengthSctransf + headLSc + headWSc, data = bears)

Residuals:
    Min      1Q  Median      3Q     Max 
-58.473 -11.112  -1.404  11.803  74.150 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    192.8548     2.1479  89.788  < 2e-16 ***
chestGSctransf  74.9734     5.4441  13.772  < 2e-16 ***
sex2            -2.2553     4.1094  -0.549   0.5840    
neckGSctransf   31.8555     6.0351   5.278 5.03e-07 ***
lengthSctransf  -6.6563     3.6866  -1.806   0.0732 .  
headLSc         -2.0311     4.2655  -0.476   0.6347    
headWSc         -0.9657     2.9858  -0.323   0.7469    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.76 on 136 degrees of freedom
Multiple R-squared:  0.9662,    Adjusted R-squared:  0.9647 
F-statistic: 648.2 on 6 and 136 DF,  p-value: < 2.2e-16
AIC(lm1)
[1] 1282.124

7.1 Variable Inflation Factors

Variable inflation factors indicate that chestG and neckG are highly colinear.

vif(lm1)
chestGSctransf            sex  neckGSctransf lengthSctransf        headLSc 
      9.762581       1.193243      11.997357       4.476711       5.993064 
       headWSc 
      2.936599 

Indeed, chestG and neckG have a 0.94 correlation.

cor(bears$chestGSctransf, bears$neckGSctransf)
          [,1]
[1,] 0.9406729

I keep chestG because it has a slightly higher correlation with weight. I remove neckG.

cor(bears$chestGSctransf, bears$weight)
          [,1]
[1,] 0.9773564
cor(bears$neckGSctransf, bears$weight)
          [,1]
[1,] 0.9532861

8 Mixed-effect model with power variance function

I first build a mixed-effects model with all of the potential predictor variables and no controls for heteroskedasticity or longitudinality.

I group the data by name and age, and allow the slope and intercept to vary by groups.

lme_init <- lme(weight ~ chestGSctransf + lengthSctransf + headWSc +  headLSc,
            random = ~ 1 | name/age, data=bears)

Remove insignificant variable (headLSc) and begin trying various weighting choices.

I choose between the varPower and varExp variance function classes based upon visual inspections of the variance in the initial facetted plot.

lme.power <- lme(weight ~ chestGSctransf + lengthSctransf + headWSc,
            random = ~ 1 | name/age, 
            weights=varPower(form=~as.vector(lengthSctransf)), data=bears)

lme.exp <- lme(weight ~ chestGSctransf + lengthSctransf + headWSc,
            random = ~ 1 | name/age, 
            weights=varExp(form=~as.vector(lengthSctransf)), data=bears)

Choose the varExp function to model variance-covariance structure, based upon its lower AIC score.

AIC(lme.power, lme.exp)
          df      AIC
lme.power  8 1296.343
lme.exp    8 1279.278

8.1 Residuals

Residuals for the linear mixed-effect model.

resid <- resid(lme.exp)
residplot <- ggplot(data=bears, aes(y=resid, x=weight))+
  geom_point()+
  geom_hline(yintercept=0, color="blue")+
  theme_bw()

print(residplot)