# Advanced regression modeling

Modeling of bear weight based upon bodily measurements using mixed effects and generalized least squares modeling accounting for covariance-variance structures and within-group correlation.

Based upon predicted vs. observed, residuals, residual sum of squares, and AIC, this script determines that bear weight is modeled best by a cubic-splines transformed mixed-effect model in the form of:

lme(weight ~ chestGSc + lengthSc + headWSc + ageNA +
splinepoints_chest + splinepoints_length,
random = ~ 1 | name/age,
weights=varExp(form=~as.vector(chestGSc + splinepoints_chest)),
correlation=corCAR1(form = ~ age | name))

where:

• chestGSc = scaled chest girth
• lengthSc = scaled length
• ageNA = dummy variable, indicating whether or not age was imputed
• splinepoints_chest = cubic splines transformation of chest girth, with knots at 7, 25, and 90 percent
• splinepoints_length = cubic splines transformation of length, with knots at 10, 35, and 75 percent
• autocorrelation is modeled with an order of 1, with a continuous time covarariate of age grouped by the name of each bear
• within-group variance is modeled using the exponential variance of scaled chest girth plus its spline transformation

# 1 Read in Data

require(pacman)
p_load(ggplot2, nlme, lme4, reshape2, dplyr, car)

bears <- read.fwf("Bears.full.dat", skip=20,
"neckG", "length", "chestG", "weight", "obs", "name"),
widths=c(9, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10),
strip.white=T, na.string = "")

Missing bear names are set to “Unnamed 1” and “Unnamed 2” and relevant columns are converted to factors

bears$name <- as.character(bears$name)
bears[87,12] <- "Unnamed1"
bears[88,12] <- "Unnamed2"
bears$name <- as.factor(bears$name)
bears$month <- as.factor(bears$month)
bears$sex <- as.factor(bears$sex)
attach(bears)

# 2 Correlations

Exploratory correlations show strong relationships between possible predictor variables and bear weight.

cor(weight, bears[,5:9])
         headL     headW     neckG    length    chestG
[1,] 0.8333214 0.7556092 0.9432771 0.8746451 0.9659937

# 3 Initial plotting

Initial plots show nonlinear relationships for length, chest, and neck.

bearsplotdata <- bears[,0:12]
bearsplotdata <- bearsplotdata[,-c(1,4,11,12)]
bearsplotdata <- melt(bearsplotdata, id.vars=c("weight"))

bearsplot1 <- ggplot(data=bearsplotdata, aes(y=weight, x=value))+
theme_bw()+
geom_point()+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.background = element_blank(),
panel.grid.minor = element_blank()) +
facet_wrap(~variable, ncol=3, scales="free_x")
print(bearsplot1)

# 4 Transformations

Transform length, chest, and neck based upon observed relationships in the initial plots.

bears$lengthtransf <- 1/bears$length
bears$chestGtransf <- bears$chestG^2
bears$neckGtransf <- bears$neckG^2

# 5 Scale response and predictor variables

The measurements and weights that we are using in the model vary by more than an order of magnitude. In order for the model to converge, we need to make sure that the predictors and response variables have similar ranges.

All predictor variables were scaled by subtracting the mean and dividing by the standard deviation.

bears$headLSc <- scale(bears$headL)
bears$neckGSctransf <- scale(bears$neckGtransf)
bears$neckGSc <- scale(bears$neckG)
bears$headWSc <- scale(bears$headW)
bears$lengthSc <- scale(bears$length)
bears$lengthSctransf <- scale(bears$lengthtransf)
bears$chestGSc <- scale(bears$chestG)
bears$chestGSctransf <- scale(bears$chestGtransf)

# 6 Deal with missing age data

I chose a model-based imputation as outlined in Andrew Gelman’s 2012 textbook “Data Analysis Using Regression and Multilevel/Hierarchical Models”. I create a dummy variable for future models indicating whether or not the age data is imputed, and then infer the value based upon the relationship between age and head width.

I build a linear model relating age to scaled head width, resulting in an estimate of 21.532. This means that, for every 1 unit change in the standard deviation of head width, age increases by 21.532 months. I then apply this relationship to the missing age data.

As a means to check whether the imputation was successful, I compare the mean of the imputed age to the mean of the known age. I then compare the correlation between age and a variable not used to impute age (chest girth) for imputed and known age.

Age is similar between the two (43.4 vs 46.8), as are the correlations (0.734 vs. 0.731). Thus, I continue to use the imputed age data in the modeling process.

bears$ageNA <- 0 bears$ageNA[is.na(bears$age)] <- 1 summary(lm(bears$age[bears$ageNA==0] ~ bears$headWSc[bears$ageNA==0]))  Call: lm(formula = bears$age[bears$ageNA == 0] ~ bears$headWSc[bears$ageNA == 0]) Residuals: Min 1Q Median 3Q Max -63.441 -14.084 -6.891 9.156 79.171 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45.893 2.808 16.341 < 2e-16 *** bears$headWSc[bears$ageNA == 0] 21.532 2.656 8.106 4.65e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 25.44 on 81 degrees of freedom Multiple R-squared: 0.4479, Adjusted R-squared: 0.441 F-statistic: 65.7 on 1 and 81 DF, p-value: 4.653e-12 meanage <- mean(bears$age[bears$ageNA==0]) bearschange <- bears$headWSc[bears$ageNA==1]*21.532 bears$age[bears$ageNA==1] <- meanage + bearschange bears$age <- jitter(bears$age, amount=0.0001) mean(bears$age[bears$ageNA==1]) [1] 46.83522 mean(bears$age[bears$ageNA==0]) [1] 43.43373 cor(bears$chestG[bears$ageNA==0], bears$age[bears$ageNA==0]) [1] 0.7342366 cor(bears$chestG[bears$ageNA==1], bears$age[bears$ageNA==1]) [1] 0.7314328 bears$agescale <- scale(bears$age) # 7 Linear model I start by building a simple linear model to get a base AIC score and model structure moving forward. lm1 <- lm(weight ~ chestGSctransf + sex + neckGSctransf + lengthSctransf + headLSc + headWSc, data=bears) summary(lm1)  Call: lm(formula = weight ~ chestGSctransf + sex + neckGSctransf + lengthSctransf + headLSc + headWSc, data = bears) Residuals: Min 1Q Median 3Q Max -58.473 -11.112 -1.404 11.803 74.150 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 192.8548 2.1479 89.788 < 2e-16 *** chestGSctransf 74.9734 5.4441 13.772 < 2e-16 *** sex2 -2.2553 4.1094 -0.549 0.5840 neckGSctransf 31.8555 6.0351 5.278 5.03e-07 *** lengthSctransf -6.6563 3.6866 -1.806 0.0732 . headLSc -2.0311 4.2655 -0.476 0.6347 headWSc -0.9657 2.9858 -0.323 0.7469 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 20.76 on 136 degrees of freedom Multiple R-squared: 0.9662, Adjusted R-squared: 0.9647 F-statistic: 648.2 on 6 and 136 DF, p-value: < 2.2e-16 AIC(lm1) [1] 1282.124 ## 7.1 Variable Inflation Factors Variable inflation factors indicate that chestG and neckG are highly colinear. vif(lm1) chestGSctransf sex neckGSctransf lengthSctransf headLSc 9.762581 1.193243 11.997357 4.476711 5.993064 headWSc 2.936599  Indeed, chestG and neckG have a 0.94 correlation. cor(bears$chestGSctransf, bears$neckGSctransf)  [,1] [1,] 0.9406729 I keep chestG because it has a slightly higher correlation with weight. I remove neckG. cor(bears$chestGSctransf, bears$weight)  [,1] [1,] 0.9773564 cor(bears$neckGSctransf, bears\$weight)
          [,1]
[1,] 0.9532861

# 8 Mixed-effect model with power variance function

I first build a mixed-effects model with all of the potential predictor variables and no controls for heteroskedasticity or longitudinality.

I group the data by name and age, and allow the slope and intercept to vary by groups.

lme_init <- lme(weight ~ chestGSctransf + lengthSctransf + headWSc +  headLSc,
random = ~ 1 | name/age, data=bears)

Remove insignificant variable (headLSc) and begin trying various weighting choices.

I choose between the varPower and varExp variance function classes based upon visual inspections of the variance in the initial facetted plot.

lme.power <- lme(weight ~ chestGSctransf + lengthSctransf + headWSc,
random = ~ 1 | name/age,
weights=varPower(form=~as.vector(lengthSctransf)), data=bears)

lme.exp <- lme(weight ~ chestGSctransf + lengthSctransf + headWSc,
random = ~ 1 | name/age,
weights=varExp(form=~as.vector(lengthSctransf)), data=bears)

Choose the varExp function to model variance-covariance structure, based upon its lower AIC score.

AIC(lme.power, lme.exp)
          df      AIC
lme.power  8 1296.343
lme.exp    8 1279.278

## 8.1 Residuals

Residuals for the linear mixed-effect model.

resid <- resid(lme.exp)
residplot <- ggplot(data=bears, aes(y=resid, x=weight))+
geom_point()+
geom_hline(yintercept=0, color="blue")+
theme_bw()

print(residplot)