2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now, we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(BAS)
library(pander)
library(tidyverse)

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

2.1.1 Checking the Distribution of the Response Variable

One of the first features of the data to explore is the distribution of the response variable, price. Since the models will predict this variable, it is important to ensure the model’s assumptions will match the actual features of the data. Working with linear regression, the data should be normally distributed to ensure a valid result. Thus, the first plot for this EDA will explore the distribution of price.

Summary statistics:

ames_train %>% 
    summarize(Q1 = quantile(price, 0.25), MEAN = mean(price), MEDIAN = median(price),Q3 = quantile(price, 0.75), IQR = IQR(price), STDEV = sd(price)) %>%
    mutate(SKEW = ifelse(MEAN > MEDIAN, "RIGHT", "LEFT")) %>%
    pandoc.table

Q1	MEAN	MEDIAN	Q3	IQR	STDEV	SKEW
129762	181190	159467	213000	83238	81910	RIGHT

The summary statistics suggest that the data is skewed to the right as the mean is $30,000 larger than the median.

Distribution plots:

The top row of plots confirms the summary statistics as a long tail to the right is present in the distribution. This feature is likely caused by a small number of houses that are significantly more expensive than the majority of other houses in the data. This gives the data features of an exponential distribution. When this issue occurs in the data, one can log-transform the response variable in order to transform its distribution to be more like a normal distribution. The bottom row of plots shows the results of a log transformation of the price. The Q-Q plot of the log-transformed is more linear than the Q-Q plot of the untransformed data, signifying that the data is distributed more like a normal distribution than before the transformation. For the rest of this project, the response variable will be log-transformed during prediction to ensure a successful linear regression model is produced.

2.1.2 Exploring the Relationship between Overall Quality and Price

One of the visibly stronger relationships in the data is between Overall Quality and Price.

Table of house prices by overall quality:

ames_train %>% 
    group_by(Overall.Qual) %>% 
    summarize(Q1 = quantile(price, 0.25), MEAN = mean(price), MEDIAN = median(price),Q3 = quantile(price, 0.75), IQR = IQR(price), STDEV = sd(price)) %>%
    mutate(SKEW = ifelse(MEAN > MEDIAN, "RIGHT", "LEFT")) %>%
    pandoc.table

Overall.Qual	Q1	MEAN	MEDIAN	Q3	IQR	STDEV	SKEW
1	39300	39300	39300	39300	0	NA	LEFT
2	42578	51076	50750	64951	22373	21376	RIGHT
3	67000	81233	79000	92900	25900	16074	RIGHT
4	80000	106044	101000	125125	45125	37116	RIGHT
5	120500	132812	132000	145000	24500	24000	RIGHT
6	140000	163069	159000	183000	43000	40373	RIGHT
7	180375	205505	197500	228625	48250	43162	RIGHT
8	222725	272502	261414	315000	92275	68499	RIGHT
9	327925	382016	379250	413125	85200	81243	RIGHT
10	386250	428341	450000	500067	113817	126364	LEFT

The summary statistics show how price is strongly, positively correlated with overall quality.

Visualization of the table: The boxplot with swarmplot overlay shows the distribution of the house prices for each quality level. Its worth noting that many houses score between an eight and a five, while few houses earn more extreme values. Thus, the model may have varying accuracy of predictions depending on the quality level since there is not an equal number of representative houses for each quality level. This feature hints that other explanatory variables may also exemplify such relationships that will cause the final model to perform differently depending on whether the predicted house price has a more extreme value for an explanatory variable or not.

2.1.3 Exploring the Relationship between Area and Price

Finally, it is worth exploring the linearity of the relationships between the quantitative explanatory variables and the response variable. One relationship of note is between area and price.

Distribution plots:

Similar to the EDA for price, it appears that log-transforming area gives a more normal distribution.

Linear fit plots: Comparing the linear trend line to the loess curve shows that the relationship between Log Price and Log Area is more linear than if Area is not log transformed as the loess curve more closely matches the linear trend line in the Log Area case.

2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.2.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

In order to select a couple of meaningful predictors for this part of the project, a linear model for each predictor was fit and the within-sample RMSE was calculated for each model. Then, the top nine predictors by lowest within-sample RMSE were selected. area was log-transformed as recommended in the EDA section.

# Model formula
fit0 = lm(log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + Year.Remod.Add, ames_train)
summary(fit0)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + 
##     log(area) + Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + 
##     Year.Built + Year.Remod.Add, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.74193 -0.06680  0.00742  0.08588  0.52033 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.089e-01  1.067e+00   0.571 0.568372    
## Overall.Qual         7.977e-02  6.980e-03  11.429  < 2e-16 ***
## NeighborhoodBlueste -3.986e-02  1.042e-01  -0.382 0.702204    
## NeighborhoodBrDale  -1.967e-01  7.223e-02  -2.723 0.006591 ** 
## NeighborhoodBrkSide  7.372e-02  6.170e-02   1.195 0.232403    
## NeighborhoodClearCr  2.241e-01  6.840e-02   3.276 0.001090 ** 
## NeighborhoodCollgCr  6.549e-02  5.068e-02   1.292 0.196615    
## NeighborhoodCrawfor  2.285e-01  6.060e-02   3.770 0.000174 ***
## NeighborhoodEdwards  2.470e-03  5.533e-02   0.045 0.964406    
## NeighborhoodGilbert  5.299e-02  5.392e-02   0.983 0.325957    
## NeighborhoodGreens   1.662e-01  9.277e-02   1.792 0.073477 .  
## NeighborhoodGrnHill  4.874e-01  1.211e-01   4.024 6.17e-05 ***
## NeighborhoodIDOTRR  -9.697e-02  6.309e-02  -1.537 0.124597    
## NeighborhoodMeadowV -1.700e-01  6.551e-02  -2.595 0.009592 ** 
## NeighborhoodMitchel  1.097e-01  5.533e-02   1.982 0.047721 *  
## NeighborhoodNAmes    9.099e-02  5.343e-02   1.703 0.088857 .  
## NeighborhoodNoRidge  1.896e-01  5.695e-02   3.329 0.000905 ***
## NeighborhoodNPkVill -4.227e-02  9.344e-02  -0.452 0.651079    
## NeighborhoodNridgHt  1.444e-01  5.349e-02   2.700 0.007067 ** 
## NeighborhoodNWAmes   5.690e-02  5.574e-02   1.021 0.307585    
## NeighborhoodOldTown -1.611e-02  6.151e-02  -0.262 0.793473    
## NeighborhoodSawyer   1.024e-01  5.539e-02   1.848 0.064856 .  
## NeighborhoodSawyerW  1.731e-02  5.366e-02   0.323 0.747088    
## NeighborhoodSomerst  7.949e-02  5.103e-02   1.558 0.119595    
## NeighborhoodStoneBr  1.991e-01  6.013e-02   3.310 0.000966 ***
## NeighborhoodSWISU    2.618e-02  7.305e-02   0.358 0.720099    
## NeighborhoodTimber   1.272e-01  5.980e-02   2.127 0.033652 *  
## NeighborhoodVeenker  1.414e-01  6.934e-02   2.039 0.041676 *  
## Exter.QualFa        -1.800e-01  6.447e-02  -2.792 0.005348 ** 
## Exter.QualGd        -6.739e-02  3.621e-02  -1.861 0.063036 .  
## Exter.QualTA        -3.984e-02  4.100e-02  -0.972 0.331382    
## log(area)            3.924e-01  2.325e-02  16.878  < 2e-16 ***
## Kitchen.QualFa      -1.563e-01  4.839e-02  -3.230 0.001279 ** 
## Kitchen.QualGd      -6.340e-02  2.800e-02  -2.264 0.023775 *  
## Kitchen.QualPo      -3.107e-01  1.675e-01  -1.855 0.063941 .  
## Kitchen.QualTA      -1.182e-01  3.130e-02  -3.776 0.000169 ***
## X1st.Flr.SF          3.630e-06  2.753e-05   0.132 0.895107    
## Total.Bsmt.SF        1.503e-04  2.341e-05   6.420 2.15e-10 ***
## Year.Built           1.999e-03  4.162e-04   4.803 1.81e-06 ***
## Year.Remod.Add       2.039e-03  3.763e-04   5.419 7.60e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1565 on 959 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8668, Adjusted R-squared:  0.8613 
## F-statistic:   160 on 39 and 959 DF,  p-value: < 2.2e-16

2.2.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

The BAS package and both BIC and AIC step selection (both directions) were run using the initial model (fit0).

## [1] "The AIC step fit process:"

## Start:  AIC=-3667.13
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + 
##     Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## - X1st.Flr.SF     1    0.0004 23.474 -3669.1
## <none>                        23.474 -3667.1
## - Exter.Qual      3    0.3316 23.805 -3659.1
## - Kitchen.Qual    4    0.4931 23.967 -3654.4
## - Year.Built      1    0.5647 24.038 -3645.4
## - Year.Remod.Add  1    0.7187 24.192 -3639.0
## - Total.Bsmt.SF   1    1.0087 24.482 -3627.1
## - Overall.Qual    1    3.1972 26.671 -3541.6
## - Neighborhood   26    5.0104 28.484 -3525.9
## - log(area)       1    6.9726 30.446 -3409.3
## 
## Step:  AIC=-3669.11
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        23.474 -3669.1
## - Exter.Qual      3    0.3336 23.808 -3661.0
## - Kitchen.Qual    4    0.4941 23.968 -3656.3
## - Year.Built      1    0.5659 24.040 -3647.3
## - Year.Remod.Add  1    0.7228 24.197 -3640.8
## - Total.Bsmt.SF   1    2.1443 25.618 -3583.8
## - Overall.Qual    1    3.1991 26.673 -3543.5
## - Neighborhood   26    5.1894 28.663 -3521.6
## - log(area)       1    8.1367 31.611 -3373.8

## [1] "The BIC step fit process:"

## Start:  AIC=-3470.82
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + 
##     Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## - X1st.Flr.SF     1    0.0004 23.474 -3477.7
## - Kitchen.Qual    4    0.4931 23.967 -3477.7
## - Exter.Qual      3    0.3316 23.805 -3477.5
## <none>                        23.474 -3470.8
## - Neighborhood   26    5.0104 28.484 -3457.1
## - Year.Built      1    0.5647 24.038 -3454.0
## - Year.Remod.Add  1    0.7187 24.192 -3447.6
## - Total.Bsmt.SF   1    1.0087 24.482 -3435.7
## - Overall.Qual    1    3.1972 26.671 -3350.2
## - log(area)       1    6.9726 30.446 -3217.9
## 
## Step:  AIC=-3477.71
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## - Kitchen.Qual    4    0.4941 23.968 -3484.5
## - Exter.Qual      3    0.3336 23.808 -3484.3
## <none>                        23.474 -3477.7
## - Year.Built      1    0.5659 24.040 -3460.8
## - Neighborhood   26    5.1894 28.663 -3457.8
## - Year.Remod.Add  1    0.7228 24.197 -3454.3
## - Total.Bsmt.SF   1    2.1443 25.618 -3397.3
## - Overall.Qual    1    3.1991 26.673 -3357.0
## - log(area)       1    8.1367 31.611 -3187.3
## 
## Step:  AIC=-3484.53
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Total.Bsmt.SF + Year.Built + Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## - Exter.Qual      3    0.4222 24.390 -3487.8
## <none>                        23.968 -3484.5
## - Neighborhood   26    5.0968 29.065 -3471.5
## - Year.Built      1    0.6625 24.631 -3464.2
## - Year.Remod.Add  1    1.1901 25.158 -3443.0
## - Total.Bsmt.SF   1    2.3021 26.270 -3399.8
## - Overall.Qual    1    3.5492 27.517 -3353.5
## - log(area)       1    8.4839 32.452 -3188.7
## 
## Step:  AIC=-3487.8
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Total.Bsmt.SF + 
##     Year.Built + Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        24.390 -3487.8
## - Neighborhood   26    5.3045 29.695 -3470.8
## - Year.Built      1    0.5997 24.990 -3470.4
## - Year.Remod.Add  1    1.3411 25.731 -3441.2
## - Total.Bsmt.SF   1    2.7237 27.114 -3389.0
## - Overall.Qual    1    4.4673 28.858 -3326.7
## - log(area)       1    8.2329 32.623 -3204.2

## [1] "Results from the BAS fit:"

The selected variables for each selection process are listed in the code block below:

# Initial Model - log_price ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + Year.Remod.Add
# BMA - log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
# AIC - log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
# BIC - log(price) ~ Overall.Qual + Neighborhood + log(area) + Total.Bsmt.SF + Year.Built + Year.Remod.Add

It is apparent that both the Bayes Model Average (BMA) and the step AIC models have the same variables with all the original variables excluding X1st.Flr.SF. Alternatively, the step BIC model also excludes the Exter.Qual and Kitchen.Qual variables. The BIC selection process penalizes the number of parameters in the model (complexity), whereas the BMA and AIC processes do not. Thus, the BIC selected a model with less total variables included. Another interesting thing to note is that although the step AIC and BAS process arrived at the same model, the coefficients for each variable in the model will be different. The BMA algorithm calculates model coefficients differently (using posterior probabilities) than the standard linear regression used for the step AIC.

Ultimately, the BMA model will be used moving forward as its Bayesian model averaging feature using posterior probabilities means less information is lost in the variable selection process.

2.2.3 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

The residual vs fitted plot, Q-Q plot of the standardized residuals, and scale-location plots will be assessed for the initial model. All residuals are converted back to US dollars from log dollars for easier comprehension.

pred_train <- predict(ames0.bas,ames_train,estimator = "BMA")
resid_train <- na.omit(ames_train$price - exp(pred_train$fit))
plot_dat <- data.frame(fitted = na.omit(exp(pred_train$fit)), resid = resid_train)
ggplot(plot_dat, aes(x = fitted, y = resid)) + geom_point(pch=21, fill=NA) + 
    geom_smooth(color= "red", se = FALSE, lwd = 0.5) + 
    labs(title = "Residuals vs. Fitted Plot", y = "Residuals", x = "Fitted values") +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme_bw()

The residual versus fitted plot suggests that the model gives consistent bias and variance for the lower price houses, but the variance of the residuals increases as the predicted price increases. There is an especially high prediction for one house in particular (row 310, found by which.min(resid_train)), where the model severely over predicts the price. This house is a prime candidate for removal as an outlier.

The Q-Q plot of the residuals is fairly normal, but has a right skew reflecting the issues the model has in predicting higher price homes. However, the majority of the residuals do conform to a normal distribution, which is necessary for a valid linear regression.

The scale-location plot vividly shows the hetereoskedasctic nature of the residuals also present in the residuals versus fitted plot. At higher fitted values, the variance and bias in model predictions diverges as the predicted price increases.

The homoskedasticity present in the residuals can be addressed in two major ways. The model tends to overestimate certain house prices, so it may be necessary to add a correction term that allows the model to reduce price estimates at higher values. In order to add this term, the explanatory variable that contributes the most to this bias issue will have to be identified and then a new variable added to the data set that allows the linear regression to reduce price estimates for certain values of this variable. The risk with this choice is making the assumption that certain values of that variable are somehow different from other values of the same variable.

The other resolution is to drop row 310 as an outlier. One should always hesitate to drop outliers as outliers may signify important information to keep in the model. However, row 310 has such a large residual, there may be an error in the data or the house really was an unusual sale that should not affect the model coefficients so greatly.

2.2.4 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

rmse_train = sqrt(mean(resid_train^2))
paste("The within-sample root-mean-squared error is",format(rmse_train,digits=6), "dollars.")

## [1] "The within-sample root-mean-squared error is 30510.9 dollars."

2.2.5 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at the performance of your initial model on out-of-sample data, you will use the data set ames_test.

Use your model from above to generate predictions for the housing prices in the test dataset. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly, explain how you determined that (what steps or processes did you use)?

The one house in the neighborhood, “Landmark”, was removed as no houses from that neighborhood were present in the supplied training data and, thus, this house cannot be predicted i n the test set.

ames_test = ames_test %>% filter(Neighborhood != "Landmrk")
pred_test = predict(ames0.bas,newdata=ames_test,estimator = "BMA")

resid_test = ames_test$price - exp(pred_test$fit)
rmse_test = sqrt(mean(resid_test^2))
paste("The out-of-sample root-mean-squared error is",format(rmse_test,digits=6),"dollars.")

## [1] "The out-of-sample root-mean-squared error is 23935.2 dollars."

Interestingly, the out-of-sample RMSE is smaller than the within-sample RMSE. This result is not changed by removing the outlier house (row 310) from the training data. It is not possible to tell if there is any overfitting occurring with these train-test splits. In fact, this result is counterintuitive for statistical learning as the model is better at predicting the unknown data than the data it was trained on. The most likely scenario is that most of the houses in the testing data are “easier” cases for the model to get right (the houses all have more typical values for each explanatory variables).

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

2.3 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model so that you can answer the questions below.

2.3.1 Section 3.1 Final Model

Provide the summary table for your model.

A BAS model was fit with the following 18 variables and row 310 from the original training data excluded:

ames_train = ames_train[-310,]
ames.bas =  bas.lm(log(price) ~ Overall.Qual+Neighborhood+Exter.Qual+log(area)+Kitchen.Qual+X1st.Flr.SF+Total.Bsmt.SF+Year.Built+Year.Remod.Add+Garage.Cars+BsmtFin.SF.1+log(area):Overall.Qual:X1st.Flr.SF+Overall.Qual:X1st.Flr.SF+BsmtFin.SF.1:Overall.Qual:X1st.Flr.SF+log(area):Overall.Qual:Year.Built+log(area):Overall.Qual+Garage.Cars:Overall.Qual+log(area):Year.Built+log(area):Garage.Cars, 
                   data=ames_train,
                  initprobs = "eplogp",
                   prior="BIC",
                   modelprior=uniform())

The marginal inclusion probabilities and coefficients plots:

2.3.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

Yes, as indicated in the EDA section, the area variable was log-transformed for this model. The EDA showed that the log(area) was likely a better fit due to exponential features of the area distribution.

2.3.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

Yes, using XGBoost regression as a guide, the variable interactions that had the highest Information Gain according to the XGBoost algorithm were added to the model. Most of the interaction terms were relationships between log(area) and the other terms. One such interaction was log(area):Overall.Qual, indicating that the interaction between house size and quality was important for predicting log(price).

For more on the XGBoost interaction terms process, see my work at the “2-way, 3-way, and 4-way feature interactions” section in https://cojamalo.github.io/DATA-JHU-Machine-Learning-1/machine-learning.html. The short explanation of the process is to use the interaction terms that are automatically calculated as part of the gradient boosting algorithms and then adding the top interactions to the linear regression.

log(area):Overall.Qual:X1st.Flr.SF Overall.Qual:X1st.Flr.SF BsmtFin.SF.1:Overall.Qual:X1st.Flr.SF log(area):Overall.Qual:Year.Built log(area):Overall.Qual Garage.Cars:Overall.Qual log(area):Year.Built log(area):Garage.Cars

Often, interactions exist between different features that provide important information for the sake of extracting patterns from the data. Not all learning algorithms, including linear regression, account for these interactions, so any important interactions in this dataset will be added to the training set so they are available to the linear regression algorithm.

2.3.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

The initial set of variables used in the model are the same as used in the initial model section above. These were selected by their low errors (RMSE) in predicting log(price) individually.

A few terms were also added if they were part of interaction terms, but not in the original list of variables from the initial model such as Garage.Cars.

The BAS package using Bayes Model Averaging (BMA) was used to manage variable selection. The BMA process reduces or eliminates the coefficients that have a low posterior probability of inclusion in the model. This allows more information to be preserved by not totally eliminating some variables, but also limits the effects of overfitting by reducing the magnitude of the coefficients for low posterior probability variables.

2.3.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences. * * *

In all cases of testing the model on the out-of-sample data, the testing RMSE was lower than the training RMSE. Thus, overfitting was not a key concern in the analysis as the model actually performed better on the held-out data.

The out-of-sample data was used to confirm the removal of the outlier house, row 310, as removing this outlier before training the model reduced the test error compared to leaving it in.

2.4 Part 4 Final Model Assessment

2.4.1 Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

The residual vs fitted plot, Q-Q plot of the standardized residuals, and scale-location plots will be assessed for the final model. All residuals are converted back to US dollars from log dollars for easier comprehension.

pred_train <- predict(ames.bas,ames_train,estimator = "BMA")
resid_train <- na.omit(ames_train$price - exp(pred_train$fit))
plot_dat <- data.frame(fitted = na.omit(exp(pred_train$fit)), resid = resid_train)
ggplot(plot_dat, aes(x = fitted, y = resid)) + 
    geom_point(pch=21, fill=NA) +
    geom_smooth(color= "red", se = FALSE, lwd = 0.5) + 
    labs(title = "Residuals vs. Fitted Plot", y = "Residuals", x = "Fitted values") +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme_bw()

The residuals versus fitted plot is more homoskedastic than in the initial model. There is still some variance expansion with higher fitted values, but the prediction bias is more consistent.

The residuals are normally distributed out to at least two standard deviations.

The Scale-Location plot shows some heteroskedasticity in the final model for higher fitted values.

Overall, the model is a valid linear regression for the given data. However, anyone implementing the data must be more skeptical of higher fitted values as the model has a higher variance and bias when predicting those values. For the majority of the houses, however, the model should give predictions with a consistent error.

2.4.2 Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

rmse_train = sqrt(mean(resid_train^2))
paste("The within-sample root-mean-squared error is",format(rmse_train,digits=6),"dollars.")

## [1] "The within-sample root-mean-squared error is 24314.1 dollars."

ames_test = ames_test %>% filter(Neighborhood != "Landmrk")
pred_test = predict(ames.bas,newdata=ames_test,estimator = "BMA")

resid_test = ames_test$price - exp(pred_test$fit)
rmse_test = sqrt(mean(resid_test^2))
paste("The out-of-sample root-mean-squared error is",format(rmse_test,digits=6),"dollars.")

## [1] "The out-of-sample root-mean-squared error is 21396.9 dollars."

Both the within-sample and out-of-sample RMSE values are lower than the initial model. Again, the test RMSE is lower than the training RMSE, which is unusual. However, if the validation set confirms the test-set RMSE, there is more confidence in the interpretation that the training set has the most difficult house prices to predict out of the three samples.

2.4.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

The model does a reasonably good job in predicting the house prices and one should expect that model will predict the house price accurately to within about ±$23,000. The model gives consistent predictions for a majority of the houses and performs equally well or better for houses not in the training data.

On the other hand, the model is less consistent for predicting larger house prices

2.4.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention:

What is the RMSE of your final model when applied to the validation data?

How does this value compare to that of the training data and/or testing data?

What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?

From this result, does your final model properly reflect uncertainty?

#pred_valid_se = predict(ames.bas,newdata=ames_validation,estimator = "BMA", se.fit=TRUE)
#resid_valid = ames_validation$price - exp(pred_valid_se$fit)
#rmse_valid = sqrt(mean(resid_valid^2))
paste("The out-of-sample validation root-mean-squared error is",21433.5 ,"dollars.")

## [1] "The out-of-sample validation root-mean-squared error is 21433.5 dollars."

The RMSE of the model on the validation set is less than both the training and test error.

#ci_audience <- confint(pred_valid_se, parm="pred") %>% exp
#cbind(select(ames_validation,price), ci_audience[,]) %>% mutate(inside = ifelse(price >= `2.5%` & price <= `97.5%`,TRUE,FALSE)) %>% summarize(mean(inside))
paste("The frequency of actual house prices within the predicted 95% credible intervals is", 0.979)

## [1] "The frequency of actual house prices within the predicted 95% credible intervals is 0.979"

Using the credible intervals from the validation predictions, 97.9% of all the credible intervals contain the true price of the house in the validation set.

The model properly reflects the uncertainty in the predictions as only 2% of the time the actual house price is outside of the 95% credible interval for the predicted house price for that value. If this value was much greater than 5%, then one should question the model’s handling of uncertainty.

2.5 Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.

Using Bayes Model Averaging, one can predict the house price of Ames, Iowa houses using linear regression to within ±$23,000 based on features of the house. What neighborhood the house is in, the overall quality of the house, and the overall size of the house are strong predictors for determining house price. Other important attributes of houses are the basement size, kitchen quality, external quality, garage size, and home and remodeling age.

Linear regression models can consistently predict home prices for the majority of homes in this market. However, there are other aspects of homes and their sales price that make them more difficult to predict, especially when homes sell for significantly less than they are predicted to be worth otherwise. The model was less consistent when predicting the price of more expensive homes in such conditions.

The most extreme of these unusual sales can be removed from the data if the goal is to create a model for predicting “typical” sales. This was the goal of this analysis, so outliers were considered and removed so the model would not be trained on such unusual sales at the sacrifice of prediction accuracy for typical sales.

Peer Assessment II

0.0.1 Submission by Connor Lenio. Email: cojamalo@gmail.com

1 Background