0.0.1 Submission by Connor Lenio. Email: cojamalo@gmail.com

Completion Date: Sept. 25, 2017

1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

load("ames_train.Rdata")
ames_train = ames_train %>% filter(Sale.Condition == "Normal")

The training data will exclude any house not sold under normal conditions. Abnormal sale conditions can substantially change the relationships between house features and their sale prices, so any house with abnormal sale conditions will not be predicted.

Packages

library(statsr)
library(BAS)
library(pander)
library(tidyverse)
library(forecast)

2.1 Part 1 - Exploratory Data Analysis (EDA)

2.1.1 Checking the Distribution of the Response Variable

One of the first features of the data to explore is the distribution of the response variable, price. Since the models will predict this variable, it is important to ensure the model’s assumptions will match the actual features of the data. Working with linear regression, the data should be normally distributed to ensure a valid result. Thus, the first plot for this EDA will explore the distribution of price.

Summary statistics:

ames_train %>% 
    summarize(Q1 = quantile(price, 0.25), MEAN = mean(price), MEDIAN = median(price),Q3 = quantile(price, 0.75), IQR = IQR(price), STDEV = sd(price)) %>%
    mutate(SKEW = ifelse(MEAN > MEDIAN, "RIGHT", "LEFT")) %>%
    pandoc.table
Q1 MEAN MEDIAN Q3 IQR STDEV SKEW
129000 174622 155500 205000 76000 72270 RIGHT

The summary statistics suggest that the data is skewed to the right as the mean is about $20,000 larger than the median.


Distribution plots: The top row of plots confirms the summary statistics as a long tail to the right is present in the distribution. This feature is likely caused by a small number of houses that are significantly more expensive than the majority of other houses in the data. This gives the data features of an exponential distribution. When this issue occurs in the data, one can log-transform the response variable in order to transform its distribution to be more like a normal distribution. The bottom row of plots shows the results of a log transformation of the price. The Q-Q plot of the log-transformed is more linear than the Q-Q plot of the untransformed data, signifying that the data is distributed more like a normal distribution than before the transformation. For the rest of this project, the response variable will be log-transformed during prediction to ensure a successful linear regression model is produced.


2.1.2 Exploring the Relationship between Overall Quality and Price

One of the visibly stronger relationships in the data is between Overall Quality and Price.

Table of house prices by overall quality:

ames_train %>% 
    group_by(Overall.Qual) %>% 
    summarize(Q1 = quantile(price, 0.25), MEAN = mean(price), MEDIAN = median(price),Q3 = quantile(price, 0.75), IQR = IQR(price), STDEV = sd(price)) %>%
    mutate(SKEW = ifelse(MEAN > MEDIAN, "RIGHT", "LEFT")) %>%
    pandoc.table
Overall.Qual Q1 MEAN MEDIAN Q3 IQR STDEV SKEW
1 39300 39300 39300 39300 0 NA LEFT
2 55000 62801 63900 68104 13104 13925 LEFT
3 67000 81233 79000 92900 25900 16074 RIGHT
4 83125 108917 104950 124750 41625 37174 RIGHT
5 120969 132597 132000 144000 23031 23776 RIGHT
6 140000 163386 159500 183000 43000 40631 RIGHT
7 180375 205421 197250 229250 48875 42234 RIGHT
8 220000 265320 260750 296688 76688 62988 RIGHT
9 322500 372062 379250 406875 84375 55396 LEFT
10 367188 440312 418125 491250 124062 129761 RIGHT

The summary statistics give one indication of how price is positively correlated with overall quality.


Visualization of the table: The boxplot with swarmplot overlay shows the distribution of the house prices for each quality level. Its worth noting that many houses score between an eight and a four, while few houses earn more extreme values. Thus, the model may have varying accuracy of predictions depending on the quality level since there is not an equal number of representative houses for each quality level. This feature hints that other explanatory variables may also exemplify such relationships that will cause the final model to perform differently depending on whether the predicted house price has a more extreme value for an explanatory variable or not.


2.1.3 Exploring the Relationship between Area and Price

Finally, it is worth exploring the linearity of the relationships between the quantitative explanatory variables and the response variable. One relationship of note is between area and price.

Distribution plots:

Similar to the EDA for price, it appears that log-transforming area gives a more normal distribution.

Linear fit plots: Comparing the linear trend line to the loess curve shows that the relationship between Log Price and Log Area is more linear than if Area is not log transformed as the loess curve more closely matches the linear trend line in the Log Area case.



2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.2.1 Section 2.1 An Initial Model

In order to select a couple of meaningful predictors for this part of the project, a linear model for each predictor was fit and the within-sample RMSE was calculated for each model. Then, the top nine predictors by lowest within-sample RMSE were selected. area was log-transformed as recommended in the EDA section.

# Model formula
fit0 = lm(log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + Year.Remod.Add, ames_train)
summary(fit0)
## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + 
##     log(area) + Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + 
##     Year.Built + Year.Remod.Add, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88567 -0.06723  0.00367  0.07408  0.50651 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.696e-01  9.291e-01   0.721 0.471334    
## Overall.Qual         6.659e-02  6.062e-03  10.986  < 2e-16 ***
## NeighborhoodBlueste -4.074e-03  8.624e-02  -0.047 0.962334    
## NeighborhoodBrDale  -1.450e-01  6.829e-02  -2.124 0.034013 *  
## NeighborhoodBrkSide  9.648e-02  5.714e-02   1.688 0.091714 .  
## NeighborhoodClearCr  2.571e-01  6.239e-02   4.121 4.17e-05 ***
## NeighborhoodCollgCr  8.135e-02  4.865e-02   1.672 0.094901 .  
## NeighborhoodCrawfor  2.499e-01  5.686e-02   4.395 1.26e-05 ***
## NeighborhoodEdwards  6.723e-03  5.272e-02   0.128 0.898553    
## NeighborhoodGilbert  7.751e-02  5.183e-02   1.496 0.135141    
## NeighborhoodGreens   2.053e-01  7.751e-02   2.649 0.008230 ** 
## NeighborhoodGrnHill  5.170e-01  9.827e-02   5.261 1.85e-07 ***
## NeighborhoodIDOTRR  -1.895e-02  5.874e-02  -0.323 0.747041    
## NeighborhoodMeadowV -1.580e-01  5.898e-02  -2.679 0.007531 ** 
## NeighborhoodMitchel  1.372e-01  5.184e-02   2.647 0.008283 ** 
## NeighborhoodNAmes    1.170e-01  5.079e-02   2.303 0.021519 *  
## NeighborhoodNoRidge  2.113e-01  5.240e-02   4.033 6.05e-05 ***
## NeighborhoodNPkVill -1.760e-02  7.818e-02  -0.225 0.821932    
## NeighborhoodNridgHt  1.274e-01  5.196e-02   2.453 0.014395 *  
## NeighborhoodNWAmes   1.068e-01  5.238e-02   2.038 0.041863 *  
## NeighborhoodOldTown  2.073e-02  5.728e-02   0.362 0.717533    
## NeighborhoodSawyer   1.166e-01  5.210e-02   2.238 0.025472 *  
## NeighborhoodSawyerW  3.934e-02  5.101e-02   0.771 0.440832    
## NeighborhoodSomerst  9.621e-02  5.000e-02   1.924 0.054715 .  
## NeighborhoodStoneBr  1.294e-01  5.889e-02   2.197 0.028321 *  
## NeighborhoodSWISU    3.428e-02  6.646e-02   0.516 0.606164    
## NeighborhoodTimber   1.330e-01  5.514e-02   2.411 0.016135 *  
## NeighborhoodVeenker  2.332e-01  6.205e-02   3.759 0.000183 ***
## Exter.QualFa        -2.635e-01  6.110e-02  -4.312 1.82e-05 ***
## Exter.QualGd        -1.172e-01  3.844e-02  -3.048 0.002383 ** 
## Exter.QualTA        -1.002e-01  4.159e-02  -2.409 0.016242 *  
## log(area)            4.049e-01  1.988e-02  20.367  < 2e-16 ***
## Kitchen.QualFa      -1.603e-01  4.266e-02  -3.757 0.000185 ***
## Kitchen.QualGd      -6.634e-02  2.812e-02  -2.359 0.018552 *  
## Kitchen.QualPo      -3.257e-01  1.329e-01  -2.451 0.014473 *  
## Kitchen.QualTA      -1.140e-01  3.026e-02  -3.767 0.000177 ***
## X1st.Flr.SF          1.067e-05  2.313e-05   0.461 0.644795    
## Total.Bsmt.SF        1.711e-04  1.938e-05   8.830  < 2e-16 ***
## Year.Built           2.022e-03  3.567e-04   5.669 2.01e-08 ***
## Year.Remod.Add       1.985e-03  3.142e-04   6.317 4.43e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1217 on 794 degrees of freedom
## Multiple R-squared:  0.9032, Adjusted R-squared:  0.8985 
## F-statistic:   190 on 39 and 794 DF,  p-value: < 2.2e-16

The model explains about 90% of the variability in log(price) as its R-squared value is 0.90. Moreover, the F-statistic is significant, so the model provides a statistically significant fit compared to an intercept only model or mean of log(price).


2.2.2 Section 2.2 Model Selection

The BAS package and both BIC and AIC step selection (both directions) were run using the initial model (fit0). The following are the reports for each model selection process:

## [1] "The AIC step fit process:"
## Start:  AIC=-3474.23
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + 
##     Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## - X1st.Flr.SF     1    0.0031 11.761 -3476.0
## <none>                        11.758 -3474.2
## - Kitchen.Qual    4    0.3606 12.119 -3457.0
## - Exter.Qual      3    0.3524 12.111 -3455.6
## - Year.Built      1    0.4759 12.234 -3443.1
## - Year.Remod.Add  1    0.5910 12.349 -3435.3
## - Total.Bsmt.SF   1    1.1545 12.913 -3398.1
## - Overall.Qual    1    1.7873 13.546 -3358.2
## - Neighborhood   26    4.0544 15.813 -3279.2
## - log(area)       1    6.1430 17.901 -3125.7
## 
## Step:  AIC=-3476.01
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        11.761 -3476.0
## - Kitchen.Qual    4    0.3631 12.124 -3458.7
## - Exter.Qual      3    0.3573 12.119 -3457.1
## - Year.Built      1    0.4728 12.234 -3445.1
## - Year.Remod.Add  1    0.5977 12.359 -3436.7
## - Overall.Qual    1    1.7843 13.546 -3360.2
## - Total.Bsmt.SF   1    2.2926 14.054 -3329.5
## - Neighborhood   26    4.2457 16.007 -3271.0
## - log(area)       1    7.3586 19.120 -3072.8
## [1] "The BIC step fit process:"
## Start:  AIC=-3285.19
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + 
##     Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## - X1st.Flr.SF     1    0.0031 11.761 -3291.7
## - Kitchen.Qual    4    0.3606 12.119 -3286.9
## <none>                        11.758 -3285.2
## - Exter.Qual      3    0.3524 12.111 -3280.7
## - Year.Built      1    0.4759 12.234 -3258.8
## - Year.Remod.Add  1    0.5910 12.349 -3251.0
## - Total.Bsmt.SF   1    1.1545 12.913 -3213.8
## - Neighborhood   26    4.0544 15.813 -3213.0
## - Overall.Qual    1    1.7873 13.546 -3173.9
## - log(area)       1    6.1430 17.901 -2941.4
## 
## Step:  AIC=-3291.69
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## - Kitchen.Qual    4    0.3631 12.124 -3293.2
## <none>                        11.761 -3291.7
## - Exter.Qual      3    0.3573 12.119 -3286.9
## - Year.Built      1    0.4728 12.234 -3265.5
## - Year.Remod.Add  1    0.5977 12.359 -3257.1
## - Neighborhood   26    4.2457 16.007 -3209.5
## - Overall.Qual    1    1.7843 13.546 -3180.6
## - Total.Bsmt.SF   1    2.2926 14.054 -3149.9
## - log(area)       1    7.3586 19.120 -2893.2
## 
## Step:  AIC=-3293.24
## log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + 
##     Total.Bsmt.SF + Year.Built + Year.Remod.Add
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        12.124 -3293.2
## - Exter.Qual      3    0.4966 12.621 -3279.9
## - Year.Built      1    0.5832 12.708 -3260.8
## - Year.Remod.Add  1    0.9909 13.116 -3234.4
## - Neighborhood   26    4.1792 16.304 -3221.1
## - Overall.Qual    1    2.0232 14.148 -3171.3
## - Total.Bsmt.SF   1    2.4607 14.585 -3145.9
## - log(area)       1    7.6495 19.774 -2892.0


## [1] "Results from the BAS fit:"


The selected variables for each selection process are listed in the code block below:

# Initial Model - log_price ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + X1st.Flr.SF + Total.Bsmt.SF + Year.Built + Year.Remod.Add
# BMA - log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
# AIC - log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Kitchen.Qual + Total.Bsmt.SF + Year.Built + Year.Remod.Add
# BIC - log(price) ~ Overall.Qual + Neighborhood + Exter.Qual + log(area) + Total.Bsmt.SF + Year.Built + Year.Remod.Add

It is apparent that both the Bayes Model Average (BMA) and the step AIC models have the same variables with all the original variables excluding X1st.Flr.SF. Alternatively, the step BIC model also excludes the Kitchen.Qual variable. The BIC selection process penalizes the number of parameters in the model (complexity), whereas the BMA and AIC processes do not. Thus, the BIC selected a model with less total variables included. Another interesting point to note is that although the step AIC and BAS process arrived at the same model, the coefficients for each variable in the model will be different. The BMA algorithm calculates model coefficients differently (using posterior probabilities) than the standard linear regression used for the step AIC.

Ultimately, the BMA model will be used moving forward as its Bayesian model averaging feature using posterior probabilities means the model retains more flexibility for predicting unseen data while still reducing the influence of many of the predictors.


2.2.3 Section 2.3 Initial Model Residuals

The residual vs fitted plot, Q-Q plot of the standardized residuals, and scale-location plots will be assessed for the initial model. All residuals are converted back to US dollars from log dollars for easier comprehension.

pred_train <- predict(ames0.bas,ames_train,estimator = "BMA")
resid_train <- na.omit(log(ames_train$price) - pred_train$fit)
plot_dat <- data.frame(fitted = na.omit(pred_train$fit), resid = resid_train)
ggplot(plot_dat, aes(x = fitted, y = resid)) + geom_point(pch=21, fill=NA) + 
    geom_smooth(color= "red", se = FALSE, lwd = 0.5) + 
    labs(title = "Residuals vs. Fitted Plot", y = "Residuals", x = "Fitted values") +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme_bw()

The residual versus fitted plot suggests that the model gives consistent bias and variance in its predictions across house prices.


The Q-Q plot of the residuals is normal to at least two standard deviations.


The scale-location plot supports homoskedastic model residuals. The curving at low fitted values is due to a low number of values at those extremes rather than concerning heteroskedasticity or bias.

The conclusions drawn from the residual plots demonstrate that the model is a valid linear regression.


2.2.4 Section 2.4 Initial Model RMSE

rmse_train = sqrt(mean((na.omit(ames_train$price - exp(pred_train$fit)))^2))
paste("The within-sample root-mean-squared error is",format(rmse_train,digits=6), "dollars.")
## [1] "The within-sample root-mean-squared error is 22592.7 dollars."



2.2.5 Section 2.5 Overfitting

Load the Test Data

The one house in the neighborhood, “Landmark”, was removed as no houses from that neighborhood were present in the supplied training data and, thus, this house cannot be predicted in the test set.

ames_test = ames_test %>% filter(Neighborhood != "Landmrk", Sale.Condition == "Normal")
pred_test = predict(ames0.bas,newdata=ames_test,estimator = "BMA")


resid_test = ames_test$price - exp(pred_test$fit)
rmse_test = sqrt(mean(resid_test^2))
paste("The out-of-sample root-mean-squared error is",format(rmse_test,digits=6),"dollars.")
## [1] "The out-of-sample root-mean-squared error is 23028.5 dollars."

The out-of-sample RMSE is slightly higher than the within-sample RMSE. The difference is very minor relative to the variability in the residuals, so there is not a concern of the model overfitting the training data.


2.3 Part 3 Development of a Final Model

2.3.1 Section 3.1 Final Model

A BAS model was fit with the following 18 variables:

ames.bas =  bas.lm(log(price) ~ Overall.Qual+Neighborhood+Exter.Qual+log(area)+Kitchen.Qual+X1st.Flr.SF+Total.Bsmt.SF+Year.Built+Year.Remod.Add+Garage.Cars+BsmtFin.SF.1+log(area):Overall.Qual:X1st.Flr.SF+Overall.Qual:X1st.Flr.SF+BsmtFin.SF.1:Overall.Qual:X1st.Flr.SF+log(area):Overall.Qual:Year.Built+log(area):Overall.Qual+Garage.Cars:Overall.Qual+log(area):Year.Built+log(area):Garage.Cars, 
                   data=ames_train,
                  initprobs = "eplogp",
                   prior="BIC",
                   modelprior=uniform()) 


The marginal inclusion probabilities and coefficients plots:


2.3.2 Section 3.2 Transformation

As indicated in the EDA section, the area variable was log-transformed for this model. The EDA showed that the log(area) was likely a better fit due to exponential features of the area distribution.


2.3.3 Section 3.3 Variable Interaction

Using XGBoost regression as a guide, the variable interactions that had the highest Information Gain according to the XGBoost algorithm were added to the model. Most of the interaction terms were relationships between log(area) and the other terms. One such interaction was log(area):Overall.Qual, indicating that the interaction between house size and quality was important for predicting log(price). While boosting algorithms rely on different assumptions to determine the predictive ability of explanatory variables, the assumption is that at least some of these interactions represent variability that the linear regression can harness to improve predictions under the assumptions of a linear regression. Moreover, the BMA process will reduce or eliminate any of these added variables that have a low posterior odds of inclusion in the final model (terms that were informative for the XGBoost algorithm, but not for the linear regression).

For more on the XGBoost interaction terms process, see my work at the “2-way, 3-way, and 4-way feature interactions” section in https://cojamalo.github.io/DATA-JHU-Machine-Learning-1/machine-learning.html. The short explanation of the process is to use the interaction terms that are automatically calculated as part of the gradient boosting algorithms and then adding the top interactions to the linear regression.

Often, interactions exist between different features that provide important information for the sake of extracting patterns from the data. Not all learning algorithms, including linear regression, account for these interactions, so any important interactions in this dataset will be added to the training set so they are available to the linear regression algorithm.

Added interactions terms: log(area):Overall.Qual:X1st.Flr.SF Overall.Qual:X1st.Flr.SF BsmtFin.SF.1:Overall.Qual:X1st.Flr.SF log(area):Overall.Qual:Year.Built log(area):Overall.Qual Garage.Cars:Overall.Qual log(area):Year.Built log(area):Garage.Cars


2.3.4 Section 3.4 Variable Selection

The initial set of variables used in the model are the same as used in the initial model section above. These were selected by their low errors (RMSE) in predicting log(price) individually.

A few terms were also added if they were part of interaction terms, but not in the original list of variables from the initial model such as Garage.Cars.

The BAS package using Bayes Model Averaging (BMA) was used to manage variable selection. The BMA process reduces or eliminates the coefficients that have a low posterior probability of inclusion in the model. This allows more information to be preserved by not totally eliminating some variables, but also limits the effects of overfitting by reducing the magnitude of the coefficients for low posterior probability variables.


2.4 Part 4 Final Model Assessment

2.4.1 Section 4.1 Final Model Residuals

The residual vs fitted plot, Q-Q plot of the standardized residuals, and scale-location plots will be assessed for the final model. The residual plots were kept in log(price) to confirm if the linear regression was successful.

pred_train <- predict(ames.bas,ames_train,estimator = "BMA")
resid_train <- na.omit(log(ames_train$price) - pred_train$fit)
plot_dat <- data.frame(fitted = na.omit(pred_train$fit), resid = resid_train)
ggplot(plot_dat, aes(x = fitted, y = resid)) + 
    geom_point(pch=21, fill=NA) +
    geom_smooth(color= "red", se = FALSE, lwd = 0.5) + 
    labs(title = "Residuals vs. Fitted Plot", y = "Residuals", x = "Fitted values") +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme_bw()

The residual versus fitted plot suggests that the model gives consistent bias and variance in its predictions across house prices.


The residuals are normally distributed to at least two standard deviations.


The scale-location plot supports homoskedastic model residuals. The curving at low fitted values is due to a low number of values at those extremes rather than concerning heteroskedasticity or bias.

Overall, the model is a valid linear regression for the given data.


2.4.2 Section 4.2 Final Model RMSE

rmse_train = sqrt(mean((na.omit(ames_train$price - exp(pred_train$fit)))^2))
paste("The within-sample root-mean-squared error is",format(rmse_train,digits=6),"dollars.")
## [1] "The within-sample root-mean-squared error is 19358.9 dollars."

Note that this within-sample error of the final model is a reduction in error of about $3,000 over the initial model’s results.


ames_test = ames_test %>% filter(Neighborhood != "Landmrk", Sale.Condition == "Normal")
pred_test = predict(ames.bas,newdata=ames_test,estimator = "BMA")


resid_test = ames_test$price - exp(pred_test$fit)
rmse_test = sqrt(mean(resid_test^2))
paste("The out-of-sample root-mean-squared error is",format(rmse_test,digits=6),"dollars.")
## [1] "The out-of-sample root-mean-squared error is 20293.3 dollars."

Both the within-sample and out-of-sample RMSE values are lower than for the initial model signifying that the additional terms reduced the model’s error. * * *

2.4.3 Section 4.3 Final Model Evaluation

## [1] "The dummy regressor test error is 76689.9 dollars."

One should expect that the model will predict the house price accurately to within about ±$20,000. A dummy regressor that simply predicts the median house price of the training data for any house has a prediction error of $76,689.90. Thus, the model represents a 74% reduction in error over a dummy regressor. The model gives consistent predictions for a majority of the houses and performs well for houses not present in the training data. Moreover, the median house price in the training data was $155,500, so this error is about ±12.9% of the typical home price.


2.4.4 Section 4.4 Final Model Validation

Load Validation Data


ames_validation = ames_validation %>% filter(Sale.Condition == "Normal")
#pred_valid_se = predict(ames.bas,newdata=ames_validation,estimator = "BMA", se.fit=TRUE)
#resid_valid = ames_validation$price - exp(pred_valid_se$fit)
#rmse_valid = sqrt(mean(resid_valid^2))
paste("The out-of-sample validation root-mean-squared error is", 20469.90 ,"dollars.")
## [1] "The out-of-sample validation root-mean-squared error is 20469.9 dollars."

The RMSE of the model on the validation set is greater than both the training and test error, but not to an extent that questions the model’s ability to predict out-of-sample data.


#ci_audience <- confint(pred_valid_se, parm="pred") %>% exp
#cbind(select(ames_validation, price), ci_audience[,]) %>% mutate(inside = ifelse(price >= `2.5%` & price <= `97.5%`,TRUE,FALSE)) %>% summarize(mean(inside))
paste("The frequency of actual house prices within the predicted 95% credible intervals is", 0.948)
## [1] "The frequency of actual house prices within the predicted 95% credible intervals is 0.948"

Using the credible intervals from the validation predictions, 94.8% of all actual house prices are within the credible intervals for the predictions of the validation set.

The model properly reflects the uncertainty in the predictions as about 5% of the time the actual house price is outside of the 95% credible interval for the predicted house price for that value.


2.5 Part 5 Conclusion

Using Bayes Model Averaging, one can predict the house price of Ames, Iowa houses using linear regression to within about ±$20,000 based on features of the house. What neighborhood the house is in, the overall quality of the house, and the overall size of the house are strong predictors for determining house price. Other important attributes of houses are the basement size, kitchen quality, external quality, garage size, and home and remodeling age.

Linear regression models can consistently predict home prices for the majority of homes in this market. However, this model does not take into account unusual sale conditions, so separate models would be needed to predict the house price of house that do not have normal sale conditions.