library(corrplot)
## corrplot 0.84 loaded
library(ggplot2)
library(ggfortify)
library(car)
## Loading required package: carData
sz <- 22
The purpose of this dataset is to determine what factors are most useful in predicting the price of a diamond. Several measurements are made when describing a diamond, and this study aims to see which of those measurements best determine the price.
This dataset has 53940 observations, but for ease of computation we took a random sample and reduced it to 300. The dataset was provided by a public posting on Kaggle and can be found at www.kaggle.com. We downloaded the CSV file on March 19, 2020. The data presents 9 factors which may potentially be useful in predicting diamond price. Each variable and its explanation is given below (note that ‘con’ represents a continuous variable and ‘cat’ represents a categorical variable) :
| Variable | Description |
|---|---|
| Carat (Con) | Measures the carat of diamond - ie. weight. |
| Cut (Cat) | Represents the type of diamond cut and has 5 levels: “Fair,”Good“,”Very Good“,”Ideal" and “Premium”. |
| Color (Cat) | Represents the color of the diamond and has 7 levels: (Best) “D”, “E”, “F”, “G”, “H”, “I”, and “J” (Worst). |
| Clarity (Cat) | Represents the clarity of the diamond and has 8 levels: (Worst) “I1”, “IF”, “SI1”, “SI2”, “VS1”, “VS2”, “VVS1” and “VVS2” (Best). |
| Depth (Con) | The measurement from the bottom tip of the diamond to the table (top) measured in millimeters but given as a percent |
| Table (Con) | Measures the facet of the diamond and is measured in millimeters, but given as a ratio at which the facet is widest. |
| X (Con) | Measures diamond length in mm |
| Y (Con) | Measures diamond width in mm |
| Z (Con) | Measures diamond depth in mm |
Given our knowledge of the data, we hypothesized that carat, cut, x, y, and z would all have a significant effect on the price of the diamond. Specifically, that as these measurements increase, we predicted that the price of the diamond would also increase accordingly.
In our analysis we checked the assumptions necessary to perform multiple linear regression. When we found that the raw data did not meet certain assumptions, we applied the necessary methods to clean the data and meet the assumptions. We found problems with homoscedasticity, the model fitting all observations, and multicollinearity. To fix the homoscedasticity assumption we performed a log transformation on price. To help the model fit all observations we ran models both with and without the influential points. We noted that the points did not significantly affect the model, and elected to leave them in. Finally, to fix multicollienarity we removed variables that showed a linear correlation between them. Once the assumptions were met, we performed multiple linear regression with price as the response, using the remaining variables for prediction.
We concluded our analysis by evaluating which of the variables were the most important in predicting the price of a diamond, and removed the variables that negatively affected the accuracy of our model. After our regression analysis, we determined that the model we created was very effective in predicting the price of a diamond given certain predictors.
Importing data and taking a random sample to reduce computation time
# setting seed to make the sample reproducible
set.seed(12345)
diamonds <- read.csv('diamonds.csv', header = TRUE)
# removing X identifier variable (X was an ID number for each diamond)
diamonds.s <- diamonds[sample(nrow(diamonds), 300), -1]
Initial Exploratory Analysis
# re-arranging columns
diamonds.s <- diamonds.s[,c(7, 1, 2, 3, 4, 5, 6, 8, 9, 10)]
head(diamonds.s)
## price carat cut color clarity depth table x y z
## 10904 4886 1.01 Ideal H SI1 62.1 56 6.38 6.47 3.99
## 33373 827 0.41 Very Good G VS2 61.7 61 4.70 4.74 2.91
## 9986 4704 1.26 Ideal I SI2 59.6 57 7.04 7.01 4.19
## 75 554 0.30 Good H SI1 63.7 57 4.28 4.26 2.72
## 30311 726 0.34 Ideal F VS2 62.1 56 4.48 4.51 2.79
## 26378 15851 2.00 Very Good F SI1 63.3 56 8.00 7.93 5.04
# coercing categorical variables to factors
diamonds.s[, 3] <- as.factor(diamonds.s[, 3])
diamonds.s[, 4] <- as.factor(diamonds.s[, 4])
diamonds.s[, 5] <- as.factor(diamonds.s[, 5])
# checking scatter plot matrix with select variables
plot(diamonds.s[, c(1, 2, 6, 7, 8, 9, 10)])
# initial check for multicollinearity (removing categorical variables for plot)
corrplot(cor(diamonds.s[, c(1, 2, 6, 7, 8, 9, 10)]), type = "upper")
From our exploratory data analysis, we noticed several interesting things. Depth appeared to have a negative correlation with all other variables (including price), while the other variables had a positive correlation with price. There is very strong linear correlation between X, Y, Z and Carat. We found this intuitive because diamonds are cut proportionally.(X, Y, and Z must all be related to one another). Knowing this, we address the multicollinearity problem later in our analysis.
Linear Model
# creating an initial linear model
diamonds.s.lm <- lm(price ~. , data = diamonds.s)
diamonds.s$residuals <- diamonds.s.lm$residuals
diamonds.s$fitted.values <- diamonds.s.lm$fitted.values
# summary of the linear model
summary(diamonds.s.lm)
##
## Call:
## lm(formula = price ~ ., data = diamonds.s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2820.5 -473.7 -139.4 299.5 4674.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14672.90 16190.70 0.906 0.365591
## carat 15451.48 614.38 25.150 < 2e-16 ***
## cutGood -315.65 583.28 -0.541 0.588836
## cutIdeal -397.42 588.06 -0.676 0.499720
## cutPremium -317.62 564.37 -0.563 0.574031
## cutVery Good -631.91 570.33 -1.108 0.268845
## colorE -153.04 184.61 -0.829 0.407822
## colorF -372.03 186.80 -1.992 0.047398 *
## colorG -578.61 185.62 -3.117 0.002019 **
## colorH -1015.49 194.40 -5.224 3.46e-07 ***
## colorI -1482.83 213.41 -6.948 2.66e-11 ***
## colorJ -3222.87 325.16 -9.912 < 2e-16 ***
## clarityIF 4481.86 533.52 8.401 2.37e-15 ***
## claritySI1 3204.45 471.75 6.793 6.75e-11 ***
## claritySI2 2282.02 475.90 4.795 2.66e-06 ***
## clarityVS1 3876.49 486.84 7.963 4.42e-14 ***
## clarityVS2 3681.18 474.21 7.763 1.63e-13 ***
## clarityVVS1 4518.31 518.65 8.712 2.81e-16 ***
## clarityVVS2 4043.45 500.81 8.074 2.12e-14 ***
## depth -119.22 252.64 -0.472 0.637380
## table -50.50 31.67 -1.594 0.111975
## x -6453.22 1713.10 -3.767 0.000202 ***
## y 2897.38 1841.70 1.573 0.116815
## z 1487.76 4206.13 0.354 0.723824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 859.7 on 276 degrees of freedom
## Multiple R-squared: 0.9593, Adjusted R-squared: 0.9559
## F-statistic: 282.7 on 23 and 276 DF, p-value: < 2.2e-16
# carat
ggplot(data = diamonds.s, mapping = aes(x = carat, y = residuals)) +
geom_point() +
geom_point() +
theme_bw() +
xlab("Residuals") +
ylab("Carat") +
ggtitle("Residual vs. Carat Plot") +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_smooth(method = "lm", se = FALSE)
# table
ggplot(data = diamonds.s, mapping = aes(x = table, y = residuals)) +
geom_point() +
geom_point() +
theme_bw() +
xlab("Residuals") +
ylab("Table") +
ggtitle("Residual vs. Table Plot") +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_smooth(method = "lm", se = FALSE)
# depth
ggplot(data = diamonds.s, mapping = aes(x = depth, y = residuals)) +
geom_point() +
geom_point() +
theme_bw() +
xlab("Residuals") +
ylab("Depth") +
ggtitle("Residual vs. Depth Plot") +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_smooth(method = "lm", se = FALSE)
# x
ggplot(data = diamonds.s, mapping = aes(x = x, y = residuals)) +
geom_point() +
geom_point() +
theme_bw() +
xlab("Residuals") +
ylab("X") +
ggtitle("Residual vs. X") +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_smooth(method = "lm", se = FALSE)
# y
ggplot(data = diamonds.s, mapping = aes(x = y, y = residuals)) +
geom_point() +
geom_point() +
theme_bw() +
xlab("Residuals") +
ylab("Y") +
ggtitle("Residual vs. Y") +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_smooth(method = "lm", se = FALSE)
# z
ggplot(data = diamonds.s, mapping = aes(x = z, y = residuals)) +
geom_point() +
geom_point() +
theme_bw() +
xlab("Residuals") +
ylab("Z") +
ggtitle("Residual vs. Z") +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_smooth(method = "lm", se = FALSE)
avPlots(diamonds.s.lm, terms = ~ carat + depth + table + x + y + z)
ggplot(data = diamonds.s, mapping = aes(x = fitted.values, y = residuals)) +
geom_point() +
theme_bw() +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residuals") +
ylab("Fitted Values") +
ggtitle("Residuals vs Fitted Value Plot") +
geom_smooth(method = 'lm', se = FALSE)
After looking at the Residual v Predictor plots, the AV plots, and the residual v fitted plot, we found that the data seemed to follow a fairly linear trend. (None of the plots showed any relationship other than linear), meaning the linearity assumption is met.
We assume the assumption of indepedent residuals to be met because we took a random sample from a larger data set.
ggplot(data = diamonds.s, mapping = aes(y = residuals)) +
geom_boxplot() +
ylab("Residuals") +
ggtitle("Residuals Boxplot") +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5))
#### Histogram of Residuals
ggplot(data = diamonds.s, mapping = aes(x = residuals)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 500) +
stat_function(fun = dnorm, color = "red", size = 2,
args = list(mean = mean(diamonds.s$residuals),
sd = sd(diamonds.s$residuals))) +
theme(aspect.ratio = 1)
shapiro.test(diamonds.s.lm$residuals)
##
## Shapiro-Wilk normality test
##
## data: diamonds.s.lm$residuals
## W = 0.89192, p-value = 9.006e-14
Looking at both the boxplot of the residuals and the histogram, it seemed that the residuals were normally distributed around 0. There did seem to be some outliers (which may suggest why we failed the Shapiro-Wilk test for normality). Despite this, we felt that the graphical measures were more convincing than the Shapiro-Wilk test, and continued with the analysis assuming that the normality assumption was met.
ggplot(data = diamonds.s, mapping = aes(x = fitted.values, y = residuals)) +
geom_point() +
theme_bw() +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residuals") +
ylab("Fitted Values") +
ggtitle("Residuals vs Fitted Value Plot") +
geom_smooth(method = 'lm', se = FALSE)
Immediately after viewing this plot, we knew that we would have a problem with homoscedasity. We saw that the points were far more spread out on the upper end of the Residuals axis. We proceeded to do a box-cox analysis and a log transform of the y variable (price) - code is shown below after checking the other assumptions for regression.
diamonds.s.dffits <- data.frame("dffits" = dffits(diamonds.s.lm))
diamonds.s.dffits$obs <- 1:length(diamonds.s$price)
ggplot(data = diamonds.s.dffits) +
geom_point(mapping = aes(x = obs, y = abs(dffits))) +
geom_hline(mapping = aes(yintercept = 2 * sqrt(9 / length(obs))),
color = "red", linetype = "dashed") +
theme_bw() +
theme(aspect.ratio = 1)
#diamonds.s.dffits[abs(diamonds.s.dffits$dffits) > 2 * sqrt(9 / 300), ]
# ## Carat
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
#
# ggplot(data = diamonds.s.dfbetas) +
# geom_point(mapping = aes(x = obs, y = abs(carat))) +
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") + # for n <= 30
# geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
# color = "red", linetype = "dashed") + # for n > 30
# theme_bw() +
# theme(aspect.ratio = 1)
#
# carat.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$carat) > 1, ]
# carat.extreme.dfbetas[order(carat.extreme.dfbetas$carat), ]
#
# ##Depth
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
#
# ggplot(data = diamonds.s.dfbetas) +
# geom_point(mapping = aes(x = obs, y = abs(depth))) +
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") + # for n <= 30
# geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
# color = "red", linetype = "dashed") + # for n > 30
# theme_bw() +
# theme(aspect.ratio = 1)
#
# depth.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$depth) > 1, ]
# depth.extreme.dfbetas[order(depth.extreme.dfbetas$depth), ]
#
#
# ## Table
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
#
# ggplot(data = diamonds.s.dfbetas) +
# geom_point(mapping = aes(x = obs, y = abs(table))) +
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") + # for n <= 30
# geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
# color = "red", linetype = "dashed") + # for n > 30
# theme_bw() +
# theme(aspect.ratio = 1)
#
# table.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$table) > 1, ]
# table.extreme.dfbetas[order(table.extreme.dfbetas$table), ]
#
# ## X
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
#
# ggplot(data = diamonds.s.dfbetas) +
# geom_point(mapping = aes(x = obs, y = abs(x))) +
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") + # for n <= 30
# geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
# color = "red", linetype = "dashed") + # for n > 30
# theme_bw() +
# theme(aspect.ratio = 1)
#
# x.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$x) > 1, ]
# x.extreme.dfbetas[order(x.extreme.dfbetas$x), ]
#
# ## Y
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
#
# ggplot(data = diamonds.s.dfbetas) +
# geom_point(mapping = aes(x = obs, y = abs(y))) +
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") + # for n <= 30
# geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
# color = "red", linetype = "dashed") + # for n > 30
# theme_bw() +
# theme(aspect.ratio = 1)
#
# y.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$y) > 1, ]
# y.extreme.dfbetas[order(y.extreme.dfbetas$y), ]
#
# ## Z
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
#
# ggplot(data = diamonds.s.dfbetas) +
# geom_point(mapping = aes(x = obs, y = abs(z))) +
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") + # for n <= 30
# geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
# color = "red", linetype = "dashed") + # for n > 30
# theme_bw() +
# theme(aspect.ratio = 1)
#
# #identifies any observations with a DFBETA greater
# # than one (as seen on the plot)
# z.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$z) > 1, ]
# z.extreme.dfbetas[order(z.extreme.dfbetas$z), ]
After viewing the DFFITS and DFBETAS, we saw that we had quite a few points above the benchmark cut off - indicating we may have some influential points. Later in the analysis we run a model both with and without these points to see if they are, in fact, influential.
#corrplot(cor(diamonds.s[, c(1, 2, 6, 7, 8, 9, 10)]), type = "upper")
vif(diamonds.s.lm)
## GVIF Df GVIF^(1/(2*Df))
## carat 34.075503 1 5.837423
## cut 2.936517 4 1.144140
## color 1.698535 6 1.045136
## clarity 2.265528 7 1.060155
## depth 43.333817 1 6.582843
## table 1.897128 1 1.377363
## x 1488.410726 1 38.579926
## y 1690.026087 1 41.109927
## z 3293.296083 1 57.387247
mean(vif(diamonds.s.lm))
## [1] 249.4468
Looking at the Correlation matrix (code is repeated here - for the graph please reference the graphic displayed above), there are many predictor variables that seem to be strongly correlated - particularly the x, y and z variables (we thought this made sense because each of these measures the size of the diamond in slightly different ways). On top of the correlation matrix, the VIF values for these variables are far above 10 (up into the thousands), bringing the average VIF up to 249. It was obvious that our data had problems with multicollinearity, and we apply variable selection methods to fix this below.
bc <- boxCox(diamonds.s$price ~ diamonds.s$carat + diamonds.s$cut + diamonds.s$color + diamonds.s$clarity + diamonds.s$depth + diamonds.s$table + diamonds.s$x + diamonds.s$y + diamonds.s$z)
After viewing the Box-Cox test, it was apparent that we needed to do a log transform of our response variable. (a lambda value of 0 indicates a log transform). We proceeded to make a new model with the log of price as the response, and again looked at the residual v fitted plot to see if our transformation improved homoscedasticity.
diamonds.s.transform <- diamonds.s
diamonds.s.transform[,1] <- log(diamonds.s.transform[,1])
diamonds.transform.lm <- lm(price ~. , data = diamonds.s.transform)
diamonds.s.transform$residuals <- diamonds.transform.lm$residuals
diamonds.s.transform$fitted.values <- diamonds.transform.lm$fitted.values
# original plot (no transformation)
ggplot(data = diamonds.s, mapping = aes(x = fitted.values, y = residuals)) +
geom_point() +
theme_bw() +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residuals") +
ylab("Fitted Values") +
ggtitle("Residuals vs Fitted Value Plot") +
geom_smooth(method = 'lm', se = FALSE)
# log transform plot
ggplot(data = diamonds.s.transform, mapping =
aes(x = fitted.values, y = residuals)) +
geom_point() +
theme_bw() +
theme(aspect.ratio = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residuals") +
ylab("Fitted Values") +
ggtitle("Residuals vs Fitted Value Plot (Transformed)") +
geom_smooth(method = 'lm', se = FALSE)
After viewing the difference in the two plots, we saw that the variance drastically improved, and we continued forward with our transformed model, assuming that the homoscedasity assumption to be met.
To work with the high correlation values of the x, y, z, and clarity variables, we decided it was best to pick only one of these variables to mitigate the effect of multicollinearity. We fit a model with each variable, then picked the model that performed the best. Originally we ran GLMNET with best subsets, but we had some trouble, so we did a best subsets analysis manually with the models shown below.
# removing the residual and fitted values columns from the data frame to make
# variable selection possible
diamonds.s.transform <- diamonds.s.transform[,-c(11,12)]
# the variable named "y" gave us some trouble with GLMNET, so we changed the name to # "yvar".
names(diamonds.s.transform) <- c("price", "carat", "cut", "color", "clarity", "depth", "table", "x", "yvar", "z")
# model using clarity
clarity.model <- lm(price ~ carat + color + clarity, data = diamonds.s.transform)
# model using x
x.model <- lm(price ~ carat + color + x, data = diamonds.s.transform)
# model using y
y.model <- lm(price ~ carat + color + yvar, data = diamonds.s.transform)
# model using z
z.model <- lm(price ~ carat + color + z, data = diamonds.s.transform)
summary(clarity.model)
##
## Call:
## lm(formula = price ~ carat + color + clarity, data = diamonds.s.transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88769 -0.21606 0.02794 0.21900 0.64227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.82554 0.16485 35.339 < 2e-16 ***
## carat 2.15327 0.04373 49.241 < 2e-16 ***
## colorE -0.10947 0.06278 -1.744 0.082288 .
## colorF -0.15886 0.06402 -2.481 0.013663 *
## colorG -0.25995 0.06269 -4.147 4.46e-05 ***
## colorH -0.30292 0.06646 -4.558 7.68e-06 ***
## colorI -0.48620 0.07278 -6.681 1.24e-10 ***
## colorJ -0.61933 0.11218 -5.521 7.60e-08 ***
## clarityIF 0.70190 0.17402 4.033 7.07e-05 ***
## claritySI1 0.38317 0.15551 2.464 0.014332 *
## claritySI2 0.32718 0.15706 2.083 0.038121 *
## clarityVS1 0.58812 0.15951 3.687 0.000272 ***
## clarityVS2 0.55659 0.15532 3.583 0.000399 ***
## clarityVVS1 0.77459 0.16942 4.572 7.21e-06 ***
## clarityVVS2 0.58915 0.16318 3.610 0.000361 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2996 on 285 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.9074
## F-statistic: 210.3 on 14 and 285 DF, p-value: < 2.2e-16
#summary(x.model)
#summary(y.model)
#summary(z.model)
Initially we thought the Y model would be best because it had the highest f statistic and R squared values, but after looking at each correlation matrix individually, we found that the clarity model was the only one to completely eliminate problems with multicollinearity. (VIF values are shown below) The differences between the clarity model and y model in terms of R square and F statistic are pretty small (both return very small p values), so we continued the model using only the clarity variable. (No muliticollinearity assumption is now met)
vif(clarity.model)
## GVIF Df GVIF^(1/(2*Df))
## carat 1.421520 1 1.192275
## color 1.410925 6 1.029103
## clarity 1.446810 7 1.026734
mean(vif(clarity.model)[1])
## [1] 1.42152
To deal with the potential influential points we saw, we created a model with and without those outliers, then compared the two models to see if the points were, in fact, influential.
# creating data frame without flagged DFFIT values
diamonds.without <- diamonds.s.transform[-diamonds.s.dffits[abs(diamonds.s.dffits$dffits) > 1, ]$obs, ]
# linear model of "without" data frame
without <- lm(price ~ carat + color + clarity, data = diamonds.without)
# # comparing both models
# clarity.model
# without
confint(clarity.model)
## 2.5 % 97.5 %
## (Intercept) 5.5010701 6.15001540
## carat 2.0671924 2.23933870
## colorE -0.2330361 0.01410093
## colorF -0.2848689 -0.03284801
## colorG -0.3833445 -0.13655647
## colorH -0.4337453 -0.17209647
## colorI -0.6294407 -0.34295106
## colorJ -0.8401462 -0.39851740
## clarityIF 0.3593675 1.04444112
## claritySI1 0.0770757 0.68926138
## claritySI2 0.0180469 0.63631772
## clarityVS1 0.2741475 0.90208483
## clarityVS2 0.2508623 0.86231251
## clarityVVS1 0.4411197 1.10805681
## clarityVVS2 0.2679652 0.91034300
confint(without)
## 2.5 % 97.5 %
## (Intercept) 5.45506855 6.087928243
## carat 2.09511288 2.271379139
## colorE -0.21373388 0.028237853
## colorF -0.25261579 -0.003980257
## colorG -0.35470244 -0.111931865
## colorH -0.41980973 -0.162874053
## colorI -0.62071345 -0.339650926
## colorJ -0.91634683 -0.461484850
## clarityIF 0.35011096 1.024139403
## claritySI1 0.11519730 0.710281696
## claritySI2 0.03126756 0.631919674
## clarityVS1 0.30147997 0.911290316
## clarityVS2 0.27380959 0.867736928
## clarityVVS1 0.42839738 1.080877540
## clarityVVS2 0.30372889 0.928152727
The difference between the two models seemed pretty negligible, so we felt comfortable leaving in the outliers and continuing assuming that the model described all observations.
# creating a model including interactions
clarity.model.int <- lm(price ~ carat + color + clarity + carat:color, data = diamonds.s.transform)
summary(clarity.model.int)
##
## Call:
## lm(formula = price ~ carat + color + clarity + carat:color, data = diamonds.s.transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93926 -0.18482 0.02157 0.21653 0.58589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5073657 0.1867901 29.484 < 2e-16 ***
## carat 2.6133016 0.1626080 16.071 < 2e-16 ***
## colorE 0.0848772 0.1357154 0.625 0.532216
## colorF 0.0831458 0.1387289 0.599 0.549432
## colorG 0.0016880 0.1340444 0.013 0.989962
## colorH -0.0292417 0.1490740 -0.196 0.844631
## colorI -0.0007528 0.1511224 -0.005 0.996029
## colorJ 0.8310160 0.3737309 2.224 0.026978 *
## clarityIF 0.7330943 0.1688624 4.341 1.98e-05 ***
## claritySI1 0.4147173 0.1510045 2.746 0.006417 **
## claritySI2 0.3193634 0.1524312 2.095 0.037061 *
## clarityVS1 0.6254214 0.1548306 4.039 6.93e-05 ***
## clarityVS2 0.5879749 0.1507271 3.901 0.000120 ***
## clarityVVS1 0.8027829 0.1642539 4.887 1.73e-06 ***
## clarityVVS2 0.6331765 0.1586149 3.992 8.39e-05 ***
## carat:colorE -0.3093331 0.1909227 -1.620 0.106319
## carat:colorF -0.3935012 0.1860563 -2.115 0.035320 *
## carat:colorG -0.4228722 0.1797223 -2.353 0.019321 *
## carat:colorH -0.4390674 0.1869930 -2.348 0.019571 *
## carat:colorI -0.6454900 0.1819304 -3.548 0.000455 ***
## carat:colorJ -1.2765449 0.2925383 -4.364 1.80e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2893 on 279 degrees of freedom
## Multiple R-squared: 0.9194, Adjusted R-squared: 0.9136
## F-statistic: 159.2 on 20 and 279 DF, p-value: < 2.2e-16
# comparing the two models to see if there is a significant interaction
anova(clarity.model, clarity.model.int)
## Analysis of Variance Table
##
## Model 1: price ~ carat + color + clarity
## Model 2: price ~ carat + color + clarity + carat:color
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 285 25.577
## 2 279 23.355 6 2.2215 4.423 0.0002722 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the output from the ANOVA test, we found a significant interaction between carat and color. We made note of this and continued using the clarity model including interaction terms.
After manipulating the data to meet linear regression assumptions, we now had a model that fit our data well and felt comfortable using this model to make inferences and predictions.
While the prediction of diamond price my not be very useful for the average person, our findings may still be applicable in certain situations. Suppose a student hopes to predict how much he will need to save in order to buy his fiance a wedding ring that fits her requirements. Suppose that after careful consideration, this student’s fiance would like a .4 carat diamond with a “Very Good” cut, “H” color, and a clarity of “SI2”. #### Prediction and Confidence Intervals
new.data <- data.frame(carat = 0.4, cut = "Very Good", color = "H", clarity = "SI2")
#Prediction Interval
log.prediction <- predict(clarity.model.int, newdata = new.data, interval = "prediction")
(round(exp(log.prediction)[c(2, 1, 3)], 2))
## [1] 435.36 786.18 1419.68
We are 95% confident that a single diamond with the characteristics specified above (Carat:0.4, Cut:“Very Good”, Color:“H”, and Clarity:“SI2”) will cost between 435.36 and 1419.69 USD.
new.data <- data.frame(carat = 0.4, cut = "Very Good", color = "H", clarity = "SI2")
#Confidence Interval
log.confidence <- predict(clarity.model.int, newdata = new.data, interval = "confidence")
round(exp(log.confidence)[c(2, 1, 3)], 2)
## [1] 671.39 786.18 920.58
We are 95% confident that diamonds with the characteristics specified above (Carat:0.4, Cut:“Very Good”, Color:“H”, and Clarity:“SI2”) will on average cost between 671.39 and 920.58 USD.
confint(clarity.model.int)
## 2.5 % 97.5 %
## (Intercept) 5.13966871 5.87506261
## carat 2.29320720 2.93339598
## colorE -0.18227892 0.35203328
## colorF -0.18994248 0.35623414
## colorG -0.26217890 0.26555490
## colorH -0.32269430 0.26421090
## colorI -0.29823773 0.29673215
## colorJ 0.09532565 1.56670639
## clarityIF 0.40068814 1.06550052
## claritySI1 0.11746440 0.71197012
## claritySI2 0.01930209 0.61942466
## clarityVS1 0.32063694 0.93020589
## clarityVS2 0.29126814 0.88468164
## clarityVVS1 0.47944847 1.12611724
## clarityVVS2 0.32094258 0.94541044
## carat:colorE -0.68516508 0.06649888
## carat:colorF -0.75975371 -0.02724877
## carat:colorG -0.77665618 -0.06908826
## carat:colorH -0.80716360 -0.07097116
## carat:colorI -1.00362063 -0.28735934
## carat:colorJ -1.85240754 -0.70068228
The confidence intervals for the variables give us additional insight into our model. For example we are 95% confident that the price of a diamond increases by between 2.29 and 2.93 USD on average for every one increase in carat, holding all else constant. This is a fairly small interval, as are most of the intervals that are displayed. The intervals have a range of less than one dollar, suggesting that we have a very good idea of how each variable effects diamond price.
Looking at our R squared value, we found that our model fits the data very well. An of .91 indicates that approximately 91% of the variation in the price of a diamond can be explained by our model.
At the end of the day, understanding how a diamond will be priced is not very important to those outside of the diamond business. For those who are, it may be valuable in knowing that we have found that color, cut, clarity and carat are the predictors most important in affecting a diamond’s price. Because of the F statistic we recieved from our model, we can safely conclude that at least one of these variables is significant in predicting the price of a diamond. We found that carat and clarity both have a significant positive impact on diamond price. We also found that carat and color share a significant interaction, and have a negative impact on diamond price.