Import Necessary Packages from R

library(corrplot)

## corrplot 0.84 loaded

library(ggplot2)
library(ggfortify)
library(car)

## Loading required package: carData

sz <- 22

Data and Description

The purpose of this dataset is to determine what factors are most useful in predicting the price of a diamond. Several measurements are made when describing a diamond, and this study aims to see which of those measurements best determine the price.

This dataset has 53940 observations, but for ease of computation we took a random sample and reduced it to 300. The dataset was provided by a public posting on Kaggle and can be found at www.kaggle.com. We downloaded the CSV file on March 19, 2020. The data presents 9 factors which may potentially be useful in predicting diamond price. Each variable and its explanation is given below (note that ‘con’ represents a continuous variable and ‘cat’ represents a categorical variable) :

Variable	Description
Carat (Con)	Measures the carat of diamond - ie. weight.
Cut (Cat)	Represents the type of diamond cut and has 5 levels: “Fair,”Good“,”Very Good“,”Ideal" and “Premium”.
Color (Cat)	Represents the color of the diamond and has 7 levels: (Best) “D”, “E”, “F”, “G”, “H”, “I”, and “J” (Worst).
Clarity (Cat)	Represents the clarity of the diamond and has 8 levels: (Worst) “I1”, “IF”, “SI1”, “SI2”, “VS1”, “VS2”, “VVS1” and “VVS2” (Best).
Depth (Con)	The measurement from the bottom tip of the diamond to the table (top) measured in millimeters but given as a percent
Table (Con)	Measures the facet of the diamond and is measured in millimeters, but given as a ratio at which the facet is widest.
X (Con)	Measures diamond length in mm
Y (Con)	Measures diamond width in mm
Z (Con)	Measures diamond depth in mm

Given our knowledge of the data, we hypothesized that carat, cut, x, y, and z would all have a significant effect on the price of the diamond. Specifically, that as these measurements increase, we predicted that the price of the diamond would also increase accordingly.

In our analysis we checked the assumptions necessary to perform multiple linear regression. When we found that the raw data did not meet certain assumptions, we applied the necessary methods to clean the data and meet the assumptions. We found problems with homoscedasticity, the model fitting all observations, and multicollinearity. To fix the homoscedasticity assumption we performed a log transformation on price. To help the model fit all observations we ran models both with and without the influential points. We noted that the points did not significantly affect the model, and elected to leave them in. Finally, to fix multicollienarity we removed variables that showed a linear correlation between them. Once the assumptions were met, we performed multiple linear regression with price as the response, using the remaining variables for prediction.

We concluded our analysis by evaluating which of the variables were the most important in predicting the price of a diamond, and removed the variables that negatively affected the accuracy of our model. After our regression analysis, we determined that the model we created was very effective in predicting the price of a diamond given certain predictors.

Analysis

Step 1:

Importing data and taking a random sample to reduce computation time

# setting seed to make the sample reproducible 
set.seed(12345)

diamonds <- read.csv('diamonds.csv', header = TRUE)

# removing X identifier variable (X was an ID number for each diamond)
diamonds.s <- diamonds[sample(nrow(diamonds), 300), -1]

Step 2:

Initial Exploratory Analysis

# re-arranging columns
diamonds.s <- diamonds.s[,c(7, 1, 2, 3, 4, 5, 6, 8, 9, 10)]
head(diamonds.s)

##       price carat       cut color clarity depth table    x    y    z
## 10904  4886  1.01     Ideal     H     SI1  62.1    56 6.38 6.47 3.99
## 33373   827  0.41 Very Good     G     VS2  61.7    61 4.70 4.74 2.91
## 9986   4704  1.26     Ideal     I     SI2  59.6    57 7.04 7.01 4.19
## 75      554  0.30      Good     H     SI1  63.7    57 4.28 4.26 2.72
## 30311   726  0.34     Ideal     F     VS2  62.1    56 4.48 4.51 2.79
## 26378 15851  2.00 Very Good     F     SI1  63.3    56 8.00 7.93 5.04

# coercing categorical variables to factors
diamonds.s[, 3] <- as.factor(diamonds.s[, 3])
diamonds.s[, 4] <- as.factor(diamonds.s[, 4])
diamonds.s[, 5] <- as.factor(diamonds.s[, 5])

# checking scatter plot matrix with select variables
plot(diamonds.s[, c(1, 2, 6, 7, 8, 9, 10)])

# initial check for multicollinearity (removing categorical variables for plot)
corrplot(cor(diamonds.s[, c(1, 2, 6, 7, 8, 9, 10)]), type = "upper")

From our exploratory data analysis, we noticed several interesting things. Depth appeared to have a negative correlation with all other variables (including price), while the other variables had a positive correlation with price. There is very strong linear correlation between X, Y, Z and Carat. We found this intuitive because diamonds are cut proportionally.(X, Y, and Z must all be related to one another). Knowing this, we address the multicollinearity problem later in our analysis.

Step 3:

Linear Model

# creating an initial linear model
diamonds.s.lm <- lm(price ~. , data = diamonds.s)
diamonds.s$residuals <- diamonds.s.lm$residuals
diamonds.s$fitted.values <- diamonds.s.lm$fitted.values

# summary of the linear model
summary(diamonds.s.lm)

## 
## Call:
## lm(formula = price ~ ., data = diamonds.s)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2820.5  -473.7  -139.4   299.5  4674.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14672.90   16190.70   0.906 0.365591    
## carat        15451.48     614.38  25.150  < 2e-16 ***
## cutGood       -315.65     583.28  -0.541 0.588836    
## cutIdeal      -397.42     588.06  -0.676 0.499720    
## cutPremium    -317.62     564.37  -0.563 0.574031    
## cutVery Good  -631.91     570.33  -1.108 0.268845    
## colorE        -153.04     184.61  -0.829 0.407822    
## colorF        -372.03     186.80  -1.992 0.047398 *  
## colorG        -578.61     185.62  -3.117 0.002019 ** 
## colorH       -1015.49     194.40  -5.224 3.46e-07 ***
## colorI       -1482.83     213.41  -6.948 2.66e-11 ***
## colorJ       -3222.87     325.16  -9.912  < 2e-16 ***
## clarityIF     4481.86     533.52   8.401 2.37e-15 ***
## claritySI1    3204.45     471.75   6.793 6.75e-11 ***
## claritySI2    2282.02     475.90   4.795 2.66e-06 ***
## clarityVS1    3876.49     486.84   7.963 4.42e-14 ***
## clarityVS2    3681.18     474.21   7.763 1.63e-13 ***
## clarityVVS1   4518.31     518.65   8.712 2.81e-16 ***
## clarityVVS2   4043.45     500.81   8.074 2.12e-14 ***
## depth         -119.22     252.64  -0.472 0.637380    
## table          -50.50      31.67  -1.594 0.111975    
## x            -6453.22    1713.10  -3.767 0.000202 ***
## y             2897.38    1841.70   1.573 0.116815    
## z             1487.76    4206.13   0.354 0.723824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 859.7 on 276 degrees of freedom
## Multiple R-squared:  0.9593, Adjusted R-squared:  0.9559 
## F-statistic: 282.7 on 23 and 276 DF,  p-value: < 2.2e-16

Step 4: Checking assumptions for linear regression

Linearity

Residuals v Predictor Plots

# carat
ggplot(data = diamonds.s, mapping = aes(x = carat, y = residuals)) +
  geom_point() +
  geom_point() +
  theme_bw() +
  xlab("Residuals") +
  ylab("Carat") +
  ggtitle("Residual vs. Carat Plot") +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", se = FALSE)

# table
ggplot(data = diamonds.s, mapping = aes(x = table, y = residuals)) +
  geom_point() +
  geom_point() +
  theme_bw() +
  xlab("Residuals") +
  ylab("Table") +
  ggtitle("Residual vs. Table Plot") +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", se = FALSE)

# depth
ggplot(data = diamonds.s, mapping = aes(x = depth, y = residuals)) +
  geom_point() +
  geom_point() +
  theme_bw() +
  xlab("Residuals") +
  ylab("Depth") +
  ggtitle("Residual vs. Depth Plot") +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", se = FALSE)

# x
ggplot(data = diamonds.s, mapping = aes(x = x, y = residuals)) +
  geom_point() +
  geom_point() +
  theme_bw() +
  xlab("Residuals") +
  ylab("X") +
  ggtitle("Residual vs. X") +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", se = FALSE)

# y
ggplot(data = diamonds.s, mapping = aes(x = y, y = residuals)) +
  geom_point() +
  geom_point() +
  theme_bw() +
  xlab("Residuals") +
  ylab("Y") +
  ggtitle("Residual vs. Y") +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", se = FALSE)

# z
ggplot(data = diamonds.s, mapping = aes(x = z, y = residuals)) +
  geom_point() +
  geom_point() +
  theme_bw() +
  xlab("Residuals") +
  ylab("Z") +
  ggtitle("Residual vs. Z") +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", se = FALSE)

AV plots

avPlots(diamonds.s.lm, terms = ~ carat + depth + table + x + y + z)

Residual v Fitted Values

ggplot(data = diamonds.s, mapping = aes(x = fitted.values, y = residuals)) +
  geom_point() +
  theme_bw() +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Residuals") +
  ylab("Fitted Values") +
  ggtitle("Residuals vs Fitted Value Plot") +
  geom_smooth(method = 'lm', se = FALSE)

After looking at the Residual v Predictor plots, the AV plots, and the residual v fitted plot, we found that the data seemed to follow a fairly linear trend. (None of the plots showed any relationship other than linear), meaning the linearity assumption is met.

Independence

We assume the assumption of indepedent residuals to be met because we took a random sample from a larger data set.

Residuals are Normally Distributed and Centered at 0

Residual Boxplot

ggplot(data = diamonds.s, mapping = aes(y = residuals)) +
  geom_boxplot() +
  ylab("Residuals") +
  ggtitle("Residuals Boxplot") +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5))

#### Histogram of Residuals

ggplot(data = diamonds.s, mapping = aes(x = residuals)) + 
  geom_histogram(mapping = aes(y = ..density..), binwidth = 500) +
  stat_function(fun = dnorm, color = "red", size = 2,
                args = list(mean = mean(diamonds.s$residuals), 
                            sd = sd(diamonds.s$residuals))) + 
  theme(aspect.ratio = 1)

Shapiro-Wilk Test for Normality

shapiro.test(diamonds.s.lm$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  diamonds.s.lm$residuals
## W = 0.89192, p-value = 9.006e-14

Looking at both the boxplot of the residuals and the histogram, it seemed that the residuals were normally distributed around 0. There did seem to be some outliers (which may suggest why we failed the Shapiro-Wilk test for normality). Despite this, we felt that the graphical measures were more convincing than the Shapiro-Wilk test, and continued with the analysis assuming that the normality assumption was met.

Homoscedasticity (equal variance)

Residuals v Fitted values plot

ggplot(data = diamonds.s, mapping = aes(x = fitted.values, y = residuals)) +
  geom_point() +
  theme_bw() +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Residuals") +
  ylab("Fitted Values") +
  ggtitle("Residuals vs Fitted Value Plot") +
  geom_smooth(method = 'lm', se = FALSE)

Immediately after viewing this plot, we knew that we would have a problem with homoscedasity. We saw that the points were far more spread out on the upper end of the Residuals axis. We proceeded to do a box-cox analysis and a log transform of the y variable (price) - code is shown below after checking the other assumptions for regression.

Model Describes all Points (No Outliers)

DFFITS

diamonds.s.dffits <- data.frame("dffits" = dffits(diamonds.s.lm))
diamonds.s.dffits$obs <- 1:length(diamonds.s$price)

ggplot(data = diamonds.s.dffits) +
  geom_point(mapping = aes(x = obs, y = abs(dffits))) +  
  geom_hline(mapping = aes(yintercept = 2 * sqrt(9 / length(obs))),
             color = "red", linetype = "dashed") +  
  theme_bw() +
  theme(aspect.ratio = 1)

#diamonds.s.dffits[abs(diamonds.s.dffits$dffits) > 2 * sqrt(9 / 300), ]

DFBETAS

# ## Carat
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
# 
# ggplot(data = diamonds.s.dfbetas) +
#   geom_point(mapping = aes(x = obs, y = abs(carat))) +
#   geom_hline(mapping = aes(yintercept = 1),
#              color = "red", linetype = "dashed") +  # for n <= 30
#   geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
#              color = "red", linetype = "dashed") +  # for n > 30
#   theme_bw() +
#   theme(aspect.ratio = 1)
# 
# carat.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$carat) > 1, ]
# carat.extreme.dfbetas[order(carat.extreme.dfbetas$carat), ]
# 
# ##Depth
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
# 
# ggplot(data = diamonds.s.dfbetas) +
#   geom_point(mapping = aes(x = obs, y = abs(depth))) +
#   geom_hline(mapping = aes(yintercept = 1),
#              color = "red", linetype = "dashed") +  # for n <= 30
#   geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
#              color = "red", linetype = "dashed") +  # for n > 30
#   theme_bw() +
#   theme(aspect.ratio = 1)
# 
# depth.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$depth) > 1, ]
# depth.extreme.dfbetas[order(depth.extreme.dfbetas$depth), ]
# 
# 
# ## Table
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
# 
# ggplot(data = diamonds.s.dfbetas) +
#   geom_point(mapping = aes(x = obs, y = abs(table))) +
#   geom_hline(mapping = aes(yintercept = 1),
#              color = "red", linetype = "dashed") +  # for n <= 30
#   geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
#              color = "red", linetype = "dashed") +  # for n > 30
#   theme_bw() +
#   theme(aspect.ratio = 1)
# 
# table.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$table) > 1, ]
# table.extreme.dfbetas[order(table.extreme.dfbetas$table), ]
# 
# ## X
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
# 
# ggplot(data = diamonds.s.dfbetas) +
#   geom_point(mapping = aes(x = obs, y = abs(x))) +
#   geom_hline(mapping = aes(yintercept = 1),
#              color = "red", linetype = "dashed") +  # for n <= 30
#   geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
#              color = "red", linetype = "dashed") +  # for n > 30
#   theme_bw() +
#   theme(aspect.ratio = 1)
# 
# x.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$x) > 1, ]
# x.extreme.dfbetas[order(x.extreme.dfbetas$x), ]
# 
# ## Y
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
# 
# ggplot(data = diamonds.s.dfbetas) +
#   geom_point(mapping = aes(x = obs, y = abs(y))) +
#   geom_hline(mapping = aes(yintercept = 1),
#              color = "red", linetype = "dashed") +  # for n <= 30
#   geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
#              color = "red", linetype = "dashed") +  # for n > 30
#   theme_bw() +
#   theme(aspect.ratio = 1)
# 
# y.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$y) > 1, ]
# y.extreme.dfbetas[order(y.extreme.dfbetas$y), ]
# 
# ## Z
# diamonds.s.dfbetas <- as.data.frame(dfbetas(diamonds.s.lm))
# diamonds.s.dfbetas$obs <- 1:length(diamonds.s$price)
# 
# ggplot(data = diamonds.s.dfbetas) +
#   geom_point(mapping = aes(x = obs, y = abs(z))) +
#   geom_hline(mapping = aes(yintercept = 1),
#              color = "red", linetype = "dashed") +  # for n <= 30
#   geom_hline(mapping = aes(yintercept = 2 / sqrt(length(obs))),
#              color = "red", linetype = "dashed") +  # for n > 30
#   theme_bw() +
#   theme(aspect.ratio = 1)
# 
# #identifies any observations with a DFBETA greater
# # than one (as seen on the plot)
# z.extreme.dfbetas <- diamonds.s.dfbetas[abs(diamonds.s.dfbetas$z) > 1, ]
# z.extreme.dfbetas[order(z.extreme.dfbetas$z), ]

After viewing the DFFITS and DFBETAS, we saw that we had quite a few points above the benchmark cut off - indicating we may have some influential points. Later in the analysis we run a model both with and without these points to see if they are, in fact, influential.

No Multicollinearity

Correlation matrix

#corrplot(cor(diamonds.s[, c(1, 2, 6, 7, 8, 9, 10)]), type = "upper")

VIF values

vif(diamonds.s.lm)

##                GVIF Df GVIF^(1/(2*Df))
## carat     34.075503  1        5.837423
## cut        2.936517  4        1.144140
## color      1.698535  6        1.045136
## clarity    2.265528  7        1.060155
## depth     43.333817  1        6.582843
## table      1.897128  1        1.377363
## x       1488.410726  1       38.579926
## y       1690.026087  1       41.109927
## z       3293.296083  1       57.387247

mean(vif(diamonds.s.lm))

## [1] 249.4468

Looking at the Correlation matrix (code is repeated here - for the graph please reference the graphic displayed above), there are many predictor variables that seem to be strongly correlated - particularly the x, y and z variables (we thought this made sense because each of these measures the size of the diamond in slightly different ways). On top of the correlation matrix, the VIF values for these variables are far above 10 (up into the thousands), bringing the average VIF up to 249. It was obvious that our data had problems with multicollinearity, and we apply variable selection methods to fix this below.

Step 5: Correcting failed assumptions

Dealing with Homoscedasity

Box-Cox Test

bc <- boxCox(diamonds.s$price ~ diamonds.s$carat + diamonds.s$cut + diamonds.s$color + diamonds.s$clarity + diamonds.s$depth + diamonds.s$table + diamonds.s$x + diamonds.s$y + diamonds.s$z)

After viewing the Box-Cox test, it was apparent that we needed to do a log transform of our response variable. (a lambda value of 0 indicates a log transform). We proceeded to make a new model with the log of price as the response, and again looked at the residual v fitted plot to see if our transformation improved homoscedasticity.

Transforming Price and Creating New Linear Model

diamonds.s.transform <- diamonds.s
diamonds.s.transform[,1] <- log(diamonds.s.transform[,1])

diamonds.transform.lm <- lm(price ~. , data = diamonds.s.transform)
diamonds.s.transform$residuals <- diamonds.transform.lm$residuals
diamonds.s.transform$fitted.values <- diamonds.transform.lm$fitted.values

Residual v Fitted plots

# original plot (no transformation)
ggplot(data = diamonds.s, mapping = aes(x = fitted.values, y = residuals)) +
  geom_point() +
  theme_bw() +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Residuals") +
  ylab("Fitted Values") +
  ggtitle("Residuals vs Fitted Value Plot") +
  geom_smooth(method = 'lm', se = FALSE)

# log transform plot
ggplot(data = diamonds.s.transform, mapping = 
         aes(x = fitted.values, y = residuals)) +
  geom_point() +
  theme_bw() +
  theme(aspect.ratio = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Residuals") +
  ylab("Fitted Values") +
  ggtitle("Residuals vs Fitted Value Plot (Transformed)") +
  geom_smooth(method = 'lm', se = FALSE)

After viewing the difference in the two plots, we saw that the variance drastically improved, and we continued forward with our transformed model, assuming that the homoscedasity assumption to be met.

Fixing Multicollinearity (variable selection)

To work with the high correlation values of the x, y, z, and clarity variables, we decided it was best to pick only one of these variables to mitigate the effect of multicollinearity. We fit a model with each variable, then picked the model that performed the best. Originally we ran GLMNET with best subsets, but we had some trouble, so we did a best subsets analysis manually with the models shown below.

# removing the residual and fitted values columns from the data frame to make 
          # variable selection possible
diamonds.s.transform <- diamonds.s.transform[,-c(11,12)]

# the variable named "y" gave us some trouble with GLMNET, so we changed the name to           # "yvar".
names(diamonds.s.transform) <- c("price", "carat", "cut", "color", "clarity", "depth", "table", "x", "yvar", "z")

# model using clarity
clarity.model <- lm(price ~ carat + color + clarity, data = diamonds.s.transform)

# model using x
x.model <- lm(price ~ carat + color + x, data = diamonds.s.transform)

# model using y
y.model <- lm(price ~ carat + color + yvar, data = diamonds.s.transform)

# model using z
z.model <- lm(price ~ carat + color + z, data = diamonds.s.transform)

Clarity model

summary(clarity.model)

## 
## Call:
## lm(formula = price ~ carat + color + clarity, data = diamonds.s.transform)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88769 -0.21606  0.02794  0.21900  0.64227 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.82554    0.16485  35.339  < 2e-16 ***
## carat        2.15327    0.04373  49.241  < 2e-16 ***
## colorE      -0.10947    0.06278  -1.744 0.082288 .  
## colorF      -0.15886    0.06402  -2.481 0.013663 *  
## colorG      -0.25995    0.06269  -4.147 4.46e-05 ***
## colorH      -0.30292    0.06646  -4.558 7.68e-06 ***
## colorI      -0.48620    0.07278  -6.681 1.24e-10 ***
## colorJ      -0.61933    0.11218  -5.521 7.60e-08 ***
## clarityIF    0.70190    0.17402   4.033 7.07e-05 ***
## claritySI1   0.38317    0.15551   2.464 0.014332 *  
## claritySI2   0.32718    0.15706   2.083 0.038121 *  
## clarityVS1   0.58812    0.15951   3.687 0.000272 ***
## clarityVS2   0.55659    0.15532   3.583 0.000399 ***
## clarityVVS1  0.77459    0.16942   4.572 7.21e-06 ***
## clarityVVS2  0.58915    0.16318   3.610 0.000361 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2996 on 285 degrees of freedom
## Multiple R-squared:  0.9118, Adjusted R-squared:  0.9074 
## F-statistic: 210.3 on 14 and 285 DF,  p-value: < 2.2e-16

X model

#summary(x.model)

Y model

#summary(y.model)

Z model

#summary(z.model)

Initially we thought the Y model would be best because it had the highest f statistic and R squared values, but after looking at each correlation matrix individually, we found that the clarity model was the only one to completely eliminate problems with multicollinearity. (VIF values are shown below) The differences between the clarity model and y model in terms of R square and F statistic are pretty small (both return very small p values), so we continued the model using only the clarity variable. (No muliticollinearity assumption is now met)

Clarity Model VIFs

vif(clarity.model)

##             GVIF Df GVIF^(1/(2*Df))
## carat   1.421520  1        1.192275
## color   1.410925  6        1.029103
## clarity 1.446810  7        1.026734

mean(vif(clarity.model)[1])

## [1] 1.42152

Dealing with Outliers

To deal with the potential influential points we saw, we created a model with and without those outliers, then compared the two models to see if the points were, in fact, influential.

# creating data frame without flagged DFFIT values
diamonds.without <- diamonds.s.transform[-diamonds.s.dffits[abs(diamonds.s.dffits$dffits) > 1, ]$obs, ]

# linear model of "without" data frame
without <- lm(price ~ carat + color + clarity, data = diamonds.without)

# # comparing both models
# clarity.model
# without

confint(clarity.model)

##                  2.5 %      97.5 %
## (Intercept)  5.5010701  6.15001540
## carat        2.0671924  2.23933870
## colorE      -0.2330361  0.01410093
## colorF      -0.2848689 -0.03284801
## colorG      -0.3833445 -0.13655647
## colorH      -0.4337453 -0.17209647
## colorI      -0.6294407 -0.34295106
## colorJ      -0.8401462 -0.39851740
## clarityIF    0.3593675  1.04444112
## claritySI1   0.0770757  0.68926138
## claritySI2   0.0180469  0.63631772
## clarityVS1   0.2741475  0.90208483
## clarityVS2   0.2508623  0.86231251
## clarityVVS1  0.4411197  1.10805681
## clarityVVS2  0.2679652  0.91034300

confint(without)

##                   2.5 %       97.5 %
## (Intercept)  5.45506855  6.087928243
## carat        2.09511288  2.271379139
## colorE      -0.21373388  0.028237853
## colorF      -0.25261579 -0.003980257
## colorG      -0.35470244 -0.111931865
## colorH      -0.41980973 -0.162874053
## colorI      -0.62071345 -0.339650926
## colorJ      -0.91634683 -0.461484850
## clarityIF    0.35011096  1.024139403
## claritySI1   0.11519730  0.710281696
## claritySI2   0.03126756  0.631919674
## clarityVS1   0.30147997  0.911290316
## clarityVS2   0.27380959  0.867736928
## clarityVVS1  0.42839738  1.080877540
## clarityVVS2  0.30372889  0.928152727

The difference between the two models seemed pretty negligible, so we felt comfortable leaving in the outliers and continuing assuming that the model described all observations.

Step 6: Checking Interaction Before Final Analysis

Interaction Models

# creating a model including interactions
clarity.model.int <- lm(price ~ carat + color + clarity + carat:color, data = diamonds.s.transform)

summary(clarity.model.int)

## 
## Call:
## lm(formula = price ~ carat + color + clarity + carat:color, data = diamonds.s.transform)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93926 -0.18482  0.02157  0.21653  0.58589 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.5073657  0.1867901  29.484  < 2e-16 ***
## carat         2.6133016  0.1626080  16.071  < 2e-16 ***
## colorE        0.0848772  0.1357154   0.625 0.532216    
## colorF        0.0831458  0.1387289   0.599 0.549432    
## colorG        0.0016880  0.1340444   0.013 0.989962    
## colorH       -0.0292417  0.1490740  -0.196 0.844631    
## colorI       -0.0007528  0.1511224  -0.005 0.996029    
## colorJ        0.8310160  0.3737309   2.224 0.026978 *  
## clarityIF     0.7330943  0.1688624   4.341 1.98e-05 ***
## claritySI1    0.4147173  0.1510045   2.746 0.006417 ** 
## claritySI2    0.3193634  0.1524312   2.095 0.037061 *  
## clarityVS1    0.6254214  0.1548306   4.039 6.93e-05 ***
## clarityVS2    0.5879749  0.1507271   3.901 0.000120 ***
## clarityVVS1   0.8027829  0.1642539   4.887 1.73e-06 ***
## clarityVVS2   0.6331765  0.1586149   3.992 8.39e-05 ***
## carat:colorE -0.3093331  0.1909227  -1.620 0.106319    
## carat:colorF -0.3935012  0.1860563  -2.115 0.035320 *  
## carat:colorG -0.4228722  0.1797223  -2.353 0.019321 *  
## carat:colorH -0.4390674  0.1869930  -2.348 0.019571 *  
## carat:colorI -0.6454900  0.1819304  -3.548 0.000455 ***
## carat:colorJ -1.2765449  0.2925383  -4.364 1.80e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2893 on 279 degrees of freedom
## Multiple R-squared:  0.9194, Adjusted R-squared:  0.9136 
## F-statistic: 159.2 on 20 and 279 DF,  p-value: < 2.2e-16

# comparing the two models to see if there is a significant interaction
anova(clarity.model, clarity.model.int)

## Analysis of Variance Table
## 
## Model 1: price ~ carat + color + clarity
## Model 2: price ~ carat + color + clarity + carat:color
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    285 25.577                                 
## 2    279 23.355  6    2.2215 4.423 0.0002722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the output from the ANOVA test, we found a significant interaction between carat and color. We made note of this and continued using the clarity model including interaction terms.

Step 7: Model Analysis and Predictions

After manipulating the data to meet linear regression assumptions, we now had a model that fit our data well and felt comfortable using this model to make inferences and predictions.

Predictions

While the prediction of diamond price my not be very useful for the average person, our findings may still be applicable in certain situations. Suppose a student hopes to predict how much he will need to save in order to buy his fiance a wedding ring that fits her requirements. Suppose that after careful consideration, this student’s fiance would like a .4 carat diamond with a “Very Good” cut, “H” color, and a clarity of “SI2”. #### Prediction and Confidence Intervals

new.data <- data.frame(carat = 0.4, cut = "Very Good", color = "H", clarity = "SI2")
#Prediction Interval
log.prediction <- predict(clarity.model.int, newdata = new.data, interval = "prediction")
(round(exp(log.prediction)[c(2, 1, 3)],  2))

## [1]  435.36  786.18 1419.68

We are 95% confident that a single diamond with the characteristics specified above (Carat:0.4, Cut:“Very Good”, Color:“H”, and Clarity:“SI2”) will cost between 435.36 and 1419.69 USD.

new.data <- data.frame(carat = 0.4, cut = "Very Good", color = "H", clarity = "SI2")
#Confidence Interval
log.confidence <- predict(clarity.model.int, newdata = new.data, interval = "confidence")
round(exp(log.confidence)[c(2, 1, 3)], 2)

## [1] 671.39 786.18 920.58

We are 95% confident that diamonds with the characteristics specified above (Carat:0.4, Cut:“Very Good”, Color:“H”, and Clarity:“SI2”) will on average cost between 671.39 and 920.58 USD.

Confidence Intervals and R-Squared

confint(clarity.model.int)

##                    2.5 %      97.5 %
## (Intercept)   5.13966871  5.87506261
## carat         2.29320720  2.93339598
## colorE       -0.18227892  0.35203328
## colorF       -0.18994248  0.35623414
## colorG       -0.26217890  0.26555490
## colorH       -0.32269430  0.26421090
## colorI       -0.29823773  0.29673215
## colorJ        0.09532565  1.56670639
## clarityIF     0.40068814  1.06550052
## claritySI1    0.11746440  0.71197012
## claritySI2    0.01930209  0.61942466
## clarityVS1    0.32063694  0.93020589
## clarityVS2    0.29126814  0.88468164
## clarityVVS1   0.47944847  1.12611724
## clarityVVS2   0.32094258  0.94541044
## carat:colorE -0.68516508  0.06649888
## carat:colorF -0.75975371 -0.02724877
## carat:colorG -0.77665618 -0.06908826
## carat:colorH -0.80716360 -0.07097116
## carat:colorI -1.00362063 -0.28735934
## carat:colorJ -1.85240754 -0.70068228

The confidence intervals for the variables give us additional insight into our model. For example we are 95% confident that the price of a diamond increases by between 2.29 and 2.93 USD on average for every one increase in carat, holding all else constant. This is a fairly small interval, as are most of the intervals that are displayed. The intervals have a range of less than one dollar, suggesting that we have a very good idea of how each variable effects diamond price.

R-Squared

Looking at our R squared value, we found that our model fits the data very well. An $R^2$ of .91 indicates that approximately 91% of the variation in the price of a diamond can be explained by our model.

Summary and Conclusions

At the end of the day, understanding how a diamond will be priced is not very important to those outside of the diamond business. For those who are, it may be valuable in knowing that we have found that color, cut, clarity and carat are the predictors most important in affecting a diamond’s price. Because of the F statistic we recieved from our model, we can safely conclude that at least one of these variables is significant in predicting the price of a diamond. We found that carat and clarity both have a significant positive impact on diamond price. We also found that carat and color share a significant interaction, and have a negative impact on diamond price.

Diamonds

Multiple Linear Regression