The Central Limit Theorem one of the most famous and widely implemented statistical theories in history. In layman's terms it essentially states that the larger pieces of something you repeatedly take, the better you can estimate the whole. In more technical terms, with repeated sampling, as sample sizes approach the size of the population, parameters of sampling distributions will more accurately predict population parameters. Taking a sample of something is taking a part of a whole - in statistics, you take samples by taking random pieces from a dataset. In order for the Central limit theorem to take effect, one needs to make sure these samples are taken at random with replacement. "With replacement" is a term statisticians use to specify that once a value is taken from the dataset in a sample, it is added back. "Without replacement" signifies taking values from datasets as samples and then not adding them back to the data pool. To apply the Central Limit Theorem, we also need to sample repeatedly in order to obtain a sampling distribution. We take multiple samples (with replacement) from a data set, and compute the mean of each sample. We can arrange the means as a distribution in order to gain information about the population. We usually call this the "sampling distribution of x bar" denoting a distribution of all of the means of our samples. This distribution will not follow the exact shape of the population, but will be centered around the population mean. We can take the mean of this sampling distribution to find the exact population mean. Again, because of the central limit theorem, when we increase the size of each of these samples, each sample mean will more closely resemble the population mean, resulting in a distribution with decreasing variability centered around the true population mean. (The sampling distribution will be centered around the population mean, and not be spread out).
# recreating example from khan academy
possible_values <- 1 : 6
pmf <- c(0.4, 0.0, 0.1, 0.1, 0.0, 0.4)
Creating function to produce sample mean
# Parameters:
# possible values - individual variable values from probability distribution
# ppmf - probability values associated with variable values
# sampleSize - desired size of sample
# Returns:
# mean of taken sample
sampleMean <- function(possibleValues, ppmf, sampleSize) {
sample <- sample(possibleValues, sampleSize, replace = TRUE, prob = pmf)
return(mean(sample))
}
Creating a function to make a vizualization and specify number of sample means
# Parameters:
# NumberYbars - number of sample means to be computed
# sampleMean - previously declared function (associated parameters to follow)
# possible values - individual variable values from probability distribution
# ppmf - probability values associated with variable values
# sampleSize - desired size of sample
# Returns:
# histogram of specified number of sample means
createVisualization <- function(NumberYBars,
FUN = sampleMean,
possibleValues,
ppmf,
sampleSize) {
means <- replicate(NumberYBars, sampleMean(possible_values, pmf, sampleSize))
return( (hist(means, freq = FALSE,
main = paste('n = ',sampleSize),
ylim = c(0, 1),
xlim = c(0, 6))))
}
Creating the graphic of histograms featuring 3 different sample sizes.
par(mfrow = c(2, 2)) # specifying graphic dimensions
plot(possible_values, pmf, type = "h", lty = 2,
xlab = "y", ylab = "p(y) = P(Y = y)", main = "Data Distribution",
ylim = c(0, 1))
points(possible_values, pmf, pch = 19)
createVisualization(10**4, possibleValues = possible_values, ppmf = pmf, sampleSize = 5) #different sample sizes
createVisualization(10**4, possibleValues = possible_values, ppmf = pmf, sampleSize = 10)
createVisualization(10**4, possibleValues = possible_values, ppmf = pmf, sampleSize = 30)
$breaks [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 $counts [1] 232 879 679 2172 1086 2443 815 1256 197 241 $density [1] 0.0464 0.1758 0.1358 0.4344 0.2172 0.4886 0.1630 0.2512 0.0394 0.0482 $mids [1] 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 $xname [1] "means" $equidist [1] TRUE attr(,"class") [1] "histogram"
From the above, we can see that as sample size increases, the individual mean values fall closer and closer to the true mean.
possible_values <- 1 : 10
pmf <- c(0.05, 0.0, 0.0, 0.0, 0.00, 0.00, 0.0, 0.0, 0.0, 0.95)
My initial thought on trying to break the central limit theorem would be to unbalance the probabilities as much as possible - by putting nearly all of the probability on the value 10, and then mininal probability on the value 1, I thought the mean would always be estimated to be 10 (and essentially forget about the 1).
Creating new vizualization function to modify the x limits and accomadate values 1-10
createVisualization <- function(NumberYBars,
FUN = sampleMean,
possibleValues,
ppmf,
sampleSize) {
means <- replicate(NumberYBars, sampleMean(possible_values, pmf, sampleSize))
return( (hist(means, freq = FALSE,
main = paste('n = ',sampleSize),
ylim = c(0, 1),
xlim = c(0, 10))))
}
par(mfrow = c(2, 2)) # specifying graphic dimensions
plot(possible_values, pmf, type = "h", lty = 2,
xlab = "y", ylab = "p(y) = P(Y = y)", main = "Data Distribution",
ylim = c(0, 1))
points(possible_values, pmf, pch = 19)
createVisualization(10**4, possibleValues = possible_values, ppmf = pmf, sampleSize = 5) #different sample sizes
createVisualization(10**4, possibleValues = possible_values, ppmf = pmf, sampleSize = 10)
createVisualization(10**4, possibleValues = possible_values, ppmf = pmf, sampleSize = 30)
$breaks [1] 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 $counts [1] 8 0 0 201 0 0 0 2054 0 0 7737 $density [1] 0.0016 0.0000 0.0000 0.0402 0.0000 0.0000 0.0000 0.4108 0.0000 0.0000 [11] 1.5474 $mids [1] 4.75 5.25 5.75 6.25 6.75 7.25 7.75 8.25 8.75 9.25 9.75 $xname [1] "means" $equidist [1] TRUE attr(,"class") [1] "histogram"
After looking at the vizualizations, we can see that I didn't really beat the theorem. Even though most of the probability is based at 10, there are still some indications that there are possible values other than 10 from the tails of the histogram. (No surprise here though, it isn't possible to beat the theorem with the restrictions we were given, or at all for that matter most likely.) We can see that as the sample sizes increase, the sample means become less spread and more focused around the true mean. While the distributions are generally centered around the same area, the accuracy of the estimates that can be drawn increases as the sample size increases.