R and Probability Distributions

This page demonstrates how to use R to understand distrbution functions, while introducing some features of R Markdown. The source code of this page can be downloaded from here: distributions.Rmd

Playing with normal distributions

To use mathematical notation within R Markdown documents, I use LATEX expressions within dollar ($) characters.

Let’s draw some random samples ($n=20$) from the standard normal distribution ($\mu$ = 0, $\sigma = 1$) and plot their histograms:

# Lets take four samples of size 20 from the standard normal distribution 
par(mfrow=c(2,2)) # I will use a 2x2 table to plot their histograms

for(i in 1:4){
  my.sample <- rnorm(20) # I take a random sample of size 20 from the z distribution
  hist(my.sample, 
       main = paste("Sample ", i), # I add a title to each plot (paste concatenates strings)
       xlab = "Range of values",  # I specify the label of the x axis
       ylim = c(0, 9)) # I specify the limits of y values
}

We assume now that the weight of a specific population of adults follows a normal distribution with $\mu = 60$ Kg and $\sigma = 10$ Kg. We plot this distribution with R:

my.normal <- function(x) { 
  dnorm(x, mean = 60, sd = 10) 
}

curve(my.normal, xlim = c(20, 100), xlab = "Weight (Kg)", ylab = "Probability", col = "blue")

Question: What is the probability that a sample randomly drawn from the above population has weight that is equal or lower than 50 Kg?

To answer this question, we can calculate the integral of the above function from $-\infty$ to $50$ (although we know that real weights can never take negative values):

area <- integrate(my.normal, -Inf, 50)
print(area)

## 0.1586553 with absolute error < 9.5e-05

Another way to calculate this probability is by using the cumulative distribution function:

curve(
  pnorm(x, mean = 60, sd = 10), 
  xlim = c(20, 100),
  xlab = "Weight (Kg)",
  ylab = "Cumulative probability",
  col = "blue"
)

Form this distribution, we can calculate and print the probability as follows:

prob <- pnorm(50, mean = 60, sd = 10)
cat("The probability that a random weight is equal to or lower than 50 is:", prob)

## The probability that a random weight is equal to or lower than 50 is: 0.1586553

We can also do the calculation directly as we write R Markdown text: P($x \le 50$) = 0.1586553.

We can experimentally verify this probability as follows. We will take a large number of samples from the above distribution and then, we will calculate the ratio of samples for which the value is equal or lower than 50:

samples <- rnorm(100000, mean = 60, sd = 10) # Drawing 100000 samples
subsample <- samples[samples<=50] # Choosing the ones <= 50 
ratio <- length(subsample) / length(samples) # Taking their ratio
cat("The probability that a random weight is equal to or lower than 50 is:", ratio)

## The probability that a random weight is equal to or lower than 50 is: 0.1561

You can see that the value that we experimentally calculated is very close to the theoretical one (and will become closer if we draw a larger number of samples).

Playing with log-normal distributions

Assume now that the time it takes a healthy adult to drink a pint of beer (as fast as possible), it follows a log-normal distribution that has the following shape ($\mu = 3$ and $\sigma = 0.5$, where these parameter values are on logarithmic scale):

curve(
  dlnorm(x, meanlog = 3, sdlog = 0.5), 
  xlim = c(0, 60),
  xlab = "Drinking time (sec)",
  ylab = "Probability",
  col = "blue"
)

Let’s approximate the mean, median, and standard deviation of the above population by using a very large sample:

sample <- rlnorm(100000, meanlog = 3, sdlog = 0.5)

# Print with the cat function, leaving a line ("\n") in between the different statistics
cat("Mean =", mean(sample), "\n", "Median =", median(sample), "\n", "SD =", sd(sample))

## Mean = 22.74735 
##  Median = 20.07344 
##  SD = 12.13918

Notice that the above values correspond to the mean, median, and standard deviation of the original distribution. The parameters ($meanlog$ and $sdlog$) of the log-normal distribution are on log-scale and represent the mean and standard deviation of the normal distribution that we can derive after log-transforming the data:

sample <- rlnorm(100000, meanlog = 3, sdlog = 0.5)
log.sample <- log(sample)
cat("Mean =", mean(log.sample), "\n", "Median =", median(log.sample), "\n", "SD =", sd(log.sample))

## Mean = 2.999603 
##  Median = 3.000965 
##  SD = 0.5003742

Let’s produce the histogram of the log-transformed sample and contrast it with the corresponding normal distribution:

hist(log.sample, main = NA, xlab = "Log-transformed values (ln(sec))", xlim = c(0, 6))
par(new=TRUE) # This is to draw a new graph over the previously drawn one!
# I will also draw the target normal curve (the many parameters avoid redrawing labels and axes)
curve(dnorm(x, mean = 3, sd = 0.5), xlim = c(0, 6), col = "red", xlab = NA, ylab = NA, axes = FALSE)

Sampling distribution of the mean

We now generate plot the sampling distributions of the mean for the above log-normal distribution for sample sizes of $n=15$, $n=30$, and $n=100$. In addition to their histograms, I plot the corresponding qqplot to visually assess how close the sampling distribution is to the normal distribution.

# I will repeat the sampling process 100000 times (for each sample size)   
sampling.15 <- replicate(100000, mean(rlnorm(15, meanlog = 3, sdlog = 0.5)))
sampling.30 <- replicate(100000, mean(rlnorm(30, meanlog = 3, sdlog = 0.5)))
sampling.100 <- replicate(100000, mean(rlnorm(100, meanlog = 3, sdlog = 0.5)))

par(mfrow=c(2,3)) # I will use a 2x3 table to plot the graphs

#First, the three histograms
hist(sampling.15, main = "n = 15", xlab = "Sample mean (sec)")
hist(sampling.30, main = "n = 30", xlab = "Sample mean (sec)")
hist(sampling.100, main = "n = 100", xlab = "Sample mean (sec)")

#Then, the three QQ-plots 
qqnorm(sampling.15)
qqnorm(sampling.30)
qqnorm(sampling.100)

As expected from the Central Limit Theorem, the larger the size of the sample from which we take the mean, the closer is the resulting sampling distribution to normal.

Note that the QQ-plots contrast each ditribution to the standard normal distribution, which explains the range of the theoretical quantiles on the y axis (-4 to 4). You do not need to care about how R chooses which quantiles to plot. It is enough to know that the number of quantiles is selected to match the size of your sample data, e.g., you could infer that it will choose percentiles if $n=99$. In any case, if your sample follows the theoretical model distribution, the points are expected to be close to the diagonal line, i.e., there is a linear relationship between sample and theoretical (model) quantiles. Clear deviations from this relationship may suggest that the model distribution is not appropriate for your data. Of course, deviations may be uncertain if your sample size is small (e.g., try to generate the qqplot of small random samples drawn from a normal distribution).

As a final step, we will calculate the standard deviation of the above sampling distributions of the mean:

cat("(n=15) SD =", sd(sampling.15), "sec", 
    "\n", "(n=30) SD =", sd(sampling.30), "sec", 
    "\n", "(n=15) SD =", sd(sampling.100), "sec")

## (n=15) SD = 3.118059 sec 
##  (n=30) SD = 2.20833 sec 
##  (n=15) SD = 1.215153 sec

These standard deviations are widely known as the standard error of the mean (SEM) The SEM decreases as the size of the sample increases.

Sampling distribution of the variance

In the previous paragraph, we discussed the sampling distribution of the mean. Here, we examine the sampling distribution of the variance.

Consider again the normal distribution of weights in our ealier population of adults. Below, we construct the sampling distributions of the variance for three different sample sizes ($n=15$, $n=30$, and $n=100$). We also use qq-plots to visually assess how far from the normal the resulting distributions are:

# I will repeat the sampling process 20000 times (for each sample size)   
sampling.15 <- replicate(20000, var(rnorm(15, mean = 60, sd = 10)))
sampling.30 <- replicate(20000, var(rnorm(30, mean = 60, sd = 10)))
sampling.100 <- replicate(20000, var(rnorm(100, mean = 60, sd = 10)))

par(mfrow=c(2,3)) # I will use a 2x3 table to plot the graphs

#First, the three histograms
hist(sampling.15, main = "n = 15", xlab = "Sample variance (Kg^2)", xlim = c(0, 300))
hist(sampling.30, main = "n = 30", xlab = "Sample variance (Kg^2)", xlim = c(0, 300))
hist(sampling.100, main = "n = 100", xlab = "Sample variance (Kg^2)", xlim = c(0, 300))

#Then, the three QQ-plots 
qqnorm(sampling.15)
qqnorm(sampling.30)
qqnorm(sampling.100)

We learned in class that if a sample with size $n$ is drawn from the standard normal distribution z, then the sum of squares of the residuals of the sample follows a $\chi^2$ (chi-squared) distribution with $n$ degrees of freedom. Since we are interested in variances (rather than sums of squares), we first need to apply the appropriate transformation $\times (n-1)$ in order to derive the correct probability, relying on the chi-squared distribution. In addition, since our samples are not standardized (i.e., they are drawn from a non-standard normal distribution), we also need to divide by $\sigma^2$ to make sure that we apply the chi-squared probabilty density function on standardized sums of squares:

# This is my probability function to model the sample variance
var.density.fun <- function(x, n, sd){
  # I know that the chi-squared distribution models the sums of squares of residuals, 
  # when samples are drawn from a standard normal distribution  
  # Thus, I need to apply the following transformation to turn my non-standardized variance 
  # into a standardized sum of squares
  y <- x*(n-1)/sd^2
  dchisq(y, df = n) 
}

par(mfrow=c(1,3)) # I will use a 1x3 table to plot the graphs

hist(sampling.15, main = "n = 15", xlab = "Sample variance (Kg^2)", xlim = c(0, 300))
par(new=TRUE) # This is to draw the theoretical (chi-squared) model over the first histogram
curve(var.density.fun(x, n = 15, sd = 10), xlim = c(0, 300), col = "red", xlab = NA, ylab = NA, axes = FALSE) 

hist(sampling.30, main = "n = 30", xlab = "Sample variance (Kg^2)", xlim = c(0, 300))
par(new=TRUE) # This is to draw the theoretical (chi-squared) model over the second histogram
curve(var.density.fun(x, n = 30, sd = 10), xlim = c(0, 300), col = "red", xlab = NA, ylab = NA, axes = FALSE) 

hist(sampling.100, main = "n = 100", xlab = "Sample variance (Kg^2)", xlim = c(0, 300))
par(new=TRUE) # This is to draw the theoretical (chi-squared) model over the third histogram
curve(var.density.fun(x, n = 100, sd = 10), xlim = c(0, 300), col = "red", xlab = NA, ylab = NA, axes = FALSE)

In the following lectures, we will focus on the sampling distribution of the mean. However, it is good to know how you could also model the variance of the sample.

R and Probability Distributions

Theophanis Tsandilas

01 December, 2021

Playing with normal distributions

Playing with log-normal distributions

Sampling distribution of the mean

Sampling distribution of the variance