A Computational Approach to Statistics

Reading time: 11 min.

If you know a little about programming and probability theory, this article will change the way you think about statistics and computation.

Aside from language constructs from deterministic computing, such as assignment “x:=42”, control flow statements such as if and while, and arrays, in probabilistic programming we moreover require the construct of sampling. Sampling in programming is a way of simulating or modeling real-world phenomena whose measurement outcomes are governed by chance, but according to some pattern. This pattern is described by a class of mathematical functions called probability distributions.

Different real-world phenomena are described by different kinds of probability distributions, and when we are doing probabilistic programming, we should require as many kinds of distributions as possible to sample from. The average computer, however, has no clue about what happens in the real world, so how do we go about generating random samples from those probability distributions?

Uniform Distribution

Uniform in probability theory refers to events whose probability of outcome is equivalent to any other possible outcome. For example, suppose you wake up after a long period of unconsciousness. Intuitively, today can be any of the 7 days of the week with equal probability (one seventh). The probability of the value of a dice throw is also uniformly distributed (one sixth). These are both examples of discrete uniform distributions, meaning: the outcomes can be considered isolated events.

If outcomes are not isolated events, but rather occur on a spectrum, then the probability distribution is said to be continuous. If a random variable (call it X) takes on a value from the interval [a,b] of real (decimal) numbers between a and b, all with equal probability, then we say that X is uniformly (continuous) distributed between a and b; in more compact notation: X ~ uniform(a,b).

Most mainstream programming languages offer a random number generator, that, when called upon, provides a (pseudo) random decimal number between 0 and 1. Thus, every such value is a random variable X such that X ~ uniform(0,1). This will be the distribution that we use to sample all others.

It is now clear that it is not much to demand that

x <~ uniform(0,1)

be a basic construct in our language.

Sampling from uniform(a,b), with a and b real numbers, can be done by first sampling from uniform(0,1), then scaling t he “length” appropriately by multiplying with b-a, yielding a sample from uniform(0, b-a).

You should at this point convince yourself that all outcomes between 0 and b-a now have equal probability, since all values between 0 and 1 have (and multiplying by a constant is a linear operation). Finally, we translate by adding a to our sample, yielding a value somewhere between a and b.

We therefore let “x <~ uniform(a,b)” encode

x <~ uniform(0,1); 
x := (b-a)*x + a;

Bernoulli Distribution

Consider the tossing of a coin. Instead of heads and tails, we have a coin yielding 1 or 0. Assume that the coin is “fair”, meaning: an outcome of 1 (heads) is equally likely as an outcome of 0 (tails), both are 50%, or 0.5. A random variable X representing the outcome of such a coin toss is said to be distributed according to the Bernoulli distribution with parameter p=0.5, written X ~ bern(0.5). If we have X ~ bern(0.6), then the coin wouldn’t be “fair”, as 1 will have a more probable outcome (namely 0.6) than 0 (namely 0.4). In this situation, the probabilities are often written as follows:

Pr{X=1} = 0.6

Pr{X=0} = 0.4

In general, if X ~ bern(p) with p somewhere between 0 and 1, then X takes value 1 with probability p, and value 0 with probability 1-p. Thus, the encoding of a sampling statement x <~ bern(p) in our probabilistic language should function like the classical assignment x:=0 with probability p, and like x:=1 with probability 1-p. This is done as follows:

x <~ uniform(0,1);   // draw sample
if (x < p) { 
  x := 1       // happens with probability p
} else {
  x := 0       // happens with probability 1-p
}

The fact that the (true) body of the if statement happens with probability p, is because the length of the interval [0,p] as a subinterval of [0,1] has length p, and all outcomes have a uniformly distributed probability.

Binomial Distribution

Now imagine being extremely bored and just flipping an unfair coin with p=0.6, and tallying how many heads you toss out of 10 tosses. This experiment can be mathematically described as a collection of 10 random variables X_n~ bern(0.6), where n ranges from 1 to 10. The random variable that counts the amount of heads (corresponding to 1), is then

Y = X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈ + X₉ + X₁₀

and its probability distribution of outcomes is described by something called the binomial distribution. The parameters of a binomial distribution are p, which is the probability of success on each trial, and n, the number of Bernoulli trials performed (10 in our coin flip example).

The probability that Y has an outcome k somewhere between 0 and (including) n, then k of them happen with probability p, and the other n-k happen with probability (1-p). Assuming the coinflips are independent, we must therefore multiply p^k with (1-p)^n-k. This explains some part of the mathematical function describing the probability of Y taking on value k:

Probability mass function of the binomial distribution with parameters n, p

The part with the factorials (exclamation marks) effectively computes the number of combinations k out of n, reflecting the fact that, say, 5 heads and 5 tails can happen in more than one order, namely the number of combinations 5 out of 10.

Sampling a binomial distribution can for instance be encoded by repeated (n times) sampling of a Bernoulli trial, and thus, “k <~ binom(p,n)” encodes

i:=0;
k:=0;
while (i < n) {
  b <~ bern(p);
  k := k+b;
  i := i+1;
}

Here, we have to (and may) assume that the variables i and b are not used elsewhere in the program.

Poisson Distribution

As is the case with binomially distributed random variables, poisson sampling also always gives a nonnegative integer as a result, i.e., a number out of the set {0,1,2,3,…}. This happens for example in discrete situations in real world amounts such as number of children in a family, number of phone calls received per hour, the number of passing cars per minute, etc.

The Poisson distribution is completely characterized by a single parameter, let us call it r. A random variable Z whose outcome is distributed according to the Poisson distribution with parameter r, written Z ~ poisson(r), always has, for example, a relatively high probability of taking on the value r itself, and also values close to it.

The probability distribution function for a Poisson random variable with parameter r is given by

Probability mass function of the Poisson distribution with parameter r

Here, e is Euler’s number, a well-known constant from calculus, which has a value of approximately 2.718281828….

So for r=3, for example, we have

P{Z=0} = e^-3 ≈ 0.049787
P{Z=1} = 3e^-3 ≈ 0.149361
P{Z=2} = (9/2)e^-3 ≈ 0.224042
P{Z=3} = (27/6)e^-3 ≈ 0.224042
P{Z=4} = (81/24)e^-3 ≈ 0.168031
…

For Poisson sampling in programming, the method of inverse transform sampling can be applied. This method exploits the fact that the cumulative probability of a Poisson distribution can be easily computed by repeatedly adding the individual discrete probability.

Plot of the cumulative distribution function for the Poisson distribution with parameter r=3

The random variable Z can be encoded in probabilistic programming by sampling from the uniform(0,1) distribution, and then, starting from k=0, summing all probabilities P(Z=k), each time checking if the sum has surpassed the randomly drawn value.

More specifically, “z <~ poisson(r)” encodes

k := 0;
u <~ uniform(0,1);
p := exp(-r);   // probability Pr{Z=0}
sum := p;
while(sum < u) {
  k := k+1;
  p := p*r/k;   // probability Pr{Z=k+1}
  sum := sum + p
};
z := k

A note on sampling from binomial distributions in the previous section: the method presented here can be applied there too, calculating the running cumulative distribution. This new method just presented here actually has a slightly lower expected running time (which is good), as we do not always have to perform n samples, but only one (and then do a multiplication and summation loop an expected number of times that is lower than n).

Geometrical Distribution

Consider now the following question: how often should I expect to throw a dice before a six turns up? To answer this, we think of each independent throw as a Bernoulli trial with p=1/6: the probability of throwing a six. The probability that we have to throw 1 time is p=1/6. The probability that we have to throw the dice 2 times is (5/6)*(1/6): first a “not six”, followed by a six. We do not take combinations here, because the order is of important.

In general, we have, for a geometrically distributed random variable X ~ geo(p), the following facts:

Pr{X=1} = p
Pr{X=2} = (1-p)p
Pr{X=3} = (1-p)²p
Pr{X=4} = (1-p)³p
…
Pr{X=k} = (1-p)^(k-1)p

This distribution, too, is best sampled from by an inverse transform method, as we saw happen with the Poisson distribution. We therefore encode “x <~ geo(p)” as

u <~ uniform(0,1);
x := 1;
q := p;
sum := p;
while(sum < u) {
  x := x+1;
  q := q*(1-p);   // probability Pr{X=x+1}
  sum := sum + q
}

Exponential Distribution

The geometric distribution is of a discrete type; its outcomes are isolated numbers. The exponential distribution is its counterpart for the continuous case, where outcomes occur on a spectrum. Any single point in the continuous real line has a length of 0, so any single outcome on a spectrum effectively has a (prior) probability of 0. This has the annoying side effect that it is impossible to list all probabilities as in “Pr{X=0.4}=0.12”, but we can describe the probability of measurable sets such as {X<0.4} – the set of outcomes less then 0.4 – using the cumulative distribution function (cdf).

The cumulative distribution function of a exponentially distributed random variable X ~ exprate(r), where r is the called the rate parameter, is given by

Cumulative density function of an exponentially distributed random variable with rate parameter r

If the inverse of the cdf, which is written F^-1, can be expressed as a continuous function, the inverse transform method is extremely powerful. This is because from probability theory we know that, if U ~ uniform(0,1), then

Pr{U < F(X)} = Pr{F^-1(U) < X}

We therefore encode “x <~ exprate(r)” as

x <~ uniform(0,1);
x := -1/r*log(x) // inverse of F

Normal Distribution

A normal sample occurs with most natural phenomena where quantities occur on a spectrum, such as with distance, size, and volume in physics, or other “natural” quantities in chemistry, biology and psychology. A normal probability distribution has two parameters: μ (mu), the mean; and σ² (sigma squared), the variance, which is a measure with information about how much “spread out” the data is from the mean. The square root of the variance is called the standard deviation, and is (conveniently) denoted σ.

Normal distributions cannot be sampled using an inverse transform method with as much ease as was the case with Poisson distributions. This is because we are, in practice, unable to sum over all possible outcomes of a normally distributed random variable, precisely because it takes values in the continuum. Moreover, the normal cumulative probability distribution function is computationally speaking not (easily) invertible.

One solution could be to use the Central Limit Theorem. This is a famous result in probability theory which states roughly the following: when more and more n samples are used for a binomial (n,p) distribution, the binomial distributed variable will start approximating a normal distribution about the “most likely” outcome of the binomial sample (which is n multiplied by p, by the way). Subtracting n times p, then, and normalizing appropriately, we may obtain a random variable that is approximately normally distributed with mean 0 and variance 1.

It has been shown by mathematicians that a “good” minimum sample size for this approximating binomial trial would be about n=10. This means, that we have to draw about 10 random uniform(0,1) samples to generate an artificial normal sample. This seems rather inefficient. Luckily, there is a faster way to do it.

Box-Muller Transform

The Box-Muller transform is a procedure that takes two uniformly drawn samples from [0,1], and produces two normal samples from normal(0,1) – a normal distribution around mean 0, with variance (and standard deviation) 1. For larger numbers of samples, this effectively means that every uniformly drawn value yields a normal drawn value.

The method of how to obtain the values is by encoding “x <~ normal(mu,var)” as

// draw two uniform samples
u <~ uniform(0,1); v <~ uniform(0,1);
// calculate one normal random sample
x <~ sqrt(-2*log(u))*cos(2*pi*v);
// use parameters mu and var to scale normal(0,1)
x := sqrt(var)*x+mu

There is in fact another (independent) normal random variable available when replacing the cos by sin. In an implementation, this can be stored away as a buffer for a later normal sample.

We have to (and may) assume here that sqrt (square root), log (natural logarithm), cos, and sin, are functions available for us to use.

Chi-squared Distribution

The χ² (chi-squared) distribution is not something that occurs in some way in the natural world, but is rather used in statistical testing methods, such as the Pearson’s χ² method, which is a test of independence between two sample frequency distributions from real world observations.

Modeling a χ²-distributed random variable as the sum of k standard normal – that is normal with μ=0 and σ²=1 – random variables, the resulting family of distributions has a single parameter k, which is referred to as the number of degrees of freedom.

Sampling from chi-squared distributions as in “s <~ chisquared(k)” now should encode

i := 0;
s := 0;
while (i < k) {
  x <~ normal(0,1);
  s := s + x;
  i := i + 1
}

Seeing as a chi-squared random sample with k degrees of freedom requires k standard normal samples, this encoding requires k (or k+1) uniform(0,1) samples. In practice, however, chi-squared samples usually need 2 degrees of freedom, or at least not much more than that.

The Gamma and Beta Families

The gamma family has 2 parameters α, β (alpha, beta) and generalizes the notion of exponential distributions and chi-squared distributions. Sampling from it is an important requirement for statisticians. It involves an integral, however, whose computation is beyond the scope of this blog:

Computed integral necessary for the gamma family of distributions

There is also the beta family, which generalizes uniform distributions. If we can sample a gamma(α,θ) distributed random variable X, and a gamma(β,θ) distributed random variables Y, then, as it conveniently turns out, X / (X+Y) has beta(α,β) distribution.

Erik Voogd