In 1733, Abraham de Moivre presented an approximation to the Binomial distribution. He later (de Moivre, 1756, page 242) appended the derivation of his approximation to the solution of a problem asking for the calculation of an expected value for a particular game. He posed the rhetorical question of how we might show that experimental proportions should be close to their expected values:
From this it follows, that if after taking a great number of Experi-ments, it should be perceived that the happenings and failings have been nearly in a certain proportion, such as of 2 to 1, it may safely be concluded that the Probabilities of happening or failing at any one time assigned will be very near in that proportion, and that the greater the number of Experiments has been, so much nearer the Truth will the conjectures be that are derived from them.
But suppose it should be said, that notwithstanding the reason-ableness of building Conjectures upon Observations, still considering the great Power of Chance, Events might at long run fall out in a different proportion from the real Bent which they have to happen one way or the other; and that supposing for Instance that an Event might as easily happen as not happen, whether after three thousand Experiments it may not be possible it should have happened two thou-sand times and failed a thousand; and that therefore the Odds against so great a variation from Equality should be assigned, whereby the Mind would be the better disposed in the Conclusions derived from
In answer to this, I’ll take the liberty to say, that this is the hardest Problem that can be proposed on the Subject of Chance, for which reason I have reserved it for the last, but I hope to be forgiven if my Solution is not tted to the capacity of all Readers; however I shall derive from it some Conclusions that may be of use to every body: in order thereto, I shall here translate a Paper of mine which was printed November 12, 1733, and communicated to some Friends, but never yet made public, reserving to myself the right of enlarging my own Thoughts, as occasion shall require.
De Moivre then stated and proved what is now known as the normal approximation to the Binomial distribution. The approximation itself has subsequently been generalized to give normal approximations for many other distributions. Nevertheless, de Moivre’s elegant method of proof is still worth understanding. This Chapter will explain de Moivre’s approximation, using modern notation.
A Method of approximating the Sum of the Terms of the Binomial expanded into a Series, from whence are deduced some practical Rules to estimate the Degree of Assent which is to be given to Experiments.
Altho’ the Solution of problems of Chance often requires that several Terms of the Binomial be added together, nevertheless in very high Powers the thing appears so laborious, and of so great difficulty, that few people have undertaken that Task; for besides James and Nicolas Bernouilli, two great Mathematicians, I know of no body that has attempted it; in which, tho’ they have shown very great skill, and have the praise that is due to their Industry, yet some things were further required; for what they have done is not so much an Approximation as the determining very wide limits, within which they demonstrated that the Sum of the Terms was contained. Now the method . . .
Pictures of the binomial
Suppose has a Bin(n; p) distribution. That is,
Recall that we can think of as a sum of independent random variables From this representation it follows that
Recall also that Tchebychev’s inequality suggests the distribution should be clustered around np, with a spread determined by the standard deviation, What does the Binomial distribution look like? The plots in the next display, for the Bin(n; 0:4) distribution with n = 20; 50; 100; 150; 200, are typical. Each plot on the left shows bars of height and width 1, centered at k. The maxima occur near nx0:4 for each plot. As n increases,the spread also increases, re ecting the increase in the standard deviations Each of the shaded regions on the left has area to one because for each n.
The plots on the right show represent the distributions of the standardized random variables The location and scaling effects of the increasing expected values and standard deviations (with p = 0:4 and various n) are now removed. Each plot is shifted to bring the location of the maximum close to 0 and the horizontal scale is multiplied by a factor
A bar of height with width is now centered at The plots all have similar shapes. Each shaded region still has area 1.
De Moivre’s argument
Notice how the standardized plots in the last picture settle down to a symmetric `bell-shaped’ curve. You can understand this effect by looking at the ratio of successive terms:
As a consequence, if and only if that is, For xed n, the probability achieves its largest value at The probabilities increase with k for then decrease for That explains why each plot on the left has a peak near np.
Now for the shape. At least for near we get a good approximation for the logarithm of the ratio of successive terms using the Taylor approximation: for x near 0. Indeed,
By taking a product of successive ratios we get the ratio of the individual Binomial probabilities to their largest term. On a log scale the calculation.
The largest binomial probability
Using the fact that the probabilities sum to 1, for p = 1=2 de Moivre was able to show that the should decrease like for a constant B that he was initially only able to express as an innite sum. Referring to his calculation of the ratio of the maximum term in the expansion of to the sum, he wrote (de Moivre, 1756, page 244)
When I first began that inquiry, I contented myself to determine at large the Value of B, which was done by the addition of some Terms of the above-written Series; but as I perceived that it converged but slowly, and seeing at the same time that what I had done answered my purpose tolerably well, I desisted from proceeding further till my worthy and learned Friend Mr. James Stirling, who had applied himself after me to that inquiry, found that the Quantity B did denote the Square-root of the Circumference of a Circle whose Radius is Unity, so that if that
Circumference be called c, the Ratio of the middle Term to the Sum of all the Terms will be expressed by
In modern notation, the vital fact discovered by the learned Mr. James Stirling asserts that
in the sense that the ratio of both sides tends to 1 (very rapidly) as n goes to innity. See Feller (1968, pp52-53) for an elegant, modern derivation of the Stirling formula.
By Stirling’s formula, for
How does one actually perform a normal approximation? Back in the golden days, I would have interpolated from a table of values for the function.
which was found in most statistics texts. For example, if X has a Bin(100; 1=2) distribution,
These days, I would just calculate in R:
> pnorm(55.5, mean = 50, sd = 5) – pnorm(44.5, mean = 50, sd = 5)  0.7286679
or use another very accurate, built-in approximation:
> pbinom(55,size = 100, prob = 0.5) – pbinom(44,size = 100, prob = 0.5)  0.728747
At this point, the integral in the denition of (x) is merely a re ection of the Calculus trick of approximating a sum by an integral. Probabilists have taken a leap into abstraction by regarding , or its derivative (y) :=
as a way to dene a probability distribution.
Denition. A random variable Y is said to have a continuous distribution (on R) with density function f(R) if
Notice that f should be a nonnegative function, for otherwise it might get awkward when calculating
That is, the integral of a density function over the whole real line equals one.
I prefer to think of densities as being dened on the whole real line, with values outside the range of the random variable being handled by setting the density function equal to zero in appropriate places. If a range of integration is not indicated explicitly, it can then always be understood as with the zero density killing off unwanted
Distributions dened by densities have both similarities to and differences from the sort of distributions I have been considering up to this point in Stat 241/541. All the distributions before now were discrete. They were
described by a (countable) discrete set of possible values fxi : i = 1; 2; : : : g that could be taken by a random variable X and the probabilities with which X took those values:
Expectations, variances, and things like Eg(X) for various functions g, could all be calculated by conditioning on the possible values for X.
For a random variable X with a continuous distribution dened by a density f, we have
for every We cannot hope to calculate a probability by adding up (an uncountable set of) zeros. Instead, as you will see in Chapter 7, we must pass to a limit and replace sums by integrals when a random variable X has
a continuous distribution.
The appeared in de Moivre’s approximation by way of Stirling’s formula. It is slightly mysterious why it appears in that formula. The reason for both appearances is the fact that the constant.
is exactly equal to as I now explain.