We can use a binomial distribution to study a situation in which we have multiple trials with two possible outcomes with the probabilities of each respective outcome the same for each trial and all of the trials independent.
A generalization of the binomial distribution is the multinomial distribution. Like the binomial distribution, the multinomial distribution considers multiple independent trials with the probabilities of respective outcomes the same for each trial.
However, the multinomial distribution gives the probability of different outcomes when we have more than two possible outcomes for each trial. This is useful, for example, in proving the distribution of order statistics, where we take the different trials to be the sample data and the outcomes to be the three intervals in the real line in which these data can fall.
Suppose that we have n trials and k mutually exclusive outcomes with probabilities θ1, θ2, . . . , θk. We will let f (x1, x2, . . . , xk) be the probability of having xi outcomes of each corresponding type, for 1 ≤ i ≤ k. Obviously, we must have x1 + x2 + · · · + xk = n. To compute f (x1, x2, . . . , xk), we first note that the probability of getting these numbers of outcomes in some particular order is We now compute the number of orders in which our combination of numbers of outcomes is attainable. The x1 outcomes of the first type can be chosen in ways, the outcomes of the second type can be chosen in ways, and so on up to the outcomes of type k which can be chosen in ways. The total number of orderings is therefore
The product telescopes and we are left with
The expression (A.30) is called a multinomial coefficient and is often denoted
Using the multinomial coefficient, we can see that
This is the multinomial distribution. We often write to emphasize the dependence on the parameters.
One can derive the multinomial distribution by repeated uses of the binomial theorem. For example, if k = 3 there are three outcomes, say A, B and C. We may amalgamate B and C and consider the case of two outcomes: A and not A. If we let θ1 equal the probability of A and 1 − θ1 the probability of not A, we find the probability of x1 Outcomes being A and n − x1 outcomes being not A is just
Let θ2 be the probability of outcome B, and θ3 the probability of outcome C. Given A does not occur, the probability that B occurs is the probability that C occurs is Thus the probability that x1 outcomes are A, x2 are B and are C is
which agrees with what we found above.
B. Big-Oh Notation
Deftnition B.1 (Big-Oh Notation). A(x) = O(B(x)), read “A(x) is of order (or big-Oh) B(x)”, means there is a C > 0 and an x0 such that for all x ≥ x0, |A(x)| ≤ C B(x). This is also written A(x) ¿ B(x) or B(x) À A(x).
Big-Oh notation is a convenient way to handle lower order terms. For example, if we write F (x) = x5 + O(x2), this means that as x tends to infinity, the main term of F (x) grows like x5, and the correction (or error) terms are at most some constant times x2.
C. Proof That With High Probability |X˜ − µ˜| is Small
In proving the Median Theorem, we assume that we can ignore higher powers of t = X˜-µ˜. We are able to do this because, with high probability, t is small. Here we provide a more formal statement of this fact, as well as a proof.
Lemma C.1. Suppose f (x) is a continuously differentiable function in some neighborhood of µ˜. Then for any c > 0, we have
Proof. This is equivalent to proving that
We will prove only the first of these two statements as the proof of the second is very similar. By (2.15), we can approximate the density of the median as
We consider the factor ([F (x˜)] [1-F (x˜)])m. It is convenient to write θ = F (x˜) and consider the function h(θ) = θ(1-θ). This function will attain its maximum for the same value of θ = F (x˜) as ([F (x˜)] [1 F (x˜)])m, and it is a simple exercise in calculus to show that this value is This condition holds only for x˜ = µ˜. We furthermore note that for hj(θ) = 1 − 2θ > 0, so h is increasing. Since precisely when x˜ = µ˜, this means that for x˜ ≤ µ˜ − c value of g(θ) occurs for x˜ = µ˜ − c. We therefore have for x˜ ≤ µ˜ − c,
In particular, we note that α < 1.
We now begin to look at the probability that X˜ is at most µ˜ − c. We have
In the last step, we use the fact that for m su±ciently large (m > 1, in fact), 2m This simpli¯es the expression as a factor of 2m is easier to work with than the factor of 2m + 1. We now apply our bound on F (x˜)(1-F (x˜)) to find that
In obtaining the rightmost expression, we have used the fact that f is nonnegative everywhere and positive in a neighborhood of µ˜, so that Since µ˜ is the median of the population, by definition, we have
Since α < 1, it follows that the right side of this inequality must converge to 0 as m goes to infinity, so the probability on the right side must likewise converge to 0.
D. Stirling’s Approximation Formula for n!
Exact computations involving factorials of large numbers can be very difficult. Fortunately, there is an approximation formula which can greatly simplify the computations.
Proof. For a proof, see [WW]. We show (D.46) is a reasonable approximation. It is often easier to analyze a product by converting it to a sum; this is readily accomplished by taking logarithms. We have
Thus log n! ≈ n log n − n, or n! ≈ nne−n.
E. Review of the exponential function
These notes in this section are taken from [MT-B].
In this section we study some of the basic properties of the number e (see [Rud] for more properties and proofs). One of the many ways to define the number e, the base of the natural logarithm, is to write it as the sum of the following infinite series:
Hence e is the limit of the convergent sequence sm. This representation is one of the main tool in analyzing the nature of e.