Independent and identically distributed random variables

In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.[1] This property is usually abbreviated as i.i.d. or iid or IID. IID was first used in statistics. With the development of science, IID has been applied in different fields such as data mining and signal processing.

In statistics, we commonly deal with random samples. A random sample can be thought of as a set of objects that are chosen randomly. Or, more formally, it’s “a sequence of independent, identically distributed (IID) random variables”.

In other words, the terms random sample and IID are basically one and the same. In statistics, we usually say “random sample,” but in probability it’s more common to say “IID.”

Independent and identically distributed random variables are often used as an assumption, which tends to simplify the underlying mathematics. In practical applications of statistical modeling, however, the assumption may or may not be realistic.[3]

The i.i.d. assumption is also used in central limit theorem, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite variance approaches a normal distribution.[4]

Often the i.i.d. assumption arises in the context of sequences of random variables. Then "independent and identically distributed" implies that an element in the sequence is independent of the random variables that came before it. In this way, an i.i.d. sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order Markov sequence). An i.i.d. sequence does not imply the probabilities for all elements of the sample space or event space must be the same.[5] For example, repeated throws of loaded dice will produce a sequence that is i.i.d., despite the outcomes being biased.

In probability theory, independence means that A ,B are two events, if the equation P(A ∩ B) = P(AB) = P(A)P(B) satisfied, then events A and B are independent.

Suppose, there are the two events of the experiment A, B if P(A) > 0, there is possibility P(B|A). Generally, the occurrence of A has an effect on the probability of B, which is called conditional probability, and only when the occurrence of A has no effect on the occurrence of B, there is P(B|A) = P(B).

Note: If P(A) > 0, P(B) > 0 then A, B are mutually independent which cannot be established with mutually incompatible at the same time, that is, independence must be compatible and mutual exclusion must be related.

Suppose A, B, C are three events. If are satisfied, then the events A, B, C are independent of each other.

P(AB) = P(A)P(B), P(BC) = P(B)P(C), P(AC) = P(A)P(C), P(ABC) = P(A)P(B)P(C)

A more general definition is there are n events, A1, A2,...,An. If the probabilities of the product events for any 2, 3, ..., n events are equal to the product of the probabilities of each event, then the events A1, A2, ..., An are independent of each other.

A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see the Gambler's fallacy).

In signal processing and image processing the notion of transformation to i.i.d. implies two specifications, the "i.d."part and the "i." part:

(i.) the signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white noise signal (i.e. a signal where all frequencies are equally present).

Toss a coin 10 times and record how many times does the coin lands on head.

Choose a card from a standard deck of cards containing 54 cards, then place the card back in the deck. Repeat it for 54 times. Record the number of King appears

Many results that were first proven under the assumption that the random variables are i.i.d. have been shown to be true even under a weaker distributional assumption.

The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced by Bruno de Finetti.[citation needed] Exchangeability means that while variables may not be independent, future ones behave like past ones – formally, any value of a finite sequence is as likely as any permutation of those values – the joint probability distribution is invariant under the symmetric group.

This provides a useful generalization – for example, sampling without replacement is not independent, but is exchangeable.

In stochastic calculus, i.i.d. variables are thought of as a discrete time Lévy process: each variable gives how much one changes from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process. One may generalize this to include continuous time Lévy processes, and many Lévy processes can be seen as limits of i.i.d. variables—for instance, the Wiener process is the limit of the Bernoulli process.

Why assume the data in machine learning are independent and identically distributed?

Machine learning uses currently acquired massive quantities of data to deliver faster, more accurate results.[7] Therefore, we need to use historical data with overall representativeness.If the data obtained is not representative of the overall situation, then the rules will be summarized badly or wrongly.

Through i.i.d. hypothesis, the number of individual cases in the training sample can be greatly reduced.

This assumption makes maximization very easy to calculate mathematically. Observing the assumption of independent and identical distribution in mathematics simplifies the calculation of the likelihood function in optimization problems. Because of the assumption of independence, the likelihood function can be written like this

In order to maximize the probability of the observed event, take the log function and maximize the parameter θ.

The computer is very efficient to calculate multiple additions, but it is not efficient to calculate the multiplication. This simplification is the core reason for the increase in computational efficiency. And this Log transformation is also in the process of maximizing, turning many exponential functions into linear functions.

For two reasons, this hypothesis is easy to use central limit theorem in practical applications.