Random variables

Definition

A random variable is a variable that takes (or produces) numerical values, each one describing or representing the outcome of some underlying random phenomenon that is of interest to an experimenter. A random variable is usually noted with a capital letter, such as \(X\).

A first example

Let's start with a simple example. Let \(X\) be the random variable describing the "Height of a person". Note that this person has been subjected to many external forces of the physical world (gravitation, lack of food, family habits, etc.) as well as to internal elements of his/her body (e.g., genetics). These many factors have been interacting in an unpredictable way to "produce" the current (at the time the experiment is made) state of the world to which we associate the (value of the) weight of this person.

We can now think that the specific observed value (noted thereafter \(x\), with a small letter) is the byproduct of all the (random) interactions described above. This would necessitate to define what we mean by "random". (This is another topic that will be treated elsewhere).

If we could do the same experiment another time, going a few days (or weeks) in the future, then we would most certainly observe a different value for the weight of the person.

We will thus imagine our random variable \(X\) as a "black box", that associates to each input (i.e., the random interactions having led to the current state of the world, noted thereafter \(\omega\)) an output (i.e., the weight of the chosen person).

\[X:\text{input}=\omega\mapsto \text{output}=x.\]

How this association is made is hidden (hence our use of the word "black box"). Indeed, in this particular experiment, we only observe the output (the weight value).

Understanding better using a computer

I will now make a simple analogy using the computer. Below is the construction of such a black box.

X <- function() rnorm(1, mean = 50, sd = 5)

In fact, it would be really black if I would not give you its internal construction. Imagine for a moment that I hide the line above. Then I ask you to press the Enter key on the keyboard after I have typed X(). As you can see below, this action produces a numerical value.

X()
## [1] 41.81731

When you ask a person his or her weight (or when you read it from a bathroom scale), this can be thought of exactly as pressing Enter above.

Note that you could not have predicted the above numerical value.

We can do the same experiment another time.

X()
## [1] 56.07248

As you can see, the observed numerical value is different from the first one. Thus, we can conclude that the R function X() (or the random variable \(X\) in Mathematical language) produces different values (hence the word variable), even though we did not change X itself. The values produced by X (or \(X\)) are produced at random. This is the 'r' in rnorm that stands for random. As you can see, the randomness is enscribed in the (hidden) body of the X() function (in the same way as it is enscribed in the hidden physical and genetics interrelations I talked about previously).

Remark: When we use a computer, as above, we can often see what's inside the "back box" whereas for real experiments (e.g., the weight), it's not possible. Only Mother Nature knows what's inside.

We can now describe what we mean by random. We say that \(X\) is a random variable because the values it produces are not predictable and they change each time we call that random variable. It is important to note that it is \(X\) that is random, not the specific value that have been produced by \(X\), which is a real number and as such is fixed (the weight value of the person when the experiment have been made can be writen on a piece of paper, and this writen value will still be the same the day after the experiment for example).

Another example

Let's create another example.

X <- function() c(sample(c("Head", "Tail"), 1), sample(c("Head", "Tail"), 1))

The above function reproduces, in silico, the real experiment that would consist in tossing two coins.

Let's (virtually) toss these two coins a few times (e.g., 5 times) to observe what happens.

X()
## [1] "Head" "Head"
X()
## [1] "Tail" "Tail"
X()
## [1] "Head" "Head"
X()
## [1] "Head" "Head"
X()
## [1] "Head" "Tail"

The above output are not numbers, thus following rigorously our definition on top of this page, we should not say that X above is a random variable.

We can consider a proper random variable now, and define it at the function (black box if I hide the code below) that will count the number of Tails when I toss two fair coins.

X <- function() sum(rbinom(1, 1, 0.5), rbinom(1, 1, 0.5))

And let's do the experiment again 5 times.

X()
## [1] 1
X()
## [1] 0
X()
## [1] 0
X()
## [1] 1
X()
## [1] 1

At this point, you shoud have already understood a few things. A random variable is some sort of engine able to produce numerical values. Switching on this engine several times (e.g., pressing X() and Enter several times, or measuring the weight of some person several times, or tossing two coins several times) result in the observation (or collection) of several numerical values. These values are called the sample of (observed) data in the case of a real experiment. At this point, we can try to do two different things:

  1. describe these observed numerical values (e.g., summarising them by computing their average)
  2. describe the random variable \(X\) itself (e.g., stating what values it can produce and give probabilities associated to these values, or say where on average this values are produced, etc.).

This will be our task below in the next section.

Expectation of a random variable \(X\)

Variance of a random variable \(X\)

Independent and Identically (i.i.d.) random variables

If the person whose weight is of interest is chosen at random in some given population, it is interesting to note that the randomness is present at two levels:

What happens if we select several persons in a population? The first level of randomness explains the variations of the magnitude of the observed values of weight. The second level of randomness guarantees independence between observations. And the way we choose the persons in the population leads to exchangeability and hence to the "identically distributed" concept. This needs some clarification.