1 Introduction

PDF version

Data is usually the result of a random experiment. The gender of the next person you meet, today’s share price of Biontech, the number of Taylor Swift’s Spotify streams this month, the sales prices of houses in Cologne, the number of pizzas you eat this year, the delay time of the KVB tram, your grade in your exam - all of this information involves a certain amount of randomness.

Suppose you are conducting a survey and ask 10 random people about their gender, years of education, hourly wage, and years of work experience. It is convenient to work with numerical values only, so we write 1 if the person is female and 0 otherwise. Your data table might look like this:

Table 1.1: Survey data

Person	Female	Education	Wage	Experience
1	1	12	9.20	41
2	0	18	14.55	15
3	1	12	25.29	46
4	1	13	12.18	15
5	0	12	15.33	7
6	1	18	10.95	15
7	1	12	5.18	36
8	0	12	0.00	3
9	0	18	13.14	2
10	0	21	11.03	6

The random selection of a particular person to be interviewed and to fill out the spreadsheet is a random experiment. Therefore, in order to make statistical inferences about the dependence of the collected variables, we must first understand randomness and uncertainty from a mathematical perspective. Probability is the mathematical language for situations where the outcome is unknown. Probability theory is the basis of mathematical statistics and econometric theory.

We interpret the entries in Table 1.1 to be the outcomes of random variables. For example, the gender of the first person interviewed is a random variable. The value 1 is the result of this random experiment. If the second person had been interviewed before the first, this value would be zero.

Let X_1 be the 4 \times 1 vector of the values of the first person interviewed, X_2 the vector of the second person interviewed, and so on. These are random vectors. It’s realizations are X_1 = \begin{pmatrix} 1 \\ 12 \\ 9.2 \\ 41 \end{pmatrix}, \quad X_2 = \begin{pmatrix} 0 \\ 18 \\ 14.55 \\ 15 \end{pmatrix}, \quad \ldots

The full data set can be collected in the 10 \times 4 matrix

\boldsymbol X = \begin{pmatrix} 1 & 12 & 9.2 & 41 \\ 0 & 18 & 14.55 & 15 \\ 1 & 12 & 25.29 & 46 \\ 1 & 13 & 12.18 & 15 \\ 0 & 12 & 15.33 & 7 \\ 1 & 18 & 10.95 & 15 \\ 1 & 12 & 5.18 & 36 \\ 0 & 12 & 0 & 3 \\ 0 & 18 & 13.14 & 2 \\ 0 & 21 & 11.03 & 6 \\ \end{pmatrix},

where each column corresponds to one of the variables in Table 1.1, and the i-th row correspond to the values of individual i, i.e., X_1 is the first row of \boldsymbol X.

Matrix algebra provides a compact representation of multivariate data and an efficient framework for analyzing and implementing statistical methods. We will use matrix algebra frequently throughout this course. Please use this refresher below to review the most important concepts (in particular the first three sections):

Crash Course on Matrix Algebra

The best way to learn statistical methods is to program them yourself. We will use the statistical programming language R to implement statistical methods and apply them to real data sets. If you are new to R, please take a look at this short introduction, which contains a lot of valuable resources:

Getting Started with R