## Multiple regression: Introduction

### What is multiple regression, and what is it used for?

Mathematically, multiple regression is a straightforward generalization
of **simple linear regression**, which is the process of fitting a straight
line through a set of points in a 2 dimensional Euclidean space (what you may know as a scatter-plot). Simple linear regression is done in order to estimate one variable (we say the dependent variable or criterion) using the other variable (we say the independent variable or predictor). In multiple regression we have one dependent variable and multiple predictor variables.

Regression (simple and multiple) techniques are closely related to the **analysis of variance (ANOVA)**. Both are special cases of the **General
Linear Model** or **GLM**, and you can in fact do an ANOVA using
the regression commands in statistical packages.

What distinguishes multiple regression from other techniques? The following are the main points:

- In multiple regression, we work with
**one dependent variable**and**many independent variables**. In simple regression, there is only one independent variable. - In multiple regression, the
**independent variables may be correlated**. In analysis of variance, we arrange for all the independent variables to vary completely independently of each other. - In multiple regression, the
**independent variables can be continuous**. For analysis of variance, they have to be categorical, and if they are naturally continuous, we have to force them into categories, for example by a**median split**.

This means that multiple regression is useful in the following general
class of situations. We observe one dependent variable, whose variation
we want to explain in terms of a number of other independent variables,
which we can also observe. These other variables are typically not under experimental
control - we accept the variations in them that happen to
occur in the sample of people or situations we can observe. We want to
know which if any of these independent variables is significantly correlated
with the dependent variable, taking into account the various correlations
that may exist between the independent variables. So typically we use multiple
regression to analyze data that come from "natural" rather than
experimental situations. This makes it very useful in social psychology,
and social science generally. **Note, however, that it is inherently a
correlational technique; it cannot of itself tell us anything about the
causalities that may underlie the relationships it describes** (Jeremy's note: Stephen makes what is to me a classic mistake here that I would like you to understand. There is nothing at all about any statistical technique or model that has anything to do with causality or correlation. In the social sciences, causal conclusions are allowed when there is random assignment of subjects to levels of an IV (we call such designs experiments). This is a research design issue, it has nothing at all to do with how the data are analyzed...recall Stephen said earlier that multiple regression and ANOVA can be run by the same commands in statistical packages! There is no requirement in any statistical package or technique that requires us to specify how the subjects were assigned to levels of an IV...this alone tells us that causal conclusions have nothing to do with statistical analysis.).

There are some additional rules that have to be obeyed if multiple regression is to be useful:

- The units (usually people) we observe should be a
**random sample**from some well defined population. This is a basic requirement for all statistical work if we want to draw any kind of general inference from the observations we have made. - The dependent variable should be
**ordered**and polychotomous or continuous. If the dependent variable is only measured on a**nominal**(unordered category, including**dichotomies**) scale, we have to use**discriminant analysis**or**logistic regression**instead. - The independent variables should also be ordered but dichotomous categorical variables can be used directly; and there is way of dealing with
**polychotomous**categorical variables as well. - The distributions of all the variables should be roughly
**normal**. - The relationships between the dependent variable and the independent
variable should be
**linear**. That is, it should be possible to draw a rough straight line through an x-y scatter-gram of the observed points. If the line looks curved, but is**monotonic**(increases or decreases all the time), things are not too bad and could be made better by transformation. If the line looks U-shaped, we will need to take special steps before regression can be used. - Although the independent variables can be correlated, there must be
no perfect (or near-perfect) correlations among them, a situation called
**multicollinearity**(which will be explained later in the course). - There must be no
**interactions**, in the ANOVA sense, between independent variables - the effect of each on the dependent variable must be roughly independent of the effects of all others. However, if interactions are obviously present, and not too complex, there are special steps we can take to cope with the situation.

### The regression equation

Like many statistical procedures, multiple regression has two uses:
to *summarize, represent or describe the features of a set of* data, and to *test hypotheses about the population from which the data were sampled*. The first of these is part of **descriptive statistics**,
the second of **inferential statistics**. We spend most of our time
in elementary statistics courses on basic concepts in inferential statistics but
descriptive statistics are often more important and useful. In this section, we concentrate
on how multiple regression describes a set of data.

#### How do we choose a descriptive statistic?

Any method we use to summarize a set of numbers is part of ** descriptive
statistics (more sophisticated statisticians would call this "data analysis")**. Many different descriptive statistics can be calculated
for a given set of data, and different ones are useful for different
purposes. In many cases, a descriptive statistic is chosen because it is
in some sense the best summary of a particular type. But what do we mean
by "best"?

Consider the best known of all descriptive statistics in psychology, the **arithmetic
mean** - what lay people call the average. Why is this the best summary
of a set of numbers? There is an answer, but it isn't obvious. The mean
is the value from which the numbers in the set have the **minimum sum
of squared deviations**. For the meaning of this, see Figure 1.

**Figure 1**

Consider observation 1. Its *y* value is *y*_{1}.
If we consider an "average" value *ÿ*, we define the
deviation from the average as *y*_{1}-*ÿ*, the squared
deviation from the as (*y*_{1}-*ÿ*)^{2},
and the sum of squared deviations as the sum of _{}(*y*_{i}-*ÿ*)^{2}.
The arithmetic mean turns out to be the value of *ÿ* that makes
this sum lowest. It also, of course, has the property that the sum of the_{i}(*y*_{i}-*ÿ*)
= 0; that, indeed, is its definition.

#### Describing data with a simple regression equation

If we look at Figure 1, it's obvious that we could summarize the data
better if we could find some way of representing the fact that the observations
with high *y* values tend to be those with high *x* values. Graphically,
we can do this by drawing a straight line on the graph so it passes through
the cluster of points, as in Figure 2. **Simple regression** is a way
of choosing the best straight line for this job.

**Figure 2**

This raises two problems: what is the best straight line, and how can
we describe it when we have found it?

Let's deal first with describing a straight line. Any
straight line can be described by an equation relating the *y* values
to the *x* values. In general, we usually write,

*y* = *mx* + *c*

Here *m* and *c* are constants whose values tell us which
of the infinite number of possible straight lines we are looking at. *m* (from French *monter*) tells us about the slope or **gradient** of the line. Positive *m* means the line slopes upwards to the right;
negative *m* that it slopes downwards. High *m* values mean a
steep slope, low values a shallow one. *c* (from French *couper*)
tells us about the **intercept**, i.e. where the line cuts the y axis:
positive *c* means that when *x* is zero, *y* has a positive
value, negative *c* means that when *x* is zero, *y* has
a negative value. But for regression purposes, it's more convenient to
use different symbols. We usually write:

*y* = *a* + *bx*

*or more generally*

*y=b _{o}+b_{1}X_{1}*

This is just the same equation with different names for the constants: *a* is the intercept, *b* is the gradient.

The problem of choosing the best straight line then comes down to finding
the best values of *a* and *b*. We define "best" in
the same way as we did when we explained why the mean is the best summary:
we choose the *a* and *b* values that give us the line such that
the sum of squared deviations *from the line* is minimized. This is
illustrated in Figure 3. The best line is called the **regression line**,
and the equation describing it is called the **regression equation**.
The deviations from the line are also called **residuals**.

**Figure 3**

#### Goodness of fit

Having found the best straight line, the next question is how well it describes the data. We measure this by the fraction

(sum of squared deviations from the line) 1 - ----------------------------------------- (sum of squared deviations from the mean)

This is called the **variance accounted for**, symbolized by *R*^{2}. Its square root is the **Pearson correlation
coefficient**. *R*^{2} can vary from 0 (the points are completely
random) to 1 (all the points lie exactly on the regression line); quite
often it is reported as a percentage (e.g. 73% instead of 0.73). The Pearson
correlation coefficient (usually symbolized by *r*) is always reported
as a decimal value. It can take values from -1 to +1; if the value of *b* is negative, the value of *r* will also be negative.

Note that two sets of data can have identical *a* and *b* values and very different *R*^{2} values, or vice versa. Correlation
measure the strength of a linear relationship: it tells you how much scatter
there is about the best fitting straight line through a scatter-gram. *a* and *b*, on the other hand, tell you what the line is. The values
of *a* and *b* will depend on the units of measurement used,
but the value of *r* is independent of units. If we transform *y* and *x* to **z-scores**, which involves rescaling them so they
have means of zero and standard deviations of 1, *b* will equal *r*.

Note carefully that *a*, *b*, *R*^{2} and *r* are all descriptive statistics. We have not said anything about hypothesis
tests. Given a set of paired *x* and *y* values, we can use virtually
any statistics package to find the corresponding values of *a*, *b* and *R*^{2}.

### From simple regression to multiple regression

What happens if we have more than two independent variables? In most
cases, we can't draw graphs to illustrate the relationship between them
all. But we can still represent the relationship by an equation. This is
what multiple regression does. It's a straightforward extension of simple
regression. If there are *n* independent variables, we call them *x*_{1}, *x*_{2}, *x*_{3} and so on up to *x*_{n}.
Multiple regression then finds values of *b _{0}*,

*b*

_{1},

*b*

_{2},

*b*

_{3}and so on up to

*b*

_{n}which give the best fitting equation of the form

*y* = *b _{0}* +

*b*

_{1}

*x*

_{1}+

*b*

_{2}

*x*

_{2}+

*b*

_{3}

*x*

_{3}+ ... +

*b*

_{n}

*x*

_{n}

*b*_{1} is called the **coefficient** of *x*_{1}, *b*_{2} is the coefficient of *x*_{2}, and so
forth. The equation is exactly like the one for simple regression, except
that it is very laborious to work out the values of *a*, *b*_{1} etc by hand. Most statistics packages, however, do it with exactly the
same command as for simple regression.

What do the regression coefficients mean? The coefficient of each independent
variable tells us what relation that variable has with *y*, the dependent
variable, *when all the other independent variables are held constant*.
So, if *b*_{1} is high and positive, that means that if *x*_{2}, *x*_{3} and so on up to *x*_{n} do not change,
then increases in *x*_{1} will correspond to large increases
in *y*.

#### Goodness of fit in multiple regression

In multiple regression, as in simple regression, we can work out a value
for *R*^{2}. However, every time we add another independent
variable, we necessarily increase the value of *R*^{2} (you
can get a feel for how this happens if you compare Fig 3 with Fig 1). Therefore,
in assessing the goodness of fit of a regression equation, we usually work
in terms of a slightly different statistic, called *R*^{2}-adjusted
or *R*^{2}_{adj}. This is calculated as

*R*^{2}_{adj} = 1 - (1-*R*^{2})(*N*-*n*-1)/(*N*-1)

where *N* is the number of observations in the data set (usually
the number of people) and *n* the number of independent variables
or **regressors**. This allows for the extra regressors. You can see
that *R*^{2}_{adj} will always be lower than *R*^{2} if there is more than one regressor. There is also another way of assessing
goodness of fit in multiple regression, using the *F* statistic which
is discussed below. It is possible in principle to to take the square root
of *R*^{2} or *R*^{2}_{adj} to get what
is called the **multiple correlation coefficient**, but we don't usually
bother.

#### Prediction

Regression equations can also be used to obtain **predicted** or **fitted** values of the dependent variable for given values of the
independent variable. If we know the values of *x*_{1}, *x*_{2},
... *x*_{n}, it is obviously a simple matter to calculate
the value of *y* which, according to the equation, should correspond
to them: we just multiply *x*_{1} by *b*_{1}, *x*_{2} by *b*_{2}, and so on, and add all the
products to *b _{0}*. We can do this for combinations of independent variables
that are represented in the data, and also for new combinations. We need
to be careful, though, of extending the independent variable values far
outside the range we have observed (

**extrapolating**), as it is not guaranteed that the regression equation will still hold accurately.

**Interpreting and reporting multiple regression
results**

**The main questions multiple regression answers**

Multiple regression enables us to answer five main questions about a
set of data, in which *n* independent variables (regressors), *x*_{1} to *x*_{n}, are being used to explain the variation in a single
dependent variable, *y*.

- How well do the regressors, taken together, explain the variation in
the dependent variable? This is assessed by the value of
*R*^{2}_{adj}. As a very rough guide, in psychological applications we would usually reckon an*R*^{2}_{adj}of above 75% as very good; 50-75% as good; 25-50% as fair; and below 25% as poor and perhaps unacceptable. Alas,*R*^{2}_{adj}values above 90% are rare in psychological data, and if you get one, you should wonder whether there is some artifact in your data. - Are the regressors, taken together, significantly associated with the
dependent variable? This is assessed by the statistic
*F*in the "Analysis of Variance" or ANOVA part of the regression output from a statistics package. This is the Fisher*F*as used in the ordinary ANOVA, so its significance depends on its**degrees of freedom**, which in turn depend on sample sizes and/or the nature of the test used. As in ANOVA,*F*has two degrees of freedom associated with it. In general they are referred to as the**numerator**and**denominator**degrees of freedom (because*F*is actually a ratio). In regression, the numerator degrees of freedom are associated with the regression (and equal the number of regressors used), and the denominator degrees of freedom with the**residual**or**error**; you can find them in the Regression and Error rows of the ANOVA table in the output from a statistics package. If you were finding the significance of an*F*value by looking it up in a book of tables, you would need the degrees of freedom to do it. Statistics packages normally work out significances for you, and you will find them in the ANOVA table next to the*F*value; but you need to use the degrees of freedom when reporting the results (see below). It is useful to remember that the higher the value of*F*, the more significant it will be for given degrees of freedom.

- What relationship does each regressor have with the dependent variable
when all other regressors are held constant? This is answered by looking
at the regression coefficients. Some statistics packages (e.g. XLSTAT)
report these twice, once in the form of a regression equation and again
(to an extra decimal place) in a table of regression coefficients and associated
statistics. Note that regression coefficients have units. So if the dependent
variable is number of cigarettes smoked per week, and one of the regressors
is annual income, the coefficient for that regressor would have units of
(cigarettes per week) per (pound of income per year). That means that if
we changed the units of one of the variables, the regression coefficient
would change - but the relationship it is describing, and what it is saying
about it, would not. So the size of a regression coefficient doesn't tell
us anything about the strength of the relationship it describes until we
have taken the units into account. The fact that regression coefficients
have units also means that we can give a precise interpretation to each
coefficient. So, staying with smoking and income, a coefficient of 0.062
in this case would mean that, with all other variables held constant, increasing
someone's income by one pound per year is associated with an increase of
cigarette consumption of 0.062 cigarettes per week (we might want to make
this easier to grasp by saying that an increase in income of 1 pound per
week would be associated with an increase in cigarette consumption of 52
* 0.062 = 3.2 cigarettes per week). Negative coefficients mean that when
the regressor increases, the dependent variable decreases. If the regressor
is a
**dichotomous**variable (e.g. gender), the size of the coefficient tells us the size of the difference between the two classes of individual (again, with all other variables held constant). So a gender coefficient of 2.6, with women coded 0 and men coded 1, would mean that with all other variables held constant, men's dependent variable scores would average 2.6 units higher than women's. - Which independent variable has most effect on the dependent variable?
It is not possible to give a fully satisfactory answer to this question,
for a number of reasons. The chief one is that we are always looking at
the effect of each variable
*in the presence of all the others*; since the dependent variable need not be independent, it is hard to be sure which one is contributing to a joint relationship (or even to be sure that that means anything). However, the usual way of addressing the question is to look at the**standardized regression coefficients**or**beta weights**for each variable; these are the regression coefficients we would get if we converted all variables (independent and dependent) to**z-scores**before doing the regression. XLSTAT reports beta weights for each independent variable in its regression output. - Are the relationships of each regressor with the dependent variable
statistically significant, with all other regressors taken into account?
This is answered by looking at the
*t*values in the table of regression coefficients. The degrees of freedom for*t*are those for the residual in the ANOVA table, but statistics packages work out significances for us, so we need to know the degrees of freedom only when it comes to reporting results. Note that if a regression coefficient is negative, most packages will report the corresponding*t*value as negative, but if you were looking it up in tables, you would use the**absolute**(unsigned) value, and the sign should be dropped when reporting results.

#### Further questions to ask

Either the nature of the data, or the regression results, may suggest
further questions. For example, you may want to obtain means and standard
deviations or histograms of variables to check on their distributions;
or plot one variable against another, or obtain a matrix of correlations,
to check on first order relationships. You should also check for unusual
observations or "**outliers**":

**Reporting regression results**

Research articles frequently report the results of several different regressions done on a single data set. In this case, it is best to present the results in a table. Where a single regression is done, however, that is unnecessary, and the results can be reported in text. The wording should be something like the following:

The data were analyzed by multiple regression, using as regressors age,
income and gender. The regression was a rather poor fit (*R*^{2}_{adj} = .40), but the overall relationship was significant (*F*_{3,12} = 4.32, *p* < 0.05). With other variables held constant, depression
scores were negatively related to age and income, decreasing by 0.16 for
every extra year of age, and by 0.09 for every extra pound per week income.
Women tended to have higher scores than men, by 3.3 units. Only the effect
of income was significant (*t*_{12} = 3.18, *p* <
0.01).

Normally you will need to go on to discuss the meaning of the trends you have described.

Note the following pitfalls for the unwary:

- The above brief paragraph does not exhaust what you can say about a set of regression results. There may be features of the data you should look at - "Unusual observations", for example.
- Always report what happened before moving on to its significance -
so
*R*^{2}_{adj}values before*F*values, regression coefficients before*t*values. Remember, descriptive statistics are more important than significance tests. - Degrees of freedom for both
*F*and*t*values must be given. Usually they are written as subscripts. For*F*the numerator degrees of freedom are given first. You can also put degrees of freedom in parentheses, or report them explicitly, e.g.: "*F*(3,12) = 4.32" or "*F*= 4.32, d. of f. = 3, 12". - Significance levels can either be reported exactly (e.g.
*p*= 0.032) or in terms of conventional levels (e.g.*p*< 0.05). There are arguments in favor of either, so it doesn't much matter which you do. But you should be consistent in any one report. - Beware of highly significant
*F*or*t*values, whose significance levels will be reported by statistics packages as, for example, 0.0000. It is an act of statistical illiteracy to write*p*= 0.0000, because significance levels can never be exactly zero - there is always*some*probability that the observed data could arise if the**null hypothesis**was true. What the package means is that this probability is so low it can't be represented with the number of columns available. We should write it as*p*< 0.00005 (or, if we are using conventional levels,*p*< 0.001). - Beware of
**spurious precision**, i.e. reporting coefficients etc to huge numbers of**significant figures**when, on the basis of the sample you have, you couldn't possibly expect them to replicate to anything like that degree of precision if someone repeated the study.*F*and*t*values are conventionally reported to two decimal places, and*R*^{2}_{adj}values to the nearest percentage point (sometimes to one decimal place). For coefficients, you should be guided by the sample size: with a sample size of 16 as in the example above, two significant figures is plenty, but even with more realistic samples, in the range of 100 to 1000, three significant figures is usually as far as you should go. This means that you will usually have to round off the numbers that statistics packages give you.

### Carrying out multiple regression in XLSTAT.....see the multiple regression videos given on the "readings" page.

THE END......