     Regression....
Instructor:          Jeremy Jackson   |    Winter, 2015

Office:                NW 3431    |   New Westminster
Niels Bohr: "Prediction is very difficult, especially if it's about the future."

Selected sections from: Stephen Lea, University of Exeter: http://people.exeter.ac.uk/SEGLea/multvar2/multreg1.html

## Multiple regression: Introduction

### What is multiple regression, and what is it used for?

Mathematically, multiple regression is a straightforward generalization of simple linear regression, which is the process of fitting a straight line through a set of points in a 2 dimensional Euclidean space (what you may know as a scatter-plot). Simple linear regression is done in order to estimate one variable (we say the dependent variable or criterion) using the other variable (we say the independent variable or predictor). In multiple regression we have one dependent variable and multiple predictor variables.

Regression (simple and multiple) techniques are closely related to the analysis of variance (ANOVA). Both are special cases of the General Linear Model or GLM, and you can in fact do an ANOVA using the regression commands in statistical packages.

What distinguishes multiple regression from other techniques? The following are the main points:

• In multiple regression, we work with one dependent variable and many independent variables. In simple regression, there is only one independent variable.
• In multiple regression, the independent variables may be correlated. In analysis of variance, we arrange for all the independent variables to vary completely independently of each other.
• In multiple regression, the independent variables can be continuous. For analysis of variance, they have to be categorical, and if they are naturally continuous, we have to force them into categories, for example by a median split.

This means that multiple regression is useful in the following general class of situations. We observe one dependent variable, whose variation we want to explain in terms of a number of other independent variables, which we can also observe. These other variables are typically not under experimental control - we accept the variations in them that happen to occur in the sample of people or situations we can observe. We want to know which if any of these independent variables is significantly correlated with the dependent variable, taking into account the various correlations that may exist between the independent variables. So typically we use multiple regression to analyze data that come from "natural" rather than experimental situations. This makes it very useful in social psychology, and social science generally. Note, however, that it is inherently a correlational technique; it cannot of itself tell us anything about the causalities that may underlie the relationships it describes (Jeremy's note: Stephen makes what is to me a classic mistake here that I would like you to understand. There is nothing at all about any statistical technique or model that has anything to do with causality or correlation. In the social sciences, causal conclusions are allowed when there is random assignment of subjects to levels of an IV (we call such designs experiments). This is a research design issue, it has nothing at all to do with how the data are analyzed...recall Stephen said earlier that multiple regression and ANOVA can be run by the same commands in statistical packages! There is no requirement in any statistical package or technique that requires us to specify how the subjects were assigned to levels of an IV...this alone tells us that causal conclusions have nothing to do with statistical analysis.).

There are some additional rules that have to be obeyed if multiple regression is to be useful:

• The units (usually people) we observe should be a random sample from some well defined population. This is a basic requirement for all statistical work if we want to draw any kind of general inference from the observations we have made.
• The dependent variable should be ordered and polychotomous or continuous. If the dependent variable is only measured on a nominal (unordered category, including dichotomies) scale, we have to use discriminant analysis or logistic regression instead.
• The independent variables should also be ordered but dichotomous categorical variables can be used directly; and there is way of dealing with polychotomous categorical variables as well.
• The distributions of all the variables should be roughly normal.
• The relationships between the dependent variable and the independent variable should be linear. That is, it should be possible to draw a rough straight line through an x-y scatter-gram of the observed points. If the line looks curved, but is monotonic (increases or decreases all the time), things are not too bad and could be made better by transformation. If the line looks U-shaped, we will need to take special steps before regression can be used.
• Although the independent variables can be correlated, there must be no perfect (or near-perfect) correlations among them, a situation called multicollinearity (which will be explained later in the course).
• There must be no interactions, in the ANOVA sense, between independent variables - the effect of each on the dependent variable must be roughly independent of the effects of all others. However, if interactions are obviously present, and not too complex, there are special steps we can take to cope with the situation.

### The regression equation

Like many statistical procedures, multiple regression has two uses: to summarize, represent or describe the features of a set of data, and to test hypotheses about the population from which the data were sampled. The first of these is part of descriptive statistics, the second of inferential statistics. We spend most of our time in elementary statistics courses on basic concepts in inferential statistics but descriptive statistics are often more important and useful. In this section, we concentrate on how multiple regression describes a set of data.

#### How do we choose a descriptive statistic?

Any method we use to summarize a set of numbers is part of descriptive statistics (more sophisticated statisticians would call this "data analysis"). Many different descriptive statistics can be calculated for a given set of data, and different ones are useful for different purposes. In many cases, a descriptive statistic is chosen because it is in some sense the best summary of a particular type. But what do we mean by "best"?

Consider the best known of all descriptive statistics in psychology, the arithmetic mean - what lay people call the average. Why is this the best summary of a set of numbers? There is an answer, but it isn't obvious. The mean is the value from which the numbers in the set have the minimum sum of squared deviations. For the meaning of this, see Figure 1. Figure 1

Consider observation 1. Its y value is y1. If we consider an "average" value ÿ, we define the deviation from the average as y1-ÿ, the squared deviation from the as (y1-ÿ)2, and the sum of squared deviations as the sum of (yi-ÿ)2. The arithmetic mean turns out to be the value of ÿ that makes this sum lowest. It also, of course, has the property that the sum of thei(yi-ÿ) = 0; that, indeed, is its definition.

#### Describing data with a simple regression equation

If we look at Figure 1, it's obvious that we could summarize the data better if we could find some way of representing the fact that the observations with high y values tend to be those with high x values. Graphically, we can do this by drawing a straight line on the graph so it passes through the cluster of points, as in Figure 2. Simple regression is a way of choosing the best straight line for this job. Figure 2

This raises two problems: what is the best straight line, and how can we describe it when we have found it?
Let's deal first with describing a straight line. Any straight line can be described by an equation relating the y values to the x values. In general, we usually write,

y = mx + c

Here m and c are constants whose values tell us which of the infinite number of possible straight lines we are looking at. m (from French monter) tells us about the slope or gradient of the line. Positive m means the line slopes upwards to the right; negative m that it slopes downwards. High m values mean a steep slope, low values a shallow one. c (from French couper) tells us about the intercept, i.e. where the line cuts the y axis: positive c means that when x is zero, y has a positive value, negative c means that when x is zero, y has a negative value. But for regression purposes, it's more convenient to use different symbols. We usually write:

y = a + bx

or more generally

y=bo+b1X1

This is just the same equation with different names for the constants: a is the intercept, b is the gradient.

The problem of choosing the best straight line then comes down to finding the best values of a and b. We define "best" in the same way as we did when we explained why the mean is the best summary: we choose the a and b values that give us the line such that the sum of squared deviations from the line is minimized. This is illustrated in Figure 3. The best line is called the regression line, and the equation describing it is called the regression equation. The deviations from the line are also called residuals. Figure 3

#### Goodness of fit

Having found the best straight line, the next question is how well it describes the data. We measure this by the fraction

```    (sum of squared deviations from the line)
1 - -----------------------------------------
(sum of squared deviations from the mean)
```

This is called the variance accounted for, symbolized by R2. Its square root is the Pearson correlation coefficient. R2 can vary from 0 (the points are completely random) to 1 (all the points lie exactly on the regression line); quite often it is reported as a percentage (e.g. 73% instead of 0.73). The Pearson correlation coefficient (usually symbolized by r) is always reported as a decimal value. It can take values from -1 to +1; if the value of b is negative, the value of r will also be negative.

Note that two sets of data can have identical a and b values and very different R2 values, or vice versa. Correlation measure the strength of a linear relationship: it tells you how much scatter there is about the best fitting straight line through a scatter-gram. a and b, on the other hand, tell you what the line is. The values of a and b will depend on the units of measurement used, but the value of r is independent of units. If we transform y and x to z-scores, which involves rescaling them so they have means of zero and standard deviations of 1, b will equal r.

Note carefully that a, b, R2 and r are all descriptive statistics. We have not said anything about hypothesis tests. Given a set of paired x and y values, we can use virtually any statistics package to find the corresponding values of a, b and R2.

### From simple regression to multiple regression

What happens if we have more than two independent variables? In most cases, we can't draw graphs to illustrate the relationship between them all. But we can still represent the relationship by an equation. This is what multiple regression does. It's a straightforward extension of simple regression. If there are n independent variables, we call them x1, x2, x3 and so on up to xn. Multiple regression then finds values of b0, b1, b2, b3 and so on up to bn which give the best fitting equation of the form

y = b0 + b1x1 + b2x2 + b3x3 + ... + bnxn

b1 is called the coefficient of x1, b2 is the coefficient of x2, and so forth. The equation is exactly like the one for simple regression, except that it is very laborious to work out the values of a, b1 etc by hand. Most statistics packages, however, do it with exactly the same command as for simple regression.

What do the regression coefficients mean? The coefficient of each independent variable tells us what relation that variable has with y, the dependent variable, when all the other independent variables are held constant. So, if b1 is high and positive, that means that if x2, x3 and so on up to xn do not change, then increases in x1 will correspond to large increases in y.

#### Goodness of fit in multiple regression

In multiple regression, as in simple regression, we can work out a value for R2. However, every time we add another independent variable, we necessarily increase the value of R2 (you can get a feel for how this happens if you compare Fig 3 with Fig 1). Therefore, in assessing the goodness of fit of a regression equation, we usually work in terms of a slightly different statistic, called R2-adjusted or R2adj. This is calculated as

R2adj = 1 - (1-R2)(N-n-1)/(N-1)

where N is the number of observations in the data set (usually the number of people) and n the number of independent variables or regressors. This allows for the extra regressors. You can see that R2adj will always be lower than R2 if there is more than one regressor. There is also another way of assessing goodness of fit in multiple regression, using the F statistic which is discussed below. It is possible in principle to to take the square root of R2 or R2adj to get what is called the multiple correlation coefficient, but we don't usually bother.

#### Prediction

Regression equations can also be used to obtain predicted or fitted values of the dependent variable for given values of the independent variable. If we know the values of x1, x2, ... xn, it is obviously a simple matter to calculate the value of y which, according to the equation, should correspond to them: we just multiply x1 by b1, x2 by b2, and so on, and add all the products to b0. We can do this for combinations of independent variables that are represented in the data, and also for new combinations. We need to be careful, though, of extending the independent variable values far outside the range we have observed (extrapolating), as it is not guaranteed that the regression equation will still hold accurately.

### Interpreting and reporting multiple regression results

#### The main questions multiple regression answers

Multiple regression enables us to answer five main questions about a set of data, in which n independent variables (regressors), x1 to xn, are being used to explain the variation in a single dependent variable, y.

1. How well do the regressors, taken together, explain the variation in the dependent variable? This is assessed by the value of R2adj. As a very rough guide, in psychological applications we would usually reckon an R2adj of above 75% as very good; 50-75% as good; 25-50% as fair; and below 25% as poor and perhaps unacceptable. Alas, R2adj values above 90% are rare in psychological data, and if you get one, you should wonder whether there is some artifact in your data.
2. Are the regressors, taken together, significantly associated with the dependent variable? This is assessed by the statistic F in the "Analysis of Variance" or ANOVA part of the regression output from a statistics package. This is the Fisher F as used in the ordinary ANOVA, so its significance depends on its degrees of freedom, which in turn depend on sample sizes and/or the nature of the test used. As in ANOVA, F has two degrees of freedom associated with it. In general they are referred to as the numerator and denominator degrees of freedom (because F is actually a ratio). In regression, the numerator degrees of freedom are associated with the regression (and equal the number of regressors used), and the denominator degrees of freedom with the residual or error; you can find them in the Regression and Error rows of the ANOVA table in the output from a statistics package. If you were finding the significance of an F value by looking it up in a book of tables, you would need the degrees of freedom to do it. Statistics packages normally work out significances for you, and you will find them in the ANOVA table next to the F value; but you need to use the degrees of freedom when reporting the results (see below). It is useful to remember that the higher the value of F, the more significant it will be for given degrees of freedom.
3. What relationship does each regressor have with the dependent variable when all other regressors are held constant? This is answered by looking at the regression coefficients. Some statistics packages (e.g. XLSTAT) report these twice, once in the form of a regression equation and again (to an extra decimal place) in a table of regression coefficients and associated statistics. Note that regression coefficients have units. So if the dependent variable is number of cigarettes smoked per week, and one of the regressors is annual income, the coefficient for that regressor would have units of (cigarettes per week) per (pound of income per year). That means that if we changed the units of one of the variables, the regression coefficient would change - but the relationship it is describing, and what it is saying about it, would not. So the size of a regression coefficient doesn't tell us anything about the strength of the relationship it describes until we have taken the units into account. The fact that regression coefficients have units also means that we can give a precise interpretation to each coefficient. So, staying with smoking and income, a coefficient of 0.062 in this case would mean that, with all other variables held constant, increasing someone's income by one pound per year is associated with an increase of cigarette consumption of 0.062 cigarettes per week (we might want to make this easier to grasp by saying that an increase in income of 1 pound per week would be associated with an increase in cigarette consumption of 52 * 0.062 = 3.2 cigarettes per week). Negative coefficients mean that when the regressor increases, the dependent variable decreases. If the regressor is a dichotomous variable (e.g. gender), the size of the coefficient tells us the size of the difference between the two classes of individual (again, with all other variables held constant). So a gender coefficient of 2.6, with women coded 0 and men coded 1, would mean that with all other variables held constant, men's dependent variable scores would average 2.6 units higher than women's.
4. Which independent variable has most effect on the dependent variable? It is not possible to give a fully satisfactory answer to this question, for a number of reasons. The chief one is that we are always looking at the effect of each variable in the presence of all the others; since the dependent variable need not be independent, it is hard to be sure which one is contributing to a joint relationship (or even to be sure that that means anything). However, the usual way of addressing the question is to look at the standardized regression coefficients or beta weights for each variable; these are the regression coefficients we would get if we converted all variables (independent and dependent) to z-scores before doing the regression. XLSTAT reports beta weights for each independent variable in its regression output.
5. Are the relationships of each regressor with the dependent variable statistically significant, with all other regressors taken into account? This is answered by looking at the t values in the table of regression coefficients. The degrees of freedom for t are those for the residual in the ANOVA table, but statistics packages work out significances for us, so we need to know the degrees of freedom only when it comes to reporting results. Note that if a regression coefficient is negative, most packages will report the corresponding t value as negative, but if you were looking it up in tables, you would use the absolute (unsigned) value, and the sign should be dropped when reporting results.

#### Further questions to ask

Either the nature of the data, or the regression results, may suggest further questions. For example, you may want to obtain means and standard deviations or histograms of variables to check on their distributions; or plot one variable against another, or obtain a matrix of correlations, to check on first order relationships. You should also check for unusual observations or "outliers":

Reporting regression results

Research articles frequently report the results of several different regressions done on a single data set. In this case, it is best to present the results in a table. Where a single regression is done, however, that is unnecessary, and the results can be reported in text. The wording should be something like the following:

The data were analyzed by multiple regression, using as regressors age, income and gender. The regression was a rather poor fit (R2adj = .40), but the overall relationship was significant (F3,12 = 4.32, p < 0.05). With other variables held constant, depression scores were negatively related to age and income, decreasing by 0.16 for every extra year of age, and by 0.09 for every extra pound per week income. Women tended to have higher scores than men, by 3.3 units. Only the effect of income was significant (t12 = 3.18, p < 0.01).

Normally you will need to go on to discuss the meaning of the trends you have described.

Note the following pitfalls for the unwary:

• The above brief paragraph does not exhaust what you can say about a set of regression results. There may be features of the data you should look at - "Unusual observations", for example.
• Always report what happened before moving on to its significance - so R2adj values before F values, regression coefficients before t values. Remember, descriptive statistics are more important than significance tests.
• Degrees of freedom for both F and t values must be given. Usually they are written as subscripts. For F the numerator degrees of freedom are given first. You can also put degrees of freedom in parentheses, or report them explicitly, e.g.: "F(3,12) = 4.32" or "F = 4.32, d. of f. = 3, 12".
• Significance levels can either be reported exactly (e.g. p = 0.032) or in terms of conventional levels (e.g. p < 0.05). There are arguments in favor of either, so it doesn't much matter which you do. But you should be consistent in any one report.
• Beware of highly significant F or t values, whose significance levels will be reported by statistics packages as, for example, 0.0000. It is an act of statistical illiteracy to write p = 0.0000, because significance levels can never be exactly zero - there is always some probability that the observed data could arise if the null hypothesis was true. What the package means is that this probability is so low it can't be represented with the number of columns available. We should write it as p < 0.00005 (or, if we are using conventional levels, p < 0.001).
• Beware of spurious precision, i.e. reporting coefficients etc to huge numbers of significant figures when, on the basis of the sample you have, you couldn't possibly expect them to replicate to anything like that degree of precision if someone repeated the study. F and t values are conventionally reported to two decimal places, and R2adj values to the nearest percentage point (sometimes to one decimal place). For coefficients, you should be guided by the sample size: with a sample size of 16 as in the example above, two significant figures is plenty, but even with more realistic samples, in the range of 100 to 1000, three significant figures is usually as far as you should go. This means that you will usually have to round off the numbers that statistics packages give you.

### Carrying out multiple regression in XLSTAT.....see the multiple regression videos given on the "readings" page.

THE END......

All content copyright of Dr Jeremy Jackson - 2014. Douglas College, Vancouver, BC.