our img
Readings on Hypothesis Testing...
Instructor:          Jeremy Jackson   |    Winter, 2017

Office:                NW 3431    |   New Westminster
Sir Ken Robinson: Learning happens in minds and souls, not in the databases of multiple-choice tests

Some Readings

In their report on methods in psychology, the APA task force on statistical inference recommends that all psychologists should read Cohen (1994). Please read this review of the parts of the Cohen paper that I have highlighted (open the paper and you will see the sections I have commented on below highlighted in yellow). Note, the paper is challenging, do your best and ask in class if you have questions. I do not expect you to understand the whole paper, but I would like you to extract basic themes from the paper. These themes are listed below in the review and highlighted in the paper.

Review of Highlighted Sections of the Cohen Paper

1) The Cohen paper is critical of what is called NHST. This is what we call hypothesis testing in psychology 2300, 3300 and in this class. In the abstract Cohen says:

"After 4 decades of severe criticism, the ritual of null hypothesis significance testing-mechanical dichotomous decisions around a sacred .05 criterion-still persists. This article reviews the problems with this practice, including its near-universal misinterpretation of p as the probability that Ho is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects Ho one thereby affirms the theory that led to the test."

The "mechanical dichotomous decisions" Cohen speaks of above are what we discussed in class. This is the idea that hypothesis testing is a "dichotomous decision procedure", not a procedure of discovery. We decide if the null is false, we do not discover that the null is false.

Cohen also says that p is misinterpreted. P is what we called the "P-value" in class. The p-value is, the probability of the observed result, or something more extreme, given the null hypothesis is true. It is often misinterpreted as the probability the null hypothesis is true. This is not a problem with hypothesis testing per-se, just a problem in how hypothesis testing is used and interpreted. For our purposes, this means that you should never interpret the p-value as the probability the null hypothesis is true.

Cohen also argues that the complement of the p-value (1-p) is NOT the probability that, if we did the experiment again, we would reject the null hypothesis. Let's say we calculate a p-value of .03, 1-.03=.97 is NOT the probability we would reject the null hypothesis if we did the experiment again. The point here is that it is often wrongly thought that hypothesis testing tells us something about how likely we would be to get the same or a similar result if we did the study again (replicated it). Actually, the p-value tells us nothing about whether or not a replication of the experiment would lead us to the same conclusion that we made the first time we conducted the experiment. This matters because replication is very important to science. If we get the same result when we do the experiment again and again, this means that the result is reliable, we can trust it. It means we have discovered something stable, predictable, etc. Again, this is not a problem with hypothesis testing per-se, it is a problem with how the results of hypothesis testing are interpreted by the users of hypothesis testing.

Finally, Cohen argues above that a rejection of the null hypothesis does not mean that we have confirmed that the IV does actually have an effect on the DV (that our theory that the IV should effect the DV has been confirmed). It just means that in this single experiment, the p-value is less than alpha. So, for our purposes, this means that we should not say we have found more than we have really found in our experiment when we reject the null hypothesis. In order to confirm the theory that tells us that the IV should have an effect on the DV, we have to do much more than merely reject a null hypothesis. For instance, we have to replicate our result in multiple different studies using different samples of subjects (young, medium, old, students, non-students), different levels of the IV (say different dosages of the drug that our theory suggests should have an effect on the DV) and so on.

In a recently published study here, replications of 100 published psychological research papers were reported. In all of the 100 original studies, the existence of an effect was based on the hypothesis testing criterion that P<alpha. In an epic effort, psychologists around the world conducted each of these 100 research studies again to determine whether or not the same effect would be found in a replication that was found in the original research study. How many of the 100 studies do you think replicated (that is, in how many of the studies was p< alpha again)? You can read the paper to find out the details but it was about 1/3. In social psychology papers it was about 1/4! Cohen is saying that we have to be careful about the scientific legitimacy of hypothesis testing because it does not deliver the kind of result in which a good scientist is interested. It does not deliver a trustworthy, reliable, replicable result.

2) In the second part of the abstract Cohen says what he thinks scientists SHOULD be doing (as opposed to hypothesis testing):

"Exploratory data analysis and the use of graphic methods, a steady improvement in and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication."

Exploratory data analysis and the use of descriptive statistics is what was called "descriptive statistics" in psychology 2300, & 3300. I will just say "data analysis". In this class we will look at univariate, bivariate and multivariate forms of data analysis. Univariate forms of data analysis are forms of analysis that assess features of 1 variable at a time (mean, median, standard deviation, histograms, etc.). Bivariate forms of data analysis assess features of two variables at a time (Pearson r, phi, conditional distributions, scatter plots, etc). Multivariate data analyses assess features of more than 2 variables at a time (PCA, MDS, multiple regression, etc).

Standardization of measurement means things like using one and only one IQ test to measure intelligence rather than using the 80 or so different tests we currently use. In our global temperature example in class, this would mean using the same kind of thermometers all around the world for measuring temperature. Or, placing the thermometers in similar environments around the world (putting all of them in a shady place for example, not putting some in the sun and some in the shade).

An emphasis on estimating effect size means that we should focus more on how big the mean differences are that we found or how big the correlation is that we found, not on whether the mean difference or correlation we found is statistically significant. In our global temperature example, this means estimating how much the temperature has changed around the globe, not whether the difference in temperature over time is statistically significant. You may have heard of Cohen's d or Eta squared...these are measures of effect size (by the way...a Pearson r is also a measure of effect size). We will calculate these measures using XLSTAT and/or SPSS in this class.

You will have heard of confidence intervals in your previous statistics classes. I talked about them under the heading of "Estimation". What Cohen is suggesting is that it is better to try to estimate what the population value is and then put a range around that value within which you are confident that the true population value falls. This range is the confidence interval. This would mean in our global temperature example that we would try to estimate what the actual global mean temperature is and then put a range around this temperature within which we feel confident the "true" global mean temperature falls. So we might say that we think the global mean temperature is 11.2 degrees, plus or minus .5 degrees. This means we would be confident that the true global temperature is somewhere between 10.7 and 11.7 degrees. This is an attempt to estimate what the actual global temperature is, in a given year, not a hypothesis test of whether or not there is no difference in global temperatures between two years or two periods of time.

3) In the quotation below, Cohen argues that the nearly sole reliance on rejecting the null hypothesis, at the expense of replication, estimation, data analysis, etc, has been a bad thing for psychology. This sort of view is actually pretty common. Cohen referred to a paper by "Bill Rozeboom" written 33 years ago (which is a paper written in 1960 that you can find on the Classics in the History of Psychology website here) in which Rozeboom makes the strong case that since science is not a decision procedure, hypothesis testing is not good science. If you look carefully, you can find literally thousands of papers criticizing hypothesis testing. This is why the APA formed the task force on statistical inference which recommended that all students of psychology read the Cohen paper.

"I argue herein that NHST has not only failed to support the advance of psychology as a science but also has seriously impeded it."

4) In this next passage, Cohen is suggesting that, scientists do ask for advice on the use of hypothesis tests that make no sense at all.

"Would you believe it? And would you believe that if he tried to publish this result without a significance test, one or more reviewers might complain? It could happen."

The reason a hypothesis test makes no sense in this case is that if the null hypothesis that there are no instances of the disease in the population is true, then the sample result (one case of the disease), could not possibly happen. It follows then that the null MUST BE false and so doing a hypothesis test makes no sense (we already know the null is false, why would we do a test to decide if the null is false).

5) In this section Cohen refers to Fisher and Neyman and Pearson.

"Fisherian dogma and had not yet heard of Neyman-Pearson"

As I said in class, these three men invented the techniques we now call hypothesis testing. They actually disagreed a fair bit on how hypothesis testing should be done. In the end, many people think that we now use a hybrid of Fishers approach and Neyman and Pearson's approach. Whether the hybrid is a good thing or not is a whole other story....ask me in my office hours if you are interested.


6) In this part of the paper, Cohen argues one of the points he made above in the abstract (I listed this under my "point 1" above). He says:

"What we want to know is "Given these data, what is the probability that Ho is true?" But as most of us know, what it tells us is "Given that Ho is true, what is the probability of these (or more extreme) data?""

So, we want to know this...P(Ho|D). This means the probability(P) that the null hypothesis is true (Ho), given (|) what we have observed (D). So this would be something like....the probability she does not like me, given she kissed me 8 times. But this is not what the p-value we get in hypothesis testing tells us. What we get in hypothesis testing is....P(D|Ho). This means, the probability of what we have observed given the null hypothesis is true. So this would be something like...the probability she would kiss me 8 times given she does not like me. Cohen is saying that we really want to know the probability the null is true....but that's not what we get in hypothesis testing.

7) Here Cohen discusses a logic that would make sense:

"If this were the reasoning of Ho testing, then it would be formally correct."

What Cohen is saying is that it make sense to argue...If D happens, then Ho is false. D happened, therefore Ho is false. The problem with hypothesis testing is that the logic is not this but this...D is unlikely if Ho is true. D happened THEREFORE Ho is false. But this is not correct reasoning. Just because D is unlikely if Ho is true, does not mean that when D happens Ho MUST BE false. Imagine is unlikely to be kissed 8 times if a girl does not like me, I was kissed 8 times, therefore, the girl likes me. Or, it is unlikely to get 9 or more heads when we flip an unloaded/fair coin 10 times. I flipped the coin and got 9 heads THEREFORE the coin is loaded. Just because we have rejected the null (concluded she likes me, or concluded the coin is loaded) does not mean that she ACTUALLY likes me or that the coin is ACTUALLY loaded.

8) At the end of this section Cohen provides a false reasoning that he says appears in paper after paper in which hypothesis testing is used. This is, he says, a form of argument that is not correct but is often made by researchers in their journal articles. Here it is:

"If Ho is true, then this result (statistical significance) would probably not occur. This result has occurred. Then Ho is probably not true and therefore formally invalid."

Hypothesis testing does not produce the probability that Ho is not true ("Ho is probably not true"). So, in this class, we are not going to talk about the probability Ho is true. We agree with Cohen….this is not a correct way to talk about our hypothesis testing results.

9) We spoke about this next issue in class. This is the idea that the null hypothesis is not the hypothesis of NO DIFFERENCE or NO EFFECT or NO CORRELATION. It is the hypothesis we wish to reject (as Fisher said, the hypothesis to be nullified). The second issue that Cohen mentions here is another idea we discussed in class. This is the idea that the null hypothesis is almost always wrong. Imagine a case in which the correlation between two variables in a population is EXCATLY 0. Does this ever happen? Imagine we collect the height of every Canadian and we collect the number of toasters each Canadian owns. We now calculate the correlation between height and toasters for all Canadians. Can we imagine that this correlation is EXACTLY 0? Well, it might be close to 0, say .0000012, but that's not 0. So, the idea is that the null is almost always wrong when it is phrased as a hypothesis of no difference or no effect.

"But as almost universally used, the null in Ho is taken to mean nil, zero. For Fisher, the null hypothesis was the hypothesis to be nullified. As if things were not bad enough in the interpretation, or misinterpretation, of NHST in this general sense, things get downright ridiculous when Ho is to the effect that the effect size (ES) is 0-that the population mean difference is 0, that the correlation is 0, that the proportion of males is .50, ... (an Ho that can almost always be rejected, even with a small sample - Heaven help us!)."

10) The point above is made a few times. Here, Cohen cites Tukey (one of the greatest statisticians in the world) as follows:

"Tukey (1991) wrote that "It is foolish to ask 'Are the effects of A and B different?' They are always different - for some decimal place" (p. 100)."

So again, the idea is that the null of NO DIFFERENCE in the population from which the samples have been drawn is always false because subjects in the conditions were treated differently (in an experiment, what subjects experience/do is different at different levels of the IV - a placebo contains different ingredients than the drug, so the drug will have SOME effect, even though it may be very small).

So, if the null is always false, the question is....what's the point of testing it! That is perhaps the greatest argument against hypothesis testing and the greatest argument in favour of estimation, descriptive statistics and replication.

11) So this leads to the question of what should be done instead of hypothesis testing. Cohen writes...

"First, don't look for a magic alternative to NHST, some other objective mechanical ritual to replace it. It doesn't exist."

For me, this is the KEY POINT we must learn before beginning a course on data analysis! That is, that an analysis of data is NOT a ritualistic, recipe driven process. There is no set procedure or method for analyzing data. There are tools, objectives, various forms of statistical analysis, etc., but no one set method. Every analysis is different!

This means that the good analyst has to be clear about their objective, understand the logic of a wide array of techniques, understand how to interpret the results of these techniques and be insightful and careful in their interpretation of what their analysis shows.

That's what this course is about!


It's important to understand that any analysis of data is only as good as the data itself. Good data is the foundation of science. So, before we move on, let's just consider some basic principles of good science as they apply to the data we collect (the observations we make).

What is Good Data?

1) Replication is widely regarded as the key to good science. As far as data goes, replication means doing a study over and over again under slightly different conditions to determine whether or not the result observed in the experiment is stable over time, situations, the sample being studied, etc. Such a practice also helps us determine the extent to which an experimental result is influenced by the conditions of the experiment. For example, we may wish to know the effects of a particular drug on depression. We may conduct a perfectly controlled clinical trial and conclude, based on the results of the trial, that the drug reduces the length of depressive episodes by an average of 10%. Based on this finding, we might set out to market the drug for use in the general population and conclude that the use of the drug will lead to a reduction of depression in the population. But imagine over time that we notice the amount of depression in the population is increasing even though the drug is in use in the population. How could this be? Well, there could be literally hundreds of factors that we did not account for in our initial experiment that might have an influence on the efficacy of a drug when it is used in the general population. Perhaps the drug tastes bad and so, unless the patient is required to take the drug on a regular schedule as they would be in a clinical trial, patients in the real-world take the drug less often than they should. Just imagine all the possible reasons why real-world results might not mirror the results of a clinical trial....they are literally endless.

As an aside, this, I believe, is why most educational research on the effectiveness of various different kinds of teaching methods does not translate to the real world. There is no reason that a method found to be effective in a particular, usually contrived, experimental situation should be expected to work when applied in the real-world. Take, for instance, the example of Dr Jordan Peterson. Dr Peterson was voted by the students of the University of Toronto as one of only 3 professors to be transformational in their teaching. His teaching deeply affects his students and many say their lives are changed by taking his course "Maps of Meaning". And what modern teaching method, discovered in modern educational research to be the best teaching method, does he use? He sits down for 2 hours each class and talks. No overhead, no Power-Point, no chalk-board....nothing, nothing modern at all. Take a look at this video, why do you think Dr Peterson is so revered by his students? Can you manipulate this in a research study? Can you implement this in a population of teachers?

This idea was beautifully articulated by Professor Peter Feynman, a Noble Prize winning physicist. The following is a video made of an address he give to the students of Caltech in 1974. In his address, Professor Feynman spoke about the need for scientists to be careful and about how results found in experiments may not work when applied to real world settings that are not at all like experimental settings. That is, the need for scientists to replicate their work in a wide array of contexts to determine whether their finding "works"....whether it replicates reliably across contexts.

Go ahead and watch the video here.

Back to replication. Here is a section on replication written by perhaps the greatest social scientist to ever live, Louis Guttman.

"Both estimation and the testing of hypotheses have usually been restricted as if to one-time experiments, both in theory and in practice. But the essence of science is replication: a scientist should always be concerned about what will happen when he or another scientist repeats his experiment."


2) The relationship of the sample to the population is an issue that we come across often in psychology. Most often, this issue is conceptualized as having to do with hypothesis testing and random sampling. I am sure you are all familiar with the idea that samples should be representative of the population from which the sample has been drawn. Many of you though, may believe that the method that ensures representativeness is random sampling. This is not correct. Consider a population of individuals....say 1,000,000 people.... that is 90% employed and 10% unemployed. Now imagine that we draw a random sample of 100 people from this population. What will the number of employed people be in the sample? The correct answer is somewhere between 0 and 100. Randomness means that sampling error is present. That means that it is possible, due to sampling error, to take samples that look very different from the population.

Now, it is true, in this example, that the sample may well contain 90 employed people and 10 unemployed people. But that is not guaranteed. It is also true that the most likely single outcome is that the sample includes 90 employed and 10 unemployed people. This probability can, in fact, be calculated using the Binomial Distribution. I did this and the probability is .131. This means that the sample proportion of employed people will be perfectly representative of the population proportion of employed people 13.1% of the time. The other 86.9% of the time, the sample proportion of employed people will be different to the population proportion of employed people. Even for a sample size of 100, the vast majority of the time, the sample will not be representative of the population!

And this is for just 1 variable. But people differ from one another on thousands and thousands of variables. So that means that on thousands of variables, the sample is likely NOT to be representative of the population. So random sampling does not ensure representativeness. What does? In order to be representative of the population on any given value of a variable, we must know the proportion of cases in the population that have the given value of the variable, and explicitly sample so that the sample has the same proportion of people with this value as the population. This is known as a form of probability proportional to size sampling.

So why is random sampling seen to be so important in psychology then? The important thing about random sampling is that it distributes all the variables on which subjects differ (like employment, gender, ethnicity, number of toasters, etc) RANDOMLY across levels of the IV (not equally). This random distribution of subject variables is a condition that is required in order to satisfy the assumptions of most statistical hypothesis tests. In any hypothesis test, we understand that SAMPLE statistical values (like sample means, for instance) will be different in different groups (at different levels of the IV), due to the fact that samples are randomly drawn from the population. That is, that sample differences arise from sampling error or you might say the randomness of the sampling process. Because of the randomness of the sampling process, subjects in the groups will be different (there will not be the same number of employed in each sample, the same distribution of ethnicity in each sample, etc). These differences between the subjects in the groups is what is responsible for differences between groups in, for example, sample means. So ACTUALLY, not only are NON-REPRESENTATIVE samples guaranteed by the sampling process, it is this non-representativeness that is, in part, responsible for differences between samples in the statistics calculated on the sample (sample mean differences, for example).

The whole idea of hypothesis testing is to make a decision about whether or not the differences we observe between samples (for instance, sample mean differences) are due to this sampling error only (error caused by the fact that SUBJECTS IN THE GROUPS ARE DIFFERENT) or sampling error PLUS the effect of the IV (the effect of the IV comes from the fact that subjects were treated differently in the different groups).

Below is a short point-form Power-Point slide deck on the logic of hypothesis testing and it's relationship to sampling error.


Now we look at the last significant feature of good data.

3) Good data has been checked carefully, verified for accuracy, and values in error have been removed. This is perhaps the most important job of the scientist. And this is something that has nothing at all to do with statistics or hypothesis testing. In part, this a job for the data analyst. The data analyst has the responsibility for ensuring that the data in their data file is actually the data that was recorded. This requires auditing data, checking any programs written on the data file and so on. This is what we deal with next week.......See you then....


All content copyright of Dr Jeremy Jackson - 2014. Douglas College, Vancouver, BC.