Dr. Alan Reifman's Intro Stats Page: November 2006

Sunday, November 26, 2006

Confidence Intervals

Confidence intervals (CI) allow us to take a statistic from one sample (e.g., mean years of education) and generate a range for what the true value of that statistic (known as a parameter) would be, had we been able to survey every single person in the population.* This statement holds as long as the sample seems representative of the larger population. We would not, for example, expect a sample of Texas Tech undergraduates to represent the entire US adult population.

CI's can be put around any kind of sample-based statistic, such as a percentage, a mean, or a correlation, to produce a range for estimating the true value in the larger population (i.e., what the percentage, mean, or correlation would be if you surveyed the full population).

Besides allowing one to see the likely range of possible values of a parameter in the population, CI's can also be used for significance-testing. A 95% CI is most commonly used, corresponding to p < .05 significance.

The following chart presents visual depictions of 95% confidence intervals, based upon correlations reported in this article on children's physical activity. (On some computers, the image below may stall before its bottom portion appears; if you click on the image to enlarge it, you should get the full picture, which features a continuum of correlations from -.30 to .30 at the bottom.)

An important thing to notice from the chart is that there's a direct translation from a confidence interval around a sample correlation to its statistical significance. If a CI does not include zero (i.e., is entirely in positive "territory" or entirely in negative "territory"), then the result is significantly different from zero, and we can reject the null hypothesis of zero correlation in the population. On the other hand, if the CI does include zero (i.e., straddles positive and negative territory), then the result cannot be significantly different from zero, and Ho is maintained. As Westfall and Henning (2013) put it:

...the confidence interval for [a parameter] provides the same information [as the p value from a significance test] as to whether the results are explainable by chance alone, but it gives you more than just that. It also gives the range of plausible values of the parameter, whether or not the results are explainable by chance alone (p. 435).

The dichotomous nature of null hypothesis significance testing (NHST) -- the idea that a result either is or is not significantly different from zero -- makes it less informative than the CI approach, where you get an estimated range within which the true population value is likely to fall. Therefore, many researchers have called for the abolition of NHST in favor of CI's (for an example, click here).

Of course, even if only CI's are presented, an interpretation in terms of statistical significance can still be made. Thus, it seems, we can have our cake and eat it too! Of that you can be confident.

This next posting shows how to calculate CI's and also includes a song...

---

*How exactly to interpret a confidence interval in technical terms remains in dispute (see Hoekstra et al. [2014] vs. Miller & Ulrich [2015] for contrasting arguments). My lecture notes above are closer to Miller and Ulrich's perspective.

Friday, November 17, 2006

Statistical Power -- Definitions

(Updated February 1, 2015)

Although we have not formally discussed the issue of statistical power, the general idea has come up many times. In the SPSS examples we've worked through, we sometimes have observed what look like small relationships or differences, but which have turned out to be statistically significant due to a large sample size. In other words, large sample sizes give you statistical power.

In research, our aim is to detect a statistically significant relationship (i.e., reject the null hypothesis) when the results warrant such. Therefore, my personal definition of statistical power boils down to just four words:

Detecting something that's there.

I also like to think in terms of a biologist attempting to detect some specimen with a microscope. Two things can aid in such detection: A stronger microscope (e.g., more powerful lenses, advanced technology) or a visually more apparent (e.g., clearer, darker, brighter) specimen.

The analogy to our research is that greater statistical power is like increasing the strength of the microscope. According to King, Rosopa, & Minium (2011), "The selection of sample size is the simplest method of increasing power" (p. 222). There are, however, additional ways to increase statistical power besides increasing sample size.

Further, a stronger relationship in the data (e.g., a .70 correlation as opposed to .30, or a 10-point difference between two means as opposed to a 2.5-point difference) is like a more visually apparent specimen. Strength of the results is also somewhat under the control of the researchers, who can take steps such as using the most reliable and valid measures possible, avoiding range restriction when designing a correlational study, etc.

The calculation of statistical power can be informed by the following example. Suppose an investigator is trying to detect the presence or absence of something. For example, during the Olympics all athletes (or at least the medalists) will be tested for the presence or absence of banned substances in their bodies. Any given athlete either will or will not actually have a banned substance in his or her body (“true state”). The Olympic official, relying upon the test results, will render a judgment as to whether the athlete has or has not tested positive for banned substances (the decision). All the possible combinations of true states and human decisions can be modeled as followed (format based loosely on Hays, W.L., 1981, Statistics, 3rd ed.):

In other words, power is the probability that you’re not going to miss the fact the athlete truly has drugs in his/her system.

[Notes: The term “alpha” for a scale’s reliability is something completely separate and different from the present context. Also, the terms “Type I” and “Type II” error are sometimes used to refer, respectively, to false alarms and misses; I personally boycott the terms “Type I” and “Type II” because I feel they are very arbitrary, whereas you can reason out what a false alarm or miss is.]

For the kind of research you’ll probably be doing...

“The power of a test is the ability of a statistical test with a specified number of cases to detect a significant relationship.”

(Original source document for above quote no longer available online.)

Thus, you’re concerned with the presence or absence of a significant relationship between your variables, rather than the presence or absence of drugs in an athlete’s body.

(Another term that means the same as statistical power is "sensitivity.")

This document makes two important points:

*There seems to be a consensus that the desired level of statistical power is at least .80 (i.e., if a signal is truly present, we should have at least an 80% probability of detecting it; note that random sampling error can introduce "noise" into our observations).

*Before you initiate a study, you can calculate the necessary sample size for a given level of power, or, if you're doing secondary analyses with an existing dataset, for example, you can calculate the power for the existing sample size. As the linked document notes, such calculations require you to input a number of study properties.

One that you'll be asked for is the expected "effect size" (i.e., strength or magnitude of relationship). Here's where Cohen's classification of small, medium, and large correlations comes in handy. I suggest being cautious and assuming you'll discover only a small relationship in your upcoming study. For comparing two means, remember that the t-test is used only for determining statistical significance (i.e., seeing where your result falls on the t distribution). Thus, for a two-group comparison of means, you have to use something called "Cohen's d" for seeing what a small, medium, and large difference between means (or effect size) would be. Cohen's (1988) book Statistical Power Analysis for the Behavioral Sciences (2nd Ed.) conveys his thinking on what constitute small, medium, and large differences between groups...

Small/Cohen's d = .20

"...approximately the size of the difference in mean height between 15- and 16-year-old girls (i.e., .5 in. where the [sigma symbol for SD] is about 2.1)..." (Cohen p. 26).

Medium/Cohen's d = .50

"A medium effect size is conceived as one large enough to be visible to the naked eye. That is, in the course of normal experience, one would become aware of an average difference in IQ between clerical and semiskilled workers or between members of professional and managerial occupational groups (Super, 1949, p. 98)" (Cohen, p. 26).

Large/Cohen's d = .80

"Such a separation, for example, is represented by the mean IQ difference estimated between holders of the Ph.D. degree and typical college freshmen, or between college graduates and persons with only a 50-50 chance of passing in an academic high school curriculum (Cronbach, 1960, p. 174). These seem like grossly perceptible and therefore large differences, as does the mean difference in height between 13- and 18-year-old girls, which is of the same size (d = .8)" (Cohen, p. 27).

Note that there's always a trade-off between reducing one type of error and increasing the other type. As King et al. (2011) explain, "In general, reducing the risk of Type I error [false alarm] increases the risk of committing a Type II error [miss] and thus reduces the power of the test" (p. 223). Consider the implications of using a .05 or .01 significance cut-off. The .01 level makes it harder to declare a result "significant," thus guarding against a potential false alarm. However, making it harder to claim a significant result does what to the likelihood of the other type of error, a miss? What is the trade-off when we use a .05 significance level?

Power-related calculations can be done using online calculators (e.g., here, here, and here). There are power calculators specific to each type of statistical technique (e.g., power for correlational analysis, power for t-tests).

More power to you!

Wednesday, November 08, 2006

Chi-Square Null Hypotheses

I just got a colorful idea (to say the least) for how to illustrate the null hypothesis of a chi-square test. In the SPSS example we've been using, the principle behind Ho is that one group's (e.g., male) distribution into the different characteristics (marital statuses) will be equal to the other group's (female) distribution. The pie charts below serve as an illustration:

Again, the null hypothesis is that the male pie (and all its wedge sizes) will equal the female pie (and all its wedge sizes). A significant overall chi-square test tells us, of course, to reject Ho, which is to say, reject the notion of identical male and female pies.

A significant overall chi-square test can derive from any one category (wedge) or more being discrepant across males and females. That's where the standardized residuals for the cells can potentially be informative.

Monday, November 06, 2006

Chi-Square in SPSS; Also, "Confusion of the Inverse"

(Updated October 31, 2025)

We'll now be learning how to perform chi-square tests in SPSS. Most of the elements will be straightforward (observed and expected frequencies, the overall chi-square, degrees of freedom, and significance). There are a few aspects of the output that may be a little confusing, however, so I've made another handy-dandy guide to aid interpretation (below).

Perhaps the most confusing aspect of a chi-square table is how to report on percentages of respondents. The important thing to remember is that, in a statement of the form "the percentage of people in group A have characteristic B," the order of A and B is not interchangeable.

Here's an example. I would estimate that about 80% of the players in the National Basketball Association (NBA) are from the U.S. (players such as the Dallas Mavericks' Dirk Nowitzki and the Houston Rockets' Yao Ming are part of the growing international presence). However, we would never claim the reverse -- that 80% of the people in the U.S. are NBA basketball players! Using a more complex example, Jessica Utts (2003, "What educated citizens should know about statistics and probability," The American Statistician) refers to this type of error as "Confusion of the Inverse."

We thus have to be careful about phrasing the results of a chi-square analysis. A good practice is to request only ROW percentages from SPSS for the cells in the table. If you follow this practice, then you can always phrase your results in the following form. For any given cell, you can say something like:

"Among [category represented by the row], ___% [shown in cell] were [characteristic represented by the column]."

This format corresponds to the SPSS output in the following manner:

Update, November 13, 2007: Television host Keith Olbermann of MSNBC's "Countdown" awarded himself third place in his nightly "Worst Person in the World" competition. Olbermann's offense? He committed a statistical reversal error of the type described above. Quoting from the transcript of the show:

The bronze to me. We inverted a statistic last night. The study based on stats from the Veterans Affairs and Census Bureau indicating the heart breaking percentage of homeless veterans. I said one in every four veterans is homeless. In fact, one of every four homeless is a veteran. That makes the number smaller. It is, quote, only, unquote, 194,000; 1,500 of them, according to the V.A., veterans of Afghanistan and Iraq, already on the streets. I apologize for the statistical mistake.

Here are some other tips:

1. The statistical significance of a chi-square analysis pertains to the table as a whole. You're saying that the overall chi-square (which is the sum of the cell-specific chi-squares) exceeds the critical value for a given degrees of freedom and significance level.

2. To see if one or more cells are making particularly large contributions to the overall significance of the chi-square, you can have SPSS provide the unstandardized (regular) and standardized residuals for each cell. The regular residual is just the difference between the observed and expected frequency counts for a given cell. The standardized residual puts the residual in z-score form, so that any standardized residual that's 1.96 or greater (in absolute value) could be said to be a major contributor to the overall chi-square. (Positive residuals indicate that, for a given cell, the observed count was higher than the expected one, whereas negative residuals convey that fewer people were observed in the cell than expected.) With large sample sizes, many cells may have standardized residuals greater than 1.96, making it a little harder to pinpoint any one or two cells where the "action" is. In general, though, I find standardized residuals very helpful.

3. SPSS also provides adjusted standardized residuals. In this video by John Hayes around the 8:00 point (with POAG referring to whether an eye patient has Primary Open-Angle Glaucome), he explains that you should use adjusted standardized residuals when (a) the sample is large, and (b) frequencies are unbalanced (e.g., many more people do not have POAG than have it).

4. If your overall chi-square for an analysis is significant, you should describe your findings in a way that emphasizes contrast. For example: "Among women, 39% followed the election campaigns closely, whereas only 28% of men did."