Dr. Alan Reifman's Intro Stats Page: 2008

Thursday, November 20, 2008

Confidence Interval (CI) Calculation

Updated November 23, 2015

The general form for calculating CI's is:

95% CI = Sample estimate +/- (1.96) (Standard Error)
............ (e.g., r or Mean)

The specific forms of this calculation for CI's around a mean, a correlation, and a proportion, respectively, are shown here, here, and here. This document (specifically Figures 7 and 8) explains why a step know as the Fisher z transformation must be implemented in finding the CI for a correlation. Because calculating the CI of a correlation is somewhat complicated, you may wish to use this online calculator for doing so.

Note how increasing one's sample size (N) will shrink the SE and hence, the CI.

Also, here's a potentially useful article:

Kalinowski, P., & Fidler, F. (2010). Interpreting ‘significance’: The difference between statistical and practical importance. Newborn and Infant Nursing Review, 10, 50-54.

Finally, I also have written a new song:

True Value
Lyrics by Alan Reifman
(May be sung to the tune of “Moon Shadow,” Cat Stevens; the song has also been recorded by real musicians, as commissioned by the Consortium for the Advancement of Undergraduate Statistics Education or "CAUSE")

Within your CI, you get the true value, true value, true value,
With 95%, you get the true value, true value, true value,

You get a sample statistic, a sample r, or sample M,
You then take plus-or-minus two (it’s really 1.96…), standard errors beyond your stat,
And within this new interval, we can be, so confident,
That the true value, mu or rho, will be somewhere… inside…, our confidence interval,

Within your CI, you get the true value, true value, true value,
With 95%, you get the true value, true value, true value...

Wednesday, October 22, 2008

t-Test Overview

(Updated October 11, 2025)

We will now be covering t-tests (for comparing the means of two groups) for the next week or so. As we'll discuss, there are two ways to design studies for a t-test:

INDEPENDENT SAMPLES, where a participant in one group (e.g., Trump voters in the 2024 election) cannot be in the other group (Harris voters). The technical term is that the groups are "mutually exclusive." The Trump and Harris voters could be compared, for example, on their average income.

PAIRED/CORRELATED GROUPS, where the same (or matched) person(s) can serve in both groups. For example, the same participant could be asked to complete math problems both during a period where loud hard-rock music is played and during a period where quiet, soothing music is played. Or, if you were comparing men and women on some attitude measure and your participants were heterosexual married couples, that would be considered a correlated design.

The Naked Statistics book briefly discusses the formula for an independent-samples t-test on pp. 164-165. Here's a simplified graphic I found from the web (original source):

Notice from the "Xbar1 - Xbar2" portion that the t statistic is gauging the amount of difference between the two means, in the context of the respective groups' standard deviations (s) and sample sizes (n). Your obtained t value will be compared to the t distribution (which is similar to the normal z distribution) to see if the t value is extreme enough to be unlikely to stem from chance. You will also need to take account of "degrees of freedom," which for an independent-samples t-test are closely based on total sample size.

There's an online graphic that visually illustrates the difference between z (normal) and t distributions (link). As noted on this page from Columbia University, "tails of the t-distribution are thicker and extend out further than those of the Z distribution. This indicates that for a given confidence level, t-scores [needed for significance] are larger than Z scores." (Remember our earlier term for distributions' tendency to have thick tails and generate outliers... kurtosis.)

More technically, as Westfall and Henning (2013) point out, "Compared to the standard normal distribution, the t-distribution has the same median (0.0) but with variance df/(df-2), which is larger than the standard normal's variance of 1.0" (p. 423). Remember that the variance is just the standard deviation squared.

In this table are shown values your obtained t statistic needs to exceed (known as "critical values") for statistical significance, depending on your df and target significance level (typically p < .05, two-tailed).

This website provides a nice overview of one- and two-tailed tests. One-tailed tests are appropriate when there is a directional hypothesis (i.e., among students with no prior calculus instruction, those who receive calculus instruction during a summer workshop will score higher, on average, on a calculus post-test than will students who did not receive a summer calculus workshop, with the opposite prediction making no sense). Despite one-tailed tests seeming to be the best choice in some situations, however, two-tailed tests are nearly always used, presumably because they are more conservative (i.e., harder to obtain significance with). This 2024 article argues for greater use of one-tailed tests.

I have created a little tutorial on how to interpret SPSS output for independent-samples t-tests.

Later, we will take up the paired/correlated/dependent samples t-test at this link.

To end this part of the lesson, let's have a song! I have not written any lyrics for t-tests but, fortunately, Dr. Jeff Witmer of Oberlin College did in 2005 ("Use a t," which may be sung to the tune of "Let it Be," Lennon-McCartney). I attended the first-ever U.S. Conference on Teaching Statistics (USCOTS) in 2005 at The Ohio State University. When I walked into the opening reception of this meeting, Jeff was on stage singing "Use a t." It was my first exposure to statistical lyrics, which inspired me to write a few of my own over the years.

The Consortium for the Advancement of Undergraduate Statistics Education (CAUSE), which sponsors the USCOTS meetings, several years ago commissioned musicians to record many statistical songs written by CAUSE/USCOTS participants, including "Use a t." Let's have a class sing-along! (click here and then search on Witmer). The references in the song to William Gosset and "Student" are explained in this article. When I saw Dr. Witmer at this past summer's (2025) 20th anniversary USCOTS meeting at Iowa State University, I absolutely had to ask for a selfie with him! Here it is...

Thursday, October 02, 2008

Hypothesis Testing with Correlations

NOTE: I have edited and reorganized some of my writings on correlation to present the information more coherently (10/11/2012).

The correlation statistic presents the first instance in which we'll be examining statistical significance (here and here). The question is whether we can reject the null hypothesis (Ho) that the correlation between a given pair of variables in the full population is zero (RHO = 0).

We, of course, obtain correlations (r) for our sample, and then see if our sample correlation is sufficiently different from zero (in either a positive or negative direction) so that it would have been sufficiently unlikely to have arisen from pure chance when the population RHO was truly zero. That's what we mean by statistical significance. When we achieve statistical significance, we can reject the Ho of zero RHO.

In order to have a statistically significant correlation, the correlation (r) itself should be appreciably different from zero, either above zero (a positive correlation) or below zero (a negative correlation).

Also, in order for the correlation to be significant, the significance (or probability) level displayed for a given correlation in your SPSS output must be very small (p < .05, or if the probability is even smaller, you can use one of the other conventional cut-off points, p < .01 or p < .001). Any time the probability p is larger than .05, the correlation is nonsignificant (in my opinion, if you get a correlation with a p level of .06 or .07, it's OK to note in your report that the correlation narrowly missed being significant under conventional standards).

Suppose you find that the correlation between two variables is r = .30, p < .01. This is telling us that, if the null hypothesis (Ho) is true -- that is, there truly is no correlation in the population from which the sample was drawn (rho = 0, where rho looks like a curvy capital P) -- then it would be extremely unlikely (p < .01) for a correlation of .30 to crop up purely by chance when the correlation throughout the society is truly zero.

[Here's a figure I've added in October 2007, to convey the idea of there truly being no correlation in a large population, but a correlation occurring in one's sample purely by random sampling error:

This web document is also helpful. The opposite problem, where the full population truly has a correlation, but you draw a sample that fails to show it, will be discussed later in the course.]

A significant correlation in our sample thus allows us to reject the null hypothesis and assert, based on an inference from our sample, that there is a correlation in the population. Again, note the inference from sample to population.

The essence of scientific hypothesis testing can thus be distilled to three steps:

1. State the null hypothesis (Ho) that there is no correlation between your two variables in the population (rho = 0). The investigator probably doesn't believe Ho (generally, we seek to uncover significant relationships), but Ho is part of the scientific protocol.

2. Obtain the sample correlation (r) between your two variables and the associated significance (p) level.

3. If the correlation is statistically significant (r is well above or well below zero, and p < .05), reject Ho. If the correlation is nonsignificant (r is close to zero and p is larger than .05), then the null hypothesis that there is no correlation between your two variables in the population must be maintained. We never accept the truth of the null hypothesis for certain; we just say it cannot be rejected.

I've stated above that, in order to be significant, a correlation (r) needed to be well above or well below zero. That's not always true, however. As we saw in some of our SPSS illustrations, with a very large sample size (n = 1,000 or more), a correlation does not necessarily need to be that far from zero to be significant. If a correlation appears to be small, yet is listed in the output as being significant (probably only due to the large sample size), you can say the correlation was "significant, though weak." A Wikipedia document on correlation (which I've just added to the links section on the right) displays guidelines developed by the late statistician Jacob Cohen for labeling correlations as "small, medium, and large."

Two concepts we will be taking up later in the course, statistical power and confidence intervals, will elaborate upon the issue of small correlations sometimes being significant and, conversely, relatively large correlations not being significant.

Friday, September 26, 2008

Probability Paradoxes

(Updated September 29, 2014)

To close out our coverage of probability, let's look at three brain-teasers.

1. One is the famous "Birthday Paradox." Upon first learning that a group size of only 23 people is necessary for the probability to be .50 that two of the people will have the same birthday, most observers find this very counterintuitive. The Wikipedia's page on the topic may help clarify the key points. One of the approaches taken on the Wikipedia page uses the "n choose k" principle. Another approach elaborates on the "and/multiplication" principle. The probability of at least one pair of people having the same birthday is 1 minus the probability of no one having the same birthday. The latter can be thought of as the product of the following probabilities:

The first person definitely has his/her birthday on some day (1) times...

The second person having it on one of the other 364 days of the year (364/365) times...

The third person having his or hers on one of the remaining 363 days (363/365) times...

2. The second puzzle is the famous Monty Hall Problem (named after the host of the old game show "Let's Make a Deal"), which is described in detail here. To my mind, the clearest explanation of the surprising solution is that given by Leonard Mlodinow's book The Drunkard's Walk. Based on Mlodinow's writing, here's a diagram we created in class a few years ago (thanks to Kristina for taking the picture):

The basic idea is that scenarios in which switching helps occur twice as often as ones in which switching hurts. The Naked Statistics book has a mini-chapter devoted to the Monty Hall Problem. There are also various YouTube videos on the problem, such as this one.

3. Finally, the third brain-teaser involves winning the lottery twice. Statisticians emphasize the distinction between a particular, named individual winning twice (which had an estimated probability of 1-in-17 trillion in a New Jersey example) and the probability that someone, somewhere, sometime would win twice. The latter probability, because it takes into account the huge number of people who play the lottery and the frequency and volume of tickets sold, is estimated at something more like 1-in-30 (the exact calculations are not shown).

Based on the linked article, let's use the n-choose-k and multiplication/and rules to derive the 1-in-17 trillion probability for a particular individual.

Tuesday, September 23, 2008

Comparing the Olympic Swimming Times of Michael Phelps (2008) vs. Mark Spitz (1972) via z-Scores

I have now corroborated the results of our Michael Phelps/Mark Spitz Olympic swimming z-score class exercise, which I'll show below. This activity was inspired by an earlier study published in the Baseball Research Journal that used z-scores to compare home-run sluggers of different eras.

I shared the activity with two listserve discussion groups, those of the APA Division of Evaluation, Measurement, and Statistics and the Society for Personality and Social Psychology, offering to provide the raw data and documentation on how to conduct the exercise. I'm pleased to report that over 100 people have requested these materials to use in their own statistics classes. The materials can still be requested, via my faculty webpage (see link in the right-hand column). I framed the exercise as follows, in the documentation:

Michael Phelps, with eight gold medals in the 2008 Beijing Olympics (on top of six golds from the 2004 Athens games), and Mark Spitz, with seven gold medals in the 1972 Munich Olympics, are swimming’s two greatest champions.

The two swam many of the same events. Though the respective times by Phelps are several seconds faster than Spitz’s, the 36 years between 1972 and 2008 are a long time for improvements in training, technique, nutrition, and facilities. A statistic known as the z-score allows us to see which swimmer was more dominant relative to his contemporary peers.

Phelps and Spitz had three individual (non-relay) events in common, the 200-meter freestyle, 200-meter butterfly, and the 100-meter butterfly. Because I have a relatively small class and wanted to have groups of three or four students each work on a different segment of the data, we looked at only the first two of the aforementioned events. Two considerations to note are that (a) times were converted into total seconds to facilitate computations; and (b) where an athlete swam multiple races of the same event (i.e., heats, semifinals, and finals), his fastest time was used. Here are the results:

200 FREESTYLE

2008

Mean = 109.42 seconds
SD = 3.20
Phelps time = 102.96 seconds (1:42.96)
Phelps z = -2.02

[For non-statisticians who may be reading this, z = an individual's value minus the mean, with the difference then divided by the standard deviation. The latter represents how spread out the data are.]

1972

Mean = 120.33 seconds
SD = 4.57 seconds
Spitz time = 112.78 seconds (1:52.78)
Spitz z = -1.65

200 BUTTERFLY

2008

Mean = 117.85 seconds
SD = 3.01
Phelps time = 112.03 seconds (1:52.03)
Phelps z = -1.93

1972

Mean = 129.51 seconds
SD = 5.32
Spitz time = 120.70 seconds (2:00.70)
Spitz z = -1.66

Note that negatively signed z scores are a "good" thing, indicating by how much Phelps or Spitz was faster (i.e., consuming less time) than his respective competitors. As can be seen, Phelps was more dominating against the 2008 fields of his events, than Spitz was against the 1972 fields. It would also be interesting to look at the 100-meter freestyle, which of course, Phelps won by the narrowest of margins.

I thank Nancy Genero of Wellesley College, a fellow University of Michigan Ph.D., for sharing the results from her class; by comparing our respective data files for possible typographical errors, we were able to reconcile some minor differences. Also, as a technical note, an "outlier" swimmer who had a time of 2:33.75 in the 1972 200 freestyle (when the next slowest time was around 2:13) was excluded. An extreme value would have affected both the mean and SD, of course.

Unlike the above analyses, which used all competitors in an event (regardless of whether they reached the finals or even the semifinals), one could also look exclusively at the finals. To the extent that qualifying rules for the Olympics may have changed between 1972 and 2008, or that other factors were operative, the proportion of weak swimmers (in a world-class context) in the fields might have been different in the two Games, again possibly affecting the z-score results. University of Nevada Reno graduate student Irem Uz indeed analyzed only the finals, and these were his results:

Phelps 200 free z = -1.92
Spitz 200 free z = -1.34
Phelps 200 fly z = -1.62
Spitz 200 fly z = -2.01

Under this method, there's a little redemption for Spitz. Examining the results of Spitz's 200 fly win in 1972, his dominance is clear:

1. Mark Spitz 2:00.70 WR
2. Gary Hall 2:02.86
3. Robin Backhaus 2:03.23
4. Jorge Delgado, Jr. 2:04.60
5. Hans Faßnacht 2:04.69
6. András Hargitay 2:04.69
7. Hartmut Flöckner 2:05.34
8. Folkert Meeuw 2:05.57

The mean was roughly 2:04, putting Spitz 3.30 seconds faster than it. Meanwhile, the extremely tight clustering of the fourth- through eighth-place swimmers served to keep the overall SD small (1.63). The upshot is a very big z for Spitz.

ADDENDA

The Wall Street Journal's "Numbers Guy," Carl Bialik, provided some other types of Phelps-Spitz comparisons as this year's Olympics were going on.

The New York Times created an amazing slide show of graphics, showing how the swimming times of Phelps and Spitz stacked up against each other, and also how each fared against his respective competition.

Another blogger, Jeremy Yoder, independently came up with the idea to analyze z-scores for Phelps and Spitz. Yoder's results are different from the comparable analyses reported above, for some reason.

Here's a 2014 application of z-scores to golf.