Wednesday, November 28, 2007

Practical Issues in Power Analysis

Below, I've added a new chart, based on things we discussed in class. William Trochim's Research Methods Knowledge Base, in discussing statistical power, sample size, effect size, and significance level, notes that, "Given values for any three of these components, it is possible to compute the value of the fourth." The table I've created attempts to convey this fact in graphical form.


You'll notice the (*) notation by "S, M, L" in the chart. Those, of course, stand for small, medium, and large effect sizes. As we discussed in class, Jacob Cohen developed criteria for what magnitude of result constitutes small, medium, and large for correlational studies and those studies comparing means of two groups (t-test type studies, but t itself is not an indicator of effect size).

When planning a new study, naturally you cannot know what your effect size will be ahead of time. However, based on your reading of the research literature in your area of study, you should be able to get an idea of whether findings have tended to be small, medium, or large, which you can convert to the relevant values for r or Cohen's d. These, in turn, can be submitted to power-analysis computer programs and online calculators.

I try to err on the side of expecting a small effect size. This will have the effect of requiring me to obtain a large sample size, to be able to detect a small effect, which seems like good practice, anyway.

UPDATE: Westfall and Henning (2013) argue that post hoc power analysis, which is what the pink column depicts in the above table, is "useless and counterproductive" (p. 508).

Tuesday, November 27, 2007

Illustration of a "Miss" in Hypothesis Testing (and Relevance for Power Analysis)

My previous blog notes on this topic are pretty extensive, so I'll just add a few more pieces of information (including a couple of songs).

As we've discussed, the conceptual framework underlying statistical power involves two different kinds of errors: rejecting the null (thus claiming a significant result) when the null hypothesis is really true in the population (known as a "false alarm"); and failing to reject the null when the true population correlation (rho) is actually different from zero (known as a "miss"). The latter is illustrated below:



And here are my two new power-related songs...

Everything’s Coming Up Asterisks
Lyrics by Alan Reifman (updated 11/18/2014)
(May be sung to the tune of “Everything’s Coming Up Roses,” from Gypsy, Styne/Sondheim)

We've got a scheme, to find p-values, baby.
Something you can use, baby.
But, is it a ruse? Maybe...

State the null! (SLOWLY),
Run the test!
See if H-oh should be, put to rest,

If p’s less, than oh-five,
Then H-oh cannot be, kept alive (SLOWLY),

With small n,
There’s a catch,
There could be findings, you will not snatch,

There’s a chance, you could miss,
Rejecting the null hypothesis… (SLOWLY),

(Bridge)
H-oh testing, how it’s always been done,
Some resisting, will anyone be desisting?

With large n,
You will find,
A problem, of the opposite kind,

Nearly all, you present,
Will be sig-nif-i-cant,

You must start to look more at effect size (SLOWLY),
’Cause, everything’s coming up asterisks, oh-one and oh-five! (SLOWLY)

Believe it or not, the above song has been cited in the social-scientific literature:




Find the Power
Lyrics by Alan Reifman
(May be sung or rapped to the tune of “Fight the Power,” Chuck D/Sadler/Shocklee/Shocklee, for Public Enemy)

People do their studies, without consideration,
If H-oh, can receive obliteration,
Is your sample large enough?
To find interesting stuff,

P-level and tails, for when H-oh fails,
Point-eight-oh’s the way to go,
That’s the kind of power,
That you’ve got to show,

Look into your mind,
For the effect, you think you’ll find,

Got to put this all together,
Got to build your study right,
Got to give yourself enough, statistical might,

You’ve got to get the sample you need,
Got to learn the way,
Find the power!

Find the power!
Get the sample you need!

Find the power!
Get the sample you need!

Monday, November 12, 2007

Non-Parametric/Assumption-Free Statistics

(Updated November 2, 2024)

This week, we'll be covering non-parametric (or assumption-free) statistical tests (brief overview). Parametric techniques, which include the correlation r and the t-test, refer to the use of sample statistics to estimate population parameters (e.g., rho, mu).* Thus far, we've come across a number of assumptions that technically are required to be met for doing parametric analyses, although in practice there's some leeway in meeting the assumptions.

Assumptions for parametric analyses are as follows (for further information, see here):

o Data for a given variable are normally distributed in the population.

o Equal-interval measurement.

o Random sampling is used.

o Homogeneity of variance between groups (for t-test).

One would generally opt for a non-parametric test when there's violation of one or more of the above assumptions and sample size is small. According to King, Rosopa, and Minium (2011), "...the problem of violation of assumptions is of great concern when sample size is small (< 25)" (p. 382). In other words, if assumptions are violated but sample size is large, you still may be able to use parametric techniques (example). The reason is something called the Central Limit Theory. I've annotated the following screenshot from Sabina's Stats Corner to show what the CLT does.


It is important first to distinguish between a frequency plot of raw data (which appears in the top row of Sabina's diagram) and something else known as a sampling distribution. A sampling distribution is what you get when you draw repeated random samples from a full population of individual persons (sometimes known as a "parent" population) and plot the means of all the samples you have drawn. Under the CLT, a frequency plot of these means (the sampling distribution) tends toward normality regardless of the parent shape. Further, the sampling distribution increasingly resembles a bell-curve shape as the size of the multiple samples increases. In essence, what the CLT does for us is get us back to normal distributions when the original parent population distribution is non-normal.

We'll be doing a neat demonstration with dice that conveys the role of large samples in salvaging data from the normal-distribution assumption, under the CLT.

Now that we've established that non-parametric statistics typically are used when one or more assumptions of parametrics statistics are violated and sample size is small, we can ask: What are some actual non-parametric statistical techniques?

To a large extent, the different non-parametric techniques represent analogues to parametric techniques. For example, the non-parametric Mann-Whitney U test (song below) is analogous to the parametric t-test, when comparing data from two independent groups, and the non-parametric Wilcoxon signed-ranks test is analogous to a repeated-measures t-test. This PowerPoint slideshow demonstrates these two non-parametric techniques corresponding to t-tests. As you'll see, non-parametric statistics operate on ranks (e.g., who has the highest score, the second highest, etc.) rather than original scores, which may have outliers or other problems.

The parametric Pearson correlation has the non-parametric analogue of a Spearman rank-order correlation. Let's work out an example involving the 2024-25 Miami Heat, a National Basketball Association team suggested by one of the students. The roster of one team gives us a small sample size of players, on whom we will correlate their annual salary with their career performance on a metric called Win Shares (i.e., how many wins are attributed to each player based on his points scored, rebounds, assists, etc.). Here are our data (as of November 1, 2024):


These data clearly have some outliers. Jimmy Butler is the highest-paid Heat player at $48.8 million per year and he is credited statistically with personally contributing 115.5 wins to his teams in his 14-year career. Other, younger players have smaller salaries (although still huge in layperson terms) and have had many fewer wins attributed to them. The ordinary Pearson correlation and Spearman rank-order correlations are shown at the bottom of this posting,** if you'd like to calculate them in suspense. Which do you think would be larger and why?

Finally, we'll close with our song...

Mann-Whitney U
Lyrics by Alan Reifman
(May be sung to the tune of “Suzie Q.,” Hawkins/Lewis, covered by John Fogerty)

Mann-Whitney U,
When your groups are two,
If your scaling’s suspect, and your cases are few,
Mann-Whitney U,

The cases are laid out,
Converted to rank scores,
You then add these up, done within each group,
Mann-Whitney U,

(Instrumental)

There is a formula,
That uses the summed ranks,
A distribution’s what you, compare the answer to,
Mann-Whitney U

---
*According to King, Rosopa, and Minium (2011), "Many people call chi-square a nonparametric test, but it does in fact assume the central limit theorem..." (p. 382).

**The Pearson correlation is r = .57 (p = .03), whereas the Spearman correlation is rs = .40 (nonsignificant due to the small sample size).

Tuesday, October 30, 2007

Partial-Correlation Oddity

While grading the correlation assignments, I came across an interesting finding in one of the students' papers (we use the "GSS93 subset" practice data set in SPSS, and each student can select his or her own variables for analysis).

Among the variables selected by this one student were number of children and frequency of sex during the last year. At the bivariate, zero-order level, these two variables were correlated at r = -.102, p < .001 (n = 1,327).

The student then conducted a partial correlation, focusing on the same two variables, but this time controlling for age (the student used the four-category age variable, although a continuous age variable is also available). This partial correlation turned out to be r = .101, p < .001.

Having graded several dozen papers from this assignment over the years, my impression was that, at least among the variables chosen by my students from this data set, partialling out variables generally had little impact on the magnitude of correlation between the two focal variables. Granted, neither of the correlations in the present example are all that huge, but the changing of the correlation's sign from negative (zero-order) to positive (first-order partial), with each of the respective correlations significantly different from zero, was noteworthy in my mind.

Also of interest was that age had both a fairly substantial positive correlation with number of children (r = .437) and a comparably powerful negative correlation with frequency of sex (r = -.410).

To probe the difference between the zero-order and partial correlations between number of children and frequency of sex, I went back to the (PowerPoint) drawing board, and created a scatter plot, color coding for age (also, in scatter plots created in SPSS, a dot only appears to indicate the presence of at least one case at a given spot on the graph, not the number of cases, so I attempted to remedy that, too).

My plot is shown below (you can click to enlarge it). I added trend lines after studying the SPSS scatter plots to see where the lines would go. As can be seen, the full-sample trend is indeed of a negative correlation, whereas all the age-specific trends are positive. We'll discuss this further in class.

Thursday, October 11, 2007

How Significance Cut-Offs for Correlations Vary by Sample Size

Here's a more elaborate diagram (click to enlarge) of what I started sketching on the board at yesterday's class. It shows that, with smaller sample sizes, larger (absolute) values of r (i.e., further away from zero) are needed to attain statistical significance, than is the case with larger samples. In other words, with smaller samples, it takes a stronger correlation (in a positive or negative direction) to reject the null hypothesis of no true correlation in the full population (rho = 0) and rule out (to the degree of certainty indicated by the p level) that the correlation in your sample (r) has arisen purely from chance.


As you can see, statisticians sometimes talk about sample sizes in terms of degrees of freedom (df). We'll discuss df more thoroughly later in the course in connection with other statistical techniques. For now, though, suffice it to say that for ordinary correlations, df and sample size (N) are very similar, with df = N - 2 (i.e., the sample size, minus the number of variables in the correlation).

For a partial correlation that controls (holds constant) one variable beyond the two main variables being correlated (a first-order partial), df = N - 3; for one that controls for two variables beyond the two main ones (a second-order partial), df = N - 4, etc.

This web document also has some useful information.

Wednesday, September 26, 2007

Intro to z-Scores

(Updated July 17, 2013)

We'll next be moving on to standardized (or z) scores and how they relate to the normal/bell curve and percentiles.

For any given body of data, each individual participant can be assigned a z-score on any variable. A z-score is calculated as:

Individual's Raw Score on a Variable - Sample Mean on that Variable
------------------------------------------------------------------------------------
Sample Standard Deviation on that Variable

This website has a good overview of z-scores.

If we wanted to compare the performances of two or more individuals on some task, the ideal way, of course, would be to administer the same measures, under the same conditions, to everyone, and see who scores highest. Sometimes, however, it's clear that the people you're trying to compare have not been assessed under identical conditions.

As one example, a university may have a large, amphitheatre-type lecture class of 400 students, with each student also attending a TA-led discussion section of 25 students. There are eight TA's, each of whom leads two sections. Overall course grades may be based 80% on uniform in-class exams taken at the same time by all the students, and 20% on section performance (mini-paper assignments and spoken participation). The kicker is that, for the 20% of the grade that comes from the sections, different students have different TA's, who can differ in the toughness or easiness of their grading. To account for differences in TA difficulty on the 20% of the course grade that comes from discussion sections, we could compute z-scores for the section grades.

Another example, which comes from an actual published study, involves comparing the home-run prowess of sluggers from different eras. As those of you who are big baseball fans will know, Babe Ruth held the single-season home-run record for many years, with the 60 he hit in 1927. Roger Maris then came along with 61 in 1961, and that's where things stood for another few decades. Within the last decade, we then saw Mark McGwire hit 70 in 1998 and Barry Bonds belt 73 in 2001.

Given the many differences between the 1920s and now, is Bonds's 2001 season really the most impressive? The initial decades of the 1900s were known as the Dead-ball Era, due to the rarity of home runs. In contrast, the last several years have been dubbed "The Live-ball Era," "The Goofy-ball Era," "The Juiced-ball (or Juiced-player) Era," "The Steroid Era," etc. It's not just steroids that are suspected of inflating the home-run totals of contemporary batters; smaller stadiums, more emphasis on weightlifting, and league expansion (which necessitates greater use of inexperienced pitchers) have also been suggested as contributing factors.

A few years ago, a student named Kyle Bang (a great name for analyzing home runs) published an article in SABR's Baseball Research Journal, applying z-scores to the problem (to access the article, click here, then click on Volume 32 -- 2003). Wrote Bang:

The z-score is best considered a measure of domination, since it only determines how well a hitter performed with respect to his contemporaries within the same season (p. 58).

In other words, for any given season, the home-run leader's total could be compared to the mean throughout baseball for the same season, with the difference divided by the standard deviation for the same season. A player with a high season-specific z score thus would have done well relative to all the other players that same season. (Bang actually used homer per at-bat as each player's input, so that, for example, a player's home-run performance would not be penalized if the player missed games due to injury.)

Under this method, Ruth has had the six best individual seasons in Major League Baseball history. The overall No. 1 season was Ruth's in 1920, where his z-score was 7.97 (i.e., Ruth's homer output in 1920 minus the major-league mean in 1920, and dividing the difference by the 1920 SD, equalled 7.97). Ruth's 1927 season, where his absolute number of homers established the record of 60, produced a z-score of 5.57, good for fourth on the list.

Bonds's 2001 season of 73 homers produced a z-score of 5.14, good for seventh on the list, and McGwire's 1998 season of 70 homers produced a z of 4.69, for eighth on the list.

Again, the point is that Ruth exceeded his contemporaries to a greater degree than Bonds and McGwire did theirs. The aforementioned factors that may have been inflating home-run totals in the 1990s and 2000s (steroids, small ballparks, etc.) would thus have helped Bonds and McGwire's contemporaries also hit lots of homers, thus raising the yearly means and weakening Bonds and McGwire's season-specific z-scores (although their z's were still pretty far out on the normal distribution).

As with any method, the z-score approach has its limitations. Among them, notes Bang, is that just by chance, some eras may have a lot of great hitters coming up at the same time, which raises the mean and weakens the top players' z-scores.

How would this logic apply to the example of the discussion sections of a large lecture class? Each TA could convert his or her students' orignal grades for section performance to z-scores, relative to that TA's grading mean and SD. If a particular TA were an easy grader, that TA's mean would be high and thus the top students would get their z-adjusted grade knocked down a bit. Conversely, a hard-grading TA would have a lower mean for his or her students, thus allowing them to get their grades bumped up a bit in the z-conversion.

Also, because z-scores have a mean of 0 and an SD of 1, and the section component would be counting 20% of students' overall course grades, the z-converted scores would have to be renormed. To get the students' section grades to top out at (roughly) 20, perhaps they could be converted to a system with a mean of 16 and an SD of 2.

The example of the discussion section grades is based on a true story. While in graduate school at the University of Michigan, I was a TA for a huge lecture class, and I suggested a z-based renorming of students' section grades. The professor turned the idea down, citing the increased complexity and grading time that would be involved.

I really believe the z-score approach gives you a lot of "bang" for your buck, but not everyone may agree.

Thursday, March 08, 2007

Survival Analysis (Special Blog Entry)

Today I'm giving a guest lecture on survival analysis (also known as event history analysis) in my colleague Dr. Du Feng's graduate class on developmental (longitudinal) data analysis. Survival analysis is not a topic for introductory statistics, but this blog seemed to be as good a place as any for putting these web links to supplement my presentation.

Survival analysis is appropriate when a researcher has a dichotomous outcome variable that can be monitored at regular time intervals to see if each participant has switched from one status to another (e.g., in medical research, from alive to dead).

Here are a couple of documents I found on the web. First is a brief one from University College London that presents both a survival curve and a hazard curve, concepts that we shall discuss during class. Second is a more elaborate document from the University of Minnesota that discusses additional issues, including "censored" data.

The focus of my talk will be a class project from when I taught graduate research methods in the spring of 1998. The study led to a poster paper at the 1999 American Psychological Association conference (copies available upon request). We took advantage of the fact that People magazine, which comes out weekly and is archived in the Texas Tech library, lists celebrity marriages and divorces (and other developments) in a "Passages" section (here's an example, from after the study was completed).

We were looking at the survival of celebrity marriages until divorce. What makes the study a little unusual is that two events had to occur for a couple to provide complete data: the couple would have to get married, then get divorced. Quoting from our paper:

All issues of People between January 1, 1990 and June 30, 1997 were examined, for a total of 392 weeks. All marriages and divorces during this period were recorded, but only those couples whose marriages took place during the study period were used. To facilitate the survival analysis, for each couple the week number of the marriage and of the divorce (if any) were recorded...

Regarding the week numbers noted above, the issue of People dated January 1, 1990 represented week number 1, the January 8, 1990 issue represented week number 2, and so forth, up through the June 30, 1997 issue, which represented week number 392.

The particular type of survival analysis we conducted is called Cox Regression. Like other kinds of regression techniques, Cox Regression tests the relationship of predictor variables (covariates) to an outcome, in this case the hazard curve.

I hope you find this information useful and that everyone "survives" the lecture. For a nice, though somewhat dated, overview of the technique, I would recommend the following article:

Luke, D.A. (1993). Charting the process of change: A primer on survival analysis. American Journal of Community Psychology, 21, 203-246.