Tuesday, October 30, 2007

Partial-Correlation Oddity

While grading the correlation assignments, I came across an interesting finding in one of the students' papers (we use the "GSS93 subset" practice data set in SPSS, and each student can select his or her own variables for analysis).

Among the variables selected by this one student were number of children and frequency of sex during the last year. At the bivariate, zero-order level, these two variables were correlated at r = -.102, p < .001 (n = 1,327).

The student then conducted a partial correlation, focusing on the same two variables, but this time controlling for age (the student used the four-category age variable, although a continuous age variable is also available). This partial correlation turned out to be r = .101, p < .001.

Having graded several dozen papers from this assignment over the years, my impression was that, at least among the variables chosen by my students from this data set, partialling out variables generally had little impact on the magnitude of correlation between the two focal variables. Granted, neither of the correlations in the present example are all that huge, but the changing of the correlation's sign from negative (zero-order) to positive (first-order partial), with each of the respective correlations significantly different from zero, was noteworthy in my mind.

Also of interest was that age had both a fairly substantial positive correlation with number of children (r = .437) and a comparably powerful negative correlation with frequency of sex (r = -.410).

To probe the difference between the zero-order and partial correlations between number of children and frequency of sex, I went back to the (PowerPoint) drawing board, and created a scatter plot, color coding for age (also, in scatter plots created in SPSS, a dot only appears to indicate the presence of at least one case at a given spot on the graph, not the number of cases, so I attempted to remedy that, too).

My plot is shown below (you can click to enlarge it). I added trend lines after studying the SPSS scatter plots to see where the lines would go. As can be seen, the full-sample trend is indeed of a negative correlation, whereas all the age-specific trends are positive. We'll discuss this further in class.

Thursday, October 11, 2007

How Significance Cut-Offs for Correlations Vary by Sample Size

Here's a more elaborate diagram (click to enlarge) of what I started sketching on the board at yesterday's class. It shows that, with smaller sample sizes, larger (absolute) values of r (i.e., further away from zero) are needed to attain statistical significance, than is the case with larger samples. In other words, with smaller samples, it takes a stronger correlation (in a positive or negative direction) to reject the null hypothesis of no true correlation in the full population (rho = 0) and rule out (to the degree of certainty indicated by the p level) that the correlation in your sample (r) has arisen purely from chance.


As you can see, statisticians sometimes talk about sample sizes in terms of degrees of freedom (df). We'll discuss df more thoroughly later in the course in connection with other statistical techniques. For now, though, suffice it to say that for ordinary correlations, df and sample size (N) are very similar, with df = N - 2 (i.e., the sample size, minus the number of variables in the correlation).

For a partial correlation that controls (holds constant) one variable beyond the two main variables being correlated (a first-order partial), df = N - 3; for one that controls for two variables beyond the two main ones (a second-order partial), df = N - 4, etc.

This web document also has some useful information.