Wednesday, October 06, 2010

Correlation

UPDATED 10/2/2022 

Our next topic is correlational analysis. There are four major areas to address:

1. A general introduction to correlation. Correlation refers to how two variables go together ("co-relate"). There are three main types of correlations:

Positive correlation: As one variable goes up, so does the other. They both follow the same pattern. Knowing where a person stands on one variable, you know roughly where he/she stands on the other (maximum = +1.0). 
  • Example: The more hours one studies before a test, the higher the score he or she will likely get. 
Zero correlation: Knowing where a person stands on one variable tells us nothing about where he/she stands on the other. Someone who has a high score on one variable is equally likely to have a high or a low value on the other. 
  • Example: A person’s number of sneezes per week is (probably) uncorrelated with the percent of a person’s shirts that are blue. 
Negative correlation: As one goes up, the other goes down. They follow an inverse pattern. Knowing where a person stands on one variable, you again know roughly where he/she stands on the other (minimum = -1.0). 
  • Example: The higher the winter temperatures where one lives (e.g., Miami), the fewer the heavy jackets people buy. 
Graphical depictions of positive, zero, and negative correlations. Note that the correlation (symbolized r) is based upon the slope of the best-fitting line (line which comes closest to all the points) and degree to which points are close to the line vs. being scattered. 

A song to nail down our understanding of correlation, best-fit lines, upward and downward slopes, etc. 

Fitting the Line 
Lyrics by Alan Reifman 
(May be sung to the tune of “Draggin’ the Line,” James/King) 

Plotting the data, on X and Y, 
Finding the slope, with most points nearby, 
We want to find the angle, of the trend’s incline, 
Fitting the line (fitting the line), 

Upward slopes make r positive, 
Slopes trending down, make it negative, 
From minus-one to plus-one, r can feel fine, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

Points align, how will the data shine? 
If you have upward slopes, it’ll give you a plus sign, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

How strongly will your variables relate? 
Is there a trend, or just a zero flat state? 
You want to know what your analysis will find, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

Points align, how will the data shine? 
Your r will be minus, if the slope declines, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

(Guitar solo) 

Points align, how will the data shine? 
If you have upward slopes, it’ll give you a plus sign, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line)… 

Facebook album of Dr. Reifman meeting singer Tommy James, after the latter's concert at the 2013 South Plains Fair. 

2. Running correlations in SPSS. This graphic of SPSS output tries to make clear that a sample correlation and its significance/probability level are two different things (although related to each other).


Second, in graphing the data points and best-fitting line, you start in "Graphs," go to "Legacy Dialogs," and select "Scatter/Dot." Then, select "Simple Scatter" and click on "Define." You will then insert the variables you want to display on the X and Y axes, and say "OK." When the scatter plot first appears, you can click on it to do more editing. To add the best-fit line, under "Elements," choose "Fit Line at Total."

Initially the dots will all look the same throughout the scatter plot. To make each dot represent the number of cases at that point (either by thickness of the dot or through color-coding), click on the "Binning" icon (circled below in red). Thanks to Xiaohui for finding this!

3. Statistical significance and testing the null hypothesis, as applied to correlation. Subthemes within this topic include how sample size affects the ease of getting a statistically significant result (i.e., rejecting the null hypothesis of zero correlation in the full population), and one- vs. two-tailed significance

4. Partial correlation (i.e., the correlation between two variables, holding constant one or more "lurking" variables).

Here are some additional tips:

5. In evaluating the meaning of a correlation that appears as positive or negative in the SPSS output, you must know how each of the variables is keyed (i.e., does a high score reflect more of the behavior or less of the behavior?).

6. Statistical significance is not necessarily indicative of social importance. With really large sample sizes (such as we have available in the GSS), even a correlation that seems only modestly different from zero may be statistically significant. To remedy this situation, the late statistician Jacob Cohen devised criteria for "small," "medium," and "large" correlations.

7. Correlations should also be interpreted in the context of range restriction (see links section on the right). Here's a song to reinforce the ideas:


Restriction in the Range 
Lyrics by Alan Reifman
(May be sung to the tune of “Laughter in the Rain,” Sedaka/Cody)

Why do you get such a small correlation,
With variables you think should be related?
Seems you’re not studying the full human spectrum,
Just looking at part of bivariate space,
All kinds of thoughts start to race, through your mind…

Ooh, there’s restriction in the range,
Dampening the slope of the best-fit line,
Ooh, I can correct r for this,
Put a better rho estimate in its place...