Monday, November 25, 2024

Fall 2024 HDFS 5349 Quantitative Methods I

Welcome to QM I, the department's introductory statistics class. I'm a bit unusual in using a blog to organize class materials, but I think it's worked well over the years. Also, for those of you who haven't been in school here, welcome to Texas Tech! You'll be visiting this welcoming page a lot, as it contains the links for our lecture notes.

I'll do my best to provide a lot of practical, real-world exercises in analyzing data, and I'll try to keep things fun. This passage from a book I read several years ago, Coincidences, Chaos, and All That Math Jazz, by Edward B. Burger and Michael Starbird, provides a concise overview of what statistics can offer:

Statistics can help us understand the world. It is a powerful and effective tool for placing economic, social welfare, sports, and health issues into perspective. It molds data into digestible morsels and shows us a measured way to look at situations that have either random or unknown features. But we must use common sense when applying statistics or other tools that draw on our experience of the world to shape data into meaningful conclusions (p. 60).

In addition, the following article sets forth some goals for what you should learn in this class (and other classes). We can access this article via the Texas Tech Library website or Google Scholar.

Utts, J. (2003). What educated citizens should know about statistics and probability. American Statistician, 57(2), 74-79.

LECTURE NOTES (asterisked [*] pages are from my undergraduate research-methods class).

Units of analysis*

Sampling*

Types of Measures*

Visual depictions of a data distribution:

  • Histograms (overview). Your textbook authors King and colleagues (2018) offer some advice on interval widths and the appearance of histograms, noting that "it is customary to make the height of the distribution about  three-quarters of the width" (pp. 32-33). To adjust the width of the columns, click on histogram in your output, then go to "Options," "Un-bin Element," and "Bin Element," then in small square in upper-right of screen, under "X Axis," select "Custom," and then enter desired interval width.
  • Frequency tables contain similar information to histograms. The cumulative percentages also are roughly similar to percentiles (for a given score, you can see what percent of the sample falls below it).
  • Shapes of distributions
  • As a class exercise, we will attempt to reproduce via SPSS this histogram of U.S. Presidents' ages upon assuming office (note that Grover Cleveland, who served two non-consecutive terms is counted as being "two presidents," the 22nd and 24th)

Descriptive statistics:* Central tendency (mean, median, and mode) and spread (standard deviation); moments of a distribution; and z-scores (hereherehere, and here)

Probability (here and here)

Correlation and significance-testing

t-tests

Chi-square

Non-parametric statistics

Statistical power

Confidence intervals

Big data

Writing Up Statistical Results in APA Style

Sunday, November 24, 2024

Big Data

(Cross-posted and modified from several years ago at Reifman Multivariate Blog

The amount of capturable data people generate is truly mind-boggling, from commerce, health care, law enforcement, sports, and other domains. For example, according to one article, "Walmart controls more than 1 million customer transactions every hour..." The study of Big Data (also known as "Data Mining") applies statistical techniques such as correlation to discern patterns in the data and make predictions. Let's start with a brief video, listing numerous examples. Overview articles are available in the Harvard Business Review and on the Wikipedia (plus many other places I'm sure you can find via a Google search). A critique of the approach is available here

Perhaps the most common use of Big Data is in the business world. A good entry point in this area is the book Supercrunchers by Ian Ayres. Probably the best-known story about Big Data is "How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did." What Target's statisticians did, essentially, was look for associations of whether customers' names were on the baby registry and whether they made greater purchases of certain products than the average person. Another potential business usage could come from this data-visualization of taxi and Uber trips in New York City (compare to layout of the city).

In the sports domain, I recommend three books pertaining to baseball: Moneyball (by Michael Lewis) on the Oakland Athletics; The Extra 2% (by Jonah Keri) on the Tampa Bay Rays; and Big Data Baseball (by Travis Sawchik) on the Pittsburgh Pirates. These three teams play in relatively small cities by Major League Baseball standards and thus bring in less local-television revenue than teams in bigger cities such as New York and Los Angeles. Therefore, teams such as Oakland, Tampa Bay, and Pittsburgh must use statistical techniques to (a) discover "deceptively good" players, whose skills are not well-known to other teams and who can thus be paid non-exorbitant salaries, and (b) identify effective strategies that other teams aren't using (yet). The Pirates' signature strategy, now copied by other teams, is the defensive shift, using statistical "spray charts" unique to each batter. (Major League Baseball later passed rules to limit shifting.) 

One final book I would recommend is The Victory Lab by Sasha Issenberg on political campaigns, namely how candidates try to convince people to vote for them and get their supporters to actually vote. 

To end the course, here's one final song...

Big Data 
Lyrics by Alan Reifman 
May be sung to the tune of “Green Earrings” (Fagen/Becker for Steely Dan) 

Sales, patterns, 
Companies try, 
Running equations, 
To predict, what we’ll buy, 

Big data, 
Lots of numbers, 
Floating in, the cloud, 
For computers, 
To analyze, now, 
We know, how, 

Sports, owners, 
Wanting to win, 
Seeking, advantage, 
In the numbers, it’s no sin, 

Big data, 
Lots of numbers, 
Floating in, the cloud, 
For computers, 
To analyze, now, 
We know, how 

Instrumentals/solos 

Big data, 
Lots of numbers, 
Floating in, the cloud, 
For computers, 
To analyze, now, 
We know, how 

Instrumentals/solos

Wednesday, December 20, 2023

Formula for a Paired t-Test

A paired t-test incorporates the possibility that the two variables whose means are being compared may also be correlated. In the following example, participating back-pain patients receive actual medication for one stretch of, say, six weeks and placebo (sugar pills) for a different stretch of six weeks. Ideally, the patients and medical staff who directly provide the pills to the patients would not know what kind of pill was being provided (i.e., a double-blind design) and the order of delivery -- medication then placebo or placebo then medication -- would be varied at random.

The focus of the paired t-test is, of course, whether participants' average reported pain while taking actual medication differs from their average reported pain under placebo. However, because patients with the most severe initial pain might report relatively high pain under medication and even higher pain under placebo, whereas patients with the mildest initial pain might report relatively low pain under placebo and even lower pain while receiving medication, patients' pain reports under medication and placebo could be positively correlated (see following graphic).

As shown in the following screenshot from this University of Georgia webpage, the correlation r between the two variables X and Y (highlighted) enters the paired t-test formula for comparing means.


Another document goes into additional depth regarding the paired/correlatedt-test, including implications of the correlation "r" being included in the formulation. As it notes, a larger correlation between the two variables will increase the size of the (absolute) t.

I've also created a graphic to interpret the SPSS output of a paired t-test, emphasizing the t-test comparison of means but also showing where the correlation between the two variables appears.

Friday, October 10, 2014

Writing Up Statistical Results in APA Style

Our colleague Dr. Shera Jackson has compiled the following list of web resources for how to write up statistical results in APA style:

Reporting Statistics in APA Style (Matthew Hesson-McInnis, Illinois State University)

Reporting Results of Common Statistical Tests in APA Format (Psychology Writing Center, University of Washington)

Statistics in APA Style (Craig Wendorf, University of Wisconsin-Stevens Point)

Tuesday, November 16, 2010

Statistical Power (Overview)

This week we'll be covering statistical power (also known as power analysis). Power is not a statistical technique like correlation, t-test, and chi-square. Rather, power involves designing your study (particularly getting a large enough sample size) so that you can use correlations, t-tests, etc., more effectively. The core concept of power, like so much else, goes back to the distinction between the population and a sample. When there truly is a basis in the population for rejecting the null hypothesis (e.g., a non-zero correlation, a non-zero difference between means), we want to increase the likelihood that we reject the null from the analysis of our sample. In other words, we want to be able to pronounce a result significant, when warranted. Here are links to my previous entries on statistical power.

Introductory lecture

Why a powerful design is needed: The population may truly have a non-zero correlation, for example, but due to random sampling error, your sample may not; plus, some songs on statistical power!

Remember that there's also the opposite kind of error: The population truly has absolutely no correlation, but again due to random sampling error, you draw a sample that gives the impression of a non-zero correlation.

How to plan a study using power considerations

Wednesday, November 03, 2010

Chi-Square

My introductory stat notes for methods class have some introductory information on chi-square.

Here are direct links to some old chi-square blog postings. This one discusses the reversibility error and how properly to read an SPSS printout of a chi-square analysis. The other one illustrates the null hypothesis for chi-square analyses in terms of equal pie-charts.

The following photo of the board, containing chi-square tips, was added on November 15, 2011 (thanks to Selen).


Plus a song (added November 1, 2011):

One Degree is Free
Lyrics by Alan Reifman
(May be sung to the tune of “Rock and Roll is Free,” Ben Harper)

Look at your, chi-square table,
If it is, 2-by-2,
One cell can be filled freely,
While the others take their cue,

The formula that you can use,
Come on, from the columns, lose one,
And one, as well, from the rows,
Multiply the two, isn’t this fun?

One degree is free, in your table,
With con-tin-gen-cy, in your table,
One degree is free, in your table,
…free in your table,
…free in your table,

Say, your table is larger,
Maybe it’s 2-by-4,
Multiply one by three,
3 df are in store,

The df’s are essential,
To check significance,
Go to your chi-square table,
And find the right instance,

Three degrees are free, in your table,
With con-tin-gen-cy, in your table,
Three degrees are free, in your table,
…free in your table,
…free in your table,

(Guitar Solo)

Wednesday, October 06, 2010

Correlation

UPDATED 10/2/2022 

Our next topic is correlational analysis. There are four major areas to address:

1. A general introduction to correlation. Correlation refers to how two variables go together ("co-relate"). There are three main types of correlations:

Positive correlation: As one variable goes up, so does the other. They both follow the same pattern. Knowing where a person stands on one variable, you know roughly where he/she stands on the other (maximum = +1.0). 
  • Example: The more hours one studies before a test, the higher the score he or she will likely get. 
Zero correlation: Knowing where a person stands on one variable tells us nothing about where he/she stands on the other. Someone who has a high score on one variable is equally likely to have a high or a low value on the other. 
  • Example: A person’s number of sneezes per week is (probably) uncorrelated with the percent of a person’s shirts that are blue. 
Negative correlation: As one goes up, the other goes down. They follow an inverse pattern. Knowing where a person stands on one variable, you again know roughly where he/she stands on the other (minimum = -1.0). 
  • Example: The higher the winter temperatures where one lives (e.g., Miami), the fewer the heavy jackets people buy. 
Graphical depictions of positive, zero, and negative correlations. Note that the correlation (symbolized r) is based upon the slope of the best-fitting line (line which comes closest to all the points) and degree to which points are close to the line vs. being scattered. 

A song to nail down our understanding of correlation, best-fit lines, upward and downward slopes, etc. 

Fitting the Line 
Lyrics by Alan Reifman 
(May be sung to the tune of “Draggin’ the Line,” James/King) 

Plotting the data, on X and Y, 
Finding the slope, with most points nearby, 
We want to find the angle, of the trend’s incline, 
Fitting the line (fitting the line), 

Upward slopes make r positive, 
Slopes trending down, make it negative, 
From minus-one to plus-one, r can feel fine, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

Points align, how will the data shine? 
If you have upward slopes, it’ll give you a plus sign, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

How strongly will your variables relate? 
Is there a trend, or just a zero flat state? 
You want to know what your analysis will find, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

Points align, how will the data shine? 
Your r will be minus, if the slope declines, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line), 

(Guitar solo) 

Points align, how will the data shine? 
If you have upward slopes, it’ll give you a plus sign, 
Fitting the line (fitting the line), 
Fitting the line (fitting the line)… 

Facebook album of Dr. Reifman meeting singer Tommy James, after the latter's concert at the 2013 South Plains Fair. 

2. Running correlations in SPSS. This graphic of SPSS output tries to make clear that a sample correlation and its significance/probability level are two different things (although related to each other).


Second, in graphing the data points and best-fitting line, you start in "Graphs," go to "Legacy Dialogs," and select "Scatter/Dot." Then, select "Simple Scatter" and click on "Define." You will then insert the variables you want to display on the X and Y axes, and say "OK." When the scatter plot first appears, you can click on it to do more editing. To add the best-fit line, under "Elements," choose "Fit Line at Total."

Initially the dots will all look the same throughout the scatter plot. To make each dot represent the number of cases at that point (either by thickness of the dot or through color-coding), click on the "Binning" icon (circled below in red). Thanks to Xiaohui for finding this!

3. Statistical significance and testing the null hypothesis, as applied to correlation. Subthemes within this topic include how sample size affects the ease of getting a statistically significant result (i.e., rejecting the null hypothesis of zero correlation in the full population), and one- vs. two-tailed significance

4. Partial correlation (i.e., the correlation between two variables, holding constant one or more "lurking" variables).

Here are some additional tips:

5. In evaluating the meaning of a correlation that appears as positive or negative in the SPSS output, you must know how each of the variables is keyed (i.e., does a high score reflect more of the behavior or less of the behavior?).

6. Statistical significance is not necessarily indicative of social importance. With really large sample sizes (such as we have available in the GSS), even a correlation that seems only modestly different from zero may be statistically significant. To remedy this situation, the late statistician Jacob Cohen devised criteria for "small," "medium," and "large" correlations.

7. Correlations should also be interpreted in the context of range restriction (see links section on the right). Here's a song to reinforce the ideas:


Restriction in the Range 
Lyrics by Alan Reifman
(May be sung to the tune of “Laughter in the Rain,” Sedaka/Cody)

Why do you get such a small correlation,
With variables you think should be related?
Seems you’re not studying the full human spectrum,
Just looking at part of bivariate space,
All kinds of thoughts start to race, through your mind…

Ooh, there’s restriction in the range,
Dampening the slope of the best-fit line,
Ooh, I can correct r for this,
Put a better rho estimate in its place...