Saturday, September 30, 2006

Probability -- Intro Lecture

(Updated October 1, 2013)

This week, we'll be learning about probability. Probability is an important foundation for statistical analysis as, when you obtain a finding in your research (e.g., a higher percentage of couples show improvement after receiving a novel, experimental marital therapy than do control-group couples who received an older form of therapy), you need to know how likely it is this result came up purely by chance.

Although some topics can get fairly complex, the core idea to keep in mind regarding probability is how many possible ways can something turn out. A coin can come up heads or tails, thus the probability of a head is 1/2 (same for a tail).

A website called "Math Goodies" has some excellent lecture notes on probability. In this class, we'll want to look at the units from this site on "Introduction to Probability," "Sample Spaces," "Addition Rules for Probability," and "Independent Events."

We're mainly interested in two principles known as the multiplication/and rule (which requires the events to be independent) and the addition/or rule (which requires the events to be mutually exclusive). Here is a description of non-mutually exclusive events and how to adjust the addition/or procedure accordingly.

***

A tool that has many useful applications in statistics and probability is the "n choose k" formulation (see the links section on the right for an online calculator). I alluded to this approach in talking about how the 2005-06 St. Louis University men's basketball team had a perfect alternation of wins and losses through its first 19 games (WLWLWLWLWLWLWLWLWLW). As discussed on my Hot Hand blog, I framed the question in the form of: How many possible different ways are there to distribute 10 items (wins) in 19 boxes (games)?

To simplify things, let's imagine that a sorority needs two of its members to oversee an event, but four members volunteer (Cammy, Pammy, Sammy, and Tammy). How many possible duos can be chosen from four members? This would be stated as "4 choose 2" (there would be a 4 on top of a 2 in parentheses, but this is not a fraction). Keeping in mind that n = 4 and k = 2, the following description from Ken Ross (p. 43) may be helpful:

...the denominator is the product of k numbers from k down to 1...

...the numerator is the product of numbers starting with n and going down until you also have k numbers in the numerator...


Mathematically, we would take:

4 X 3
_____

2 X 1

which equals 6. The possible duos are thus:

Cammy with Pammy
Cammy with Sammy
Cammy with Tammy
Pammy with Sammy
Pammy with Tammy
Sammy with Tammy

***

Finally, let's look at how binomial distributions (e.g., coin-tossing) lead up to the normal, bell-shaped curve (here and here).

Coin-tossing is an example of a Bernoulli process. Westfall and Henning's book shows how to simulate coin-tossing in Excel. Here are the specifications if anyone wants to try this at home:

You need to download Excel's “Analysis ToolPak” add-in. Then, access the Random Number Generator through the tabs "Data" and "Data Analysis."  Number of variables is 10 and the number of random numbers is 1000. The distribution to select is "Bernoulli," with p value = 0.5. The output range should be:  $A$1. In column K of the first line of data (after the data are generated): =SUM(A1:J1). How to drag down to get sum for all rows is explained here. Finally, making histograms in Excel is explained here.

Wednesday, September 27, 2006

z-Scores and Other Standardization Schemes

(Updated March 14, 2015) 

As you know, I like to find real-life examples of the concepts covered in class. Standard deviations, z scores, etc., rarely come up in the public discourse. One place they do come up, however, is in the criteria for membership in Mensa, also known as the "high IQ society" (click here and then scroll down a bit). Mensa requires someone to be in the top 2% of IQ scores or 2 SD above the mean. Those of you who fly on American Eagle into and out of Lubbock, and read the American Way magazine on the plane, may have noticed that each issue includes some sample Mensa items, for your amusement and bemusement.

First, to summarize how different distributions are normed to different means and standard deviations, here's a list (each distribution's properties can be reported in the form, M +/- SD):


Mean SD
Normal (z) 0 1
McCall T 50 10
IQ Tests 100 15 or 16


Here's an informative document from Valdosta State University on standardization of variables onto different scales.

Here is a link on the "Flynn Effect" (rising population-level IQ's over the years), for those who are interested.

Tuesday, September 19, 2006

Moments of a Distribution

As we finish up descriptive statistics (i.e., ways to describe distributions of data), I wanted to provide a conceptual scheme for those of you who want more structure to organize the information.

In my little example of how often people go to the movies, we saw how two distributions could have the same mean, but the distributions still had pretty different shapes (both were symmetrical, but one was more spread out than the other). The means don't tell the distributions apart, so we have to look at a second feature, the standard deviation, to tell them apart.

You can also have two distributions that have the same mean and same standard deviation, yet are still not identical-looking. The following example from the financial world illustrates just such a case:


The blue and green distributions are clearly different. The distributions are not distinguished by their means (they're the same), nor are the distributions distinguished by their standard deviations (they're also the same). We have to look at a third feature of the distributions, their skewness, to tell them apart.

The logic that I hoped you've seized upon has to do with the mathematical concept of moments:

The mean is the first moment.

The standard deviation (or variance, which is SD squared) is the second moment.

The skewness is the third moment.

The kurtosis is the fourth moment.

The textbook from when I took first-year graduate statistics in the psychology department at Michigan during the 1984-85 academic year (William L. Hays, Statistics, 3rd ed.) nicely summarizes moments, in my view:

Just as the mean describes the "location" of the distribution on the X axis, and the variance describes its dispersion, so do the higher moments reflect other features of the distribution. For example, the third moment about the mean is used in certain measures of degree of skewness... The fourth moment indicates the degree of "peakedness" or kurtosis of the distribution, and so on. These higher moments have relatively little use in elementary applications of statistics, but they are important for mathematical statisticians... The entire set of moments for a distribution will ordinarily determine the distribution exactly... (p. 167)

UPDATE (2014): Peter Westfall (TTU business professor) and Kevin Henning, in their book Understanding Advanced Statistical Methods, argue that the key feature of kurtosis is not a distribution's peakedness, but rather its propensity to outliers. This formula determines whether kurtosis is positive or negative. Westfall and Henning note that "if the kurtosis is greater than zero, then the distribution is more outlier-prone than the normal distribution; and if the kurtosis is less than zero, then the distribution is less outlier-prone than the normal distribution" (pp. 250-251).

Wednesday, September 13, 2006

Normal and Non-Normal (J-Shaped) Distributions

(Updated July 10, 2015)

We're going to move beyond simple histograms and focus more intensively on types of distributions. These two web documents describe types of distributions (here and here).

Much of statistics is based on the idea of a normal "bell-shaped" curve, where most participants have scores in the middle of the distribution (on whatever variable is being studied) and progressively fewer have scores as you move to the extremes (high and low) of the distribution. Variables such as self-identified political ideology in the U.S. and people's height have at least a roughly normal distribution. However, data sets only rarely seem to follow the normal curve. The following article provides evidence that many human characteristics do not follow a normal bell-shaped curve:

O'Boyle, E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65, 79-119. (Accessible on TTU computers at this link; see especially Figure 2)

In recent years, however, there has been a lot of interest in "J-shaped" curves. Such curves, which are also known as "power-law," "scale-free," and Pareto distributions, represent the situation where most people have relatively low scores, a few people have moderately high scores, and a very few have extremely high scores. The term "power law" comes from the fact that such distributions follow an equation such as [y = 1/(x-squared)]; we can plot y when x equals 1, 2, 3, etc.         

Distributions in surveys of respondents' number of sexual partners (see Figure 2 of linked document) illustrate what we mean by "scale free" or J-shaped. 

Other examples of J-shaped curves include:
  • Wealth distributions. In the United States, "the top 1% of households (the upper class) owned 34.6% of all privately held wealth" (link). Thus, at the extreme high end of the X-axis, where the high monetary values would be located, the frequency of occurrence (Y-axis) would be very small (1%).
  • Alcohol consumption. See here.
  • Number of lifetime romantic partners. See here.
An example of a bimodal distribution is shown in Table 2 of:
  • James, J. D., Breezeel, G. S., & Ross, S. D. (2001). A two-stage study of the reasons to begin and continue tailgating. Sport Marketing Quarterly, 10, 212-222.
Most of the statistical techniques we will learn in this course assume normal distributions. As far as dealing with non-normal data, O'Boyle and Aguinis (2012) suggest the following:

Based on the problems of applying Gaussian [normal] techniques to [a] Paretian distribution, our first recommendation for researchers examining individual performance is to test for normality. Paretian distributions will often appear highly skewed and leptokurtic.

(We will learn about skew and kurtosis [where the term "leptokurtic" comes from] in an upcoming lecture.)