0.67 | 0.66 | 0.57 | 0.69 | 0.67 | 0.60 | 0.71 | 0.78 | 0.62 | 0.65 |
0.66 | 0.76 | 0.63 | 0.64 | 0.57 | 0.65 | 0.66 | 0.75 | 0.61 | 0.61 |
0.62 | 0.66 | 0.67 | 0.66 | 0.63 | 0.68 | 0.68 | 0.61 | 0.62 | 0.70 |
0.71 | 0.66 | 0.64 | 0.64 | 0.63 | 0.70 | 0.62 | 0.73 | 0.56 | 0.66 |
0.66 | 0.62 | 0.68 | 0.72 | 0.64 | 0.64 | 0.64 | 0.63 | 0.70 | 0.64 |
19 Sampling Distribution of Sample Proportions
19.1 Repeated Sampling For Proportions
Suppose we start with a population that has a certain population proportion \(p\).
Then imagine we take samples of a certain size \(n\) over and over again, each time we calculate the sample proportion \(\hat p\).
So we:
- choose a sample and calculate a sample proportion \(\hat p\)
- choose a sample and calculate another sample proportion \(\hat p\)
- choose a sample and calculate another sample proportion \(\hat p\)
- and so on
Usually the sample proportions are close to the population proportion \(p\) but sometimes they are farther away (though this is less likely the farther away we get from the population proportion).
It turns out…
If we then plot all these sample proportions on a histogram, we get something that looks like a normal curve!
19.2 Coke vs Pepsi Example
Lets show this with a made up example.
Suppose the population proportion of students that prefer coke to pepsi is \(p=0.65\).
This means for the whole student population:
- 65% prefer coke
- 35% prefer pepsi
(We don’t allow choosing neither as an answer here…its binary)
Suppose we take samples of size \(n=100\) students with the following results:
- The first sample has \(\hat p = 0.67\) prefer coke
- The second sample has \(\hat p = 0.66\) prefer coke
- The third sample has \(\hat p = 0.57\) prefer coke
- and so on
Here are the sample proportions that we get for the first 50 samples:
Suppose we continue and do this 300 times…
Finally let’s plot all of these sample proportions together in a histogram:
We can see that the shape of the distribution is normal
19.3 Sampling Distribution of Sample Proportions
This distribution of the sample proportions is called the sampling distribution of sample proportions or the \(\hat p\)-distribution.
Sampling distributions are made by repeatedly taking samples then computing a statistic, then collecting that data and graphing it in a histogram.
19.4 Central Limit Theorem - Proportions
It turns out that this “normal curve” shape turns up all the time for sampling distributions under the right conditions, and that fact is called the Central Limit Theorem.
This is one of the most important results in statistics and basically it tells us that we can use a normal distribution to approximate the sampling distribution (the \(\hat p\)-distribution that is).
This will allow us to calculate areas for the \(\widehat{p}\) distribution by calculating z-values and using a standard normal curve to find areas we want.
Theorem 19.1 (Central Limit Theorem-Proportions)
When we use a large population with a population proportion of \(p\) and consider samples of size n from this population
- The \(\hat p\) distribution is approximately normal, becoming more and more accurate as n is larger
- The mean and standard deviation of the \(\hat p\) distribution are given by these formulas:
\[ \text{mean} = p \]
\[ \text{standard deviation} = \sqrt{\frac{p(1-p)}{n}} \]
- For the above to be valid must have \(n(p)>5\) and \(n(1-p)>5\).
The first bullet point of this theorem says:
- we can use a standard normal table to calculate areas under the \(\widehat{p}\)-distribution
The second bullet point says:
- the \(\widehat{p}\)-distribution has \(p\) as the center of it, and the standard deviation we can compute with the formula in the theorem.
The third bullet point says:
- You have to compute \(n(p)\) and \(n(1-p)\) and make sure they are at least \(5\). Otherwise you cannot use this approach.
19.5 How to Calculate \(\hat p\)-distribution Areas
So to find areas under the \(\widehat{p}\) distribution we will just convert them to areas under a standard normal curve like we did before but this time we will use this formula for the z-scores:
\[ z=\frac{\hat p-p}{\sqrt{\frac{(1-p)(p)}{n}}} \]
So just like before we will find left tails, right tails and areas between but this time using the z-value formula above.
19.6 Examples of Areas For \(\hat p\)-distribution
Lets see some examples of computing areas for the \(\hat p\)-distribution.
First lets see how to compute a left tail area:
Example 19.1 (Left Tail Area for \(\hat p\)-distribution)
Suppose we have a population proportion of \(79\%\) and take a sample of size \(250\). Suppose the sample proportion from that sample is \(190\) out of \(250\). What is the chance of getting that sample proportion or less?
Solution:
We have population proportion of \(p=0.79\) and sample proportion of \(190\) out of \(250\).
We are interested in how likely it is we get our sample proportion or below.
First let’s calculate our sample proportion:
\[\begin{equation} \hat p = \frac{count}{n} = \frac{190}{250} = 0.76 \end{equation}\]
Before we start we check the conditions of the Central Limit Theorem to make sure we can use a normal distribution here:
\(np=(250)(0.79)= 197.5\) and \(n(1-p)=(250)(0.21)= 52.5\)
Since these are both at least \(5\) we are in good shape.
Here is the \(\hat p\)-distribution showing the area that we want:
Notice:
- The sample proportion \(\hat p=0.76\) is at the edge of the shaded region we want to find.
- The population proportion \(p =0.79\) is in the center.
We want the shaded left tail area that is to the left of \(\hat p = 0.76\):
We will find this area by changing the sample proportion \(\hat p\) into a z-value and using the standard normal table.
First we need the calculation of the standard deviation since we need this in our z-value calculation:
\[\begin{equation} \small{ stdev=\sqrt{\frac{(1-p)p}{n}} =\sqrt{\frac{(1-0.79)0.79}{250}} =\sqrt{\frac{(0.21)0.79}{250}} =\sqrt{\frac{0.1659}{250}} =0.0257604 } \end{equation}\]
Now lets find the z-value using the sample proportion \(\hat p= 0.76\) the population proportion \(p=0.79\) and the standard deviation \(stdev\) we just found:
\[\begin{equation} z=\frac{\hat p-p}{stdev} =\frac{0.76-0.79}{0.0257604} =\frac{-0.03}{0.0257604} =-1.16 \end{equation}\]
So here is the equivalent left tail area for of \(z=-1.16\).
This area is the same size as our original area so we just find this one using the standard normal distribution.
We can look up the area in the standard normal z-table using \(z=-1.16\)
We go to the row that has -1.1 and then to the column that contains .06 and we see this:
.00 | .01 | .02 | .03 | .04 | .05 | .06 | .07 | .08 | .09 | |
-1.2 | .1151 | .1131 | .1112 | .1093 | .1075 | .1056 | .1038 | .1020 | .1003 | .0985 |
-1.1 | .1357 | .1335 | .1314 | .1292 | .1271 | .1251 | .1230 | .1210 | .1190 | .1170 |
-1.0 | .1587 | .1562 | .1539 | .1515 | .1492 | .1469 | .1446 | .1423 | .1401 | .1379 |
So that means:
\[\begin{equation} \text{left tail area} =0.123 \end{equation}\]
Rounded to the nearest percent this is 12%. This means that the shaded area corresponds to 12% of the entire data.
This means there is a 12% chance of getting a sample proportion of \(\hat p=0.76\) or less for this population when using samples of size \(n = 250\).
\[ \tag*{$\blacksquare$} \]
Now lets see how to compute a right tail area:
Example 19.2 (Right Tail Area for \(\hat p\)-distribution)
Suppose we have a population proportion of \(61\%\) and take a sample of size \(320\). Suppose the sample proportion from that sample is \(208\) out of \(320\). What is the chance of getting that sample proportion or more?
Solution:
We have population proportion of \(p=0.61\) and sample proportion of \(208\) out of \(320\).
We are interested in how likely it is we get our sample proportion or above.
First let’s calculate our sample proportion:
\[\begin{equation} \hat p = \frac{count}{n} = \frac{208}{320} = 0.65 \end{equation}\]
Before we start we check the conditions of the Central Limit Theorem to make sure we can use a normal distribution here:
\(np=(320)(0.61)= 195.2\) and \(n(1-p)=(320)(0.39)= 124.8\)
Since these are both at least \(5\) we are in good shape.
Here is the \(\hat p\)-distribution showing the area that we want:
Notice:
- The sample proportion \(\hat p=0.65\) is at the edge of the shaded region we want to find.
- The population proportion \(p =0.61\) is in the center.
We want the shaded right tail area that is to the right of \(\hat p = 0.65\):
We will find this area by changing the sample proportion \(\hat p\) into a z-value and using the standard normal table.
First we need the calculation of the standard deviation since we need this in our z-value calculation:
\[\begin{equation} \small{ stdev=\sqrt{\frac{(1-p)p}{n}} =\sqrt{\frac{(1-0.61)0.61}{320}} =\sqrt{\frac{(0.39)0.61}{320}} =\sqrt{\frac{0.2379}{320}} =0.0272661 } \end{equation}\]
Now lets find the z-value using the sample proportion \(\hat p= 0.65\) the population proportion \(p=0.61\) and the standard deviation \(stdev\) we just found:
\[\begin{equation} z=\frac{\hat p-p}{stdev} =\frac{0.65-0.61}{0.0272661} =\frac{0.04}{0.0272661} =1.47 \end{equation}\]
So here is the equivalent right tail area for of \(z=1.47\).
This area is the same size as our original area so we just find this one using the standard normal distribution.
We can find the right tail area for \(z=1.47\) by finding the left tail area and then subtracting that from 1.0.
So lets look up the left tail area first. Go to the row that has 1.4 and then to the column that contains .07 and we see this:
.00 | .01 | .02 | .03 | .04 | .05 | .06 | .07 | .08 | .09 | |
1.3 | .9032 | .9049 | .9066 | .9082 | .9099 | .9115 | .9131 | .9147 | .9162 | .9177 |
1.4 | .9192 | .9207 | .9222 | .9236 | .9251 | .9265 | .9279 | .9292 | .9306 | .9319 |
1.5 | .9332 | .9345 | .9357 | .9370 | .9382 | .9394 | .9406 | .9418 | .9429 | .9441 |
So that means:
\[\begin{equation} \text{left tail area} =0.9292 \end{equation}\]
This means that the right tail that we want to find is going to be this left tail subtracted from 1.0.
\[\begin{equation} \text{right tail area} = 1.0 - 0.9292 = 0.0708 \end{equation}\]
Rounded to the nearest percent this is 7%. This means the shaded right tail area corresponds to 7% of the entire data.
This means there is a 7% chance of getting a sample proportion of \(\hat p=0.65\) or more for this population when using samples of size \(n = 320\).
\[ \tag*{$\blacksquare$} \]