Parametric and Non-Parametric Tests of Independence

Quantitative Methods

Tests of Independence Using Contingency Table Data

Learning Outcome Statement:

explain tests of independence based on contingency table data

Summary:

This LOS covers the methodology to test the independence of classifications in categorical data using contingency tables. It involves calculating expected frequencies under the assumption of independence and comparing them to observed frequencies using the chi-square test statistic. The chi-square test helps determine if there is a significant association between the classifications.

Key Concepts:

Contingency Table

A contingency table, or two-way table, displays the frequency distribution of variables and helps in assessing the relationship between categorical variables.

Chi-Square Test of Independence

This nonparametric test assesses whether observed frequencies in a contingency table differ significantly from expected frequencies, which are calculated assuming no association between the variables.

Expected Frequency

Expected frequencies are calculated under the hypothesis of independence. They represent the expected counts in each cell of the contingency table if the row and column variables are independent.

Degrees of Freedom

In a chi-square test, degrees of freedom are calculated as (number of rows - 1) * (number of columns - 1). This value is crucial in determining the critical value from the chi-square distribution.

Critical Value and Decision Rule

The critical value is determined based on the desired level of significance and the degrees of freedom. The decision to reject the null hypothesis of independence is based on whether the chi-square statistic exceeds this critical value.

Formulas:

Chi-Square Test Statistic

χ2=i=1m(OijEij)2Eij\chi^2 = \sum_{i=1}^{m} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

This formula calculates the chi-square statistic by summing the squared differences between observed and expected frequencies, scaled by the expected frequencies.

Variables:
OijO_{ij}:
Observed frequency in cell (i, j)
EijE_{ij}:
Expected frequency in cell (i, j)
mm:
Total number of cells in the contingency table
Units: unitless

Expected Frequency Calculation

Eij=(Total row i)×(Total column j)Overall totalE_{ij} = \frac{(\text{Total row } i) \times (\text{Total column } j)}{\text{Overall total}}

This formula calculates the expected frequency for each cell of the contingency table under the assumption of independence between the row and column classifications.

Variables:
ii:
Row index
jj:
Column index
Units: unitless

Tests Concerning Correlation

Learning Outcome Statement:

explain parametric and nonparametric tests of the hypothesis that the population correlation coefficient equals zero, and determine whether the hypothesis is rejected at a given level of significance

Summary:

This LOS covers the methods to test hypotheses concerning the population correlation coefficient, specifically whether it equals zero. It distinguishes between parametric tests (using Pearson correlation) and non-parametric tests (using Spearman Rank correlation), explaining their applications, calculations, and interpretations in hypothesis testing.

Key Concepts:

Parametric Test of Correlation

Parametric tests, such as the Pearson correlation coefficient test, are used when data distribution assumptions (e.g., normality) are met. The test involves calculating a t-statistic to determine if the population correlation coefficient significantly differs from zero.

Non-Parametric Test of Correlation

Non-parametric tests, such as the Spearman Rank correlation test, are used when data do not meet certain distributional assumptions. This test calculates a correlation based on ranked data, making it robust against non-normal distributions and outliers.

Hypothesis Testing

Both parametric and non-parametric tests involve setting up null and alternative hypotheses about the population correlation coefficient, selecting an appropriate test statistic, and using this statistic to decide whether to reject the null hypothesis at a specified significance level.

Formulas:

Pearson Correlation Coefficient

rXY=sXYsXsYr_{XY} = \frac{s_{XY}}{s_X s_Y}

This formula calculates the sample correlation coefficient, which measures the linear relationship between two variables.

Variables:
rXYr_{XY}:
sample correlation coefficient between variables X and Y
sXYs_{XY}:
sample covariance between variables X and Y
sXs_X:
standard deviation of variable X
sYs_Y:
standard deviation of variable Y
Units: unitless

t-Statistic for Testing Correlation

t=rn21r2t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}

This formula is used to test the null hypothesis that the population correlation coefficient is zero. The t-statistic follows a t-distribution with n-2 degrees of freedom.

Variables:
tt:
t-statistic for testing the correlation
rr:
sample correlation coefficient
nn:
sample size
Units: unitless

Spearman Rank Correlation Coefficient

rs=16i=1ndi2n(n21)r_s = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}

This formula calculates the Spearman rank correlation coefficient, which assesses how well the relationship between two variables can be described using a monotonic function.

Variables:
rsr_s:
Spearman rank correlation coefficient
did_i:
difference between the ranks of corresponding values of X and Y
nn:
sample size
Units: unitless