Ramgopal Prajapat:

Learnings and Views

Bivariate Analysis: Contingency Analysis and Chi Square

By: Ram on Apr 02, 2021

For finding an association between two nominal variables, a contingency analysis table, and chi-square test is employed.  The contingency table is created by listing the values of one variable as a row and the other as a column.

The Pearson chi-square test or "goodness-of-fit” is used to check if the expected frequency in the table is different from the actual frequency in the contingency table.  The test statistic is assumed to follow the chi-square distribution. Another statistic that can be used is the Likelihood Ratio Chi-Square test. As the sample size increases both Chi-square Test and Likelihood Ration Chi-Square test statistics converge.

Application of Contingency Analysis

The segment manager wanted to understand the association of customer segments and saving product holding. The important point to be noted is that both the variables – customer segment and product holding are Nominal Variables. One of the statistical techniques which help in finding an association between nominal variables is Chi-Square Test (and also the Likelihood Ratio Chi-Square Test).

A bank has 3 different saving products – Money Market, Online Saver, and Fixed Deposit (FD).

Money Market Product: Money Market is similar to other saving accounts but requires higher saving balances and may have limitations on the number of withdrawals. Customers get higher interest rate and the bank has some stability of balances on customer accounts

Online Saving Account: Account is accessed and managed online. It may have limitations on the number of withdrawal transactions.   Customers may require keeping lower balances. The bank may have lower operational costs associated with the account.

Fixed Deposit (FP): Money is deposited for a fixed period and the bank has agreed to pay the fixed interest rate for the period. Customers get a higher interest rate and the bank gets to invest the money for a longer and assured period.

Customer Segments can be defined by a combination of variables such as Income, Age, and other information.  The bank had 4 segments – Premium, Advance, Core, and Mass-Market

Contingency Analysis and Calculation of Statistics

Column and row marginal totals and the grand total are important concepts for calculating expected frequencies in a cell.

Every cell has observed frequency/count for a combination of row and column variable values.  Let i refer to a row and j refer to a column.

Oij           : Observed (from input data) frequency for a row i and column j

Ri             : Marginal total for row i

Cj             : Marginal total for column j

N             : Grand total

Eij            : Expected frequency based on assumption that the count follows a Chi-Square distribution – (Ri * Cj/N)

 The degree of freedom is (R-1)*(C-1) R is the number of rows and C is the number of columns

 

Chi-Square Statistic

Observed Frequencies (Oij)

 

 

Expected Frequencies

Chi-Square Statistic

Result and Insights

The hypothesis is that there is no relationship between the product holding of a customer and the customer segment it belongs to.  The chi-square statistics 59.28 is significant for the degree of freedom 6 (= (4-1)*(3-1)). This suggests that the null hypothesis of no relationship/association between product holding and customer segment is rejected or cannot be accepted.

The important point to note is that the chi-square value for each cell is independent and can be used for detailed analysis in terms of what is the contribution of each cell in overall chi-square statistics.

Looking at the contribution of Chi-square statistics for each row, it is evident that Premium Segment has an association with product holding. The premium customer segment has a significant holding of Online savings products as compared to other segments.  Further analysis of customer profiles of the Premium Segment can help to answer why there is an association between Premium Segment and Online Saving Product.

One caveat, there should be enough data points (Cochran’s rule for small frequency) for rows and columns otherwise overall expected frequency could be driven by a particular row or column. Also for low-frequency distribution, an exact test (Fisher’s exact test) can be used.

Reference

http://www.uvm.edu/~dhowell/methods8/Supplements/ChiSquareTests.pdf

http://www.psychstat.missouristate.edu/introbook/sbk28m.htm

Leave a comment