December 28, 2014 by akhilendra

Business Analytics Part 4- Analysis of variance (ANOVA)

Sharing is Caring

As mentioned in the first tutorial that these chapters are designed to help you learn and train for SAS Business Analyst certification and other data scientist certifications, we are going to first focus on topics which are more relevant for that and then we will explore the vast ocean of statistical concepts.

if you have gone through the last 3 chapters, then you have already got your foot wet with the basic concepts of data science and have a basic understanding about its operations.

In this chapter, we will primarily focus on ANOVA which is listed as one of the topics on SAS BA certification page and the technique is widely used in data science domain.

Analysis of variance (ANOVA)

Analysis of variance which is also known as ANOVA, is the statistical model or rather, collection of models which is used to measure the significant variation between groups of variable where usually means are used to measure the difference. It basically inform user about the difference in the mean of multiple groups for statistical significance.

You can use t-test if you are going to compare two groups, you can learn about t-test in our previous chapters.

ANOVA is primarily used when you are going to measure differences between more than 2 groups. ANOVA was developed by R.A Fisher.

In the last chapter while explaining t-test, I have used two groups preparing for SAS certification as an example to elaborate T-Test. Suppose you need to measure the performance for more than 2 groups, you will use ANOVA.

Type of ANOVA

One Way ANOVA
Two Ways ANOVA

Before we further move into ANOVA, let’s look at two important concepts which are integral to ANOVA.

F test

It is a test where distribution is F shaped. It is commonly used in ANOVA when means of the two population, having similar standard deviations, are equal.

Critical value is selected using the f table and then it is used to accept or reject the null hypothesis.

F values are non-negative, distribution is asymmetric and two independent degrees of freedom, one for numerator and one for denominator.

And the second concept is, test statistic.

Test Statistic

F test is used in test statistic where distribution pattern is f shaped. Test statistic is used to measure one attribute and is a standard value calculated from a sample data while performing a hypothesis test.

Let’s go through the following example to understand F Test;

Sample 1	value-mean	Squared	Sample 2	value-mean	Squared	Sample 3	value-mean	Squared
12	-4.4	19.36	18	-5.6	31.36	25	-1.6	2.56
14	-2.8	7.84	19	-4.6	21.16	26	-0.60	0.36
18	1.2	1.44	21	-2.6	6.76	23	-3.6	12.96
22	5.2	27.04	29	5.4	29.16	28	1.4	1.96
18	1.2	1.44	31	7.4	54.76	31	4.4	19.36
16.8		57.12	23.6		143.2	26.6		37.2

Sum of Squares within Group = Sum of squares Sample 1 + Sum of squares Sample 2+ Sum of squares Sample 3

Sum of Squares within Group: 57.12+143.2+37.2= 237.52

Now calculate total sum of squares using all groups
							Square
		12	–	22.33	=	-10.33	106.7089
		14	–	22.33	=	-8.33	69.3889
		18	–	22.33	=	-4.33	18.7489
		22	–	22.33	=	-0.329999999999998	0.1089
		18	–	22.33	=	-4.33	18.7489
		18	–	22.33	=	-4.33	18.7489
		19	–	22.33	=	-3.33	11.0889
		21	–	22.33	=	-1.33	1.7689
		29	–	22.33	=	6.67	44.4889
		31	–	22.33	=	8.67	75.1689
		25	–	22.33	=	2.67	7.1289
		26	–	22.33	=	3.67	13.4689
		23	–	22.33	=	0.670000000000002	0.4489
		28	–	22.33	=	5.67	32.1489
		31	–	22.33	=	8.67	75.1689
	Mean	22.33333333			Total		493.33

493.33= Sum of Squares between Group + 237.52

To find out Sum of Squares between Group, mean of each group (sample, sample 1, sample 2 etc) – mean of all the groups (22.33 in this case)

Sample 1	16.8-22.33	=	-5.53	30.5809
Sample 2	23.6-22.33	=	1.27	1.6129
Sample 3	26.6-22.33	=	4.27	18.2329
Total				50.42

Total x 5(number of units in each sample)
50.42*5		=	252.1	corrected value	255.81

You can also use basic algebra to figure out the value of sum of squares between groups, but I did it manually to give you some idea on how it is done manually. But round off is causing issues here, so we will use 255.81 to rule out errors.

Now Calculate Total/degree of freedom numerator where degree of freedom numerator=number of sample Groups – 1

255.81/2 = 127.905

Now Calculate Sum of Squares within Group/degree of freedom denominator where degree of freedom denominator=number of items within sample Groups – number of samples

237.52/15-3 = 19.79

F is ratio of total/sum of squares within group

127.905/19.79=6.46

Now We need to determine the critical value using the below table where numerator is at the top and denomincator are the horizontal lines, we need to determine the intersecting points of the table for 2 and 12.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
1	161	200	216	225	230	234	##	239	#	242	244	##	##	249	250	251	252	253	254
2	18.5	19	19.2	19.3	19.3	19	19	19.4	19	19	19	19	19	19	19	19	19	19	19.5
3	10.1	9.55	9.28	9.12	9.01	8.9	9	8.85	9	8.8	8.7	9	9	8.6	8.6	8.6	8.6	8.6	8.53
4	7.71	6.94	6.59	6.39	6.26	6.2	6	6.04	6	6	5.9	6	6	5.8	5.8	5.7	5.7	5.7	5.63
5	6.61	5.79	5.41	5.19	5.05	5	5	4.82	5	4.7	4.7	5	5	4.5	4.5	4.5	4.4	4.4	4.36
6	6.99	5.14	4.76	4.53	4.39	4.3	4	4.15	4	4.1	4	4	4	3.8	3.8	3.8	3.7	3.7	3.67
7	5.59	4.74	4.35	4.12	3.97	3.9	4	3.73	4	3.6	3.6	4	3	3.4	3.4	3.3	3.3	3.3	3.23
8	5.32	4.46	4.07	3.84	3.69	3.6	4	3.44	3	3.4	3.3	3	3	3.1	3.1	3	3	3	2.93
9	5.12	4.26	3.86	3.63	3.48	3.4	3	3.23	3	3.1	3.1	3	3	2.9	2.9	2.8	2.8	2.8	2.71
10	4.96	4.1	3.71	3.48	3.33	3.2	3	3.07	3	3	2.9	3	3	2.7	2.7	2.7	2.6	2.6	2.54
11	4.84	3.98	3.59	3.36	3.2	3.1	3	2.95	3	2.9	2.8	3	3	2.6	2.6	2.5	2.5	2.5	2.4
12	4.75	*3.9*	3.49	3.26	3.11	3	3	2.85	3	2.8	2.7	3	3	2.5	2.5	2.4	2.4	2.3	2.3
13	4.67	3.81	3.41	3.18	3.03	2.9	3	2.77	3	2.7	2.6	3	2	2.4	2.4	2.3	2.3	2.3	2.21
14	4.6	3.74	3.34	3.11	2.96	2.9	3	2.7	3	2.6	2.5	2	2	2.4	2.3	2.3	2.2	2.2	2.13
15	4.54	3.68	3.29	3.06	2.9	2.8	3	2.64	3	2.5	2.5	2	2	2.3	2.3	2.2	2.2	2.1	2.07

Now we have;

F(2,12)= 6.46

critical value=3.89

As f value or ratio is greater than critical value and above 5%, null hypothesis will be rejected.

So coming back to ANOVA, it is one of the most widely used techniques in psychological experiments to collect and examine data.

Please find same calculation in the excel format, please note that some of the cells don’t have formulas so you can manually calculate them but most of the functions are already there;

Download F test calculations

Assumption of ANOVA

Assumptions of ANOVA are very important in data science field. They are also required for SAS BA exams. So now we will look at assumptions of ANOVA.

Dependent variable are normally distributed and have same variance.
Errors are normally distributed and are independent.

We will further look into the assumptions and one way ANOVA in the next chapter.

Previous Chapter Next Chapter