If you answered“Yes” to any of the above question, you need to enroll in this course for Machine learning & data science.

### Don't miss your 80% Discount

This report is intended to provide insight about how to use R to perform ANOVA analysis for marketing campaign. In this report, objective is to analyse promotion Campaign data set (“PL_X_SELL”) in R programming language and generate information about the data set. This data exploration report will contain following:

- Problem Description
- Importing the dataset using R
- Identify the dependent variable & independent
- Summary statistics & inferences
- Tests for testing normality & homogeneity
- ANOVA Analysis (one way & two-way anova)
- Post Hoc Test
- Interpretation of results

## Assumptions

- The groups or categories under the treatment variable are independent of each other
- The groups are homogeneous (similar in every manner) except for the case of treatment
- The response (dependent) variable is normally distributed within each of the categories defined by the treatment variable

## Exploratory Data Analysis – Step by step approach

A Typical Data exploration activity consists of the following steps:

- Environment Set up and Data Import
- Problem Description
- Null Hypothesis & Alternative Hypothesis
- Variable Identification & Descriptive statistics
- Test of Assumptions
- ANOVA Analysis (one way & two-way anova)

We shall follow these steps in exploring the provided data set

## Environment Set up and Data Import

## Install necessary Packages and Invoke Libraries

Use this section to install necessary packages and invoke associated libraries. Having all the packages at the same places increases code readability.

Package Name | Description |

CSV | Read and write CSV Files with selected conventions |

Psych | Procedures for Psychological, Psychometric, and Personality Research |

Car | Companion to Applied Regression |

Foreign | Read Data Stored by ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, ‘dBase’, … |

MASS | Support Functions and Datasets for Variables and Ripley’s MASS |

robustHD | Robust Methods for High-Dimensional Data |

rcompanion | Functions to Support Extension Education Program Evaluation |

WRS2 | A Collection of Robust Statistical Methods |

tabplot | For visualizing large dataset |

tidyverse | tidyverse is a collection of R packages designed for data science |

dplyr | It is used for subset selection and applying actions on the datasets like applying filter, reorder etc. |

ggplot2 | It is used for creating attractive graphics in R |

readxl | Used to read excel files in R |

nlme | Fit and compare Gaussian linear and nonlinear mixed-effects models. |

onewaytests | Performs one-way tests in independent groups designs, pairwise comparisons, graphical approaches, assesses variance homogeneity and . |

### Set up working Directory

Setting a working directory on starting of the R session makes importing and exporting data files and code files easier. Basically, working directory is the location/ folder on the PC where you have the data, codes etc. related to the project.

Please refer to Appendix A for Source Code for more details on how to set working directory in R.

### Import and Read the Data set

The given data set is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.

### Problem Description

- To Conduct a one-way ANOVA analysis to study whether occupation of the account holder affects quarterly average balance in the account
- To Conduct two-way ANOVA analysis on gender and occupation on quarterly average balance.

### Variable Identification & Descriptive Statistics

- As per the data the independent Variable (Factor Variable) is “Occupation” and the Dependent Variable is “Balance”
- Load all the libraries

- Load the data and run str commands to learn more about the data. It shows that there are 20000 observations with 10 variables, where Cust_ID, Gender, Occupation, AGE_BKT are factor variables, Target, Age, No_of_CR_TXNS, SCR, Holding period are integers and Balance is number variable respectively

- Run summary command to know the Mean, Median, range, min, max value of different variables

- Using table plot from tabplot to visualize entire dataset- tableplot(plxsell)

*visualization with boxplot using ggplot2, it shows the variation between groups*

### Null & Alternative Hypothesis

Null Hypothesis (H0): µ of all occupations is equal (i.e.,Mean of Prof, SAL, Self-emp and SENP are same)

Alternative Hypothesis (HA): µ of all occupations is not equal i.e., at least one of the means is different from the rest.

### Test of Assumptions

- The groups or categories under the treatment variable are independent of each other
- The groups are homogeneous (similar in every manner) except for the case of treatment
- The response (dependent) variable is normally distributed within each of the categories defined by the treatment variable

** ****Normality is tested by Shapiro test. First create a subset with 5000 rows and then apply shapiro test.**

- Since P Values are greater than 0.05 that implies the Normality has not been violated
- Homogeneity in variance across the categories in the factor variable is tested by levene Test & Bartlett Test
**.**

Based on the Homogeneous of variance test using levene & Bartlett method, P-Value is less than 0.05 therefore null hypothesis is rejected i.e., the variations are not homogeneous across occupations

#### Levene Test in R

[crayon-5c4a8607378ab942517707/][crayon-5c4a8607378b3134456887/]#### Bartlett test in R

[crayon-5c4a8607378bb979475389/][crayon-5c4a8607378c2307947026/]## ANOVA Analysis -One-Way ANOVA in R

**Problem**– one-way ANOVA analysis to study whether occupation of the account holder affects quarterly average balance in the account.

**Inference**– Anova analysis shows that F statistic is highly significant therefore the occupation of the account holder do not affect the quarterly average balance.

### Post – Hoc Test

[crayon-5c4a8607378dc002202425/][crayon-5c4a8607378e4449191978/][crayon-5c4a8607378ed768241868/]__The above Post-Hoc Test shows that multiple comparison of means that implies each of the occupation are different from the avg balance based on the p adjusted value____Tukey signifies that the differences across the occupation are all significant__

## How To Conduct Two Way Anova Analysis in R

**Problem- **Conduct two-way ANOVA analysis on gender and occupation on quarterly average balance.

- Two way ANOVA analysis used for more than one factor.
- In this case, we have occupation & gender.
- We have four categories in occupation and two categories in Gender
- The effects that needs to be considered are
- Direct effect from Occupation
- Direct effect from Gender
- The interaction effect of occupation and the gender

- Process remains same
- Male has high mean occupation category compared to the Female and Org

- Now same test for standard deviation reveals that Org has high SD in SAL & SELF-EMP occupation category compared to the Female and male

- Now plot interaction plot. This reveals that Male has higher mean & earnings in occupation category compared to the Female and Org

__Test of Assumptions__

- Normality Assumption is violated
- The homogeneous variance across the groups is also violated.
- F-statistic of the ANOVA does not get affected by much if there is balanced data.

__Two Way Anova Analysis__

- Two way Anova analysis shows that F statistic is highly significant

__Robust Methods One-Way ANOVA__

[crayon-5c4a860737933647643717/][crayon-5c4a86073793a277330883/][crayon-5c4a860737942872467030/][crayon-5c4a86073794a155702448/]__Robust Methods One-Way ANOVA__

__Robust Methods, Two-way ANOVA__

[crayon-5c4a860737952623002433/]

- The results of both one-way and two-way ANOVA are robust to violations of assumptions as the p-values for both occupation and gender are closer to zero.

## Speak Your Mind