December 19, 2014 by akhilendra

Business Analytics Part 2- Data Science Basics with Statistics

Sharing is Caring

Welcome to the 2^nd part of our data science, in this post we will learn about the basic concepts of data science and how to get used to the basic concepts before we delve deeper.

You can read the first part here.

Data science in nutshell is “Statistic” and a data scientist is a “statistician”.

Therefore we are going to cover the concept of statistics which are applied for modern data science. These concepts are required to start your data science journey and if you are planning for any Business Analytics , Machine Learning or Data Science certification, they will be really useful. In fact, without understanding statistics, it will be very hard for you to make sense of the data, so please spend some time brushing your statistical skills even if you don’t like it very much.

Ok, so let’s cut the chase and start from the concepts which are important in data science.

One of the most important concepts of data science and statistic is “Variable”.

Data Science Basics with Statistics

Variables

In maths, a variable is a alphabetic character which represent a value. Variables are also used in programming where they are used for almost similar purpose and contain/represent a value.

In data science, variables are measured, manipulated and explored to identify its value or the output of an equation.

To further elaborate on variables, let’s use two different examples where variables are used;

Consider a Correlational research where you are measuring the body mass index (BMI) of a population. BMI is a measured as mass (kg)/ (height (m)) ². In a correlation research, you simply measure the variable and not manipulate it.

On the contrary, consider an experimental research, where you manipulate the variables and measure the output. If we use the above mentioned example, if we change the mass or height in the BMI calculation, we will get the different output.

So variables can be used in different ways depending upon the need of the study.

Under data science, relation between variable is one of the most important factor affecting the output.

There is also one other important factor influencing variable and the outcome, and that is “size of the sample”. The smaller the size, higher the probably of getting a fake pattern whereas in bigger samples, chances of identifying a pattern closer to reality is “high”.

Therefore it is important to pay attention to the sample size in data science and carrying out any research project.

If the sample size is small, the results will not be statistically significant.

Consider a website with five categories and 100 posts in each category.

Category 1-100 post

Category 2-100 post

Category3-100 post

Category 4- 100 post

Category5-100 post

If the traffic analytics report indicates that they are receiving more organic traffic (traffic from search engines) for a particular category, it would be easier for them to consider it as a trend and they can work on that category to further increase their organic traffic.

But if consider a scenario, where there is only 1 post in each category and as per their reports, they are receiving more organic traffic for one category, pattern would be weaker and not reliable enough to produce realistic result. It could be pure coincidence(less probable) that one post is getting more traffic than others.

There are two important properties of the relation between variable;

Magnitude

Magnitude or size is measured by the significance of the variable in the equation. Using the example of BMI again, if there is clear study that each time height is increased by ‘x’ m, there is rise in BMI by ‘y’, we can easily identify the pattern and establish the impact. Therefore it won’t be very hard to measure the magnitude of the variable x or height in that research.

Reliability

It is slightly more complicated than magnitude and considered as less “intuitive”. It is also known as “truthfulness”. It is the indicator of the reliability of the test or research. It reflects the probability of getting similar result if same test is recreated using the different sample from same population.

This is just the tip of “VARIABLES” in statistic and we will move forward with this tutorial. We will cover more and more ground in data science including variables.

PREVIOUS CHAPTER NEXT CHAPTER