Statistics is at the heart of data science. It has the power to unravel complex datasets and draw conclusions, enabling data science to address business problems and drive decisions. Statistics overrides intuition and eliminates the risk of making decisions based on emotions or gut feelings. This significantly reduces uncertainties. Statistics for data science is the key to capturing and translating data patterns into evidence as well as gathering, reviewing and analysing data.
One must have a clear understanding of how to interpret and communicate statistical results to become an efficient data scientist. The role of statistics in data science is to use quantified models and representations for a given set of real-life studies or experimental information, so that sophisticated algorithms can help fine-tune the information into actionable insights.
Here’s a look at some of the basic statistics concepts or features that every data scientist must know and how to apply them.
Table of Contents
1. Bayesian Statistics
This uses existing data (current knowledge about model parameters) to determine the probability of a future event. This is known as the prior distribution. When any new information is discovered, it is expressed as the likelihood, which is proportional to the distribution of the new data given the model parameters. This data is combined with the prior distribution to generate an updated probability distribution known as the posterior distribution.
For instance, you wish to predict whether at least 50 customers come to your cafe every Saturday for the next 6 months. Frequency statistics will conclude a probability by looking at past records, while Bayesian statistics will also factor in a book fair that is expected to take place every Saturday evening over the next few months. These important statistics for data science can help deliver a more accurate figure.
2. Over and Under Sampling
This adjusts the class distribution of a dataset. When one class of data is the underrepresented minority class in the data sample, an over sampling technique is used to duplicate the results. It ensures more balanced positive results in training. Similarly, if a class of data is the overrepresented majority class, under-sampling may be used to balance it with the minority class. SMOTE (Synthetic Minority Over-Sampling Technique) is a popular over-sampling technique, while under-sampling methods include cluster centroids and Tomek links.
For instance, suppose we have 2,000 statistical data examples for class 1 but only 200 for class 2, then this is problematic for machine learning techniques. This imbalance can be solved with over- and under-sampling, which makes this an important statistics concept for data science.
3. Dimensionality Reduction
This is one of the statistics concepts essential for data science and is used to reduce the number of random variables in a problem by obtaining a set of principal variables. It helps in data compression and minimises the storage space as well. You can consider taking up the Data Science Architect Program offered by NASSCOM-partner GeekLurn to learn the impact of dimension within data and how to perform factor analysis using PCA and compress dimension.
4. Probability Distribution
This refers to all the possible outcomes of a random variable and their corresponding probability values between 0 and 1. In simple terms, it is the likelihood of the results of an experiment, like the chances of rain. For instance, the weather report says there is a 30% chance of a downpour, which suggests a 70% chance of it not raining. Now, calculating the probability that the 30% chance of rain may change over the next few days is called distribution.
It contains 3 core functions:
- Probability Mass Function
- Probability Density Function
- Cumulative Density Function
There can be both uniform distributions, where outcomes are equally likely, and Gaussian distribution, which means a normal distribution as the sample size gets larger. The exponential distribution is the time between the events in a Poisson point process. Besides knowing what statistics in data science is, you can also learn about Bernoulli Distribution, Poisson Distribution and Binomial Distribution in the GeekLurn course under the Statistics and Probability segment.
This determines the statistical relationship between one or more independent variables and a dependent variable. Regression is mainly of two types:
- Logistic Regression: Explains the relationship between binary response variables and one or more predictor variables. For instance, it can be used to predict if a political candidate will win or lose an election.
- Linear Regression: Explains the relationship between one or more predictor variables and a numeric predictor variable. For instance, height and age have a linear relationship. As you age, your height continues to grow.
These are a few of the most important statistics concepts for data science to do its work. Competency in these areas can help you forge a successful career in India’s fastest-growing industries.