The super interesting field of statistics plays an important part in the world of large data handling. Many organisations are now investing a lot of money into data statistics and its analytics to take their businesses to greater heights. This has created several lucrative job opportunities for data scientists. If you have an important interview to crack, here are 30 statistics interview questions for data science that’ll help you prepare better and ace the interview.
Table of Contents
Basic Statistics Interview Questions For Data Science
Q1. What are the different types of data?
There are 2 types of data:
- Qualitative or Categorical Data – This type of data tells about perception of people.
Example – Language spoken – Latin, French, Hindi, English etc.
Brands of cars – Audi, Maruti, BMW, Renault etc.
- Numerical or Quantitative Data – Representing numbers, this form of data is further divided into 2 types:
Type 1: Discrete data. Examples include:
- Cost of a mobile phone
- Number of students present or absent in a classroom
- Days in a week or months in a year
Type 2: Continuous data. Example include:
- Speed of a vehicle
- Height of a person
- Price of a share
Q2. How is the statistical significance of an insight assessed?
To find out the statistical significance of an insight, hypothesis testing is used. In this, the alternate hypothesis and null hypothesis are stated and then the p-value is calculated. Once we calculate the p-value, the null hypothesis is assumed true, and the values are determined. To get accurate results, the alpha value is tweaked. After this, if p-value is lesser than the alpha value, the null hypothesis is rejected.
Q3. What are the types of measurement levels?
Q4. Difference between Descriptive and Inferential Statistics?
|Inferential Statistics||Descriptive Statistics|
|Allows you to test a hypothesis||Summarises the properties of a data set|
|Compares, tests and predicts data||Organise, analyse and present data in a meaningful way|
|Final results are in the form of probability||Final results are in the form of graphs, tables, charts etc.|
|Is used to explain the chances of occurrence of an event||Is used to describe a situation|
Q5. Where are long-tailed distributions used?
Long-tailed distributions are used to model statistics such as product sales distribution, frequency of internet search terms etc. They are also in solving classification and regression problems.
Q6. How is missing data handled in statistics?
This is one of the easy to answer basic statistics interview questions. Following are few of the ways to handle missing data in statistics:
- Prediction of the missing values
- Deletion of rows which have the missing data
- Using Random Forests which support the missing values
- Mean or median imputation
Q7. Name the types of selection bias in statistics?
Following are some of the types of selection bias in statistics:
- Sampling bias
- Exclusion bias
- Survivorship bias
- Recall bias
- Attrition bias
- Accidental bias
- Misclassification bias
- Observer bias
Q8. What is observational and experimental data in Statistics?
Observational Data is data that is obtained from observational studies. Here, variables are observed to check if there’s any correlation between them.
Data derived from experimental studies is known as Experimental Data. Here, certain variables are held constant to check if any inconsistencies or discrepancies are raised during the working.
Q9. What are descriptive statistics?
Descriptive statistics describe, and summarise the basic characteristics of a data set in a given study or experiment. There are predominantly 3 types of descriptive statistics. They are:
2) Central Tendency
3) Variability (AKA Dispersion)
Q10. What is the probability of throwing two fair dice when the sum is 5 and 8?
There are 4 ways of rolling a 5. They are – 1+4, 4+1, 2+3, and 3+2.
P = 4/36 = 1/9
There are 7 ways of rolling an 8. They are 1+7, 7+1, 2+6, 6+2, 3+5, 5+3, and 4+4.
So P = 7/36 = 0.194
Intermediate Statistics Interview Questions
Q11. What are left-skewed and right-skewed distributions?
A left-skewed distribution is where the left tail is longer than the right tail.
Here, the mean < median < mode.
A right-skewed distribution is where the right tail is longer than the left tail.
Here, the mean > median > mode.
Q12. What is the difference between population parameters and sample statistics?
These are population parameters:
Mean = µ
Standard deviation = σ
These are sample statistics:
Mean = x (bar)
Standard deviation = s
Q13. Joy gave an exam. The test has a mean score of 160, and it has a standard deviation of 15. If Joy’s Z-score is 1.20, what would be his score on the test?
An intermediate level statistics interview question for data science, the answer to this one is as follows:
X = μ + Zσ
μ is the mean
σ is the standard deviation
X is the value to be calculated
Therefore, X = 160 + (15*1.2) = 173.8
This can be approximated to 174
Q14. What is Bessel’s Correction?
The factor that is used to estimate a population’s Standard Deviation from its sample is known as Bessel’s Correction.
Q15. What is the proportion of confidence interval that will not contain the population parameter?
Alpha is the proportion of confidence interval that will not contain the population parameter.
α = 1 – CL
Q16. What is the relationship between the confidence level and the significance level in statistics?
Significance level is the probability of finding a result that is different from the condition where the null hypothesis is true.
Confidence level is the percentage of certainty that the confidence interval will contain true population parameters when you draw a random sample several times.
They are related by the formula:
Significance level = 1 − Confidence level
Q17. What is the difference between Probability and Likelihood?
Probability deals with the possibility of a random experiment occurring. It attaches to possible results.
Likelihood is the possibility that an event which has already occurred would yield a specific outcome. Likelihood attaches to hypotheses.
Q18. What types of variables are used for Pearson’s correlation coefficient?
Variables used for Pearson’s correlation coefficient are either in an interval or a ratio. However, there can exist a condition where one variable is an interval score and the other is a ratio.
Q19. When to use t-distribution and when to use z-distribution?
Conditions that must be satisfied to use z-distribution are:
What is the population standard deviation?
Is the sample size > 30?
CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
If these conditions are not satisfied, we use t-distribution
CI = x (bar) – t*s/√n to x (bar) + t*s/√n
Q20. What is the difference between one tail and two tail hypothesis testing?
H1: x <= µ
H1: x > µ
H0: x = µ
H1: x <> µ
Advanced Data Science Statistics Interview Questions
Q21. Briefly explain the procedure to measure the length of all elephants in the world.
The following steps are used to determine the length of elephants:
Step 1: Define the confidence level. This is generally around 95%
Step 2: Use sample elephants to measure
Step 3: Calculate the mean and standard deviation of the lengths
Step 4: Determine the t-statistics values
Step 5: Determine the confidence interval in which the mean length lies
Q22. How to calculate p-value using MS Excel?
One of the advanced statistics interview questions often asked, the answer to this is as follows:
- Go to the Data tab
- Click on Data Analysis
- Select Descriptive Statistics
- Choose the corresponding column
- Select the summary statistics and input the confidence level
Q23. What are the properties of normal distributions?
Following are few of the properties of normal distributions:
- The mean, median and mode are equal.
- Left and right halves of the curve are mirrored.
- It is perfectly symmetrical. The distribution curve can be divided in the middle to produce two equal halves.
Q24. If there is a 30% probability that you will see a red bicycle in any 20-minute time interval, what is the probability that you will see at least 1 red bicycle in the period of 60 minutes?
The probability of not seeing a red bicycle in 20 minutes is:
= 1 – P
= 1 − 0.3
Probability of not seeing any red bicycle in 60 minutes is:
= (0.7) ^ 3 = 0.343
Hence, the probability of seeing at least 1 red bicycle in 60 minutes is:
= 1 – P
= 1 − 0.343 = 0.657
Q25. What are some of the low and high-bias Machine Learning algorithms?
Low-bias Machine Learning algorithms: Decision trees, KNN algorithm, Support Vector Machines etc.
High-bias Machine Learning algorithms: Linear regression, logistic regression, linear discriminant analysis.
Q26. Name some techniques to reduce underfitting and overfitting during model training
For reducing underfitting:
– Increase model complexity
– Increase the number of training epochs
– Increase the number of features
– Remove noise from the data
For reducing overfitting:
– Increase training data
– Lasso regularisation
– Use random dropouts
Q27. What is the empirical rule?
Also known as the 68 – 95 – 99.7 rule, the Empirical Rule states that on a Normal Distribution:
68% of the data will be within one Standard Deviation of the Mean
95% of the data will be within two Standard Deviations of the Mean
99.7% of the data will be within three Standard Deviations of the Mean
Q28. What is the meaning of sensitivity in statistics?
Sensitivity in statistics is used to determine the accuracy of a classifier.
The formula to calculate sensitivity is:
Sensitivity = Predicted true events / Total number of events
Q29. What is an undercoverage bias?
The undercoverage bias occurs when some members of the population are ineffectually represented in a sample. It refers to a type of sampling bias that occurs when a part of the information from a sample response goes uncovered in the results.
Q30. What is the law of large numbers in statistics?
The law of large numbers states that, “as a sample size grows, its mean gets closer to the average of the whole population”.
These data science statistics interview questions and answers will definitely make it easier for professionals and students to ace their interviews. For upskilling yourself in the dynamic field of data science, check out GeekLurn’s Data Science Course with placement guarantee. Not only does it have 320+ hours of live interactive sessions with eminent Data Scientists, but also has opportunities to conduct research work with GeekLurn AI Singapore. We wish you all the best with your interview process. Hope you land your dream job soon!