Data Science basics Interview Questions And Answers


Data science is an interdisciplinary field that unifies “statistics, data analysis, informatics, and their related methods” to “understand and analyze data”. It uses techniques and theories drawn from many fields of computer science and information science, including mathematics and statistics domain knowledge.

Data Science aims at extracting or extrapolating knowledge and insights from noisy, structured, and unstructured data across a broad range of application domains. Also, you can use these insights for decision-making and strategic planning that helps us to formulate data-driven solutions.

The accelerating volume of data sources (IoT, mobile devices, social networks and more) and, subsequently, data has made Data Science one of the fastest-growing fields across different industries. Organizations worldwide have increasingly relied on Data Science techniques to interpret data and provide actionable recommendations to improve business outcomes.

Basics of Data Science

Before you head straight to the Data Science Interview questions and answers, let’s first know the basics of the Data Science Project.

Data Science Lifecycle 

The Data Science lifecycle involves various processes that enable analysts to analyze and make patterns out of data, gleaning actionable insights. Here is a stepwise description of the Data Science Project:

Data Ingestion

The Data Science life cycle begins with collecting data from different sources. You can collect both structured (such as customer data) and unstructured data (log files, video, audio, pictures) from all the relevant sources using various methods. For example, web scraping and real-time streaming data from systems and devices etc. 

Data storage and Data Processing In Data Science

All data collected may have different formats and structures. You need to consider different storage systems. Once done, the next step includes cleaning data, deduplicating, transforming and combining the data using ETL (extract, transform, load) jobs or other data integration technologies. 

It promotes data quality before loading into a data warehouse, data lake, or another repository.

Data Analysis

Data scientists conduct an exploratory data analysis to examine biases, patterns, ranges, and distributions of values within the data and drive hypothesis generation for a/b testing. It allows data analysts to find out the data’s relevance for use within modelling efforts for:

  • Predictive analytics 
  • Machine learning
  • Deep learning

Depending on a model’s accuracy, you can rely on these insights for business decision-making.

Communicate

Finally, insights driven by the data analysis are presented as reports and other data visualizations. Businesses find out market trends and use advanced analytics to gain value in many ways, such as:

  • Reducing Cost 
  • Fast Decision Making
  • Marketing
  • Developing New products
  • Customer Services

What are the differences between supervised and unsupervised learning?

Supervised Learning In Data Science Unsupervised Learning In Data Science
Uses known and labeled data as inputSupervised learning has a feedback mechanism The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machineUses unlabeled data as inputUnsupervised learning has no feedback mechanism The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

How is logistic regression done?

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

Explain the steps in making a decision tree.

  1. Take the entire data set as input
  2. Calculate entropy of the target variable, as well as the predictor attributes
  3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
  4. Choose the attribute with the highest information gain as the root node 
  5. Repeat the same procedure on every branch until the decision node of each branch is finalized

How do you build a random forest model?

A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

Steps to build a random forest model:

  1. Randomly select ‘k’ features from a total of ‘m’ features where k << m
  2. Among the ‘k’ features, calculate the node D using the best split point
  3. Split the node into daughter nodes using the best split
  4. Repeat steps two and three until leaf nodes are finalized 
  5. Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

How can you avoid overfitting your model?

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:

  1. Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
  2. Use cross-validation techniques, such as k folds cross-validation 
  3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting

What are the feature selection methods that you can use to select the right variables?

There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods

This involves: 

  • Linear discrimination analysis
  • ANOVA
  • Chi-Square

The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in. 

Wrapper Methods

This involves: 

  • Forward Selection: You need to test one feature at a time and keep adding them to get a good fit.
  • Backward Selection: We test all the features and start removing them to see what works better
  • Recursive Feature Elimination: Recursively looks through all the different features and how they pair together.

Wrapper methods are very labor-intensive, and also you need high-end computers for a lot of data analysis is performed with the wrapper method.

What are dimensionality reduction and its benefits?

The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. 

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).

How should you maintain a deployed model?

The steps to maintain a deployed model are:

Monitor 

You need constant monitoring of all models to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. Moreover, you need to consistantly monitor it.

Evaluate

Evaluation metrics of the current model are calculated to determine if a new algorithm is needed. 

Compare

The new models are compared to each other to determine which model performs the best. 

Rebuild

The best performing model is re-built on the current state of data.

What are recommender systems?

A recommender system predicts what a user would rate a specific product based on their preferences. Generally, you can split it into two different areas:

Collaborative Filtering

As an example, Last.fm recommends tracks that other users with similar interests play often. Moreover, you can commonly see it on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: “Users who bought this also bought…”

Content-based Filtering

As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.

How can you select k for k-means?

We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where ‘k’ is the number of clusters.

Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

What is a star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and you can connect it to the central fact table using the ID fields. Generally, you can call it as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

How can You Treat outlier values?

You can drop outliers only if it is a garbage value. 

Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, you can remove outliers.

Also, you can remove it If the outliers have extreme values. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point.

If you cannot drop outliers, you can try the following:

  • Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.
  • Try normalizing the data. This way, you can pull the extreme data points to a similar range.
  • You can use algorithms that are less affected by outliers; an example would be random forests.

Write a basic SQL query that lists all orders with customer information.

Usually, we have order tables and customer tables that contain the following columns:

  • Order Table 
  • Orderid
  • customerId 
  • OrderNumber
  • TotalAmount
  • Customer Table 
  • Id
  • FirstName
  • LastName
  • City 
  • Country  
  • The SQL query is:
  • SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
  • FROM Order
  • JOIN Customer
  • ON Order.CustomerId = Customer.Id

You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 percent. Why shouldn’t you be happy with your model performance? What can you do about it?

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient’s prognosis.

Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.

Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?

  • Linear regression 
  • K-NN (k-nearest neighbor)
  • Decision trees 

You should use the K nearest neighbor algorithm because it can compute the nearest neighbor and if it doesn’t have a value, it just computes the nearest neighbor based on all the other features. 

When you’re dealing with K-means clustering or linear regression, you need to do that in your pre-processing, otherwise, they’ll crash. Decision trees also have the same problem, although there is some variance.

What is the ROC curve?

The ROC curve is the graph between the True Positive Rate on the y-axis and the False Positive Rate on x-axis that you can use in binary classification.

Generally, you can calculate the False Positive Rate (FPR) by taking the ratio between False Positives and the total number of negative samples. Also, you can calculate the True Positive Rate (TPR) by taking the ratio between True Positives and the total number of positive samples.

In order to construct the ROC curve, the TPR and FPR values are plotted on multiple threshold values. The area range under the ROC curve has a range between 0 and 1. A completely random modelrepresenting by a straight line has a 0.5 ROC. The amount of deviation a ROC has from this straight line denotes the efficiency of the model.

What is a Confusion Matrix?

The Confusion Matrix is the summary of prediction results of a particular problem. It is a table that you can also use to describe the performance of the model. The Confusion Matrix is an n*n matrix that evaluates the performance of the classification model.

What do you understand about the true-positive rate and false-positive rate?

TRUE-POSITIVE RATE: The true-positive rate gives the proportion of correct predictions of the positive class. You can also use it to measure the percentage of actual positives that are accurately verified.

FALSE-POSITIVE RATE: The false-positive rate gives the proportion of incorrect predictions of the positive class. A false positive determines something is true when that is initially false.

How is Data Science different from traditional application programming?

The primary and vital difference between Data Science and traditional application programming is that in traditional programming, one has to create rules to translate the input to output. In Data Science, the rules are automatically produced from the data.

What is Prior probability and likelihood?

Prior probability is the proportion of the dependent variable in the data set while the likelihood is the probability of classifying a given observant in the presence of some other variable.

What is the difference between the long format data and wide format data?

LONG FORMAT DATA: It contains values that repeat in the first column. In this format, each row is a one-time point per subject.

WIDE FORMAT DATA: In the Wide Format Data, the data’s repeated responses will be in a single row, and each response can be recorded in separate columns.

Long format Table:

NAMEATTRIBUTEVALUE
RAMAHEIGHT 182
SITAHEIGHT160

Wide format Table:

NAMEHEIGHT
RAMA182
SITA160

Mention some techniques used for sampling. What is the main advantage of sampling?

Sampling is the selection of individual members or a subset of the population to estimate the characters of the whole population. There are two types of Sampling, namely Probability and Non-Probability Sampling.

Why You Use Python for Data Cleaning in DS?

Data Scientists and technical analysts must convert a huge amount of data into effective ones. Data Cleaning includes removing malwared records, outliners, inconsistent values, redundant formatting etc. Matplotlib, Pandas etc are the most used Python Data Cleaners.

Which language is best for text analytics? R or Python?

Python will be more suitable for text analytics as it consists of a rich library known as pandas. It allows you to use high-level data analysis tools and data structures, while R doesn’t offer this feature.

What are the popular libraries used in Data Science?

The popular libraries used in Data Science are 

  • Tensor Flow
  • Pandas
  • NumPy
  • SciPy
  • Scrapy
  • Librosa
  • MatPlotLib

What is variance in Data Science?

Variance is the value which depicts the individual figures in a set of data which distributes themselves about the mean and describes the difference of each value from the mean value. Data Scientists use variance to understand the distribution of a data set.

What is pruning in a decision tree algorithm?

In Data Science and Machine Learning, Pruning is a technique which is related to decision trees. Pruning simplifies the decision tree by reducing the rules. Pruning helps to avoid complexity and improves accuracy. Reduced error Pruning, cost complexity pruning etc. are the different types of Pruning.

What is Ensemble Learning?

The ensemble is a method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are:

Bagging

Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions.

Boosting

Boosting is an iterative method which allows you to adjust the weight of an observation depending upon the last classification. Also, boosting decreases the bias error and helps you to build strong predictive models.

Explain Eigenvalue and Eigenvector

Eigenvectors are for understanding linear transformations. Data scientists also need to calculate the eigenvectors for a covariance matrix or correlation. Eigenvalues are the directions along using specific linear transformation acts by compressing, flipping, or stretching.

State the difference between the expected value and mean value

There are not many differences, but both of these terms are used in different contexts. Mean value is generally referred to when you are discussing a probability distribution whereas expected value is referred to in the context of a random variable.

What is entropy in a decision tree algorithm?

Entropy is the measure of randomness or disorder in the group of observations. It also determines how a decision tree switches to split data. Entropy is also used to check the homogeneity of the given data. If the entropy is zero, then the sample of data is entirely homogeneous, and if the entropy is one, then it indicates that the sample is equally divided.

What information is gained in a decision tree algorithm?

Information gain is the expected reduction in entropy that decides the building of the tree. Information Gain makes the decision tree smarter. Also, information gain includes parent node R and a set E of K training examples. It calculates the difference between entropy before and after the split.

What is k-fold cross-validation?

The k-fold cross validation is a procedure used to estimate the model’s skill in new data. In k-fold cross validation, every observation from the original dataset may appear in the training and testing set. K-fold cross-validation estimates the accuracy but does not help you to improve the accuracy.

What is a normal distribution?

Normal Distribution is also known as the Gaussian Distribution. The normal distribution shows the data near the mean and the frequency of that particular data. When represented in graphical form, normal distribution generally appears like a bell curve. The parameters included in the normal distribution are Mean, Standard Deviation, Median etc.

What is Deep Learning?

Deep Learning is one of the essential factors in Data Science, including statistics. Deep Learning makes us work more closely with the human brain and reliable with human thoughts. Moreover, the algorithms are sincerely created to resemble the human brain. In Deep Learning, multiple layers are formed from the raw input to extract the high-level layer with the best features.

What is an RNN (recurrent neural network)?

RNN is an algorithm that uses sequential data. RNN is used in language translation, voice recognition, image capturing etc. There are different types of RNN networks such as one-to-one, one-to-many, many-to-one and many-to-many. RNN is used in Google’s Voice search and Apple’s Siri.

Discuss Artificial Neural Networks

Artificial Neural networks (ANN) are a special set of algorithms that have revolutionized machine learning. It helps you to adapt according to changing input. So the network generates the best possible result without redesigning the output criteria.

What is Back Propagation?

Back-propagation is the essence of neural net training. It is the method of tuning the weights of a neural net depending upon the error rate obtained in the previous epoch. Proper tuning of the helps you to reduce error rates and to make the model reliable by increasing its generalization.

What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyze.

What are the steps in making a decision tree?

  1. Take the entire data set as input.
  2. Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
  3. Apply the split to the input data (divide step).
  4. Re-apply steps one and two to the divided data.
  5. Stop when you meet any stopping criteria.
  6. This step is called pruning. Clean up the tree if you went too far doing splits.

What is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

What is logistic regression?

Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

What are recommender systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

Name three disadvantages of using a linear model

Three disadvantages of the linear model are:

  • The assumption of linearity of the errors.
  • You can’t use this model for binary or count outcomes
  • There are plenty of overfitting problems that it can’t solve

Why do you need to perform resampling?

Resampling is done in below-given cases:

  • Estimating the accuracy of sample statistics by drawing randomly with replacement from a set of the data point or using as subsets of accessible data
  • Substituting labels on data points when performing necessary tests
  • Validating models by using random subsets

Explain cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice. 

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

Explain the benefits of using statistics by Data Scientists

Statistics help Data scientists to get a better idea of customer’s expectation. Using the statistical method Data Scientists can get knowledge regarding consumer interest, behavior, engagement, retention, etc. It also helps you to build powerful data models to validate certain inferences and predictions.

What is collaborative filtering?

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

What is bias?

Bias is an error introduced in your model because of the oversimplification of a machine learning algorithm.” It can lead to underfitting.

Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.