Machine Learning Interview Questions


The term Machine Learning is booming across the spectrum. It helps to solve real-world problems. Unlike the hard coding rule for solving the problem, with the help of machine learning, the algorithm of the process learns from the data. With this, the learning can help to predict the feature. It is highly critical for those early adopters. 

The research found that more than 80% of companies are opting for machine learning and AI, which has gained a significant benefit from their investment as it helps them to garner better ROI. Another research found that enterprises get more than 16% of ROI on their investment in machine learning and AI.

Considering these stats, the demand for professional machine learning experts is booming, and enterprises have devised the best and tough Machine Learning Interview Questions to ensure that the talent they hire will be up to the mark.

Know more about machine learning!

Machine learning primary goal is to make our live easier and more accessible. In the early days, most systems used the hardcoded rules of else or ifs decisions for the data processing or adjusting the user input. Machine learning provides us with ample information for the data so that it helps us to learn and figure out the different data patterns.

Because enterprises across the globe want to make services and information accessible to people by adopting the latest and most advanced technologies such as AI and machine learning, as said earlier, there is growing adoption of the technologies in industries such as finance, banking, healthcare, and more.

The demand for AI engineers, data scientists, machine learning engineers, and data analysts are roaring. If you want to apply for any of these types of jobs, you must know of the machine learning interview questions and answers that hiring managers may potentially ask you.

In this article, we will take you through some of the best machine learning interview questions and answers you may encounter when applying for these job roles. 

What do you understand about Machine learning?

Machine learning is the form of Artificial Intelligence that deals with system programming and automates data analysis to enable computers to learn and act through experiences without being explicitly programmed.

For example, you can code Robots in such a way that they can perform the tasks based on data they collect from sensors. They automatically learn programs from data and improve with experiences.

Differentiate between inductive learning and deductive learning?

In inductive learning, the model learns by examples from a set of observed instances to draw a generalized conclusion. On the other hand, in deductive learning, the model first applies the conclusion, and after that draws the conclusion.

  • Inductive learning is the method of using observations to draw conclusions.
  • Deductive learning is the method of using conclusions to form observations.

What is Cross-Validation?

Cross-validation is a method of splitting all your data into three parts: training, testing, and validation data. you can split data into k subsets, and the model has trained on k-1 of those datasets.

The last subset is for testing. This is done for each of the subsets. This is k-fold cross-validation. Finally, the scores from all the k-folds are averaged to produce the final score.

What is the difference between Data Mining and Machine Learning?

Data mining can be described as the process in which the structured data tries to abstract knowledge or interesting unknown patterns. During this process, it uses machine learning algorithms.

Machine learning represents the study, design, and development of the algorithms which provide the ability to the processors to learn without being explicitly programmed.

What is the meaning of Overfitting in Machine learning?

You will see overfitting in machine learning when a statistical model describes random error or noise instead of the underlying relationship. Generally, you will observe overfitting when a model is excessively complex. It happens because of having too many parameters concerning the number of training data types. The model displays poor performance, which has been overfitted.

Why does overfitting occur?

The possibility of overfitting occurs when the criteria used for training the model is not as per the criteria used to judge the efficiency of a model.

What is the method to avoid overfitting?

Overfitting occurs when we have a small dataset, and a model is trying to learn from it. You can avoid overfitting by using a large amount of data. But if we have a small database and forced to build a model based on that, then we can use a technique known as cross-validation. In this method, a model is usually given a dataset of a known data on which the training data set is run and a dataset of unknown data against which the model is tested. The primary aim of cross-validation is to define a dataset to “test” the model in the training phase. If there is sufficient data, you use ‘Isotonic Regression’ to prevent overfitting.

Differentiate supervised and unsupervised machine learning.

  • In supervised machine learning, the machine is trained using labeled data. Then a new dataset is given into the learning model so that the algorithm provides a positive outcome by analyzing the labeled data. For example, we first require labeling the data which is necessary to train the model while performing classification.
  • In unsupervised machine learning, the machine is not trained using labeled data and let the algorithms make the decisions without any corresponding output variables.

What are Different Kernels in SVM?

There are six types of kernels in SVM:

  • Linear kernel – used when data is linearly separable. 
  • Polynomial kernel – When you have discrete data that has no natural notion of smoothness.
  • Radial basis kernel – Create a decision boundary able to do a much better job of separating two classes than the linear kernel.
  • Sigmoid kernel – used as an activation function for neural networks

How does Machine Learning differ from Deep Learning?

  • Machine learning is all about algorithms which you use to parse data, learn from that data, and then apply whatever they have learned to make informed decisions.
  • Deep learning is a part of machine learning, inspired by the structure of the human brain and is particularly useful in feature detection.

How is KNN different from k-means?

KNN or K nearest neighbors is a supervised algorithm which you use for classification purposes. In KNN, a test sample is given as the class of the majority of its nearest neighbors. On the other hand, K-means is an unsupervised algorithm which is mainly used for clustering. In k-means clustering, it needs a set of unlabeled points and a threshold only. The algorithm further takes unlabeled data and learns how to cluster it into groups by computing the mean of the distance between different unlabeled points.

What are the different types of Algorithm methods in Machine Learning?

The different types of algorithm methods in machine learning are:

  • Supervised Learning
  • Semi-supervised Learning
  • Unsupervised Learning
  • Transduction
  • Reinforcement Learning

What is a Neural Network?

It is a simplified model of the human brain. Much like the brain, it has neurons that activate when encountering something similar.

Also, there are connections between different neurons that help information flow from one neuron to another.

What do you understand about Reinforcement Learning techniques?

Reinforcement learning is an algorithm technique that you use in Machine Learning. It involves an agent that interacts with its environment by producing actions & discovering errors or rewards. Reinforcement learning is employed by different software and machines to search for the best suitable behavior or path it should follow in a specific situation. It usually learns on the basis of reward or penalty given for every action it performs.

What is the trade-off between bias and variance?

Both bias and variance are errors. Bias is an error due to erroneous or overly simplistic assumptions in the learning algorithm. It can lead to the model under-fitting the data, making it hard to have high predictive accuracy and generalize the knowledge from the training set to the test set.

Variance is an error due to too much complexity in the learning algorithm. It leads to the algorithm being highly sensitive to high degrees of variation in the training data, which can lead the model to overfit the data.

To optimally reduce the number of errors, we will need to tradeoff bias and variance.

How do classification and regression differ?

ClassificationRegression
Classification is the task to predict a discrete class label.Regression is the task to predict a continuous quantity.
In a classification problem, data is labeled into one of two or more classes.A regression problem needs the prediction of a quantity.
A classification having problem with two classes is called binary classification, and more than two classes is called multi-class classificationA regression problem containing multiple input variables is called a multivariate regression problem.
Classifying an email as spam or non-spam is an example of a classification problem.Predicting the price of a stock over a period of time is a regression problem.

What are the five popular algorithms we use in Machine Learning?

Five popular algorithms are:

  • Decision Trees
  • Probabilistic Networks
  • Neural Networks
  • Support Vector Machines
  • Nearest Neighbor

What is a Box-Cox transformation?

Box-Cox transformation is a power transform which transforms non-normal dependent variables into normal variables as normality is the most common assumption made while using many statistical techniques. It has a lambda parameter which when set to 0 implies that this transform is equivalent to log-transform. It is for variance stabilization and also to normalize the distribution.

What do you mean by ensemble learning?

Numerous models, such as classifiers are strategically made and combined to solve a specific computational program, known as ensemble learning. You can also call the ensemble methods as committee-based learning or learning multiple classifier systems. It trains various hypotheses to fix the same issue. One of the most suitable examples of ensemble modeling is the random forest trees where it uses several decision trees to predict outcomes. You use it to improve the classification, function approximation, prediction, etc. of a model.

What is model selection in Machine Learning?

The process of choosing models among diverse mathematical models, which are used to define the same data is known as Model Selection. You can apply Model learning to the fields of statistics, data mining, and machine learning.

What are Loss Function and Cost Functions? Explain the key Difference Between them?

When calculating loss we consider only a single data point, then we use the term loss function.

Whereas, when calculating the sum of error for multiple data then we use the cost function. There is no major difference.

In other words, the loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.

Mean-squared error and Hinge loss are most common loss functions.

Mean-Squared Error(MSE): In simple words, we can say how our model predicted values against the actual values.

MSE = √(predicted value - actual value)2

Hinge loss: It is to train the machine learning classifier, which is

L(y) = max(0,1- yy)

Where y = -1 or 1 indicates two classes and y represents the output form of the classifier. The most common cost function represents the total cost as the sum of the fixed costs and the variable costs in the

equation y = mx + b

How to Handle Outlier Values?

An Outlier is an observation in the dataset that is far away from other observations in the dataset. Tools used to discover outliers are

  • Box plot
  • Z-score
  • Scatter plot, etc.

Typically, we need to follow three simple strategies to handle outliers:

  • We can drop them. 
  • We can mark them as outliers and include them as a feature. 
  • Likewise, we can transform the feature to reduce the effect of the outlier.

What are the three stages of building hypotheses or models in machine learning?

There are three stages to build hypotheses or model in machine learning:

  • Model building
    It chooses a suitable algorithm for the model and trains it according to the requirement of the problem.
  • Applying the model
    It is responsible for checking the accuracy of the model through the test data.
  • Model testing
    It performs the required changes after testing and applies the final model.

What, according to you, is the standard approach to supervised learning?

In supervised learning, the standard approach is to split the set of examples into the training set and the test.

What is a Random Forest? How does it work?

Random forest is a versatile machine learning method capable of performing both regression and classification tasks.

Like bagging and boosting, random forest works by combining a set of other tree models. Random forest builds a tree from a random sample of the columns in the test data.

Here’s are the steps how a random forest creates the trees:

  • Take a sample size from the training data.
  • Begin with a single node.
  • Run the following algorithm, from the start node:
    • If the number of observations is less than node size then stop.
    • Select random variables.
    • Find the variable that does the “best” job of splitting the observations.
    • Split the observations into two nodes.
    • Call step `a` on each of these nodes.

Describe ‘Training set’ and ‘training Test’.

In various areas of information of machine learning, you can use a set of data to discover the potentially predictive relationship, known as the ‘Training Set’. The training set is an example that is given to the learner. Besides, you use the ‘Test set’ to test the accuracy of the hypotheses generated by the learner. It is the set of instances held back from the learner. Thus, the training set is distinct from the test set.

What are the common ways to handle missing data in a dataset?

Missing data is one of the standard factors while working with data and handling. It is one of the greatest challenges that the data analysts faces. There are many ways one can impute the missing values. Some of the common methods to handle missing data in datasets can be defined as deleting the rows, replacing with mean/median/mode, predicting the missing values, assigning a unique category, using algorithms that support missing values, etc.

What do you understand about ILP?

ILP stands for Inductive Logic Programming. It is a part of machine learning which uses logic programming. It aims at searching patterns in data which you can use to build predictive models. In this process, you can assume the logic programs as a hypothesis.

What are the necessary steps in a Machine Learning Project?

There are several essential steps we must follow to achieve a good working model while doing a Machine Learning Project. Those steps may include parameter tuning, data preparation, data collection, training the model, model evaluation, and prediction, etc.

Describe Precision and Recall?

Precision and Recall both are the measures which you use in the information retrieval domain to measure how good an information retrieval system reclaims the related data as requested by the user.

Precision can be said as a positive predictive value. It is the fraction of relevant instances among the received instances.

On the other side, recall is the fraction of relevant instances that have been retrieved over the total amount or relevant instances. Moreover, you can call recall as sensitivity.

What do you understand about the Decision Tree in Machine Learning?

Decision Trees can be defined as the Supervised Machine Learning, where the data is continuously split according to a certain parameter. It builds classification or regression models similar to a tree structure, with datasets broken up into ever smaller subsets while developing the decision tree. You can define the tree by two entities, namely decision nodes, and leaves. The leaves are the decisions or the outcomes, and the decision nodes are where the data splits. Decision trees can manage both categorical and numerical data.

What are the functions of Supervised Learning?

  • Classification
  • Speech Recognition
  • Regression
  • Predict Time Series
  • Annotate Strings

What are the functions of Unsupervised Learning?

  • Finding clusters of the data
  • Finding low-dimensional representations of the data
  • Finding interesting directions in data
  • Finding novel observations/ database cleaning
  • Finding interesting coordinates and correlations

What do you understand about algorithm independent machine learning?

Algorithm independent machine learning can be defined as machine learning, where mathematical foundations are independent of any particular classifier or learning algorithm.

What is a confusion matrix and why do you need it?

Confusion matrix (also called the error matrix) is a table that is frequently used to illustrate the performance of a classification model i.e. classifier on a set of test data for which the true values are well-known.

It allows us to visualize the performance of an algorithm/model. It allows us to easily identify the confusion between different classes. Also, used as a performance measure of a model/algorithm.

A confusion matrix is known as a summary of predictions on a classification model. The number of right and wrong predictions were summarized with count values and broken down by each class label. It gives us information about the errors made through the classifier and also the types of errors made by a classifier.

Describe the classifier in machine learning.

A classifier is a case of a hypothesis or discrete-valued function which is used to assign class labels to particular data points. It is a system that inputs a vector of discrete or continuous feature values and outputs a single discrete value, the class.

What do you mean by Genetic Programming?

Genetic Programming (GP) is almost similar to an Evolutionary Algorithm, a subset of machine learning. Genetic programming software systems implement an algorithm that uses random mutation, a fitness function, crossover, and multiple generations of evolution to resolve a user-defined task. The genetic programming model is based on testing and choosing the best option among a set of results.

What is SVM in machine learning? What are the classification methods that SVM can handle?

SVM stands for Support Vector Machine. SVM are supervised learning models with an associated learning algorithm which analyze the data used for classification and regression analysis.

The classification methods that SVM can handle are:

  • Combining binary classifiers
  • Modifying binary to incorporate multiclass learning

How will you explain a linked list and an array?

An array is a data type which is widely implemented as a default type, in almost all the modern programming languages. Generally, used to store data of a similar type.

But there are many use-cases where we don’t know the quantity of data to be stored. For such cases, advanced data structures are required, and one such data structure is a linked list.

Difference

There are some points which explain how the linked list is different from an array:

ARRAYLINKED LIST
An array is a group of elements of a similar data type.Linked List is an ordered group of elements of the same type, which are connected using pointers.
Elements are stored consecutively in the memory.New elements can be stored anywhere in memory.
An Array supports Random Access. It means that the elements can be accessed directly using their index value, like arr[0] for 1st element, arr[5] for 6th element, etc.
As a result, accessing elements in an array is fast with constant time complexity of O(1).
Linked List supports Sequential Access. It means that we have to traverse the complete linked list, up to that element sequentially which element/node we want to access in a linked list.
To access the nth element of a linked list, the time complexity is O(n).
Memory is allocated at compile time as soon as the array is declared. It is known as Static Memory Allocation.Memory is allocated at runtime, whenever a new node is added. It is known as Dynamic Memory Allocation.
Insertion and Deletion operation takes more time in the array, as the memory locations are consecutive and fixed.In case of a linked list, a new element is stored at the first free available memory location.
Thus, Insertion and Deletion operations are fast in the linked list.
Size of the array must be declared at the time of array declaration.Size of a Linked list is variable. It grows at runtime whenever nodes are added to it.

What is the Time series?

A Time series is a sequence of numerical data points in successive order. It tracks the movement of the chosen data points over a specified period of time and records the data points at regular intervals. Time series doesn’t require any minimum or maximum time input. Analysts often use Time series to examine data according to their specific requirement.

What do you mean by Associative Rule Mining (ARM)?

Associative Rule Mining is one of the techniques to discover patterns in data like features (dimensions) which occur together and features (dimensions) which are correlated. It is mostly used in Market-based Analysis to find how frequently an itemset occurs in a transaction. Association rules have to satisfy minimum support and minimum confidence at the very same time. Association rule generation generally comprised of two different steps:

  • “A min support threshold is given to obtain all frequent item-sets in a database.”
  • “A min confidence constraint is given to these frequent item-sets in order to form the association rules.”

Support is a measure of how often the “item set” appears in the data set and Confidence is a measure of how often a particular rule has been found to be true.

What is Bayes’ Theorem? How is it useful in a machine learning context?

Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge.

Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?

Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population)  = 0.0594 or 5.94% chance of getting a flu.

Bayes’ Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier.

Explain the difference between L1 and L2 regularization.

L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.

What’s the difference between Type I and Type II error?

Don’t think that this is a trick question! Many machine learning interview questions will be an attempt to lob basic questions at you just to make sure you’re on top of your game and you’ve prepared all of your bases.

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.

What’s a Fourier transform?

A Fourier transform is a generic method to decompose generic functions into a superposition of symmetric functions. Or as this more intuitive tutorial puts it, given a smoothie, it’s how we find the recipe. The Fourier transform finds the set of cycle speeds, amplitudes, and phases to match any time signal. A Fourier transform converts a signal from time to frequency domain—it’s a very common way to extract features from audio signals or other time series such as sensor data.

What is Marginalization? Explain the process.

Marginalisation is summing the probability of a random variable X given joint probability distribution of X with other variables. It is an application of the law of total probability.

P(X=x) = ∑YP(X=x,Y) 

Given the joint probability P(X=x,Y), we can use marginalization to find P(X=x). So, it is to find the distribution of one random variable by exhausting cases on other random variables.

What’s the difference between a generative and discriminative model?

A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.

What’s the F1 score? How would you use it?

The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.

Explain the phrase “Curse of Dimensionality”.

The Curse of Dimensionality refers to the situation when your data has too many features.

You can use the phrase to express the difficulty of using brute force or grid search to optimize a function with too many inputs.

It can also refer to several other issues like:

  • If we have more features than observations, we have a risk of overfitting the model.
  • When we have too many features, observations become harder to cluster. Too many dimensions cause every observation in the dataset to appear equidistant from all others and no meaningful clusters can be formed.

Dimensionality reduction techniques like PCA come to the rescue in such cases.

How do we check the normality of a data set or a feature?

Visually, we can check it using plots. There is a list of Normality checks, they are as follow:

  • Shapiro-Wilk W Test
  • Anderson-Darling Test
  • Martinez-Iglewicz Test
  • Kolmogorov-Smirnov Test
  • D’Agostino Skewness Test