Big-Data Interview Question And Answers


Big-Data may sound like a misnomer to you that pre-existing data might have been small or less in size. However, it is not only size that defines Big Data. 

Big data refers to data sets that are too large/ complex to be dealt with by traditional data-processing application software and are to be computationally analyzed to reveal patterns and trends.

Big Data analysis includes capturing, storing, and analyzing datasets to drive valuable insights from them. There are 7 characteristics of Big Data. Let us find each one of them.

Characteristics of Big-Data

  • Velocity: It refers to the speed at which you collect data from different sources and stored, and then retrieved. You can also say that Big Data Velocity is the speed at which your data moves across different systems.
  • Volume: It refers to the size of data that you want to manage and analyze across different systems. Generally, it is in terabytes.
  • Variety: It refers to the diversity and range of different data types that you collect from data sources. Also, it can be structured, unstructured, and semi-structured.
  • Veracity: It refers to the quality of data that you collect. In other words, you can say that it determines the accuracy of the dataset collected for insight discovery and pattern recognition. 
  • Value: It refers to the value that Big Data can provide, and it relates directly to what organizations can do with that collected data.
  • Volatility: It defines how long the knowledge gathered is valid (up-to-date) to use. Hence it determines how long we can use the data collected. 
  • Valid: It determines if the collected dataset is correct for the intended use. 
  • If yes, it is valid.
  • If not, it is not valid.

Facts- Big-data

Before getting into Big-Data interview questions and answers, here are some facts: 

  • Without sufficient investment/ expertise in Big Data Veracity, the volume and variety of Big Data can produce costs and risks that exceed an organization’s capacity to capture value from it.
  • Big Data trends due to the usage of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods to extract value from it
  • Engineers use Big-Data platforms for analysis and make accurate data-driven decisions for different reasons. For example, analyzing Big Data sets can find new correlations to prevent diseases, combat crime, find business trends and more.
  • The size and number of available data sets increases as data is collected from IOT (Internet of things), mobiles and more which may require massively parallel software running on tens, hundreds, or even thousands of servers. 

1 . Tell us about Big-Data in your own words.

Big Data is a collection of huge amounts of data that you cannot handle, store, or analyze using conventional data processing techniques due to its scale and exponential growth.

 2 . Explain in detail the 3 different types of big data.

STRUCTURED DATA: It implies that the information can be processed, stored, and retrieved in a predetermined format. Contact numbers, social security numbers, ZIP codes, employee records, and wages, among other things, are examples of it.

UNSTRUCTURED DATA: This is data that does not have a particular structure or type. Audio, video, social media posts, digital surveillance data, satellite data, and other forms of unstructured data are the most common types.

SEMI-STRUCTURED DATA: This is an undefined but essential term that applies to both structured and unstructured data formats.

3 . What is Hadoop?

Hadoop is an open-source software architecture for storing and processing data on commodity hardware clusters. It has a lot of storage for any kind of data, a lot of computing power, and it can handle practically unlimited concurrent tasks or jobs.

4 . Are Hadoop and Big-Data interconnected?

Big Data is a resource, and Hadoop is an open-source software application that helps to manage that resource by achieving a set of goals and objectives. To extract actionable insights, Hadoop works to process, store, and analyze complex unstructured data sets using proprietary algorithms and methods. So, Yes, they are related, but they are not the same.

5 . Mention the important tools used in Big-Data analytics.

The important tools used in Big Data Analytics are as follows,

  • NodeXL
  • KNIME
  • Tableau
  • Solver
  • OpenRefine
  • Rattle GUI
  • Qlikview

6 . Explain the 5 V’s of Big-Data?

Big-data has five ‘Vs’: – Value, Variety, Veracity, Velocity, Volume.

Value: The worth of the data you collect is it’s value.

Variety (Data in a variety of formats): Variety describes various types of data, such as text, audios, images, photographs, and PDFs, among others.

Veracity (Data in Doubt): Veracity refers to the processed data’s consistency, trustworthiness, and accuracy.

Velocity (Data in motion): The speed at which you produce, process, and analyze the data is velocity.

Volume (Data at Rest): The volume or sum of data represents volume. The amounts of data are mostly from social media, cell phones, vehicles, credit cards, photographs, and videos.

7 . List of the various vendor-specific distributions of Hadoop?

Hadoop uses the various vendor-specific distributions as below,

  • Cloudera
  • MapR
  • Amazon EMR (Elastic MapReduce)
  • Microsoft Azure HDInsight
  • IBM InfoSphere Information Server for Data Integration and
  • Hortonworks.

8 . Explain FSCK?

HDFS uses the FSCK command, which stands for File System Check. It checks if a file is corrupt, if it has a copy, and if it has any missing blocks. FSCK produces a summary report that summarises the file system’s overall health.

9 . Explain in your own words about HDFS?

Hadoop Distributed File System (HDFS) is a fault-tolerant file system that operates on commodity hardware. HDFS is a distributed storage and processing system that includes file permissions and authentication. NameNode, DataNode, and Secondary NameNode are the three components that make up this node.

10 . Explain about YARN?

YARN stands for Yet Another Resource Negotiator and is a key component of Hadoop 2.0. It is a Hadoop resource management layer that allows various data processing engines to run and process data stored in HDFS, such as graph processing, interactive processing, stream processing, and batch processing. The two key components of YARN are ResourceManager and NodeManager.

 11 . What do you mean by Commodity Hardware?

The basic hardware resource needed to run the Apache Hadoop system is commodity hardware. It’s a general term for low-cost devices that are typically compatible with other low-cost devices.

12 . Tell me about Logistic Regression?

Logistic regression, also known as the logit model, is a technique for predicting a binary outcome from a linear combination of predictor variables.

 13 . Explain Distributed Cache?

The Hadoop MapReduce framework’s Distributed Cache is a dedicated service that is used to cache files whenever they are required by applications. Read-only text files, directories, and jar files are the examples. These can be cached and accessed and read later on each data node where map/reduce tasks are running.

 14 . In how many modes Hadoop can run?

Hadoop can run in three different modes:

  • Standalone mode
  • Pseudo Distributed mode (Single node cluster)
  • Fully distributed mode (Multiple node cluster)

15 . What are the most common data management tools for Hadoop Edge Nodes?

The following are the most popular data management methods for Hadoop Edge Nodes:

  • Oozie
  • Ambari
  • Pig
  • Flume

16 . When several clients attempt to write to the same HDFS file, what happens?

At the same time, multiple users cannot write to the same HDFS file. Since HDFS NameNode supports exclusive write, inputs from the second user will be rejected when the first user accesses the file.

17 . Explain Block in HDFS?

When a file is stored in HDFS, the whole file structure is broken down into a series of blocks, and HDFS has no idea what is in the file. Hadoop blocks must be 128MB in size. This value can be customized for each file.

18 . Tell us about Collaborative Filtering?

Collaborative filtering is a collection of technologies that predict which products a specific user would like based on the preferences of a group of people. It’s simply a technical term for asking people for their opinions.

19 . Explain the ‘jps’ command functions?

We can use the ‘jps’ command to see whether Hadoop daemons such as namenode, datanode, resourcemanager, nodemanager, are running on the machine.

20 . List the various Hadoop and YARN daemons.

Hadoop Daemons are, NameNode, Datanode, and Secondary NameNode

YARN Daemons are ResourceManager, NodeManager, and JobHistoryServer.

21 .  Define Checkpoints?

In HDFS, a checkpoint is important for maintaining file system metadata. By entering fsimage and the edit log, it establishes file system metadata checkpoints. Checkpoint is the name of the latest version of fsimage.

22 . How is Big-Data used in business?

Big Data allows businesses to gain a greater understanding of their customers by allowing them to conclude from vast data sets accumulated over time. It aids them in making better choices.

23 . Why Hadoop in Big-Data?

We need a system to process Big-Data. Hadoop is a free and open-source platform developed by the Apache Software Foundation. When it comes to processing large amounts of data, Hadoop is a must-have.

24 . List the primary steps to take while dealing with big-data?

Start processing with Big-Data with the below basic steps like,

  • Data Ingestion
  • Data Storage and
  • Data Processing

25 . Tell us about Fault Tolerance in Hadoop.

Hadoop’s data is extremely available. Since each piece of data is reproduced three times by default, there is very little to no risk of data loss. As a result, Hadoop is regarded as a fault-tolerant system.

26 . Which configuration of hardware is best for Hadoop jobs?

For running Hadoop operations, dual processors or core machines with 4 / 8 GB RAM and ECC memory are suitable. The hardware design, on the other hand, varies depending on the project’s workflow and process flow and must be customized accordingly.

27 . When a NameNode goes down, how do you get it back up?

To get the Hadoop cluster up and running, perform the following steps:

  • To start a new NameNode, use the fsimage, which is a file system metadata replica.
  • Configure the DataNodes as well as the clients to recognize the newly launched NameNode.
  • The client will be served once the new NameNode has finished loading the last checkpoint FsImage and obtained enough block reports from the DataNodes.

The NameNode recovery process takes a long time in large Hadoop clusters, and it becomes a more significant challenge during routine maintenance.

28 . Explain RackAwareness in Hadoop?

It’s an algorithm that determines where blocks and replicas are put on the NameNode. Network traffic between DataNodes within the same rack is reduced based on rack definitions. If the replication factor is 3, for example, two copies will be placed on one rack and the third copy will be placed on a different rack.

29 . What are the commands for starting and shutting down the Hadoop daemons?

To start all daemons, enter

/sbin/start-all.sh

To stop all daemons, enter

./sbin/stop-all.sh

 30 . Name the reducer’s core methods.

A reducer’s three main methods are, setup(), reduce() and cleanup()

 31 . What are Hadoop’s real-time applications?

Hadoop has real-time implementations in the following areas:

  • Management of information.
  • Financial services.
  • Cybersecurity and protection.
  • Managing social media posts.

 32 . In Hadoop architecture, what are JT, TT, and Secondary name nodes?

JT – Job Tracker is that assigns jobs to Task Trackers.

TT – Task Tracker that performs the job that JT has assigned to it.

The metadata information of the name node is stored in a secondary name node.

The name node information in the secondary name node is changed every 30 minutes.

33 . Tell us about Hive?

Hive is a data-querying and data-processing platform. Hive is a Facebook project that was donated to the Apache Software Foundation.

Hives are primarily used to store structured data.

34 . What will happen after you create a table in Hive?

All metadata will be stored in the meta store database, and a default directory with the table name will be built in /hive/usr/warehouse.

35 . Name two methods for detecting outliers.

Extreme Value Analysis: Extreme value analysis determines the statistical tails of the data distribution; statistical approaches like Altman Z-scores on univariate are good examples.

Probabilistic and statistical models: Determine the unlikely cases using a probabilistic data model.

36 . Explain the 2 types of Tables in Hive?

Managed/internal table: When a table is deleted, both the metadata and the actual data are removed.

External table: Only the metadata, not the actual data, is removed when a table is deleted.

37 . How is big-data analysis helpful in increasing business revenue?

Big data analysis has become very important for businesses. It helps businesses to differentiate themselves from others and increase revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue are – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.

38 . Define respective components of HDFS and YARN

The two main components of HDFS are-

  • NameNode – This is the master node for processing metadata information for data blocks within the HDFS
  • DataNode/Slave node – This is the node which acts as slave node to store the data, for processing and use by the NameNode

In addition to serving the client requests, the NameNode executes either of two following roles –

  • CheckpointNode – It runs on a different host from the NameNode
  • BackupNode- It is a read-only NameNode which contains file system metadata information excluding the block locations

The two main components of YARN are–

  • ResourceManager– This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs.
  • NodeManager– It executes tasks on each single Data Node

39 . What are the main differences between NAS (Network-attached storage) and HDFS?

The main differences between NAS (Network-attached storage) and HDFS –

  • HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less.
  • Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.

40 . What is the Command to format the NameNode?

 $ hdfs namenode -format

 41 . Will you optimize algorithms or code to make them run faster?

The answer to this question should always be “Yes.” Real world performance matters and it doesn’t depend on the data or model you are using in your project.

The interviewer might also be interested to know if you have had any previous experience in code or algorithm optimization. For a beginner, it obviously depends on which projects he worked on in the past. Experienced candidates can share their experience accordingly as well. However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just let the interviewer know your real experience and you will be able to crack the big-data interview.

42 . How would you transform unstructured data into structured data?

Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.

By answering this question correctly, you are signaling that you understand the types of data, both structured and unstructured, and also have the practical experience to work with these. If you give an answer to this question specifically, you will definitely be able to crack the big data interview.

43 . What happens when two users try to access the same file in the HDFS?

HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.

44 . How to recover a NameNode when it is down?

The following steps need to execute to make the Hadoop cluster up and running:

  1. Use the FsImage which is file system metadata replica to start a new NameNode. 
  2. Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode.
  3. Once the new NameNode completes loading the last checkpoint FsImage which has received enough block reports from the DataNodes, it will start to serve the client. 

In case of large Hadoop clusters, the NameNode recovery process consumes a lot of time which turns out to be a more significant challenge in case of routine maintenance.

45 . What do you understand by Rack Awareness in Hadoop?

It is an algorithm applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions, you can minimise network traffic between DataNodes within the same rack. For example, if we consider replication factor as 3, two copies will be on one rack whereas the third copy in a separate rack.

46 . What is a Distributed Cache? What are its benefits?

Any Big Data Interview Question and Answers guide won’t complete without this question. Distributed cache in Hadoop is a service by the MapReduce framework for caching files. If a file is in cache for a specific job, Hadoop makes it available on individual DataNodes both in memory and in a system where the map and reduce tasks are simultaneously executing. This allows you to quickly access and read cached files to populate any collection (like arrays, hashmaps, etc.) in a code.

Distributed cache offers the following benefits:

  • It distributes simple, read-only text/data files and other complex types like jars, archives, etc. 
  • It tracks the modification timestamps of cache files which highlight the files that should not be modified until a job is executed successfully.

47 . Name the configuration parameters of a MapReduce framework.

The configuration parameters in the MapReduce framework include:

  • The input format of data.
  • The output format of data.
  • The input location of jobs in the distributed file system.
  • The output location of jobs in the distributed file system.
  • The class containing the map function
  • The class containing the reduce function
  • The JAR file containing the mapper, reducer, and driver classes.

48 . Explain MapReduce and write its syntax to run a MapReduce program?

MapReduce is a Hadoop programming model for processing large data sets through a cluster of computers, which is generally referred to as HDFS. It’s a blueprint for parallel programming.

hadoop_jar_file.jar /input_path /output_path

49 . Tell us about Sequencefileinputformat?

Hadoop makes use of a file format known as a Sequence file. A serialized key-value pair stores data in the sequence register. The input format for reading sequence files is sequencefileinputformat.

50 . DFA can handle large volumes of big-data. Then why do you need Hadoop Framework?

Hadoop stores and also process vast amounts of data. While DFS (Distributed File System) can also store data, it lacks the following features:

  • DFA is not fault-tolerant
  • The amount of data that you move over a network refers bandwidth.