Hadoop Interview Question And Answers


Apache Hadoop is a collection of open-source software utilities that efficiently stores and processes large datasets ranging in size from gigabytes to petabytes of data. 

The framework facilitates distributed processing of Big Data across clusters of computers using MapReduce programming models. Clustering multiple computers help analyze massive datasets in parallel, offering more speed and efficiency.

Hadoop consists of 4 main modules:

Hadoop Distributed File System (HDFS) 

HDFS runs on standard or low-end hardware and provides better data throughput, high fault tolerance and native support of large datasets than traditional file systems.

Yet Another Resource Negotiator (YARN)

YARN manages cluster nodes and resource usage to schedule jobs and tasks.

MapReduce

MapReduce is a software framework which processes Big Data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. It helps programs do the parallel computation on data via the Map tasks and Reduce tasks. 

  • Map task takes input data converting it into a dataset that can be computed in key-value pairs
  • The output of the map task is consumed by Reduce tasks that aggregate output and provide the desired result

Hadoop Common

Hadoop Common provides Java libraries that can be commonly used across all modules. Also, Hadoop Ozone is an object store for the Hadoop framework.

1 . How does Hadoop work?

Hadoop splits files into large blocks and distributes them across clusters of nodes. The framework then transfers packaged code into nodes to process the data in parallel, taking advantage of data locality where nodes manipulate the data they have access to. Hence, it allows fast processing of the dataset and also makes it more efficient.

Facts about Hadoop

Before we step into Hadoop interview questions and answers, here are some facts that may help:

  • All the modules in Hadoop works in such a way that if any hardware fails (that is a common occurrence), framework automatically handles it.
  • The core of Apache Hadoop consists of two parts: HDFS and MapReduce.
  • HDFS is a storage part popular as Hadoop-Distributed File System (HDFS).
  • Mapreduce is a processing part called as MapReduce programming model.
  • The Hadoop framework implements a distributed file system with high-performance access to data across highly scalable Hadoop clusters.
  • HDFS accepts data in any format, regardless of the schema. Accepting any format helps to optimize the high bandwidth streaming and scales to proven the deployment of 100PB and beyond.

2 . What platform and Java version you need to run Hadoop?

Java 1.6.x or higher versions are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating systems for Hadoop, but BSD, Mac OS/X, and Solaris are more famous for working.

3 . What is Hadoop?

Hadoop is a distributed computing platform. You need to code it in Java. It consists of the features like Google File System and MapReduce.

4 . What kind of Hardware is best for Hadoop?

Hadoop can run on a dual processor/ dual core machine with 4-8 GB RAM using ECC memory. It depends on the workflow needs.

5 . What are the most common input formats in Hadoop?

These are the most common input-formats in Hadoop:

  1. TextInputFormat
  2. KeyValueInputFormat
  3. SequenceFileInputFormat

Text Input-Format is a by default input-format.

6 . How do you categorize big data?

You can categorize it using the following features:

  • Volume
  • Velocity
  • Variety

7 . Explain the use of the .mecia class?

For the floating of media objects from one side to another, we use this class.

8 . Give the use of the bootstrap panel.

We use panels in bootstrap from the boxing of DOM components.

9 . What is the purpose of button groups?

Button groups works for the placement of more than one button in the same line.

10 . Name the various types of lists supported by Bootstrap.

  • Ordered list
  • Unordered list
  • Definition list

11 . Which command you can use for the retrieval of the status of daemons running the Hadoop cluster?

The ‘jps’ command is best for the retrieval of the status of daemons running the Hadoop cluster.

12 . What is TextInputFormat?

In TextInputFormat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

13 . What is the SequenceFileInputFormat in Hadoop?

In Hadoop, SequenceFileInputFormat reads files in sequence. It is a specific-compressed binary file format which passes data between the output of one MapReduce job to the input of some other MapReduce job.

14 . What is the use of RecordReader in Hadoop?

InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. You can define the RecordReader’s instance by the Input Format.

15 . What is WebDAV in Hadoop?

WebDAV is a set of extensions to HTTP that supports editing and uploading files. On most operating systems, you can mount WebDAV share as filesystems, so it is also possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.

16 . What is Sqoop in Hadoop?

Sqoop is a tool to transfer data between the Relational Database Management System (RDBMS) and Hadoop HDFS. By using Sqoop, you can transfer data from RDBMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.

17 . What is “map” and what is “reducer” in Hadoop?

Map: In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location and outputs a key-value pair according to the input type.

Reducer: In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

18 . How is indexing done in HDFS?

There is a very unique way of indexing in Hadoop. Once you store the data as per the block size, the HDFS will keep on storing the last part of the data which specifies the location of the next part of the data.

19 . What is Hadoop Streaming?

Hadoop streaming is a utility which allows you to create and run mapreduce jobs. It is a generic API that allows programs written in any language work as Hadoop mapper.

20 . What is a combiner in Hadoop?

A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, the combiner receives it as input and sends the output to a reducer.

21 . What are Hadoop’s three configuration files?

Following are the three configuration files in Hadoop:

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

22 . What are the network requirements for using Hadoop?

Following are the network requirement for using Hadoop:

  • Password-less SSH connection.
  • Secure Shell (SSH) for launching server processes.

23 . What do you know about storage and compute nodes?

Storage node: Storage Node is the machine or computer where your file system resides to store the processing data.

Compute Node: Compute Node is a machine or computer where you can execute actual business logic.

24 . Is it necessary to know Java to learn Hadoop?

If you have a background in any programming language like C, C++, PHP, Python, Java, etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.

25 . How to debug Hadoop code?

There are many ways to debug Hadoop codes but the most popular methods are:

  • By using Counters.
  • By web interface provided by the Hadoop framework.

26 . Is it possible to provide multiple inputs to Hadoop? If yes, explain.

Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.

27 . What is the relation between job and task in Hadoop?

In Hadoop, you can divide a job into small parts popular as the tasks.

28 . What is the difference between Input-Split and HDFS Block?

The Logical division of data is Input Split and physical division of data is HDFS Block.

29 . What is the difference between RDBMS and Hadoop?

RDBMSHadoop


RDBMS is a relational database management system.
Hadoop is a node based flat structure.
RDBMS is a relational database management system.Hadoop is a node based flat structure.
RDBMS is used for OLTP processing.

Hadoop is used for analytical and for big data processing.
In RDBMS, the database cluster uses the same data files stored in shared storage.In Hadoop, the storage data can be stored independently in each processing node.

30 . What is the difference between HDFS and NAS?

HDFS data blocks are distributed across local drives of all machines in a cluster whereas NAS data is stored on dedicated hardware.

31 . What is the difference between Hadoop and other data processing tools?

Hadoop facilitates you to increase or decrease the number of mappers without worrying about the volume of data to process.

32 . What is a distributed cache in Hadoop?

Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.

33 . What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?

Hadoop job – list

Hadoop job – kill jobID

34 . What are the different vendor-specific distributions of Hadoop?

The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).

35 . What are the different Hadoop configuration files?

The different Hadoop configuration files include:

  • hadoop-env.sh
  • mapred-site.xml
  • core-site.xml
  • yarn-site.xml
  • hdfs-site.xml
  • Master and Slaves

36 . What are the three modes in which Hadoop can run?

The three modes in which Hadoop can run are :

  1. Standalone mode: This is the default mode. It uses the local FileSystem and a single Java process to run the Hadoop services.
  2. Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
  3. Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.

37 . What are the differences between regular FileSystem and HDFS?

  1. Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
  2. HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.

38 . Why is HDFS fault-tolerant?

HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes. 

39 . Explain the architecture of HDFS.

For an HDFS service, we have a NameNode that has the master process running on one of the machines and DataNodes, which are the slave nodes.

NameNode

NameNode is the master service that hosts metadata in disk and RAM. It holds information about the various DataNodes, their location, the size of each block, etc. 

DataNode

DataNodes hold the actual data blocks and send block reports to the NameNode every 10 seconds. The DataNode stores and retrieves the blocks when the NameNode asks. It reads and writes the client’s request and performs block creation, deletion, and replication based on instructions from the NameNode.

  • Data that is written to HDFS is split into blocks, depending on its size. The blocks are randomly distributed across the nodes. With the auto-replication feature, these blocks are auto-replicated across multiple machines with the condition that no two identical blocks can sit on the same machine. 
  • As soon as the cluster comes up, the DataNodes start sending their heartbeats to the NameNodes every three seconds. The NameNode stores this information; in other words, it starts building metadata in RAM, which contains information about the DataNodes available in the beginning. This metadata is maintained in RAM, as well as in the disk.

40 . What are the two types of metadata that a NameNode server holds?

The two types of metadata that a NameNode server holds are:

  • Metadata in Disk – This contains the edit log and the FSImage
  • Metadata in RAM – This contains the information about DataNodes

41 . How can you restart NameNode and all the daemons in Hadoop?

The following commands will help you restart NameNode and all the daemons:

You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.

You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.

42 . Which command will help you find the status of blocks and FileSystem health?

To check the status of the blocks, use the command:

hdfs fsck <path> -files -blocks

To check the health status of FileSystem, use the command:

hdfs fsck / -files –blocks –locations > dfs-fsck.log

43 . What would happen if you store too many small files in a cluster on HDFS?

Storing several small files on HDFS generates a lot of metadata files. To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.

44 . How do you copy data from the local system onto HDFS?

The following command will copy data from the local file system onto HDFS:

hadoop fs –copyFromLocal [source] [destination]

Example:

 hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv

In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS. 

45 . When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?

The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed.

dfsadmin -refreshNodes

This is used to run the HDFS client and it refreshes node configuration for the NameNode. 

rmadmin -refreshNodes

This is used to perform administrative tasks for ResourceManager.

46 . Who takes care of replication consistency in a Hadoop cluster and what do under/over replicated blocks mean?

In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck command provides information regarding the over and under-replicated block. 

Under-replicated blocks:

These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.

Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes. 

Over-replicated blocks:

These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.

Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes.

47 . What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?

RecordReader

This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read. 

Combiner

This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase. 

Partitioner

The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.

48 . Why is MapReduce slower in processing data in comparison to other processing frameworks?

This is quite a common question in Hadoop interviews; let us understand why MapReduce is slower in comparison to the other processing frameworks:

MapReduce is slower because:

  • It is batch-oriented when it comes to processing data. Here, no matter what, you would have to provide the mapper and reducer functions to work on data. 
  • During processing, whenever the mapper function delivers an output, it will be written to HDFS and the underlying disks. This data will be shuffled and sorted, and then be picked up for the reducing phase. The entire process of writing data to HDFS and retrieving it from HDFS makes MapReduce a lengthier process.
  • In addition to the above reasons, MapReduce also uses Java language, which is difficult to program as it has multiple lines of code.

49 . Is it possible to change the number of mappers to be created in a MapReduce job?

By default, you cannot change the number of mappers, because it is equal to the number of input splits. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

For example, if you have a 1GB file that is split into eight blocks (of 128MB each), there will only be eight mappers running on the cluster. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

50 . What is speculative execution in Hadoop?

If a DataNode is executing any task slowly, the master node can redundantly execute another instance of the same task on another node. The task that finishes first will be accepted, and the other task will be killed. Therefore, speculative execution is useful if you are working in an intensive workload kind of environment.

The following image depicts the speculative execution:

From the above example, you can see that node A has a slower task. A scheduler maintains the resources available, and with speculative execution turned on, a copy of the slower task runs on node B. If node A task is slower, then the output is accepted from node B.

51 . How is identity mapper different from chain mapper?

Identity MapperChain Mapper
This is the default mapper that is chosen when no mapper is specified in the MapReduce driver class.This class is used to run multiple mappers in a single map task.
It implements the identity function, which directly writes all its key-value pairs into output.The output of the first mapper becomes the input to the second mapper, second to third and so on.
It is defined in old MapReduce API (MR1) in: org.apache.Hadoop.mapred.lib.packageIt is defined in: org.apache.Hadoop.mapreduce.lib.chain.ChainMapperpackage

52 . What are the major configuration parameters required in a MapReduce program?

We need to have the following configuration parameters:

  • Input location of the job in HDFS
  • Output location of the job in HDFS
  • Input and output formats
  • Classes containing a map and reduce functions
  • JAR file for mapper, reducer and driver classes 

53 . What is the role of the OutputCommitter class in a MapReduce job?

As the name indicates, OutputCommitter describes the commit of task output for a MapReduce job.

Example:

org.apache.hadoop.mapreduce.OutputCommitter
public abstract class OutputCommitter extends OutputCommitter

MapReduce relies on the OutputCommitter for the following:

  • Set up the job initialization 
  • Cleaning up the job after the job completion 
  • Set up the task’s temporary output
  • Check whether a task needs a commit
  • Committing the task output
  • Discard the task commit

54 . Explain the process of spilling in MapReduce.

Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled. 

For a 100 MB size buffer, the spilling will start after the content of the buffer reaches a size of 80 MB. 

55 . How can you set the mappers and reducers for a MapReduce job?

The number of mappers and reducers can be set in the command line using:

-D mapred.map.tasks=5 –D mapred.reduce.tasks=2
In the code, one can configure JobConf variables:
job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers

56 . What happens when a node running a map task fails before sending the output to the reducer?

If this ever happens, map tasks will be assigned to a new node, and the entire task will be rerun to re-create the map output. In Hadoop v2, the YARN framework has a temporary daemon called application master, which takes care of the execution of the application. If a task on a particular node fails due to the unavailability of a node, it is the role of the application master to have this task scheduled on another node.

57 . What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?

In Hadoop v1,  MapReduce performed both data processing and resource management; there was only one master process for the processing layer known as JobTracker. JobTracker was responsible for resource tracking and job scheduling. 

Managing jobs using a single JobTracker and utilization of computational resources was inefficient in MapReduce 1. As a result, JobTracker was overburdened due to handling, job scheduling, and resource management. Some of the issues were scalability, availability issue, and resource utilization. In addition to these issues, the other problem was that non-MapReduce jobs couldn’t run in v1.

To overcome this issue, Hadoop 2 introduced YARN as the processing layer. In YARN, there is a processing master called ResourceManager. In Hadoop v2, you have ResourceManager running in high availability mode. There are node managers running on multiple machines, and a temporary daemon called application master. Here, the ResourceManager is only handling the client connections and taking care of tracking the resources. 

In Hadoop v2, the following features are available:

  • Scalability – You can have a cluster size of more than 10,000 nodes and you can run more than 100,000 concurrent tasks. 
  • Compatibility – The applications developed for Hadoop v1 run on YARN without any disruption or availability issues.
  • Resource utilization – YARN allows the dynamic allocation of cluster resources to improve resource utilization.
  • Multitenancy – YARN can use open-source and proprietary data access engines, as well as perform real-time analysis and run ad-hoc queries.