Home > Posts > BigData > What is Hadoop and its Components

What is Hadoop and its Components

Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. it is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Components of Hadoop

Hadoop Common pre-defined set of utilities and libraries that can be used by other modules within the Hadoop ecosystem. For example, if HBase and Hive want to access HDFS they need to make of Java archives (JAR files) that are stored in it.  It is an essential part or module of the Apache Hadoop Framework, along with the Hadoop Distributed File System (HDFS), YARN and  MapReduce. Like all other modules, it assumes that hardware failures are common and that these should be automatically handled in software by the Hadoop Framework.

Hadoop Distributed File System (HDFS)  is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable clusters. when we move a file in HDFS, it is automatically split into smaller files. These smaller files are then replicated on a minimum of three different servers so that they can be used as an alternative to unforeseen circumstances. This replication count isn’t necessarily hard-set and can be decided upon as per requirements. HDFS Use Case
Nokia deals with more than 500 terabytes of unstructured data and close to 100 terabytes of structured data. Nokia uses HDFS for storing all the structured and unstructured data sets as it allows processing of the stored data at a petabyte scale.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a mapping procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation. it is responsible for analyzing large datasets in parallel before reducing it to find the results. In the ecosystem, MapReduce is a framework based on YARN architecture. The yarn-based architecture supports parallel processing of huge data sets and MapReduce provides the framework for easily writing applications on thousands of nodes, considering fault and failure management.  MapReduce Use Case
Skybox has developed an economical image satellite system for capturing videos and images from any location on earth. Skybox uses Hadoop to analyze the large volumes of image data downloaded from the satellites. The image processing algorithms of Skybox are written in C++. Busboy, a proprietary framework of Skybox makes use of built-in code from java based MapReduce framework.

YARN  is a specific component of the open source platform for big data analytics, licensed by the non-profit Apache software foundation. Major components of Hadoop include a central library system, an HDFS file handling system, and MapReduce, which is a batch data-handling resource. YARN Use Case
Yahoo has close to 40,000 nodes running Apache Hadoop with 500,000 MapReduce jobs per day taking 230 compute years extra for processing every day. YARN at Yahoo helped them increase the load on the most heavily used Hadoop cluster to 125,000 jobs a day when compared to 80,000 jobs a day which is close to 50% increase.

Key Benefits of YARN Component

  • It offers improved cluster utilization.
  • Highly scalable.
  • Beyond Java.
  • Novel programming models and services.
  • Agility.