Big Data Hadoop Distributed File System- ( PART 2)
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop application. It is a distributed file system that provides access to data across Hadoop clusters. It manages and supports analysis of very huge volume of Big Data.
Characteristics of HDFS
1. Fault-Tolerant - HDFS is highly fault-tolerant. HDFS divides the data into blocks after that create multiple copies of the block (default 3) and distribute it across the cluster. So, when any machine in the cluster goes down client then also can access their data from the different machine which contains the copy of that data block.
2. High Availability - HDFS is a highly available file system. HDFS replicates the data present among the nodes in the Hadoop cluster by creating the replica of the blocks on the other slave's machine present in the system. So, at the time of the failure of a node, a user can still access their data from the different nodes as duplicate copies of blocks are present on the other nodes in the HDFS cluster.
3. Scalability - HDFS stores data on multiple nodes in the cluster, when requires we can scale the cluster. There are two scalability mechanism- Vertical and Horizontal Scaling. Horizontal Scaling is preferred over Vertical Scaling as we can scale the cluster from 10s of nodes to 100s of nodes without any downtime.
It is a master and slave architecture.
The main components of HDFS are-
1. NameNode - The NameNode server is the main component of the HDFS cluster. It maintains and executes file system namespace operations such as opening, closing, and renaming of files and directories that are present in HDFS.
The Namenode maintains two files:
- A transaction log called an Edit Log
- A namespace image called FsImage.
2. Secondary NameNode - Secondary NameNode is responsible for maintaining the backup of the NameNode. It maintains the edit log and namespace image information in sync with Namenode.
3. File System - HDFS exposes a file system namespace and allows user data to be stored in files.
4. MetaData - HDFS metadata is the structure of HDFS directories and files in a tree. It includes attributes of directories and files, such as ownership, permissions, quotas, and replication factor.
5. DataNode - Datanode is responsible for storing the actual data in HDFS. It also retrieves the blocks when asked by clients or the NameNode.
Data Block Split
It is an important process of HDFS architecture. Each file is split into one or more blocks and blocks are replicated and stored in the DataNodes. Block size of each block is 128 MB (default).
Replication and Rack Awareness in Hadoop
Rack is a collection of machines that are physically located in a single place/datacenter and connected through a network.
The process of replicating data is very critical to ensure the reliability of HDFS. By default, it is replicated thrice. The following replication topology is used by the Hadoop:
- The first replica is placed on the same node as that of the client.
- The second replica is placed on a different rack from that of the first replica.
- The third replica is placed on the same rack as that of the second one but on a different node.
Anatomy of File Write in Hadoop
- Before the client start writing data to HDFS it grabs an instance of an object of Distributed File System (HDFS)
- DistributedFileSystem makes an RPC call to the NameNode to create a new file in the filesystem’s namespace. TheDistributedFileSystem returns an FSDataOutputStream for the client to start writing data to.
- As the client writes data DFSOutputStream split it into packets and write it to its internal queue i.e. data queue and also maintains an acknowledgment queue, which is consumed by the DataStreamer. The DataStreamer streams the packets to the first DataNode in the pipeline.
- The list of DataNodes forms a pipeline and assuming a replication factor of three, so there will be three nodes in the pipeline.
- When the client has finished writing data, it calls close() on the stream.
Anatomy of File Read in Hadoop
- The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem.
- DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to determine the locations of the blocks for the first few blocks of the file. For each block, the NameNode returns the addresses of all the DataNodes that have a copy of that block.
- Client calls read() on the stream. DFSInputStream which has stored the DataNode addresses then connects to the closest DataNode for the first block in the file.
- Data is streamed from the DataNode back to the client, which calls read() repeatedly on the stream. When the end of the block is reached DFSInputStream will close the connection to the DataNode and then finds the best DataNode for the next block.
- Blocks are read in order. When the client has finished reading, it calls close() on the FSDataInputStream.
There are many advantages of learning this technology as HDFS is by far the most resilient and fault-tolerant technology that is available as an open-source platform, which can be scaled up or scaled down depending on the needs, making it really hard for finding an HDFS replacement for Big Data Hadoop storage needs. So, you will have a head start when it comes to working on the Hadoop platform if you are able to decipher HDFS concepts. Some of the biggest enterprises on earth are deploying Hadoop in unprecedented scales, and things can only get better in the future.