Hadoop HDFS

Hadoop is a distributed file system and it uses to store bulk amounts of data like terabytes or even petabytes. HDFS support high throughput mechanism for accessing this large amount information. In HDFS files are stored in s redundant manner over the multiple machines and this guaranteed the following ones.

  1. Durability to failure
  2. High availability to very parallel applications

NFS is the example of distributed file system. NFS stands for network file system. It gives remote access to a single logical volume stored on a single machine. A NFS server can visible a portion of its local files system to external clients and also the client can mount this remote file system directly into their own Linux file system, and  interact with it as though it were part of the local drive. The one advantage of NFS is its transparency. That is clients do not need to be particularly aware that they are working on files stored remotely. For accomplishing the above task we are using the existing standard library methods like open (), close (), fread ().

Advantages of HDFS

  1. HDFS store large amount of information
  2. HDFS is simple and robust coherency model
  3. That is it should store data reliably.
  4. HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.
  5. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.
  6. HDFS provide streaming read performance.
  7. Data will be written to the HDFS once and then read several times.
  8. The overhead of cashing is helps the data should simply be re-read from HDFS source.
  9. Fault tolerance by detecting faults and applying quick, automatic recovery
  10. Processing logic close to the data, rather than the data close to the processing logic
  11. Portability across heterogeneous commodity hardware and operating systems
  12. Economy by distributing data and processing across clusters of commodity personal computers
  13. Efficiency by distributing data and logic to process it in parallel on nodes where data is located
  14. Reliability by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of failures

HDFS is a block structured file system: – Each file is broken into blocks of a fixed size and these blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are called DataNodes. A file can be made of several blocks and not necessarily stored on the same machine. The target machine chose each block randomly on a block-by-block basis. So access permission to a file may need the cooperation of multiple machines and it supports file size for larger than a single machine DFS.  Individual files sometimes need large space than a single hard drive could hold. If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default).

In the above figure the DataNodes represents multiple files with replication factor of 2 and the NameNode maps the filenames onto the block ids. In block structured file systems commonly use a block size on the order of 4 or 8 KB. The default block size in HDFS is 64 KB. This permits HDFS to decrease the amount of metadata storage required per file.

In a HDFS block structured file system, all the information’s are handled by single machine called NameNode. The NameNode stores all the metadata for the file system. All the information’s like tracks file names, permissions, and the locations of each block of each files etc can be stored in the main memory of the NameNode machine and it permits fast access to the metadata.  The allow open a file the client first contacts the NameNode and access a list of locations for the block that comprise the files and these locations identify the DataNodes which hold each block. Clients then read file data directly from the DataNode servers in parallel. So the NameNode is not directly involved in the bulk data transfer keeping its overhead to a minimum. NameNode information should be preserved even if the NameNode machine fails. NameNode failure is more severe for the cluster than DataNode failure. While individual DataNodes may crash and the entire cluster will continue to operate, the loss of the NameNode will render the cluster inaccessible until it is manually restored.

Disadvantage of HDFS

In distributed file system, it is limited in its power. The files in an NFS volume all reside on a single machine. This will create some problems

  1. It does not gives any reliability guarantees if that machine goes down. Eg: – By replacing the files to other machine.
  2. All the clients must go to this machine to retrieve their data. This can overload the server if a large number of clients must be handled.
  3. Clients need to copy the data to their local machines before they can operate on it.

Goals of HDFS

  1. Very large distributed file system:- 10K nodes, 100 million files, 10PB
  2. Assumes commodity hardware:-Files are replicated to handle hardware failure. Detect failures and recover from them.
  3. Optimized for batch processing:-Data locations exposed so that computations can move to where data resides. It provides very high aggregate bandwidth.