Hadoop – Advantages and Disadvantages

We are listing here the advantages and disadvantages of Hadoop.Map-Reduce and HDFS are the two different parts of the Hadoop.

Advantages of Hadoop

1)      Distribute data and computation.The computation local to data prevents the network overload.

2)      Tasks are independent The task are independent so,

  1. We can easy to handle partial failure. Here the entire nodes can fail and restart.
  2. it avoids crawling horrors of failure and tolerant synchronous distributed systems.
  3. Speculative execution to work around stragglers.

3)      Linear scaling in the ideal case.It used to design for cheap, commodity hardware.

4)      Simple programming model.The end-user programmer only writes map-reduce tasks.

5)      Flat scalability:-

This is the one advantages of using Hadoop in contrast to other distributed systems is its flat scalability curve.  Executing Hadoop on a limited amount of data on a small number of nodes may not demonstrate particularly stellar performance as the overhead involved in starting Hadoop programs is relatively high. Other parallel/distributed programming paradigms such as MPI (Message Passing Interface) may perform much better on two, four, or perhaps a dozen machines. Though the effort of coordinating work among a small number of machines may be better-performed by such systems the price paid in performance and engineering effort (when adding more hardware as a result of increasing data volumes) increases non-linearly.

A program written in distributed frameworks other than Hadoop may require large amounts of refactoring when scaling from ten to one hundred or one thousand machines. This may involve having the program be rewritten several times; fundamental elements of its design may also put an upper bound on the scale to which the application can grow.

Hadoop, however, is specifically designed to have a very flat scalability curve. After a Hadoop program is written and functioning on ten nodes, very little–if any–work is required for that same program to run on a much larger amount of hardware. Orders of magnitude of growth can be managed with little re-work required for your applications. The underlying Hadoop platform will manage the data and hardware resources and provide dependable performance growth proportionate to the number of machines available.

6)      HDFS store large amount of information

7)      HDFS is simple and robust coherency model

8 )      That is it should store data reliably.

9)      HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.

10)   HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.

11)  HDFS provide streaming read performance.

12)  Data will be written to the HDFS once and then read several times.

13)  The overhead of cashing is helps the data should simply be re-read from HDFS source.

14)  Fault tolerance by detecting faults and applying quick, automatic recovery

15)  Processing logic close to the data, rather than the data close to the processing logic

16)  Portability across heterogeneous commodity hardware and operating systems

17)  Economy by distributing data and processing across clusters of commodity personal computers

18)  Efficiency by distributing data and logic to process it in parallel on nodes where data is located

19)  Reliability by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of failures

20)  HDFS is a block structured file system: – Each file is broken into blocks of a fixed size and these blocks are stored across a cluster of one or more machines with data storage capacity

21)  Ability to write MapReduce programs in Java, a language which even many noncomputer scientists can learn with sufficient capability to meet powerful data-processing needs

22)  Ability to rapidly process large amounts of data in parallel

23)  Can be deployed on large clusters of cheap commodity hardware as opposed to expensive, specialized parallel-processing hardware

24)  Can be offered as an on-demand service, for example as part of Amazon’s EC2 cluster computing service

Disadvantages  of Hadoop

1)      Rough manner:- Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.

2)      Programming model is very restrictive:- Lack of central data can be preventive.

3)      Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.

4)      Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard.

5)      Still single master which requires care and may limit scaling

6)      Managing job flow isn’t trivial when intermediate data should  be kept

7)      Optimal configuration of nodes not obvious. Eg: – #mappers, #reducers, mem.limits