Hadoop on demand

Apache Hadoop is a java software framework. It supports data intensive distributed applications. Hadoop consists of a distributed file system, HDFS and a system for provisioning virtual Hadoop clusters a bulk physical cluster called Hadoop on demand or HOD. Sabalcore’s Hadoop on demand assign nodes and generates the appropriate configuration files for the Hadoop daemons and client. For allocating nodes HOD uses Torque resource manager. Hadoop on demand provide the facility to distribute Hadoop to assign the nodes and it automatically creates and separates the instances. HOD helps it easy for users to rapidly setup and use Hadoop.

Properties of Hadoop at Sabalcore:-

  1. Run Hadoop in conjunction with Sabalcore’s HPC on-demand services
  2. No additional charges for local scratch disk storage
  3. Fast reliable high throughput cluster network
  4. Easy to use. Quickly create and tear down instances
  5. Configuration files automatically generated
  6. Completely integrated into Sabalcore’s queuing and accounting systems
  7. Pay only for the time you use

HOD works on three different guides.

1)      HOD admin guide:-

This gives the overview of the architecture of HOD, prerequisites, installing various components and dependent software and configuring HOD to get it up and running.

The HOD project is a system for rendering and handling independent Hadoop MapReduce and HDFS instances on a shared cluster of nodes. The HOD tool helps for administrators and users to rapidly setup and use Hadoop easily and this tool useful for Hadoop developers and testers for sharing a physical cluster for testing their own Hadoop versions.

The HOD architecture includes components from a Resource manager (possibly together with a scheduler), HOD components, Hadoop Map /Reduce and HDFS daemons.

Cluster nodes compressing of two sets of nodes

  1. Submit nodes:- Users use the HOD client on these nodes to assign clusters, and then use the Hadoop client to submit Hadoop jobs
  2. Compute nodes:- By using the resource manager, the HOD components are run on these nodes to supply the Hadoop deamons. After that the Hadoop jobs run.

For assigning a cluster and running jobs we are using the following steps

  1. The user uses the HOD client on the Submit node to allocate a required number of cluster nodes, and provision Hadoop on them.
  2. The HOD client uses a Resource Manager interface, (qsub, in Torque), to submit a HOD process, called the RingMaster, as a Resource Manager job, requesting the user desired number of nodes. This job is submitted to the central server of the Resource Manager (pbs_server, in Torque).
  3. On the compute nodes, the resource manager slave daemons, (pbs_moms in Torque), accept and run jobs that they are given by the central server (pbs_server in Torque). The RingMaster process is started on one of the compute nodes (mother superior, in Torque).
  4. The Ringmaster then uses another Resource Manager interface, (pbsdsh, in Torque), to run the second HOD component, HodRing, as distributed tasks on each of the compute nodes allocated.
  5. The Hodrings, after initializing, communicate with the Ringmaster to get Hadoop commands, and run them accordingly. Once the Hadoop commands are started, they register with the RingMaster, giving information about the daemons.
  6. All the configuration files needed for Hadoop instances are generated by HOD itself, some obtained from options given by user in its own configuration file.
  7. The HOD client keeps communicating with the RingMaster to find out the location of the JobTracker and HDFS daemons.

Pre-requisites

  • Operating System: – HOD is currently tested on RHEL4.
  • Nodes: – HOD requires a minimum of 3 nodes configured through a resource manager.
  • Software:-The following components are to be installed on *ALL* the nodes before using HOD:
  • Torque: – Resource manager
  • Python: – HOD requires version 2.5.1 of Python.
  • The following components can be optionally installed for getting better functionality from HOD:-
  • Twisted Python: – This can be used for improving the scalability of HOD. If this module is detected to be installed, HOD uses it, else it falls back to default modules.
  • Hadoop: – HOD can automatically distribute Hadoop to all nodes in the cluster. However, it can also use a pre-installed version of Hadoop, if it is available on all nodes in the cluster. HOD currently supports Hadoop 0.15 and above.

Installing HOD

The following steps are used for installing HOD.

  • If you are getting HOD from the Hadoop tarball, it is available under the ‘contrib’ section of Hadoop, under the root directory ‘hod’.
  • If you are building from source, you can run ant tar from the Hadoop root directory, to generate the Hadoop tarball, and then pick HOD from there, as described in the point above.
  • Distribute the files under this directory to all the nodes in the cluster. Note that the location where the files are copied should be the same on all the nodes.
  • Note that compiling Hadoop would build HOD with appropriate permissions set on all the required script files in HOD

Configuring HOD

After installing HOD it has to be configured before using it. Step used for configuring HOD are given below.

a)         On the node from where you want to run hod, edit the file hodrc which can be found in the <install dir>/conf directory. This file contains the minimal set of values required for running hod.

b)        Specify values suitable to your environment for the following variables defined in the configuration file. Note that some of these variables are defined at more than one place in the file.

  1. ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK 1.5. x and above.
  2. ${CLUSTER_NAME}: Name of the cluster which is specified in the ‘node property’ as mentioned in resource manager configuration.
  3. ${HADOOP_HOME}: Location of Hadoop installation on the compute and submit nodes.
  4. ${RM_QUEUE}: Queue configured for submitting jobs in the resource manager configuration.
  5. ${RM_HOME}: Location of the resource manager installation on the compute and submit nodes.

c)     The following environment variables *may* need to be set depending on your environment. These variables must be defined where you run the HOD client, and also be specified in the HOD configuration file as the value of the key resource_manager.env-vars. Multiple variables can be specified as a comma separated list of key=value pairs.

  • HOD_PYTHON_HOME: If you install python to a non-default location of the compute nodes, or submit nodes, then, this variable must be defined to point to the python executable in the non-standard location.
  • For advanced configuration we use the configuration guide. HOD user guide.This will helps how to start on running HOD and its various features, command line options and help on troubleshooting in detail.

Getting started using HOD

The command line user interface utility is called hod. By using the hod users can override the configuration file. The configuration file can be defined in two ways.

1)      Specify it on command line, using the -c option. Such as hod <operation> <required-args> -c path-to-the-configuration-file [other-options]

2)      Set up an environment variable HOD_CONF_DIR where hod will be run. This should be pointed to a directory on the local file system, containing a file called hodrc. Note that this is analogous to the HADOOP_CONF_DIR and hadoop-site.xml file for Hadoop. If no configuration file is specified on the command line, hod shall look for the HOD_CONF_DIR environment variable and a hodrc file under that.

3) HOD configuration guide:- This guide helps to describe  configuration of HOD and describe various configuration sections, parameters and their purpose.

HOD features

  1. Provisioning and Managing Hadoop Clusters.
  2. Using a tarball to distribute Hadoop
  3. Using an external HDFS
  4. Options for configuring Hadoop
  5. Viewing Hadoop Web-UIs
  6. Collection and Viewing Hadoop Logs
  7. Auto-deallocation of Idle clusters
  8. Specifing additional job attributes
  9. Capturing HOD exit codes in Torque
  10. Command line
  11. Options configuring HOD

Trouble shooting for HOD

Trouble shooting happens in three ways.

1)      HOD hangs during allocation

Here three different errors can happen.

a)    HOD or Hadoop components failed to come up. At the time of generating this error the hod command will return after a few times with an error code of either 7 or 8 as defined in the error codes section.

b)    A large allocation is fired with a tarball. Sometimes due to load in the network, or on the allocated nodes, the tarball distribution might be significantly slow and take a couple of minutes to come back. Wait for completion. Also check that the tarball does not have the Hadoop sources or documentation.

c)    A Torque related error: – At this time the hod command will not return for more than 5 minutes.

2)      HOD hangs during deallocation

A Torque related problem: – Load on the Torque server, or the allocation is very large. Generally, waiting for the command to complete is the only option.

3)      HOD fails with an error code and error message. If the exit code of the hod command is not 0, then refer to the following table of error exit codes to determine why the code may have occurred and how to debug the situation.

Sections in HOD

Sections in the HOD configuration are,

  1. hod:- Options for the HOD client
  2. resource_manager:- Options for specifying which resource manager to use, and other parameters for using that resource manager
  3. ringmaster:- Options for the RingMaster process
  4. hodring:- Options for the HodRing processes
  5. gridservice-mapred:- Options for the MapReduce daemons
  6. gridservice-hdfs:- Options for the HDFS daemons

Commonly used configuration options

1)      Common configuration options

a)      temp-dir: – Temporary directory for usage by the HOD processes. Make sure that the users who will run hod have rights to create directories under the directory specified here.

b)      Debug: – A numeric value from 1-4. 4 produces the most log information, and 1 the least.

c)      log-dir: – Directory where log files is stored. By default, this is <install-location>/logs/. The restrictions and notes for the temp-dir variable apply here too.

d)      xrs-port-range: – A range of ports, among which an available port shall be picked for use to run an XML-RPC server.

e)      http-port-range: – A range of ports, among which an available port shall be picked for use to run an HTTP server.

f)      java-home: – Location of Java to be used by Hadoop.

g)      syslog-address: – Address to which a syslog daemon is bound to. The format of the value is host:port. If configured, HOD log messages will be logged to syslog using this value

2)      hod options

a)      cluster: – A descriptive name given to the cluster. For Torque, this is specified as a ‘Node property’ for every node in the cluster. HOD uses this value to compute the number of available nodes.

b)      client-params: – A comma-separated list of hadoop config parameters specified as key-value pairs. These will be used to generate a hadoop-site.xml on the submit node that should be used for running MapReduce jobs.

3)      Resource_manager options

a)      Queue: – Name of the queue configured in the resource manager to which jobs are to be submitted.

b)      batch-home: – Install directory to which ‘bin’ is appended and under which the executables of the resource manager can be found.

c)      env-vars: – This is a comma separated list of key-value pairs, expressed as key=value, which would be passed to the jobs, launched on the compute nodes. For example, if the python installation is in a non-standard location, one can set the environment variable ‘HOD_PYTHON_HOME’ to the path to the python executable. The HOD processes launched on the compute nodes can then use this variable.

4)       ringmaster options

a)      work-dirs: – These are a list of comma separated paths that will serve as the root for directories that HOD generates and passes to Hadoop for use to store DFS / MapReduce data. For e.g. this is where DFS data blocks will be stored. Typically, as many paths are specified as there are disks available to ensure all disks are being utilized. The restrictions and notes for the temp-dir variable apply here too.

5)      gridservice-hdfs options

a)      external: – If false, this indicates that a HDFS cluster must be bought up by the HOD system, on the nodes which it allocates via the allocate command. Note that in that case, when the cluster is de-allocated, it will bring down the HDFS cluster, and all the data will be lost. If true, it will try and connect to an externally configured HDFS system. Typically, because input for jobs are placed into HDFS before jobs are run, and also the output from jobs in HDFS is required to be persistent, an internal HDFS cluster is of little value in a production system. However, it allows for quick testing.

b)      Host:- Hostname of the externally configured NameNode, if any

c)       fs_port:- Port to which NameNode RPC server is bound.

d)      info_port:- Port to which the NameNode web UI server is bound.

e)      Pkgs: – Installation directory, under which bin/hadoop executable is located. This can be used to use a pre-installed version of Hadoop on the cluster.

f)       server-params: – A comma-separated list of hadoop config parameters specified key-value pairs. These will be used to generate a hadoop-site.xml that will be used by the NameNode and DataNodes.

g)      Final-server-params: – Same as above, except they will be marked final.

6)      gridservice-mapred options

a)       external: – If false, this indicates that a MapReduce cluster must be bought up by the HOD system on the nodes which it allocates via the allocate command. If true, if will try and connect to an externally configured MapReduce system.

b)      Host:- Hostname of the externally configured JobTracker, if any

tracker_port: Port to which the JobTracker RPC server is bound

c)      info_port:- Port to which the JobTracker web UI server is bound.

d)     Pkgs:- Installation directory, under which bin/hadoop executable is located

e)       server-params:- A comma-separated list of hadoop config parameters specified

f)       key-value pairs:- These will be used to generate a hadoop-site.xml that will be used by the JobTracker and TaskTrackers

g)      final-server-params: – Same as above, except they will be marked final.

7)      hodring options

a)       mapred-system-dir-root: – Directory in the DFS under which HOD will generate sub-directory names and pass the full path as the value of the ‘mapred.system.dir’ configuration parameter to Hadoop daemons. The format of the full path will be value-of-this-option/userid/mapredsystem/cluster-id. Note that the directory specified here should be such that all users can create directories under this, if permissions are enabled in HDFS. Setting the value of this option to /user will make HOD use the user’s home directory to generate the mapred.system.dir value.

b)      log-destination-uri: – URL describing a path in an external, static DFS or the cluster node’s local files system where HOD will upload Hadoop logs when a cluster is deallocated. To specify a DFS path, use the format ‘hdfs://path’. To specify a cluster node’s local file path, use the format ‘file://path’. When clusters are deallocated by HOD, the hadoop logs will be deleted as part of HOD’s cleanup process. In order to persist these logs, you can use this configuration option. The format of the path is value-of-this-option/userid/hod-logs/cluster-id Note that the directory you specify here must be such that all users can create sub-directories under this. Setting this value to hdfs://user will make the logs come in the user’s home directory in DFS.

c)      Pkgs: – Installation directory, under which bin/hadoop executable is located. This will be used by HOD to upload logs if a HDFS URL is specified in log-destination-uri option. Note that this is useful if the users are using a tarball whose version may differ from the external, static HDFS version.