Hadoop Virtual Machine

To run the virtual machine first we installing the virtual machine software and the virtual image. The users of Mac OS, Linux or other Unix-like environments helps to install Hadoop and run it on one or more machines with no additional software. For operating Hadoop on top of Windows needs to install cygwin and it is used for development purposes only. Sometimes the cygwin may be unstable and installing cygwin itself can be cumbersome. The virtual machine image will run inside of a “sandbox” environment and also we can run it in the operating system.

The sandbox does not know there is another operating environment outside on it and this sandbox environment is referred as the “guest machine” running a “guest operating system. The actual physical machine running the VM software is referred to as the “host machine” and it runs the “host operating system.” The virtual machine provides other host-machine applications with the appearance that another physical computer is available on the same network. Applications running on the host machine see the VM as a separate machine with its own IP address, and can interact with the programs inside the VM in this fashion.

 Hadoop Virtual Machine
Hadoop Virtual Machine

A virtual machine wraps one operating system within another. Applications in the virtual machine act as they run on a separate physical host from other applications in the external operating system.  In the above figure windows is our host machine and Linux guest (virtual) machine. The Linux virtual machine typically uses Hadoop in their native development environment and Windows users install cygwin for Hadoop development.  The advantages of using virtual machine are,

1)      The virtual machine provided with this tutorial permits users a convenient alternative development platform with a minimum of configuration required.

2)      The virtual machine is its easy reset functionality. If your experiments break the Hadoop configuration or render the operating system unusable

The following steps are used for getting Cloudera’s virtual machine image up and running.

1)      Download the virtual machine

2)      Start up Virtual machine.  Once it loads, go to the File menu and select Virtual Media Manager.

Sun Virtual Box
Sun Virtual Box

3)      For creating the new image in the Virtual Media Manager window, click New to create a new image.

Virtual Media Manager
Virtual Media Manager

4)      In the file dialog box that appears, browse to the directory where you extracted the download and select the file cloudera-training-0.2-cl3.vmdk

5)      After closing the Virtual Media Manager window, click the new button in the main VirtualBox window to create a new virtual machine.

6)      From the Create New Virtual Machine dialog box, give your new machine a name. Select Linux as the operating system and Ubuntu as the version.

7)      On the next screen, set the memory size. You can set between 4 to 512 MB. The recommended is 384 MB.

8)      Next, you’ll select the hard disk image, which we added earlier.

9)       Double check the summary before clicking Finish.

10)  After closing the Virtual Machine Wizard, you can select the Cloudera machine that you just created and click Start.

11)   Assuming you’ve done everything correctly up to this point and your VirtualBox installation is working properly, you should see a window pop up with the boot-up messages for the new virtual machine. Watch this to make sure everything is booting fine. If you see error messages here or if your machine doesn’t boot up correctly, you may have missed a step earlier or selected the wrong file for the hard disk image.

12)  After a few moments, you should see the desktop of your new image. If you’ve gotten this far, you can stop here if you want, but you’ll be missing out on the enhanced functionality that VirtualBox offers, such as better integration with your existing desktop, sharing of files, etc.

13)  If you want full integration, open a terminal and run the following command:

sudo apt-get install build-essential linux-headers-`uname -r`

This will install the basics that you need before loading the VirtualBox additions.

14)  Select Install Guest Additions from the Devices menu.

15)  You should now see a pop-up window prompting you to run the installer for the guest additions. Click the Run button to continue.

16)  If the dependencies installed correctly earlier, you’ll see a terminal window, which will show you the progress as the add-ons are installed.

17)  At this point, you can select Shutdown from the system menu in the top menu bar, and then choose Restart to reboot your virtual machine. When the VM restarts and the desktop is fully loaded, you should be able to resize the window, use your mouse seamlessly between the virtual machine window and your desktop, and add a shared folder

Advantages of virtual machine-hosted Hadoop

1)         A single image can be cloned -lower operations costs.

2)         Hadoop clusters can be set up on demand.

3)         Physical infrastructure can be reused.

4)         You only pay for the CPU time you need.

5)         The cluster size can be expanded or contracted on demand.

Difference between virtual infrastructure and physical datacenter

1)      Storage is usually one or more of transient virtual drives, transient local physical drives, persistent local virtual drives, or remote SAN-mounted block stores or file systems.

2)      Storage in virtual hard drives may cause a lot of seeking, even if it appears to be sequential access to the VM.

3)      Networking may be slower and throttled by the infrastructure provider.

4)      Virtual Machine are requested on demand from the infrastructure the machines will be allocated anywhere in the infrastructure possibly on servers running other VMs at the same time.

5)      The other VMs may be heavy CPU and network users which can cause the Hadoop jobs to suffer; alternatively the heavy CPU and network load of Hadoop can cause problems for the other users of the server.

6)      VMs can be suspended and restarted without OS notification, this can cause clocks to move forward in jumps of many seconds.

7)      Other users on the network may be able to listen to traffic, to disrupt it, and to access   ports that are not authenticating all access.

8)      Some infrastructures may move VMs around; this can actually move clocks backwards when the new physical host’s clock is behind that of the original host.

9)      Replication to (transient) hard drives is no longer a reliable way to persist data.

10)  The network topology is not visible to the Hadoop cluster, though latency and bandwidth tests may be used to infer “closeness”, to build a de-facto topology.

11)  The correct way to deal with a VM that is showing re-occuring failures is to release the VM and ask for a new one, instead of blacklisting it.

12)  The JobTracker may want to request extra VMs when there is extra demand.

13)  The JobTracker may want to release VMs when there is idle time.

14)  A failure of the hosting infrastructure can lose all machines simultaneously.