Monday, September 30, 2013

How to Install Hadoop

EDIT: As of 10/11/2013 brew is not installing Hadoop correctly. Just ignore the brew parts and install Hadoop manually. 

This will be boring. I planned on making a post today about how to use Hadoop with all kinds of helpful tips and jokes and then spent the entire afternoon trying to get the stupid thing installed. There's a few tutorials floating around already, but they’re all out of date and somewhat incomplete. After tinkering around and taking bits from each one I finally got it working, so here it is. How to freakin’ install Hadoop.

I’ll be using a single-node, pseudo-distributed setup. This is the best configuration if you’re looking to learn how to use Hadoop. My machine is a MacBook Pro running OSX 10.8.5 with Java 1.6. Cluster users: may god have mercy on your soul. At the time of writing this, the latest stable version is 1.2.1, but I’ll just say <version>.

First things first: if you don’t use homebrew by now, this is the time to get it. Everything in this tutorial is applicable to a manual installation, but homebrew is great and you should use it. Go get it, I'll wait. You may run into some problems with permissions because of homebrew, but these are easily solvable with a quick Google search and will only have to be fixed once. Install Hadoop with the following terminal command:


If you don’t want to use homebrew, get the latest stable build here by navigating to the stable/ directory and downloading the hadoop-<version>.tar.gz file. Then, unpack it to any directory you like.

Before getting started with Hadoop itself, we need to make sure ssh is enabled, and that self-login (i.e. ssh-ing into your own machine) can be done without a password prompt. First, navigate to System Preferences -> Sharing and make sure Remote Login is checked. I also suggest changing “Allow access for:” to “Only these users:”, and then adding only yourself to the list. This is optional, but adds some security.

Next, we need to set up an ssh-key so we can authenticate without entering a password. Just type the following into your terminal:


Make sure that worked by entering:


It shouldn’t ask you for a password. 

Now we can start configuring Hadoop. There’s actually quite a bit of tweaking you can do, but we’ll keep it to the bare basics for now. First, navigate to your conf/hadoop-env.sh file and open it in a text editor (TextEdit works fine). If you installed with homebrew, the file is located at /usr/local/Cellar/hadoop/<version>/libexec/conf/hadoop-env.sh. It’s easiest to navigate to and open the file with the following commands:


We need to do two things with this file. First, we need to specify the location of your Java root directory for the JAVA_HOME variable. Finding the root directory is tricky, but after a bit of snooping around you should get it. I personally ended up setting this line to:


Make sure you get rid of the leading '#'. Second, add the following line to the bottom of the file to a fix a known bug:


We will now modify three more files in the same directory. Their content segments should be set as follows:

conf/core-site.xml:


conf/hdfs-site.xml:


conf/mapred-site.xml:


Let’s take a moment to reflect on the fact that the last localhost port is in fact, over 9000.

It’s smooth sailing from here. From the command-line (even you non-homebrew users), navigate to the directory that contains conf and bin to format your distributed-filesystem by typing:


There should be no errors in the output. Now, let’s test it to make sure it works. Start the hadoop daemons by typing:


Copy some input files to the distributed-filesystem:


Then execute an example run:


You should get output similar to the following:


You can examine the output by typing:


Which should output something like:


Finally, stop all running daemon processes by typing:



There you have it. If everything went without a hitch, Hadoop is now installed and configured on your machine. Stay tuned for next time when we look at MapReduce, what it should be used for, what it shouldn’t be used for, and how to use it.

UPDATE: I'm still working on the main Hadoop post, but I'm having a lot of issues getting everything ready. It seems like there isn't much, if any, support for the Eclipse plugin for Hadoop 1.2.1. I've spent several hours trying to get it set up, but haven't had any luck. It's technically possible to do MapReduce programming without IDE support, but much harder. I'll keep working on it when I have the time but I may have to just come back to it later. 

No comments:

Post a Comment