In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

This tutorial has been tested with the following software versions:

Ubuntu 13.04

Apache Hadoop 1.1.2 (Released on February 15th, 2013)

Prerequisites

Oracle Java 7

Hadoop requires a working Java 1.5+ (aka Java 5) installation. In this tutorial, I will describe the installation of Java 1.7.0 Update 21.

You can get a Java Development Kit(JDK) on the oracle. Then decompress it to /usr/lib/jvm/jdk1.7.0_21 (You can decompress it to any other place)

After installation, you should set environment variables as follows:

  1. Open ~/.bashrc

  2. Add the following statements:

JAVA_HOME=/usr/lib/jvm/jdk1.7.0_21
export JAVA_HOME

PATH=$PATH:$JAVA_HOME/bin
export PATH

You can make a quick check whether Oracle’s JDK is correctly set up:

$ java -version
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) Server VM (build 23.21-b01, mixed mode)

Configuring SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we, therefore, need to configure SSH access to localhost for the user, you use in the system.

I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several online guides available.

First, we have to generate an SSH key for the user.

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hzxie/.ssh/id_rsa):
Created directory '/home/hzxie/.ssh'.
Your identification has been saved in /home/hzxie/.ssh/id_rsa.
Your public key has been saved in /home/hzxie/.ssh/id_rsa.pub.
The key fingerprint is:
91:05:86:1e:22:d1:23:e2:49:26:6c:ef:7c:f7:8f:6b hzxie@XieHaozhe-Think
The key's randomart image is:
[...snipp...]

If you got a message said “Connect to host localhost port 22: Connection refused”, you should do as following:

First, you’re supposed to download three files:

Then, install them on your computer.

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

Second, you have to enable SSH access to your local machine with this newly created key.

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the user. The step is also needed to save your local machine’s host key fingerprint to the user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 91:05:86:1e:22:d1:23:e2:49:26:6c:ef:7c:f7:8f:6b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-25-generic i686)

If the SSH connect should fail, these general tips might help:

  • Enable debugging with ssh -vvv localhost and investigate the error in detail.
  • Check the SSH server configuration in /etc/ssh/sshd_config, in particular, the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration to reload with sudo /etc/init.d/ssh reload.

Hadoop

Installation

Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /opt/hadoop.

Update $HOME/.bashrc

export HADOOP_PREFIX=/opt/hadoop

Excursus: Hadoop Distributed File System (HDFS)

Before we continue let us briefly learn a bit more about Hadoop’s distributed file system.

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project. The Hadoop Distributed File System: Architecture and Design

Configuration

Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.

hadoop-env.sh

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /opt/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Oracle JDK7 directory.

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_21

In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

Now we create the directory and set the required ownerships and permissions:

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown user:user /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOExceptionwhen you try to format the name node in the next section).

Add the following snippets between the ... tags in the respective configuration XML file.

In file conf/core-site.xml:

<property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
</property>

<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml:

<property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
    <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>
</property>

In file conf/hdfs-site.xml:

<property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
</property>

See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.

Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

To format the filesystem (which simply initializes the directory specified by the dfs.name.dirvariable), run the command:

hzxie@XieHaozhe-Think:~$ /opt/hadoop/bin/hadoop namenode -format

The output will look like this:

hzxie@XieHaozhe-Think:/opt/hadoop/bin$ ./hadoop namenode -format
13/06/15 22:49:27 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = XieHaozhe-Think/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.1.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
13/06/15 22:49:27 INFO util.GSet: VM type = 32-bit
13/06/15 22:49:27 INFO util.GSet: 2% max memory = 17.77875 MB
13/06/15 22:49:27 INFO util.GSet: capacity = 2^22 = 4194304 entries
13/06/15 22:49:27 INFO util.GSet: recommended=4194304, actual=4194304
13/06/15 22:49:29 INFO namenode.FSNamesystem: fsOwner=hzxie
13/06/15 22:49:29 INFO namenode.FSNamesystem: supergroup=supergroup
13/06/15 22:49:29 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/06/15 22:49:29 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/06/15 22:49:29 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/06/15 22:49:29 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/06/15 22:49:29 INFO common.Storage: Image file of size 115 saved in 0 seconds.
13/06/15 22:49:29 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-hzxie/dfs/name/current/edits
13/06/15 22:49:29 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-hzxie/dfs/name/current/edits
13/06/15 22:49:29 INFO common.Storage: Storage directory /tmp/hadoop-hzxie/dfs/name has been successfully formatted.
13/06/15 22:49:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at XieHaozhe-Think/127.0.1.1
************************************************************/

Starting your single-node cluster

Run the command:

hzxie@XieHaozhe-Think:~$ /opt/hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker, and Tasktracker on your machine.

The output will look like this:

starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-hzxie-namenode-XieHaozhe-Think.out
localhost: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-hzxie-datanode-XieHaozhe-Think.out
localhost: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-hzxie-secondarynamenode-XieHaozhe-Think.out
starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-hzxie-jobtracker-XieHaozhe-Think.out
localhost: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-hzxie-tasktracker-XieHaozhe-Think.out

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.

hzxie@XieHaozhe-Think:/opt/hadoop/bin$ jps
4662 SecondaryNameNode
4439 DataNode
5883 Jps
4971 TaskTracker
5221 NameNode
4755 JobTracker

Stopping your single-node cluster

Run the command

hzxie@XieHaozhe-Think:~$ /opt/hadoop/bin/stop-all.sh

to stop all the daemons running on your machine.

Example output:

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.

References

The Disqus comment system is loading ...
If the message does not appear, please check your Disqus configuration.