Tag Archives: Hadoop

Getting Started with Hadoop 2.0

Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. Hadoop 1 popularized MapReduce programming for batch jobs and demonstrated the potential value of large scale, distributed processing. MapReduce, as implemented in Hadoop 1, can be I/O intensive, not suitable for interactive analysis, and constrained in support for graph, machine learning and on other memory intensive algorithms. Hadoop developers rewrote major…

Continue Reading

Running Hadoop 1.1.2 on Ubuntu Linux (Single-Node Cluster)

In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. This tutorial has been tested with the following software versions: Ubuntu 13.04 Apache Hadoop 1.1.2 (Released on February 15th, 2013) Prerequisites Oracle Java 7 Hadoop requires a working Java 1.5+ (aka Java 5) installation. In this tutorial, I will describe the installation of Java 1.7.0 Update 21. You can get a Java Development Kit(JDK) on the oracle. Then decompress it to /usr/lib/jvm/jdk1.7.0_21 (You can decompress it to any other place) After installation, you should set environment variables as follows: 1. Open ~/.bashrc 2. Add the following statements: You can make a quick check whether Oracle’s JDK…

Continue Reading
Contact Us
  • Nanyang Technological University, Singapore
  • root [at] haozhexie [dot] com