Maruf Notes

Sunday, April 6, 2014

CS131: Hash Tables

A hash table (also called map or dictionary) is an abstract collection of items where any data item can be stored and accessed with an associated key. Both data and key can be of any type.

The word hash is used as a term for transformation of a given key.

Direct address tables

If total number of possible items are reasonably small we may use direct address table where key is the index of the data array. Array is a direct address table.

Search()

Insert()

Delete()

Hash function

When direct address table is not a feasible option, for example when possible keys are too big, we use a function that takes the key k and return the index i of the table.

        i = h(k)   [ 0 <= i <= M - where M is total number of buckets in the table]

If the possible number of keys are bigger than available buckets in the hash table two keys may map the same index. This situation is called collision.

Let us consider the following keys

5, 13, 15, 17, 21, 24

We define the hash function as

    h(k) = k mod M

                  5  13
+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+

Fig: The mapping of the keys for M=7

Now for the first five keys (5, 13, 15, 17, 21) we can put the keys in the hash table buckets without any problem. But if we try to insert 24 we find 24 mod 7 = 3 and we have already 17 mapped to the 3rd bucket. We will look at a few techniques to solve the problem.

It is possible to create a very big hash table to minimize the number of collision. But we will be wasting a lot of memory if the table is mostly empty. So we want the average number of elements per bucket, which is called load factor, to be larger.

Load factor = n/M

The bigger the load factor the less wasted space.

Choosing a hash method

A good hash function should minimize collision, be easy to compute and distribute keys evenly through the available buckets.

The division method

h(k) = k mod M

For a good choice of M take a prime number that is distant from the values those are power of 2.

If M = 2^p the function will map two keys with same last character/digit to the same bucket [verify]

If M = 2^p-1 the function maps keys with same set of characters / digits to the same bucket. [verify]

The multiplication method

For all k h(k) = floor(M(kA-(floor(kA)) - where A is a constant with value range from 0 to 1. The value of M is not critical here.

Collision resolution by chaining

When two keys maps to same index we can use a linked list for that key to store multiple values. If there are M buckets we can maintain M linked list each of which may contain zero or more items.

Head(1) [*]-->[Value 1 | * ]-->[Value 2 | $ ]
Head(2) [*]-->[Value 3 | $ ]
Head(3) [*]-->[Value 5 | * ]-->[Value 6 | * ]-->[Value 7 | $ ]
Head(4) [$]
Head(5) [*]-->[Value 8 | $ ]

Fig: Using linked list for collision resolution

I <-- H(k) +1   [1 <= I <= M]
If I is present in HashTable 
    Do
        If k = KEY[I] return I        
        I = Link[I]
    While (I != 0)

Open addressing with Linear probing and insertion

Open addressing with Double hashing

Deletion

We can not simply delete a key from hash table since there could be more than one keys that hashed to the same bucket.

Deletion with linear probing

Universal hashing

For any hash function it is possible that someone will be able to come up with a set of keys such that every key maps to the same bucket or a set of very small number of buckets making the hash table a linked list and accessing any item will take O(n) time instead of expected O(1) time.

To solve this problem we can define a set of different hash functions and pick one function randomly when we initialize the hash table before first use. This technique is called universal hashing and operation on any element is expected to take O(1) average time.

Once chosen the hash function is not changed for the lifetime of the hash table so that each key hashed to the same bucket.

Linear Hashing

The hash table grows or shrinks as items are inserted or deleted from the hash table. It is not related to linear probing.

If N is initially chooses as number of buckets the number of total bucket of the hash table is chosen as 2N, 4N, 8N, 16N etc. that is, power of 2 * N. The power used is called level. So total number of bucket is 2^level * N. Level 2 has 4N buckets.

Linear hashing is implemented using dynamic array, a variable size array that allows random access of keys. When the size of array is changed

Extendible Hashing

Growing the size of hash table requires rehashing the keys and could be big performance hit when it is done. To avoid this situation, extendible hashing technique uses a trie as a hierarchical storage so that the rehashing can be done incrementally, one bucket at a time, as the hash table grows.

CS131: Prioryty Queue

A priority queue is a data structure that stores a set of entries where each entry has an associated key and total order of the keys are maintained.

Prioroty key can be either min priority queue which maintains keys in increasing order or max priority queue which maintains keys in decreasing order. In this chapter we will use min priority queue.

Priority queue can identify and remove the smallest key very fast and an item can be inserted at any time.

Operations:

Priority queue supports following operations:

Insert - which inserts item into the priority queue Min - which returns the item with smallest key RemoveMin - which removes the item with smallest key and returns it

Here is a few example of the operations:

[ | | | | | ]

Insert(4)

[4| | | | | ]

Insert(8)

[4|8| | | | ]

Insert(2)

[2|4|8| | | ]

Min() -> returns 2 and the priority queue is unchanged

[2|4|8| | | ]

RemoveMin() -> returns 2 and remove it from the priority queue

[4|8| | | | ]

Priority tree can be implemented using a Binary Heap.

Complete Binary Tree A binary tree in which every level of the three is full except the bottom row- which is filled from left to right. A complete binary tree has 1+log(n) levels where n is the number of items.

Fig: Complete Binary Tree

Binary heap

A binary heap is a complete binary tree where entries must satisfy heap order property, that is, no child element has a key less than its parents key. Multiple copies of same keys are allowed.

                (1)
             /       \
         /               \
      (3)                 (4) 
     /   \               /   \
   /       \           /       \
  (9)       (11)      (12)      (17)
 /   \     /   \     /
(13) (14) (15) (16) (12)

          Fig: Binary heap

Storing binary heap in a array

We do a level order traversal and on each level we traverse from left to right and store the keys sequentially in an array. If we do this for the binary heap from above figure we get following array.

[x|1|3|4|9|11|12|17|13|14|15|16|12|---]

Notice that we have kept the first cell unused.

In the array any item i's children is located at 2i and 2i+1 indexes and item i's parent is located at index |i/2|

For example item at index 4 is 9. Its children are 8th ad 9th item and the keys are 13 and 14. Its parent is at 2nd index which is 3. From the above binary heap tree figure you may verify that these are correct keys.

Operations

Min():
Return the entry at the root. In the array the root is stored at index 1. 
This is a constant time operation.

Insert(k):
1. Place k at the first open position on the bottom level from left. In the array 
   put k at the first empty space at the end of existing item.
2. Bubble up the element untill the heap order property is satisfied.
        If the parent key is bigger than the items key swap the item with its 
        parent and repeat this process as long as parent's key is bigger or the item is at the root.

Fig: Insert example

RemoveMin():
    1. Remove the entry at root
    2. Take the last entry from bottom level of  the tree feom left and put it at root 
       position. In array this is the last item.
    3. Bubble the root entry down through the heap-
         until the items key is smaller than both of its children's key swap with smaller child

Fig: Remove example

Worst case performsnce Theta(log n) Best case performance Theta(1)

Bottom up heap construction

If we are given an array of items and we need to create a heap, we can insert each item in the tree one by one which will take Theta(n log n) time or we can use bottom up heap construction technique to do it in Theta(n)

BottomUpHesp(itemArray): [ref:]
1. Make a complete binary tree without considering order. For a given array of 
items no operation is required. 
2. For each nonleaf node starting from last one bubble the item 
down (swap with smaller child) until its key both of its children's key.

Fig: Bottom up heap construction example

CS131: Queue

Que is a linear list data structure. The insert and remove operations are performed at the opposite ends of the list which are called the rear and the front respectively.

enqueue   ---->   |  3 |  4 |  2 |  <------ dequeue
              rear              front

Fig: Queue showing rear and front

A que can be implemented using a circular array or a linked list that maintains the rear and front item reference.

enqueue   -----+                       +------ dequeue
               |                       |    
               v  rear           front v
     +---->  |  |  3 |  4 |  2 | 7 | 10 |  |  | >-----+
     |                                                |
     +------------------------------------------------+

Fig: Circular list implementation using array

enqueue   -----+                       +------ dequeue
               |                       |    
               v  rear           front v
   [head]---> [3|*]-->[4|*]-->[2|*]-->[9|*]--->$

Fig: Linked list implementation

Operations

    public interface Queue
    {
        /// <summary>
        /// Inserts an item to the rear of the list of items.
        /// </summary>
        /// <param name="item">item to be inserted</param>
        void enqueue(object item);

        /// <summary>
        /// Removes an item from the front of the list of items and returns it.
        /// </summary>
        /// <returns>The removed item</returns>
        object dequeue();

        /// <summary>
        /// Returns the item in front of the list of items. The list is not altered in any way.
        /// </summary>
        /// <returns>The item at the front of the queue</returns>
        object front();

        /// <summary>
        /// Check if the queue is empty or not
        /// </summary>
        /// <returns>true if queue is empty, false otherwise</returns>
        bool empty();

        /// <summary>
        /// Checks if the queue is full or not
        /// </summary>
        /// <returns>true if queue is full, false otherwise</returns>
        bool full();

        /// <summary>
        /// Calculates the number of items in the queue
        /// </summary>
        /// <returns>Number of items in teh queue</returns>
        int size();
    }

Implementation

    public class Node
    {
        public Node next, previous;
        public object data;
    }


    public class LinkedListQueue : Queue
    {
        Node rear_item, front_item;
        int item_count;

        public LinkedListQueue()
        {
            rear_item = front_item = null;
            item_count = 0;
        }

        public void enqueue(object data)
        {
            SNode new_item = new SNode();
            new_item.data = data;
            new_item.next = null;
            new_item.previous = front_item;

            if (front_item != null)
            {
                front_item.next = new_item;
            }

            front_item = new_item;

            if (item_count == 0)
            {
                rear_item = new_item;
            }

            item_count++;
        }

        public object dequeue()
        {
            if (front_item == null)
            {
                return null;
            }

            object data = front_item.data;

            item_count--;

            if (item_count == 0)
            {
                rear_item = null;
                front_item = null;
            }
            else
            {
                front_item = front_item.previous;
                front_item.next = null;
            }

            return data;
        }

        public object front()
        {
            if (front_item == null)
            {
                return null;
            }

            return front_item.data;
        }

        public bool empty()
        {
            return (item_count == 0);
        }

        public bool full()
        {
            return false;
        }

        public int size()
        {
            return item_count;
        }

        public static void Test()
        {
            Queue queue = new LinkedListQueue();
            object data = 10;
            queue.enqueue(data);
            object item = queue.dequeue();
            TestEngine.AssertEquals(item, data, "Dequeue does not return expected data");
        }
    }

Deque

A double ended queue where insertion and deletion are allowed on both end of the list.

Problem: Implement a deque

CS131: Stack

Stack is linear list data structure which allows all operations at one end of the list. The operations are usually performed at one end called the top. The top item is inserted last. Any other item that is not at the top of the stack is not allowed to be accessed. Since we can access last item inserted in the list first stack is called to be a Last In First Out (LIFO) data structure.

Fig: Stack with top, insert and remove positions

Stack supports three operations- push, pop and top

Push operation

Inserts an item on top of the stack.

push(item)
    if top >= max items allowed  //overflow
        throw error
    stack[top]=data
    top=top+1

Pop operation

Removes the item at the top from the stack and returns it.

Pop()
    if top<0
        throw error

    item= stack[top]
    top=top-1
    return item

Top operation

Returns the item at top of the stack

Top()
    if top<0
        throw error  //underflow
    return stack[top]

Stack operation example

Start with an empty stack

[    |    |     |    |     |    |     ]  top=-1

Push(15)

[ 15 |    |     |    |     |    |     ] top = 0

Push(18)

[ 15 | 18 |     |    |     |    |     ] top = 1

Push(3)

[ 15 | 18 |  3  |    |     |    |     ] top = 2

Pop() ---> returns 3

[ 15 | 18 |     |    |     |    |     ] top = 1

Overflow and underflow

Stack usually has an upper limit of number of items that can be inserted before we run out of memory. If we try to push an item on the stack when stack is full and no more memory can be allocated an error condition is occured. This condition is known as overflow.

Another error condition occured when the stack is empty and pop() method is called. This is called underflow.

Fixed and Dynamic stack

Stack can be implemented such that maximum amount of memory that can be used is allocated when stack in created nad this amount is fixed. This type of implementation may use a fixed array.

Another implementation may allocate memory dynamically as the stack grows and overflow occurs only when the application can no longer allocate more memory. When items are popped from the stack the extra memory may be released. This type of application may use a linked list that grows and shrinks dynamically.

CS131: Linked List

A linked list is a linear data structure where each element keeps a pointer to track elements in linear order. If the foirst element of the list is accessible, it is possible to access any element of the list. Each element of the list is called to a node which stores data and reference to next node of the list.
We use a special node marked as head to reference the first node of the list.
In a singly linked list each node stores data and keeps reference to the next node. First node is referenced by head and last node does not have a next node reference.

[head|o]--->[data1|o]--->[data2|o]--->[data3|o]--->%

Fig: A singly linked list
I a doubly linked list each node stores data and keeps reference to the previous node and the next node. First node does not have a previous node reference and last node does not have a next node reference.

[head|o]--->[o|data1|o]--->[o|data2|o]--->[o|data3|o]--->%
             |     ^        |    ^         |
       %<----+     +--------+    +---------+

Fig: A doubly linked list
A circular list is similar to singly where each nor keeps reference to the next node adn the last node references the first node as its next node reference. Head points to the first item which can be any item in the list because of the circular nature of the list.

[head|o]--->[data1|o]--->[data2|o]--->[data3|o]---+
              ^                                   |
              |                                   |
              +-------------[o|data4]<------------+

Fig: A circular linked list

Representing linked list

class SNode
{  
    object data;
    SNode next;
};

class SNode
{  
    object data;
    SNode next, prev;
};

class SList
{      
    SNode head;
    int itemCount;
};

class DList
{      
    DNode head;
    int itemCount;
};

Insert Operation

We want to insert a node at the front of the list

void insert(SList list, object data)
{
    SNode node = new SNode();
    node.data = data;
    node.next = list.head;
    list.head = node;
}

Quiz: Hou would ypu modify the list or the insert method to insert an item at the end of the list in O(1) time.
Quiz: How would you implement a stack using a linked list
Problem: Implement a doubly linked list
Problem: Implement a circular list

Find operation

Given a data value we need to find the node that stores that data. We the head item to get first item in the list and follow the next reference to find the item with given data or we reach the end of the list.

SNode find(SList list, object data)
{
    SNode cur = list.head;
    while(cur != null && !cur.data.equals(data))
    {
        cur = cur.next;
    }
    return cur;
}

Since there is no way to find a middle item of the list directly, find takes O(n) time even for a sorted list.

Delete operation

Let us consider the following list:

[head|o]--->[  5 |o]--->[  9 |o]--->[  6 |o]--->[ 13 |o]--->%

If we want to delete the element with data 9 the list will be like this after deletion:

[head|o]--->[  5 |o]--->[  6 |o]--->[ 13 |o]--->%

One way to do this is by finding previous element in the list and then delete the item and copying the next reference from the node we are deleting. This will take O(n) time. If the node is not last node in the list we can do it in O(1) time by copying the data and next reference from the next node and then deleting the next node.

[head|o]--->[  5 |o]--->[  6 |o]-X->[  6 |o]--->[ 13 |o]--->%
                            |       delete        ^
                            +---------------------+

void delete(SList list, SNode node)
{
    if(node.next != null)
    {
        node.data=node.next.data;
        node.next = node.next.next;
    }
    else
    {
        SNode cur = list.head;

        if(cur == node)
            list.head=null;

        while(cur.next != node)
        {
            cur = cur.next;
        }
        cur.next = node.next;                
    }
}

Most of the time the delete operation will complete in constant time. But for the last node it'll take O(n) time. To avoid the situation we can use a specual node called sentinel node to mark the end of the list. So, if we need to delete the last node of the list we just mark the last node as sentinel node.
Problem: Implement the list that uses a sentinel node to mark end of list and rewrite the delete operation by using sentinel node so that it always completes in constant time.

Sunday, December 8, 2013

My Private Cloud

It is costly to rent systems from Azure/Amazon to do regular experiment at home for learning purposes when I need many machines. I have four very powerful machines lying around at home with a lot of RAM and 32 total processor cores. I bought them second hand from ebay for less than 500$. These machines are not going to be economical for datacenters because of low processing power and RAM capability but for experimental purposes I can create hundreds of virtual machines easily (most of them wont be running most of the time).

Now installing operating system on each of the virtual machines is very time consuming task. So I decided to go with MAAS inside ubuntu server so it handles OS installation part. Also with juju I can deploy a lot of services (hadoop, mongodb, django and many many more) with a single command.

I would not be exposing the services outside my home for other people. So I do not require high bandwidth. But I still want to see how it works from internet.

I split my home network in two subnetworks (192.168.0.0 and 192.168.1.0). One powers my home equipments and other is for my private cloud machines.

I have a router (192.168.0.1) that is connected to Internet and works as gateway to Internet.

I have another system (cloud gateway) running Linux with two network interfaces (192.168.0.100, 192.168.1.2) and configured as a router for my private cloud and works as gateway. It has forward packet rule enabled in iptables.

I have one virtual machine on the cloud gateway that I have configured as MAAS server.

Cloud Gateway

Dual interface 192.168.0.100, 192.168.1.2

192.168.0.100 is connected to Internet router by direct link [GW + DNS/DHCP: 192.168.0.1 ].
192.168.1.2 is connected to cloud machines switch (no DNS, DHCP - managed by MAAS server 192.168.1.3).

/etc/network/interfaces

#while br0 is using dhcp the IP is static and managed by router DHCP server
auto br0
iface br0 inet dhcp
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0
        bridge_maxwait 0

auto br1
iface br1 inet static
        address 192.168.1.2
        netmask 255.255.255.0
        network 192.168.1.0
        broadcast 192.168.1.255
        bridge_ports eth1
        bridge_stp off
        bridge_fd 0
        bridge_maxwait 0

Enable forwarding:

sudo su
echo 1 > /proc/sys/net/ipv4/ip_forward

To make it run on boot

/etc/init.d/lekhonicloud
#!/bin/sh
# turn ip_forward on
echo 1 > /proc/sys/net/ipv4/ip_forward

sudo chmod +x /etc/init.d/lekhonicloud
sudo update-rc.d lekhonicloud defaults

iptables -A FORWARD -p tcp -m state -d 192.168.1.0/255.255.255.0 --state RELATED,ESTABLISHED -j ACCEPT

iptables -A FORWARD -p udp -m state -d 192.168.1.0/255.255.255.0 --state RELATED,ESTABLISHED -j ACCEPT

iptables -A FORWARD -p tcp -m state -d 192.168.0.0/255.255.255.0 --state RELATED,ESTABLISHED -j ACCEPT

iptables -A FORWARD -p udp -m state -d 192.168.1.0/255.255.255.0 --state RELATED,ESTABLISHED -j ACCEPT

Internet router

Dual Interface:
LAN 192.168.0.1
WAN [Static IP mapped to public domain name]

Destination     Gateway         Netmask         Flags Metric Ref    Use Iface
192.168.1.0     192.168.0.100   255.255.255.0   UG    1      0        0 LAN
192.168.0.0     *               255.255.255.0   U     0      0        0 LAN
default         [ISP Gtwy IP]   0.0.0.0         UG    0      0        0 WAN

MAAS System Settings

From MAAS Network Config page
Interface eth0
Management: Manage DHCP and DNS
IP 192.168.1.3
Subnet mask 255.255.255.0
Router : 192.168.1.2
Broadcast: 192.168.1.0

IP Range: 192.168.1.110-192.168.1.149

/etc/network/interfaces

# The primary network interface
auto eth0
iface eth0 inet static
        address 192.168.1.3
        netmask 255.255.255.0
        gateway 192.168.1.2
        #dns-nameservers 192.168.1.3

auto eth1
iface eth1 inet dhcp
    #dns-nameservers 192.168.0.1

Install and configure MAAS

Installing MAAS is easy. Details are availabe at http://maas.ubuntu.com/docs/install.html . I am putting here the basic steps. After you boot a system with Ubuntu server bootable disk from first screen select

Multiple server install with MAAS

Then follow the setup. When Install or enlist with Ubuntu MAAS Server dialog box select

Create a new MAAS on this server .

When you are asked for Ubuntu MAAS API address on the Configuring maas-cluster-controller provide the right IP. It should be automatically selected and shown. But if it is not correct change it:

http://<maas ip address>/MAAS/

If you want MAAS to manage DHCP and DNS install those components:

sudo apt-get install maas-dhcp maas-dns

It makes things much easier.

After thse you want to create an admin user and import boot images-

$ sudo maas createadmin --username=root --email=<adminemail>
$ maas-cli maas node-groups import-boot-images

You may now login to http://<maas ip address>/MAAS/ and start configuring.

Commission and start your servers

When you first start a server that is not yet added to MAAS service and it has PXE boot enabled it will boot using an image that it receives from MAAS maintained DHCP/TFTP server and collect some data and tell MAAS about its various hardware configuration and shutdown itself. The node will be shown in MAAS node list. You can changes some of the parameters like host name, power options and press commission button to start commission. If the system is not powered up for some reason you may manually power up. The system will again boot itself using boot image from MAAS and poweroff itself. The system will now be shown as Ready in MASS console. You may start the node now. On boot it will install Ubuntu (you can set which version from MAAS web console).

You may keep any node in ready state so that juju can pick up for deploying any service on it. We will configure juju shortly. Since I have only 4 physical servers I'll be using virtual machines to create juju node pool. MAAS treats both type of machines in the same way.

MAAS inserts the SSH keys, that we set in the MAAS Preferences page, for the default user "ubuntu". But in case the network interface is down it is hard to get the machine back. While it may be a security bad practice, I create a user on the physical server so that I can login through console.

Install KVM on each Server

First install the required components and setup user:

sudo apt-get install qemu-kvm libvirt-bin ubuntu-vm-builder bridge-utils

sudo adduser `id -un` kvm

sudo adduser `id -un` libvirtd

Now logout and login back and verify installation using:

ubuntu@Server06:~$ virsh -c qemu:///system list

 Id    Name                           State
----------------------------------------------------

Setting up virsh on MAAS system

From MAAS system the query

$ virsh -c qemu+ssh://<username>@<serverip>/system list --all

should show all virtual machines on the server when you are logged in as maas

user (sudo su - maas). If it doers not show up we must do the following steps.

First make sure the user maas has its home directory (/home/maas).

If not create them and assign /bin/bash as default shell:

$ mkdir /home/maas
$ chown maas:maas /home/maas

$ sudo chsh maas /bin/bash

Now sudo as mass

$ sudo su - maas

Now generate the key and assign copy it to the server:

$ ssh-keygen

$ ssh-copy-id -i ~/.ssh/id_rsa <username>@<server_ip>

The username here can be any user name. If that user is not already setup with ssh keys

you'll be prompted for password.

Now test again if the command works:

$ virsh -c qemu+ssh://<username>@<serverip>/system list --all

If it still fails look at http://wiki.libvirt.org/page/Failed_to_connect_to_the_hypervisor for more troubleshooting information.

Now you may use qemu+ssh://<username>@<serverip>/system in the node configuration page on MAAS portal.

Power Type: virsh (virtual systems)

Address: qemu+ssh://<username>@<serverip>/system

Power Id: <Virtual machine name>

The <virtual machine name> should be same as it is shown for the command

virsh -c qemu+ssh://<username>@<serverip>/system list --all

With this setup MAAS should automatically start and stop the system. When it says "This node is now allocated to you. It has been asked to start up." make sure the system is actually has been powered up.

Install and Configure Juju

First install the components:

sudo apt-get install juju-core juju

Then create a configuration file ~/.juju/environments.yaml


default: maas
envirnments:
 maas:
    type: maas
    # Change this to where your MAAS server lives.  It must specify the base path.
    maas-server: 'http://<MAAS Server IP>/MAAS/'
    maas-oauth: '<MAAS API key>'
    admin-secret: '<admin secret>
    default-series: precise
    authorized-keys-path: ~/.ssh/id_rsa.pub #authorized_keys # or any file you want.
    # Or:
    # authorized-keys: ssh-rsa keymaterialhere

You can get MAAS API Key from MAAS preferences page.

You can generate this using juju generate-config command and then edit the file to keep only configuration for MAAS and putting in the keys.

Now do the bootstrap using

$ juju bootstrap --upload-tools

And see the status using

$ juju status

You may now deploy hadoop

$ juju deploy hadoop

This will pick a machine from MAAS machine pool and install linux na dthen hadoop in it.

References

http://askubuntu.com/questions/95354/how-to-enable-forwarding-of-data-between-two-local-interfaces

http://maas.ubuntu.com/docs/install.html

https://help.ubuntu.com/community/KVM/Installation

http://askubuntu.com/questions/292061/how-to-configure-maas-to-be-able-to-boot-virtual-machines

http://wiki.libvirt.org/page/Failed_to_connect_to_the_hypervisor

http://maas.ubuntu.com/docs/juju-quick-start.html

Sunday, October 20, 2013

Hadoop quickstart: Run your first MapReduce job in 10 minutes

We setup a single machine cluster using a few simple steps. This is an experimental system to quickly get started with hadoop. Once we get familiar with the core components we can easily add more nodes to the system. There are a few differences from the official quickstart guide.

Assuming that you have downloaded the latest hadoop binary from apache hadoop site lets start configuring it.

Lets assume you have extracted the contents at

/home/hduser/hadoop

Now let us generate ssh key so that we can connect the machine using ssh without typing a password.

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

This should do the trick. Make sure you can connect to localhost without typing in password

$ ssh localhost

You should be connected to local machine. Exit the session for next steps.

Install JDK 7 on your machine. I am assuming jdk is instaed at /usr/lib/jvm/java-7-openjdk-amd64.

Lets edit .bashrc (or your shells init script if you use something else than bash) and add following lines at the end.

export HADOOP_HOME=/home/hduser/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin

How edit the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh and add the following line after export JSVC_HOME=${JSVC_HOME} line.

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Now we need to edit 3 configuration files in hadoop. In each files we'll ne adding few property blocks inside the <configuration> ... </<configuration> block.

$HADOOP_HOME/etc/hadoop/core-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hduser/tmp</value>
  <description>TMP Directory</description>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
  <description></description>
</property>

$HADOOP_HOME/etc/hadoop/mapred-site.xml

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
  <description>MapReduce job tracker</description>
</property>

$HADOOP_HOME/etc/hadoop/hdfs-site.xml

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description></description>
</property>

After editing the files we are now ready to format the filesystem-

$hdfs namenode -format

It should show some information and along with those you should see one line like:

13/10/20 11:02:21 INFO common.Storage: Storage directory /home/hduser/tmp/dfs/name has been successfully formatted.

Now the HDFS filesystem is ready. Lets create a directory to store our data file.

hadoop fs -mkdir /user
hadoop fs -mkdir /user/hduser
hadoop fs -mkdir /user/hduser/data

Now we need sample data files those contain a lot of words in each file. You may use a free book that is in text format. I have created three sample text files for this and put those in /home/hduser/data/ directory on my machine.

We can copy these files to HDFS using following command:

#Usage: hadoop fs -copyFromLocal <localsrc> URI
bin/hdfs dfs -copyFromLocal /home/hduser/data/* /user/hduser/data

There are two paths required for copyFromLocal operation- first one is the local files path and second one is HDFS uri. We can see the files using -ls command-

hduser@hadoop:~/hadoop$ hdfs dfs -ls /user/hduser/data


-rw-r--r--   1 hduser supergroup     538900 2013-10-20 11:08 /user/hduser/data/file1.txt
-rw-r--r--   1 hduser supergroup     856710 2013-10-20 11:08 /user/hduser/data/file2.txt
-rw-r--r--   1 hduser supergroup     357800 2013-10-20 11:08 /user/hduser/data/file3.txt

Now lets start the hadoop cluster:

hduser@hadoop:~/hadoop$ sbin/start-all.sh

This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
localhost: 09:36:31 up 14:05, 2 users, load average: 0.02, 0.04, 0.08
[output removed]

We are now ready to run the mapreduce sample available with hadoop. Lets copt the jar file to our current directory.

cp /home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar ./

And run it by providing the mapreduce class name, data location and output location:

bin/hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount /user/hduser/data /user/hduser/wcoutput

13/10/20 11:25:36 INFO input.FileInputFormat: Total input paths to process : 3
13/10/20 11:25:36 INFO mapreduce.JobSubmitter: number of splits:3
13/10/20 11:25:37 INFO mapreduce.Job: Running job: job_local3325941_0001
[output removed]
13/10/20 11:25:43 INFO mapreduce.Job: Job job_local3325941_0001 completed successfully

OK. The job has been completed successfully. We should be able to see the output files in the output directory.

hduser@hadoop:~/hadoop$hdfs dfs -ls /user/hduser/wcoutput/

-rw-r--r-- 1 hduser supergroup 0 2013-10-20 11:25 /user/hduser/wcoutput/_SUCCESS
-rw-r--r-- 1 hduser supergroup 880838 2013-10-20 11:25 /user/hduser/wcoutput/part-r-00000

Now we can download the output file to our local disk-

#Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

$bin/hdfs dfs -copyToLocal  /user/hduser/wcoutput/part-r-00000 ./wc_result.txt

And view that if you want.

$vim wc_result.txt

Finally we stop the cluter.

$sbin/stop-all.sh