Flippin The Coin: 2009

Hadoop is an utility that works on Grids..
Grid is just a group of computers that are connected normally(Internet or LAN nothing HIFI) and are willing to install a few software..
Hadoop is one of them.

Say if you have the task to be done on large data.. it splits the data in to many parts and does it in many connected computers and gives you results..

Hadoop just the splittin and combining part. You have to write your code..
If you find it interesting visit websites and blogs on
Hadoop, Grid Computing, Distributed Computing,etc.

I'll speak only about configuring Hadoop
It took me one day to do the configuration..

Prerequisites..

1. Cygwin with openSSH(server) and EMACS(or unix based editor)
2. Hadoop Latest Version
3. Java SE JDK
Step 1.

Install cygwin with openSSH. There are many sites that tell about this.
Univ Page
might be useful for setting up openSSH .
Make sure that you have a text editor.

Hayes Davis' page
has detailed instructions. Hope You're back.

Step 2.
UnTar the install that you downloaded from hadoop's site.

go to conf\hadoop-env.sh using cygwin. Use a UNIX Based Editor(To Avoid all the trouble) and
Read the line JAVA_HOME 13 th line i think
JAVA_HOME = /cygdrive/c/"Program Files"/java1.5
Change it according to ur java path
remove the # in front

Now save the file and ..
do ssh-host-config

Should privilege separation be used?", to save tons of trouble I recommend you to answer "no"

might or might not pause and prompt you to enter a value for "CYGWIN=", enter "ntsec tty"

start sshd sometimes it starts with ssh

/usr/sbin/sshd
/usr/sbin/sshd

then do

ssh localhost

If it asks for a passphrase to ssh to the localhost, press "ctrl + c" and type the following commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

You're done configuring ssh.. type
exit
to come out for some time from sever mode..
Step 3

You're almost done..
There are three modes in Hadoop

1. Standalone
Easiest and fastest for beginners..
It just works after what you've done

Try an example

$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

If it shows JAVA_HOME not found then.. Plz check the path and use a UNIX based editor.

2. Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

For this change the conf/hadoop-site.xml(Replace Having a backup ) with

Configuration
Use the following conf/hadoop-site.xml:


 
   
    fs.default.name 
    hdfs://localhost:9000 
   
   
    mapred.job.tracker 
    localhost:9001 
   
   
    dfs.replication 
    1

Now you're almost ready There.. But its essential to setup a master and slave key values

$> scp ~/.ssh/id_dsa.pub @localhost:~/.ssh/master-key.pub
$> cat ~/.ssh/master-key.pub >> ~/.ssh/authorized_keys

$> scp ~/.ssh/id_dsa.pub @localhost:~/.ssh/slave-key.pub
$> cat ~/.ssh/master-key.pub >> ~/.ssh/authorized_keys

Substitute masterusername and slaveusername with ur account name in winxp with admin privileges....

Now you're ready to go..

do

ssh master/slaverusername@localhost

If you're not connected do

/usr/sbin/sshd
then
ssh master/slaverusername@localhost

I hope you're connected...

Bingo .. u're ready to excecute your first example in psuedo cluster in win XP

Execution
Format a new distributed-filesystem:
$ bin/hadoop namenode -format

Start the hadoop daemons:
$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/
Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input

Run some of the examples provided:
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

Examine the output files:

Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*

or

View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*

When you're done, stop the daemons with:
$ bin/stop-all.sh

If you've a problem after u restart your system.. Delete the tmp director in the root of your hadoop installation and format a namenode again...

3. Distributed mode in WINXP..

Hayes davis site is good.. It was where i figured out pseudo-distributed mode operation..

Happy Map-Reduce...

Plz post comments on any mistakes or something left out..

Acknowlegments:

Hayes Davis
Brandies Univ Site

Flippin The Coin

Monday, January 26, 2009

Hadoop pseudo distributed mode in Win Xp

Friday, January 23, 2009

Wednesday, January 14, 2009

Redhat's commercial

Open Source is cool

Blog Archive