Thursday, 14 May 2015

HDFS...Hadoop Distributed Filesystem



Hadoop Distributed Filesystem  is like any other other file system with few exceptions. 

HDFS is described by below three points.

·   Works best when u have large files.

·   Streaming access(write once read many) .

·   Works on commodity hardware.

It is different from Portable Operating System Interface(POSIX) as if files saved is less than the block size of disk(4kb is default) it will waste/use who of the block but in HDFS if one is having file of 2kb then the rest of 2kb is free for other systems to use.

HDFS is having a block size which is greater that POSIX by default it is 128MB.means each file is divided into n blocks (file>128 MB).Each block is stored on different data nodes and name node holds metadata only(which data node holds which block).

We have a concept of replication (default is 3),means each block is stored on 3 different locations.to handle fault tolerance.We have a secondary name node to handle name node failures.


HDFS’s fsck command understands blocks. For example, running:
% hdfs fsck / -files -blocks
 which gives the block info of att the files in hdfs.
Java Api for HDFS:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/package-summary.html

I have developed below code which reads,writes,deletes and stores in HDFS.

package com.hadoop.tutorial;
import java.io.InputStream;
import java.net.URI;

import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;

public class FileSystemCat {

  public static void main(String[] args) throws Exception {
    String uri = args[0];
    String dest = args[1];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    FSDataInputStream in = null;
    FSDataOutputStream out = null;
    try {
      in = fs.open(new Path(uri));
      //fs.mkdirs(new Path(dest));
      out=fs.create(new Path(dest));
      
      IOUtils.copyBytes(in, System.out, 4096, false);
      in.seek(0);
      IOUtils.copyBytes(in, out, 4096, false);
      fs.delete(new Path(uri),true);
    } finally {
      IOUtils.closeStream(in);
      
    }
  }

}
$ hadoop jar gettingstarted_mapreduce.jar com.hadoop.tutorial.FileSystemCat hdfs://quickstart.cloudera:8020/user/cloudera/abc.txt hdfs://quickstart.cloudera:8020/user/cloudera/xyz.txt


Coherency Model :


If More than a block of data is written to the hdfs ,it is available for others users to read.
and the current written box wont b available to the reader.
HDFS provides a way to force all buffers to be flushed to the datanodes via the hflush() method on FSDataOutputStream
Once data is written and  hflush is called it is available to users to read.use hsync() to make sure it is written on directory.

Parallel Copying with distcp
one can run mapreduce jobs to paraller process and transfer data between two clusters,nodes and directories by calling distcp

hadoop distcp dir1 dir2


 







4 comments:

  1. Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
    Regards,
    cognos Training in Chennai|cognos Training Chennai|cognos Training

    ReplyDelete
  2. Cloud servers are the best in safe guarding one's information thorugh online. Without this dedicated methodology many companies would have not existed at all. The same though has been furnished above. Thanks for sharing this worth while content in here. Keep writing article like this.

    Salesforce course in chennai | Salesforce course in chennai | Salesforce administrator training in chennai

    ReplyDelete
  3. Very Nice article,keep sharing more article.
    Keep updating..

    Big Data Hadoop Course

    ReplyDelete