Why , When and How Hadoop and Spark...: HDFS...Hadoop Distributed Filesystem

Hadoop Distributed Filesystem is like any other other file system with few exceptions.

HDFS is described by below three points.

· Works best when u have large files.

· Streaming access(write once read many) .

· Works on commodity hardware.

It is different from Portable Operating System Interface(POSIX) as if files saved is less than the block size of disk(4kb is default) it will waste/use who of the block but in HDFS if one is having file of 2kb then the rest of 2kb is free for other systems to use.

HDFS is having a block size which is greater that POSIX by default it is 128MB.means each file is divided into n blocks (file>128 MB).Each block is stored on different data nodes and name node holds metadata only(which data node holds which block).

We have a concept of replication (default is 3),means each block is stored on 3 different locations.to handle fault tolerance.We have a secondary name node to handle name node failures.

HDFS’s fsck command understands blocks. For example, running:

% hdfs fsck / -files -blocks
which gives the block info of att the files in hdfs.

HDFS CLI commands:

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

Java Api for HDFS:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/package-summary.html

I have developed below code which reads,writes,deletes and stores in HDFS.

package com.hadoop.tutorial;
import java.io.InputStream;
import java.net.URI;

import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;

public class FileSystemCat {

public static void main(String[] args) throws Exception {
String uri = args[0];
String dest = args[1];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
FSDataOutputStream out = null;
try {
in = fs.open(new Path(uri));
//fs.mkdirs(new Path(dest));
out=fs.create(new Path(dest));

IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0);
IOUtils.copyBytes(in, out, 4096, false);
fs.delete(new Path(uri),true);
} finally {
IOUtils.closeStream(in);

}
}

}
$ hadoop jar gettingstarted_mapreduce.jar com.hadoop.tutorial.FileSystemCat hdfs://quickstart.cloudera:8020/user/cloudera/abc.txt hdfs://quickstart.cloudera:8020/user/cloudera/xyz.txt

Coherency Model :

If More than a block of data is written to the hdfs ,it is available for others users to read.

and the current written box wont b available to the reader.

HDFS provides a way to force all buffers to be flushed to the datanodes via the hflush() method on FSDataOutputStream

Once data is written and hflush is called it is available to users to read.use hsync() to make sure it is written on directory.

Parallel Copying with distcp

one can run mapreduce jobs to paraller process and transfer data between two clusters,nodes and directories by calling distcp

hadoop distcp dir1 dir2

4 comments:

Unknown28 September 2015 at 06:04
Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
Regards,
cognos Training in Chennai|cognos Training Chennai|cognos Training
Unknown17 December 2015 at 04:49
Cloud servers are the best in safe guarding one's information thorugh online. Without this dedicated methodology many companies would have not existed at all. The same though has been furnished above. Thanks for sharing this worth while content in here. Keep writing article like this.

Salesforce course in chennai | Salesforce course in chennai | Salesforce administrator training in chennai
Veera Blogspot28 November 2020 at 01:12
Very Nice article,keep sharing more article.
Keep updating..

Big Data Hadoop Course
Ba28 February 2023 at 01:58
tütün sarma makinesi
site kurma
sms onay
binance hesap açma
SU5Q5

Why , When and How Hadoop and Spark...

Thursday, 14 May 2015

HDFS...Hadoop Distributed Filesystem