Sunday, 31 May 2015

Hive


Designed by some guy called Jeff from Facebook..this mapreduce tool/application is very much useful for guys who are good in SQL based programming.though Hive isn't good for applications where Machine Learning Algos are needed for complex logic.

One can Install it very easily and can run it in 3 different modes.
  1. Hive Shell (hive> SHOW TABLES;)
  2. Calling scripts containing hive commands (hive -f script.q)
  3. Short scripts, you can use the -e option to specify the commands in line, in which case the final semicolon is not required.(hive -S -e 'SELECT * FROM dummy')
Hive converts the above queries to Java mapreduce code and execute it in same way as we do it from hadoop jar command.

Hive creates logical tables from the data files in HDFS.There are two components of Hive .


  1. Metastore 

  2. Warehouse(/user/cloudera/any.db/tablename/0000_m )

Metastore is the Database in the default hive mysql schema.It store matadata information of the tables created by Hive.
Warehouse is nothing but the physical location of the data files stored for Hive Tables.(marked as red).

One can set properties in Hive like we set in  Mapreduce..
There is a precedence hierarchy to setting properties. In the following list, lower numbers take precedence over higher numbers:
  1. The Hive SET command(set a particular property)
  2. The command-line -hiveconf option(for a particular session hive --config /Users/tom/dev/hive-conf)
  3. hive-site.xml and the Hadoop site files (core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml)
  4. The Hive defaults and the Hadoop default files (core-default.xml, hdfs-default.xml, mapred-default.xml, and yarn-default.xml)

Comparison with Traditional Databases

Schema on Read Versus Schema on Write

In traditional database  we have to mention schema (check data ,its integrity ,etc)while loading data in a table which is called as schema on write.where as in hive we need not to specify any schema or data comparison is performed while loading data in hive tables.it is just copy or move operation of data files .
Having a schema makes data performance fast but if schema is not known at the time of creation of data tables..hive is a good choice at that time.

  Tables

1)Managed

CREATE TABLE NYSE_2014(symbol string,date string,high float,low float,close float,open float,quantity string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOAD DATA INPATH '/user/cloudera/nyse_2014.csv' OVERWRITE INTO TABLE NYSE_2014;

/user/cloudera/nyse_2014.csv
 file will be moved to warehouse directory and a schema definition will be created in Metastore DB.when we drop this table (drop table managed_table) ,both file and schema definition of the table will be deleted.(files will be copied to warehouse directory ,if we are using load data local inpath i.e loacl file)

2)External

 CREATE EXTERNAL TABLE managed_table (dummy STRING) LOCATION 'user/rakesh/external_table';
LOAD DATA INPATH '/user/Rakesh/data.txt' INTO table managed_table;

Files will not be copied/moved to warehouse directory.it doesn’t even check whether the external location exists at the time it is defined..and when we drop this table only the schema definition will be deleted not the file.

Partitioning and Bucketing

Both partitioning and bucketing help us in performance while looking though the data present in hive metastore files.increase efficiency of the queries on the table(as these will look for the specific file not the whole data set)..

 Partitioning:

create table nyse_2014_partition (symbol string,date string,high float,low float,close float,open float,quantity string) partitioned by (dt string,country string);

insert overwrite table nyse_2014_partition partition(dt='2014',country='US') select * from default.nyse_2014;


 
At the filesystem level, partitions are simply nested subdirectories of the table directory. After loading a few more files into the logs table, the directory structure might look like this:

/user/hive/warehouse/nyse_2014_partition/
dt=2014/
country=US/

00000

 now as partition columns are pseudo columns and one can query on the same and get results.

select * from nyse_2014_partition where country = GB this query will look for the filesin the partition having country ='GB' not the whole data set.

Bucketing:

CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;

insert overwrite table bucket_user select * from user:

will create 4 directories in warehouse/bucketed_users directory.

SELECT * FROM bucketed_users
     TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
will return 25% of records of all the buckets.

If you want to create a new, empty table with the same schema as another table, then use the LIKE keyword:

create table nyse_2014_like like nyse_2014;

CTAS

create table symbols_nyse as select distinct(symbol) from nyse_2014;

Using a custom SerDe: RegexSerDe

Let’s see how to use a custom SerDe for loading data. We’ll use a contrib SerDe that uses a regular expression for reading the fixed-width station metadata from a text file:
CREATE TABLE stations (usaf STRING, wban STRING, name STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"
);
In previous examples, we have used the DELIMITED keyword to refer to delimited text in the ROW FORMAT clause. In this example, we instead specify a SerDe with the SERDE keyword and the fully qualified classname of the Java class that implements the SerDe,org.apache.hadoop.hive.contrib.serde2.RegexSerDe
to be continued,,,,

2 comments:

  1. I am a regular reader of your blog. the blog is very interesting and will be much useful for us. I really enjoyed very much with this article here. Really its a amazing article I had ever read. I hope it will help a lot for all.
    Hadoop Training in Chennai

    ReplyDelete
  2. really Good blog post.provided a helpful information.I hope that you will post more updates like this Big data Hadoop online Training
    Big data hadoop online Course
    Big data hadoop online Course Hyderabad



    ReplyDelete