Designed by some guy called Jeff from Facebook..this mapreduce tool/application is very much useful for guys who are good in SQL based programming.though Hive isn't good for applications where Machine Learning Algos are needed for complex logic.
One can Install it very easily and can run it in 3 different modes.
- Hive Shell (hive> SHOW TABLES;)
- Calling scripts containing hive commands (hive -f script.q)
- Short scripts, you can use the
-e
option to specify the commands in line, in which case the final semicolon is not required.(hive -S -e 'SELECT * FROM dummy')
Hive creates logical tables from the data files in HDFS.There are two components of Hive .
Metastore
Warehouse(/user/cloudera/any.db/tablename/0000_m )
Warehouse is nothing but the physical location of the data files stored for Hive Tables.(marked as red).
One can set properties in Hive like we set in Mapreduce..
There is a precedence hierarchy to setting properties. In the following list, lower numbers take precedence over higher numbers:
- The Hive
SET
command(set a particular property) - The command-line
-hiveconf
option(for a particular session hive --config /Users/tom/dev/hive-conf) - hive-site.xml and the Hadoop site files (core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml)
- The Hive defaults and the Hadoop default files (core-default.xml, hdfs-default.xml, mapred-default.xml, and yarn-default.xml)
Comparison with Traditional Databases
Schema on Read Versus Schema on Write
In traditional database we have to mention schema (check data ,its integrity ,etc)while loading data in a table which is called as schema on write.where as in hive we need not to specify any schema or data comparison is performed while loading data in hive tables.it is just copy or move operation of data files .
Having a schema makes data performance fast but if schema is not known at the time of creation of data tables..hive is a good choice at that time.
Tables
1)Managed
CREATE TABLE NYSE_2014(symbol string,date string,high float,low float,close float,open float,quantity string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOAD DATA INPATH '/user/cloudera/nyse_2014.csv' OVERWRITE INTO TABLE NYSE_2014;
/user/cloudera/nyse_2014.csv
file will be moved to warehouse directory and a schema definition will be created in Metastore DB.when we drop this table (drop table managed_table) ,both file and schema definition of the table will be deleted.(files will be copied to warehouse directory ,if we are using load data local inpath i.e loacl file)
file will be moved to warehouse directory and a schema definition will be created in Metastore DB.when we drop this table (drop table managed_table) ,both file and schema definition of the table will be deleted.(files will be copied to warehouse directory ,if we are using load data local inpath i.e loacl file)
2)External
CREATE EXTERNAL TABLE managed_table (dummy STRING) LOCATION 'user/rakesh/external_table';
LOAD DATA INPATH '/user/Rakesh/data.txt' INTO table managed_table;
Files will not be copied/moved to warehouse directory.it doesn’t even check
whether the external location exists at the time it is defined..and when we drop this table only the schema definition will be deleted not the file.Partitioning and Bucketing
Both partitioning and bucketing help us in performance while looking though the data present in hive metastore files.increase efficiency of the queries on the table(as these will look for the specific file not the whole data set)..
Partitioning:
create table nyse_2014_partition (symbol string,date string,high float,low float,close float,open float,quantity string) partitioned by (dt string,country string);
insert overwrite table nyse_2014_partition partition(dt='2014',country='US') select * from default.nyse_2014;
insert overwrite table nyse_2014_partition partition(dt='2014',country='US') select * from default.nyse_2014;
At the filesystem level, partitions are simply nested subdirectories of the table directory. After loading a few more files into the
logs
table, the directory
structure might look like this:/user/hive/warehouse/nyse_2014_partition/
dt=2014/
country=US/
00000
now as partition columns are pseudo columns and one can query on the same and get results.
select * from nyse_2014_partition where country = GB this query will look for the filesin the partition having country ='GB' not the whole data set.
Bucketing:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
CLUSTERED BY (id) INTO 4 BUCKETS;
insert overwrite table bucket_user select * from user:
will create 4 directories in warehouse/bucketed_users directory.
SELECT * FROM bucketed_users
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
will return 25% of records of all the buckets.
If you want to create a new, empty table with the same schema as another table, then use the LIKE
keyword:
create table nyse_2014_like like nyse_2014;
CTAS
create table symbols_nyse as select distinct(symbol) from nyse_2014;
Using a custom SerDe: RegexSerDe
Let’s see how to use a custom SerDe for loading data. We’ll use a contrib SerDe that uses a regular expression for reading the fixed-width station metadata from a text file:CREATE TABLE stations (usaf STRING, wban STRING, name STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*" );In previous examples, we have used theDELIMITED
keyword to refer to delimited text in theROW FORMAT
clause. In this example, we instead specify a SerDe with theSERDE
keyword and the fully qualified classname of the Java class that implements the SerDe,org.apache.hadoop.hive.contrib.serde2.RegexSerDe
I am a regular reader of your blog. the blog is very interesting and will be much useful for us. I really enjoyed very much with this article here. Really its a amazing article I had ever read. I hope it will help a lot for all.
ReplyDeleteHadoop Training in Chennai
really Good blog post.provided a helpful information.I hope that you will post more updates like this Big data Hadoop online Training
ReplyDeleteBig data hadoop online Course
Big data hadoop online Course Hyderabad