Friday, January 5, 2024

Big data - Hive

HIVE

Hive developed by facebook
and earlier java only use for mapReduce
so facebook develop Hive similar to SQL

 

 Hive developed by facebook
and earlier java only use for mapReduce
so facebook develop Hive similar to SQL

Hive used for Enterprise Data warehouse -
Most of datawarehouse move to hadoop

Advantages of hive
similar to sql
can process more data
Hadoop is scalable

Hive applications
Log application
Text mining
Document indexing
Business Intelligence (google analytics)
Predictive Modeling

 

 

 Shell: - Allows Interactive query(MYSQL)
Driver: - Handle session, Fetch operatin, execute operatin
Compiler: - Parse(Syntax check), plan(logical, physical plan), optimize(optimz the query)
Execution Engine: DAG(Directed Acyclic Graph) of stages
Metastore: - Schema,location in HDFS, SerDe(Serialization De-serialization) when java object conversion to text, and backnforth is called SerDe.

 

Data Model
Hive mostly use for processing of structured data
Table: typed column(int,float,string,)
       list:(JSON Data)
Partitions: Cards, -> 100,000's trans comes daily, table go vry big, read time will high. so now we creation partition under table, it could be monthwise etc.
Buckets: if data is more bigger, inside partition you can have buckets, like in b1,b2 etc, and in b1 may be you have card present trans. , card not present trans. etc
 


 

Metastore
It as all the database.
Database: namespace containing set of table.
Table definition(column type) whether its int,str,bool etc.
also contain info above partition and bucketing
uses JPOX ORM for implementation, hive understand java object, so order into convert it they use JPOX ORM
ORM- Object Relational Mapping
JPOX- Java persistent object - its also know as Data nucleus (since 2008 it called data nucleus) - its use for data management in java.

 

start hive
hive
hive>
hive>select * from card;

Create table
hive> show tables;
(path /usr/hive/warehouse/)
hive>CREATE TABLE shakesper(freq INT,word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY TAB STORED AS TEXTFILE;
hive>DESCRIBE shakesper;


hive authorization
users: user priviledge
group: group priviledge
Roles: Role priviledge

command all    Description
All     -    All    
Alter    -    metadata
update    -    update
create    -    create object
drop    -    drop object
index    -    indexing
lock    -    locking & unlocking
select     -     access the data
show database - to see all available database


 

Run query in Hive
Hive -
hive>
hive -e "query string"    - single query
hive -f <file path>    - multiple query, store in a file
hive -h         - help

Hive command line options
hive -d    A:B    define
hive -i     Initialize file
hive -h        host name 


 Order by - do only arrenge in asc, desc order, but in bigdata data store in multiplenode, not store in one.
Sort by - orders data in each reducer, but actuall order may not be true
Distribute by - it ensure that all the reducer are getting not overlaping range of x.
                each of reducer dont get overlape data, avoid duplicacy of data processing.
Cluster by - sort by + distributed by = sort by
         it do both work, above and if understand in normal term it will by sort by

 

 group by - similar to SQL
select employeename,count(*) from employee group by employeename


using having condition
select employeename,count(*) from employee group by employeename having count(*) > 1

 

Join


Inner join: 3
Left Outer Join: 1 2 3 4
Right outer Join: 3 5 6 7


IN
select * from A where RNO IN(select RNO from B)

Hive
select * from A LEFT SEMI JOIN B ON A.RNO=B.RNO
HIVE

union all
1,2,3,3,4,5
select * from A union ALL

view's
create view studview AS select RNO,Name,Std from studfile

drop view studfile

 

Partitioning
Transaction file---TL20180101
                 |_TL20180102
                 |_TL20180103
We are dividing datafile in partition, to achieve speed
it will give less load, searching fast    

Bucketing
bucket is again inside a partition,inside the partition, may be we can have multiple buckets.
but bucket not commonly used.

 

HINT - when we do some kind of join operation use hint.
employee detail
 1     AB
 2     CD
 3     EF
 4     GH
 5        IJ
 6     KL

employee month salary
01/18,1,10000
02/18,2,20000
03/18,3,30000

Join is costly incase of hadoop

A hint is a directive or suggestion provided by the user to the query optimizer to influence the execution plan of a Hive query
Map Join Hint -This hint instructs the optimizer to use a map-side join for the specified table, which can be more efficient for certain types of joins.

 

     

Hive 

1. Starting HIVE
[cloudera@quickstart ~]$ hive
hive> 


2. Create database
hive> create database hive180520;
OK
Time taken: 2.168 seconds


3. To See database
hive> show databases;
OK
hive180520
Time taken: 0.015 seconds, Fetched: 1 row(s)


4. Use Database
Time taken: 0.015 seconds, Fetched: 6 row(s)
hive> use hive180520;
OK

*********** INTERNAL TABLE (Managed Table)*********************

5. Create Table
hive> create table employee(
    >      name string,
    >      location string,
    >      extn int,
    >      desig string)
    >      row format delimited fields terminated by ',';
OK
Time taken: 0.436 seconds


6. See the file structure
hive> describe employee;
OK
name                	string              	                    
location            	string              	                    
extn                	int                 	                    
desig               	string              	                    
Time taken: 0.092 seconds, Fetched: 4 r


7. Load the data from employee.txt file
hive> load data local inpath '/home/cloudera/hive180520/employee.txt' into table employee;
Loading data to table hive180520.employee
Table hive180519.employee stats: [numFiles=1, totalSize=101]
OK
Time taken: 0.808 seconds


8. Run SELECT query to see the data
hive> select * from employee;
OK
Harish	A 25	5159	Soft Engg
Shoaib	A 26	5160	Sr Soft Engg
Paul	A 27	5169	Lead
Trump	A 28	5170	Manager
Time taken: 0.482 seconds, Fetched: 4 row(


9. You can also open file explorer and go to below path to see the data
 user/ hive/ warehouse/ hive180520.db/ employee/ employee.txt


10. Drop the table
hive> drop table employee;
OK
Time taken: 0.662 seconds

Since Internal Table, schema as well as data deleted (Can see in fileexplorer)
*********** INTERNAL TABLE OVER **************************

*********** EXTERNAL TABLE *******************************


11. Create External Table
hive> create external table employee_ext(
    > name string,
    > location string,
    > desig string)
    > row format delimited fields terminated by ','
    > location '/hive180520';
OK
Time taken: 0.069 seconds


13. Below query will return no records 
hive> select * from employee_ext;
OK


14. Open another command prompt and put the data
[cloudera@quickstart ~]$ cd hive180520
[cloudera@quickstart hive180519]$ hdfs dfs -put employee.txt /hive180520


15. Back to Hive Shell. Now same SQL query will return records as data loaded in step 14.
hive> select * from employee_ext;
OK
Harish	A 25	5159
Shoaib	A 26	5160
Paul	A 27	5169
Trump	A 28	5170
Time taken: 0.076 seconds, Fetched: 4 row(s)
(can also see data in File explorer at path '/hive180519/employee.txt'

16. We can execute HQL queries to fetch the data
hive> select * from employee_ext where name = 'Harish';
OK
Harish	A 25	5159
Time taken: 0.274 seconds, Fetched: 1 row(s)


17. Drop the table
hive> drop table employee_ext;
OK
Time taken: 0.162 seconds

Only the Schema is deleted but data is still present.

*********** EXTERNAL TABLE OVER **************************
 

Partition

1. Create the Table
create table employee_stp(name string,location string,extn int) 
partitioned by (desig string)  
row format delimited fields terminated by ',' stored as textfile;


2. See the schema
describe employee_stp;

******************* STATIC PARTITION **************************************
3. Load the data using Static Partition:
load data local inpath '/home/cloudera/hive180520/employee_SE.txt' into table employee_stp partition (desig = 'SE');
load data local inpath '/home/cloudera/hive180520/employee_SSE.txt' into table employee_stp partition (desig = 'SSE');
load data local inpath '/home/cloudera/hive180520/employee_Lead.txt' into table employee_stp partition (desig = 'Lead');


4. Query the data
select * from employee_spr where desig = 'SE' and name = 'HarishSE';


5. Can also see from '/user/hive/warehouse/hive180519.db/employee_spr/'

******************* STATIC PARTITION END **************************************


******************* DYNAMIC PARTITION **************************************
6. 
#Dynamic Partitioning

Step A: Create a Internal Table ( It can be delta load coming from different servers daily)
create table employee_full(name string,location string,extn int, desig string)  
row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/cloudera/hive180520/employee.txt' into table employee_full;

select * from employee_full;


Step B: Put data from Internal table in Step A to Step B Dynamic Table ( It can be delta load coming from different servers daily)
create table employee_dyp(name string,location string,extn int) 
partitioned by (desig string)  
row format delimited fields terminated by ',' stored as textfile;


set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=100;
set hive.exec.max.dynamic.partitions.pernode=100;


insert into table employee_dyp partition(desig) select name,location, extn, desig from employee_full;

******************* DYNAMIC PARTITION END **************************************

 

Bucket

1. Create entire Table
create table realestate (street string, cityname string, zipcode int, state string, beds int, baths int, sq_feet int, flat_type string, price int) row format delimited fields terminated by ',' stored as textfile;

2. Load entire Data
load data local inpath '/home/cloudera/hive180520/realestatedata.csv' into table realestate;

3. Select query 
select * from realestate;

4. Enable Bucketing on Basis of Reducers
set hive.enforce.bucketing=true ;

5. Create Bucket Table
create table realestate_bucket (street string, zipcode int, state string, beds int, baths int, sq_feet int, flat_type string, price int) partitioned by (city string) clustered by (street) into 4 buckets row format delimited fields terminated by  ',';

6. Insert data in to bucket table
insert into table realestate_bucket partition(city) select street, zipcode ,state, beds, baths, sq_feet,flat_type,price,cityname from realestate;

7. Check the output from 'user/hive/warehouse/pig180520/realestate_bucket'

 

 

 

 

 

 

No comments:

Post a Comment