Bigdata
3V's of big data
- Data volume
- Data velocity
- Data variety
Data volume is increasing - KB to MB, MB to GB and so on.
Data velocity - old days there only batch jobs, list of job were running,
data was generating in batches and it was very less.
from batch it change perioding(hourly etc.)
Periodic to it change in near realtime(in few minutes)
near realtime change to realtime, video call audio call etc.
Velocity - speed @ which data generated
Data variety
old - Excel table, then database, then photos - web, audio
unstructured data.
Flow of big data
data coming in 2 way mostly
- Real-time/
- Batch-processing
Then again it can be structured data
unstructured data(that you need to figure-out)
Data source
data coming from diff. location
- mobile
- bank
- cars
- IoT
Privacy
DLP(may be apply on that)
Encryption/Scrambled/Masked
Privacy - suppose if you have SSN, aadhar etc. - we don't store it directly
Pre- Processing
If data coming in some other format, and you converting it required format before storing
Big data analysis
we analysis and we get some insights of big data, we can visualize the thing, we generate the Report
basically we analysis data over here, we will try to find out some output from bigdata- report
Data Integration
Integrating with other system, to generate the reports, for internal business use,
it can be give to the govts agencies, It can be send to NGOs
what is the diff. b/w data analysis and data scientist
Data analysis
SQL
BI Packages
Intermediate Stat.
Its looking backward
Data scientist
Data acquisition, movement, manipulation
Programing.
Looking forward - it predict things
Adv. Stat.
where to use big data?
you have multiple site or app, and you have warehouse for inventory
you keeping track of everything.
so data generating by our app that you analysis for business decision
Performance in vertical vs horizontal and big data - horizontal
Shared Nothing vs shared disk
How data is stored in hadoop?
Partitioning
Replication
RDBMS - table kind of structure like in Excel.
Hadoop- Parallel Processing
Handling Failure in Hadoop
by replication factor,
2003-2013
Hadoop - Mapreduce
everything happening in 2 stage
Map -divide and share in multiple-device for parallel processing
Reduce -then aggregate all processing and reduce it and provide Report
In 2007
new each big player invented new way of handling bigdata
Pig script - yahoo - develop scripting language
simple instruction like load, store etc.
Hive query - facbook - Hive query-similar to SQL - used lot now a days
equal to writing a sql program
Impala query - like sql query - query engine
mahout - for machine learning
Flume - use to fetch the Streaming data from diff. system
Sqoop - fetch the data from RDBM system
Oozie - schedular - schedule some job
Pregel, giraph - graph processing
- if you use Hive or pig - it internally convert your job in map reduce.
you dont need to learn java, becz you know SQL, and you can work on HIVE,PIG
you just learn hive, and you can became - big data hadoop developer.
2014 - spark came
it complete package - SQL,Streaming,MLIB
Fast(in memory processing)
Eco System
Customer -> NoSQL -> Hadoop(load in) -> Reports(Analysis)
www,mobile -> RDBMS
HDFS - hadoop distributed file system
Yarn - resource negotiator, arrangement of resource
Mapreduce - programing - can be run on java/python
Sqoop - fetch the data from RDBMS system
Pig - scripting lang.
hive - query lang.
Spark - directly release on YARN - it dont convert in java prog. it directly run
on YARN. , spark prog. can be done either scala, java8/9, python,R, it fast
HBase - NoSQL - columnar kind of database - directly run on HDFS
Impala - SQL ENGINE
Flume - get streaming data from diff. system, use a weblog a lot
Zoo keeper - to manage the hadoop system- its coordinator, do all coordination b/w diff. system, its coordination framework.
On what hadoop is good at ?
Processing the large files
processing the sequential file
handling the partial failure
handling the unstructured data
performance cost is linear
Not good at?
More no of small files, it good the less no. but large file
becz it has to store the metadata, and do the replication, so no use if store process a small file on a ping processing cluster
Random access
Modifying data(write once, read multiple time)
not support 3NF(normalize form), not support traditionally ACID properties
not kind of RDBMS
Hadoop
Hadoop Creation History
Feature of hadoop
Reliable - Handle failure -(Replication)
Flexible - Add more system (add more node) no downtime
Stable - stable version of hadoop
Economical -commercial h/w used is cheap
Major components Hadoop Map reduce
Map - distribute the queries
Reduce - gather the result
Resource - management
HDFS
2.1 - what is HDFS?
Hadoop distributed file system
Core component
like regular file system(txt,email,image etc)
Store TB & PB
HDFC Architecture
2 things
Namenode
datanode
Namenode is high configuration machine, and namenode control datanode.
Data store in datanode, and its metadata will be store in namenode.
namenode will be aware which file present where, becz its have metadata.
namenode have fsimage.
the data store in datanode, it can be change, like file can be deleted, update etc.
the changes happening a period of time, this info store in Secondary Namenode logfile
so periodically fsimage and logfile from secondary node will merge to namenode.
exam: if have 300mb file it don't store directly 300mb
it can be store in block - 128mb+128mb+44mb in diff. machine and it can parallel processed.
HDFS Client
1. Request for metadata about file system
2. Response for metadata about file system
3. actual data transfer directly from client to DataNodes based on metadata received from NameNode
What is heart beat
datanode give signal to namenode that its alive
HDFS Architecture
Typical hadoop cluster
Master
- Map reduceLayer
task tracker
Job tracker
- HDFS
Namenode
Datanode
Slave
- Map reduceLayer
task tracker
- HDFS
Datanode
HDFS Architecture
Job Tracker - one instance of job tracker in master
- MapReduce Job is submitted to job tracker
- initiate task in nodes where file is located.
Task Tracker - multiple instance of Task tracker in slave
- Receives information from job tracker.
- Heart beats to Job tracker.
when you submit a MapReduce job - it send to job tracker - and job tracker will send the task to 4 computers or so on wherever that particalur data file located. so it will initiate the ask where it located.
For task tracker multiple instance in slave, it receive the information from job tracker, what its need to do, and it continously send the heart beat to job tracker, so job tracker is aware task tracker is aligned
the MR job submit - >to the client and -> client to Job tracker - job tracker to data location - and on those location task tracker
Block Storage
700 mb file
Block 1 - Block6
Rack awareness
exam: in Rack 1, we have 4 machines, in 2 Rack again we have 4 machines
in 3 Rack again have 4 machines
Namenode always know where the file is presence.
Namenode always rack awareness, and its always says, where to access data from
Write file
- Hdfc client ask for write file ->
- namenode, namenode ask client break the file replication factor of 3, and Block A should be on datanode 1, block should be on datanode 3 and so on. ->
- and Client do the same as Namenode mention. ->
- then next datanodes send block report to namenode it will go to logfile in secondary namenode, later which mainly merge with namenode
- datanode inform to the client file is written successfully
- client ask for read file -
- namenode will said okay, file is present in this particular datanode.
- finally client read from these datanode.
- file presence in mutiple node, there will be a checksum presence in name node, it will be calculated by sum formula, by entire bit, and whenever client read the file, it calculate th checksum, and it will match the checksum with namenode, if checksum is correct the file read is correct
Limitation of hadoop 1.x
SPOF - namenode -> stand by
1 Job trackers => bottle neck -
Node manager
App manager
1000's task trackers
can connect 4k node
Hadoop 2.0 we can connect upto 10k
Yarn
Yet another resource negotiator
- Managing resource cluster.
- introduced from hadoop2.x
Why YARN came?
Hadoop 1.x
- Resource Management done by mapReduce
- Processing by done mapReduce
so there is load on mapRdeuce, even other app also on mapReduce like Pig,hive etc.
and directly it only support mapReduce
Hadoop 2.x
- Resource managment give to YARN
Processing can be done mapReduce,SPARK,TEZ
YARN Architecture
Job tracker - replaced by Resource manager
Application manager
Task tracker - replaced by Node manager
Slots
the execution location in Hadoop 1.x called slots, inside datanode can have multiple slots
in Hadoop 2.x
slots called container, single node can have multiple container
Resource Manager
Resource scheduler - scheduling as per request from application master
exam - app master request for 120RAM, it will do that.
Application Mater liveness monitor / Node Manager liveness monitor
both are heart beat -
Several event handlers - exam - one pc fail, same data need to replicate in another computer.
Application master
work @ application level.
manage lifecycle of application
manage resource from RM
Load on RM is reduce.
Similarly the load is distributed b/w Resource manager and app master
Hadoop 1.x Job tracker replaced by Resource manager and application master
Node manager
Slaves
Manage container per cluster
manage lifecycle of container
Heart beat to Resource manager
container#1 container#2
spark mapReduce
- slaves
- many container per cluster
- manage lifecycle of container
- Heart beat to Resource Manager
YARN application startup
- Job submitted by client -> Job submitted to Resource manager.
- Resource manager, take to the Resource scheduler, to assign the resources.
- And Resource manager give the resource to the Node manager.
- Node manager will make a container
- Each container - the application will be executed. - application master execute the application
1. client submits application Resource Manager
2. Resource Manager allocate containers
3. Resource Manager contact Node manager
4. Node manager launches container
5. Container executes application master
Hadoop Distributed File System (HDFS) commands
- Creating a Directory:
hdfs dfs -mkdir /path/to/directory
Copying Files or Directories to HDFS
hdfs dfs -copyFromLocal /path/to/local/file_or_directory /path/in/hdfs
Copying Files or Directories from HDFS to Local File System
hdfs dfs -copyToLocal /path/in/hdfs /path/to/local/directory
Listing Files or Directories in HDFS:
hdfs dfs -ls /path/in/hdfs
Removing Files or Directories from HDFS:
hdfs dfs -rm /path/in/hdfs/file_or_directory
Moving (Renaming) Files or Directories in HDFS:
hdfs dfs -mv /path/in/hdfs/source /path/in/hdfs/destination
Displaying the Contents of a File in HDFS:
hdfs dfs -cat /path/in/hdfs/file
Creating an Empty File in HDFS:
hdfs dfs -touchz /path/in/hdfs/file
Checking Disk Usage in HDFS:
hdfs dfs -du -h /path/in/hdfs
Checking the Size of a File in HDFS:
hdfs dfs -du -h /path/in/hdfs/file
Setting Replication Factor for a File:
hdfs dfs -setrep -w <replication_factor> /path/in/hdfs/file
Getting Help
hdfs dfs -help
Introduction to Map Reduce
why we use map reduce
Performance issue in vertical scaling
then commodity h/w in horizontal scaling performance increased in linear
Advantages of Map Reduce
- Scalability (can add more machine, without any downtime)
- Cost effective (using commodity h/w, these are cheap)
- Flexible (R, Python, Java,Scala,xls, video, audio)
- Fast(Parallel processing)
- security (chown,chmod 640)(HDFS,HBase)
- High availability(Namenode, Standby node)
How map reduce works
{k1;v1,v2}
Map: create key value pair
Reduce: combine the key value pair
exam:
we have input file, its a big input file, its divided and store in 6 diff. machine
it store in 6 diff. machine.
The map program run in each all 6 machine.
After program run it create key value pair like:
machine 1 - k1=v,k2=v,k3=v
machine 2 -
machine 3 - k1=v, k3=v
machine 4 - k2=v
machine 5 - k3=v, k5=v
machine 6 - k4=v
group by key
shuffle & sort
k1=vv, k2=v,v k3=vvv, k4=v, k5=v
Reducer
combine key value pair and give output
o/p = k1=2v,k2=2v,k3=3v,k4=v,k5=v
how many mapper is running ?
the number of mapper is decided on the basis of there how many splits to a input
How many reduce run decide?
Number of reducer totally depend on data and what processing we are doing, its internally decide
in above example, 4 mapper running and and 4 reducer ran
Indexing
T[0] = 'it is what it is'
T[1] = 'what is it'
T[2] = 'it is a mango'
Forward indexing - On basis of file name
Inverted index - Indexing on the basis of (on search engines)
file a:{2}
is:{0,1,2}
it:{0,1,2}
"mango": {2}
"what": {0,1}
TF - IDX
Term Frequency - Inverse document frequency
TF = No of times "HIVE" appears in doc / No of terms in doc
= 50 / 1000 = 0.05
IDF = log ( total no of docs / no of doc's having word Hive)
= log(10000000/ 1000) = 4
TFX IDF = 0.05X4 = .20
Partitioner
bw map and reduce we hae partitioner
it make sure particular data will be processed by particular reducer
like data from map k1 k2
so k1 data it gave to reducer 1 and k2 data it gave to reducer 2
partitioner again a program
Hadoop Streaming (utility)
Allow to Create & Run Map-reduce Jobs
(Non Java Lang.) - R,Python
Syntax
$ /usr/bin/hadoop jar /usr/lib/hadoop-map/hadoop-streaming.jar
-files mapper.py,reducer.py
-mapper mapper.py
-reduces reducer.py
-input /user/harish/file1.txt
-output /usr/harish/fil1op.txt
MR JOB (Yelp)
Allow to run map-reduce pgrm in local system(unix,linux)
People who viewed this also vied this
Mapper.py & Reducer.py
Advantages of MRJob
Map Reduce is single class.
easy to upload, install, deploy dependency
Download error log
Easy to integrate
Amazon - EC2 /EMR
map -reduce practice
Install docker -hadoop on docker
https://medium.com/analytics-vidhya/how-to-easily-install-hadoop-with-docker-ad094d556f11
https://drive.google.com/drive/folders/1LrYzvpjvZ7k5Y8plzq6vDVk8-IvcYr0O

No comments:
Post a Comment