Bigdata

3V's of big data

Data volume
Data velocity
Data variety

Data volume is increasing - KB to MB, MB to GB and so on.

Data velocity - old days there only batch jobs, list of job were running,
data was generating in batches and it was very less.
from batch it change perioding(hourly etc.)
Periodic to it change in near realtime(in few minutes)
near realtime change to realtime, video call audio call etc.

Velocity - speed @ which data generated

Data variety
old - Excel table, then database, then photos - web, audio
unstructured data.

Flow of big data

data coming in 2 way mostly

Real-time/
Batch-processing

Then again it can be structured data
unstructured data(that you need to figure-out)

Data source
data coming from diff. location

mobile
bank
cars
IoT

Privacy
DLP(may be apply on that)
Encryption/Scrambled/Masked
Privacy - suppose if you have SSN, aadhar etc. - we don't store it directly

Pre- Processing
If data coming in some other format, and you converting it required format before storing

Big data analysis
we analysis and we get some insights of big data, we can visualize the thing, we generate the Report
basically we analysis data over here, we will try to find out some output from bigdata- report

Data Integration
Integrating with other system, to generate the reports, for internal business use,
it can be give to the govts agencies, It can be send to NGOs

what is the diff. b/w data analysis and data scientist

Data analysis
SQL
BI Packages
Intermediate Stat.
Its looking backward

Data scientist

Data acquisition, movement, manipulation
Programing.
Looking forward - it predict things
Adv. Stat.

where to use big data?
you have multiple site or app, and you have warehouse for inventory
you keeping track of everything.
so data generating by our app that you analysis for business decision

Performance in vertical vs horizontal and big data - horizontal

Shared Nothing vs shared disk

How data is stored in hadoop?
Partitioning
Replication

RDBMS - table kind of structure like in Excel.

Hadoop- Parallel Processing

Handling Failure in Hadoop

by replication factor,

2003-2013
Hadoop - Mapreduce
everything happening in 2 stage
Map -divide and share in multiple-device for parallel processing
Reduce -then aggregate all processing and reduce it and provide Report

In 2007
new each big player invented new way of handling bigdata
Pig script - yahoo - develop scripting language
simple instruction like load, store etc.

Hive query - facbook - Hive query-similar to SQL - used lot now a days
equal to writing a sql program

Impala query - like sql query - query engine

mahout - for machine learning

Flume - use to fetch the Streaming data from diff. system

Sqoop - fetch the data from RDBM system

Oozie - schedular - schedule some job

Pregel, giraph - graph processing

- if you use Hive or pig - it internally convert your job in map reduce.
you dont need to learn java, becz you know SQL, and you can work on HIVE,PIG
you just learn hive, and you can became - big data hadoop developer.

2014 - spark came
it complete package - SQL,Streaming,MLIB
Fast(in memory processing)

Eco System
Customer -> NoSQL -> Hadoop(load in) -> Reports(Analysis)
www,mobile -> RDBMS

HDFS - hadoop distributed file system

Yarn - resource negotiator, arrangement of resource

Mapreduce - programing - can be run on java/python

Sqoop - fetch the data from RDBMS system

Pig - scripting lang.

hive - query lang.

Spark - directly release on YARN - it dont convert in java prog. it directly run

on YARN. , spark prog. can be done either scala, java8/9, python,R, it fast

HBase - NoSQL - columnar kind of database - directly run on HDFS

Impala - SQL ENGINE

Flume - get streaming data from diff. system, use a weblog a lot

Zoo keeper - to manage the hadoop system- its coordinator, do all coordination b/w diff. system, its coordination framework.

On what hadoop is good at ?
Processing the large files
processing the sequential file
handling the partial failure
handling the unstructured data
performance cost is linear

Not good at?
More no of small files, it good the less no. but large file
becz it has to store the metadata, and do the replication, so no use if store process a small file on a ping processing cluster
Random access
Modifying data(write once, read multiple time)
not support 3NF(normalize form), not support traditionally ACID properties
not kind of RDBMS

Hadoop

Hadoop Creation History

Feature of hadoop

Reliable - Handle failure -(Replication)
Flexible - Add more system (add more node) no downtime
Stable - stable version of hadoop
Economical -commercial h/w used is cheap

Major components Hadoop Map reduce
Map - distribute the queries
Reduce - gather the result
Resource - management

HDFS

2.1 - what is HDFS?
Hadoop distributed file system
Core component
like regular file system(txt,email,image etc)
Store TB & PB

HDFC Architecture

2 things
Namenode
datanode

Namenode is high configuration machine, and namenode control datanode.
Data store in datanode, and its metadata will be store in namenode.
namenode will be aware which file present where, becz its have metadata.
namenode have fsimage.

the data store in datanode, it can be change, like file can be deleted, update etc.
the changes happening a period of time, this info store in Secondary Namenode logfile
so periodically fsimage and logfile from secondary node will merge to namenode.

exam: if have 300mb file it don't store directly 300mb

it can be store in block - 128mb+128mb+44mb in diff. machine and it can parallel processed.

HDFS Client
1. Request for metadata about file system
2. Response for metadata about file system
3. actual data transfer directly from client to DataNodes based on metadata received from NameNode

What is heart beat
datanode give signal to namenode that its alive

HDFS Architecture

Typical hadoop cluster

Master

Map reduceLayer

task tracker
Job tracker

HDFS

Namenode
Datanode

Slave

Map reduceLayer

task tracker

HDFS

Datanode

HDFS Architecture

Job Tracker - one instance of job tracker in master

MapReduce Job is submitted to job tracker
initiate task in nodes where file is located.

Task Tracker - multiple instance of Task tracker in slave

Receives information from job tracker.
Heart beats to Job tracker.

when you submit a MapReduce job - it send to job tracker - and job tracker will send the task to 4 computers or so on wherever that particalur data file located. so it will initiate the ask where it located.

For task tracker multiple instance in slave, it receive the information from job tracker, what its need to do, and it continously send the heart beat to job tracker, so job tracker is aware task tracker is aligned

the MR job submit - >to the client and -> client to Job tracker - job tracker to data location - and on those location task tracker

Block Storage

700 mb file
Block 1 - Block6

Rack awareness
exam: in Rack 1, we have 4 machines, in 2 Rack again we have 4 machines
in 3 Rack again have 4 machines
Namenode always know where the file is presence.
Namenode always rack awareness, and its always says, where to access data from

Write file

Hdfc client ask for write file ->
namenode, namenode ask client break the file replication factor of 3, and Block A should be on datanode 1, block should be on datanode 3 and so on. ->
and Client do the same as Namenode mention. ->
then next datanodes send block report to namenode it will go to logfile in secondary namenode, later which mainly merge with namenode
datanode inform to the client file is written successfully

Read file

client ask for read file -
namenode will said okay, file is present in this particular datanode.
finally client read from these datanode.
file presence in mutiple node, there will be a checksum presence in name node, it will be calculated by sum formula, by entire bit, and whenever client read the file, it calculate th checksum, and it will match the checksum with namenode, if checksum is correct the file read is correct

Limitation of hadoop 1.x
SPOF - namenode -> stand by
1 Job trackers => bottle neck -
Node manager
App manager
1000's task trackers
can connect 4k node

Hadoop 2.0 we can connect upto 10k

Yarn

Yet another resource negotiator

Managing resource cluster.
introduced from hadoop2.x

Why YARN came?
Hadoop 1.x

Resource Management done by mapReduce
Processing by done mapReduce

so there is load on mapRdeuce, even other app also on mapReduce like Pig,hive etc.
and directly it only support mapReduce

Hadoop 2.x

Resource managment give to YARN

Processing can be done mapReduce,SPARK,TEZ

YARN Architecture
Job tracker - replaced by Resource manager
Application manager

Task tracker - replaced by Node manager

Slots

the execution location in Hadoop 1.x called slots, inside datanode can have multiple slots

in Hadoop 2.x

slots called container, single node can have multiple container

Resource Manager

Resource scheduler - scheduling as per request from application master
exam - app master request for 120RAM, it will do that.

Application Mater liveness monitor / Node Manager liveness monitor
both are heart beat -

Several event handlers - exam - one pc fail, same data need to replicate in another computer.

Application master

work @ application level.
manage lifecycle of application
manage resource from RM
Load on RM is reduce.
Similarly the load is distributed b/w Resource manager and app master

Hadoop 1.x Job tracker replaced by Resource manager and application master

Node manager

Slaves
Manage container per cluster
manage lifecycle of container
Heart beat to Resource manager
container#1 container#2
spark mapReduce

slaves
many container per cluster
manage lifecycle of container
Heart beat to Resource Manager

YARN application startup

Job submitted by client -> Job submitted to Resource manager.
Resource manager, take to the Resource scheduler, to assign the resources.
And Resource manager give the resource to the Node manager.
Node manager will make a container
Each container - the application will be executed. - application master execute the application

1. client submits application Resource Manager
2. Resource Manager allocate containers
3. Resource Manager contact Node manager
4. Node manager launches container
5. Container executes application master

Hadoop Distributed File System (HDFS) commands

how to install hdfs system (docker why)

https://medium.com/analytics-vidhya/how-to-easily-install-hadoop-with-docker-ad094d556f11

Creating a Directory:
hdfs dfs -mkdir /path/to/directory
 
Copying Files or Directories to HDFS
hdfs dfs -copyFromLocal /path/to/local/file_or_directory /path/in/hdfs
 
Copying Files or Directories from HDFS to Local File System
hdfs dfs -copyToLocal /path/in/hdfs /path/to/local/directory
 
Listing Files or Directories in HDFS:
hdfs dfs -ls /path/in/hdfs
 
Removing Files or Directories from HDFS:
hdfs dfs -rm /path/in/hdfs/file_or_directory
 
Moving (Renaming) Files or Directories in HDFS:
hdfs dfs -mv /path/in/hdfs/source /path/in/hdfs/destination
 
Displaying the Contents of a File in HDFS:
hdfs dfs -cat /path/in/hdfs/file
 
Creating an Empty File in HDFS:
hdfs dfs -touchz /path/in/hdfs/file
 
Checking Disk Usage in HDFS:
hdfs dfs -du -h /path/in/hdfs
 
Checking the Size of a File in HDFS:
hdfs dfs -du -h /path/in/hdfs/file 
 
Setting Replication Factor for a File:
hdfs dfs -setrep -w <replication_factor> /path/in/hdfs/file
 
Getting Help
hdfs dfs -help
Introduction to Map Reduce
why we use map reduce

Performance issue in vertical scaling
then commodity h/w in horizontal scaling performance increased in linear

Advantages of Map Reduce

Scalability (can add more machine, without any downtime)
Cost effective (using commodity h/w, these are cheap)
Flexible (R, Python, Java,Scala,xls, video, audio)
Fast(Parallel processing)
security (chown,chmod 640)(HDFS,HBase)
High availability(Namenode, Standby node)

How map reduce works
{k1;v1,v2}
Map: create key value pair

Reduce: combine the key value pair

exam:
we have input file, its a big input file, its divided and store in 6 diff. machine
it store in 6 diff. machine.
The map program run in each all 6 machine.

After program run it create key value pair like:
machine 1 - k1=v,k2=v,k3=v
machine 2 -
machine 3 - k1=v, k3=v
machine 4 - k2=v
machine 5 - k3=v, k5=v
machine 6 - k4=v

group by key

shuffle & sort
k1=vv, k2=v,v k3=vvv, k4=v, k5=v

Reducer
combine key value pair and give output
o/p = k1=2v,k2=2v,k3=3v,k4=v,k5=v

how many mapper is running ?
the number of mapper is decided on the basis of there how many splits to a input

How many reduce run decide?
Number of reducer totally depend on data and what processing we are doing, its internally decide

in above example, 4 mapper running and and 4 reducer ran

Indexing
T[0] = 'it is what it is'
T[1] = 'what is it'
T[2] = 'it is a mango'
Forward indexing - On basis of file name
Inverted index - Indexing on the basis of (on search engines)

        file a:{2}
              is:{0,1,2}
                     it:{0,1,2}
                  "mango": {2}
                  "what": {0,1}

TF - IDX
Term Frequency - Inverse document frequency

TF = No of times "HIVE" appears in doc / No of terms in doc
= 50 / 1000 = 0.05

IDF = log ( total no of docs / no of doc's having word Hive)
= log(10000000/ 1000) = 4

TFX IDF = 0.05X4 = .20

Partitioner

bw map and reduce we hae partitioner
it make sure particular data will be processed by particular reducer

like data from map k1 k2
so k1 data it gave to reducer 1 and k2 data it gave to reducer 2

partitioner again a program

Hadoop Streaming (utility)

Allow to Create & Run Map-reduce Jobs
(Non Java Lang.) - R,Python

Syntax
$ /usr/bin/hadoop jar /usr/lib/hadoop-map/hadoop-streaming.jar
-files mapper.py,reducer.py
-mapper mapper.py
-reduces reducer.py
-input /user/harish/file1.txt
-output /usr/harish/fil1op.txt

MR JOB (Yelp)
Allow to run map-reduce pgrm in local system(unix,linux)

People who viewed this also vied this

Mapper.py & Reducer.py
Advantages of MRJob
Map Reduce is single class.
easy to upload, install, deploy dependency
Download error log
Easy to integrate

Amazon - EC2 /EMR

map -reduce practice

Install docker -hadoop on docker

https://medium.com/analytics-vidhya/how-to-easily-install-hadoop-with-docker-ad094d556f11

https://drive.google.com/drive/folders/1LrYzvpjvZ7k5Y8plzq6vDVk8-IvcYr0O

Tech Giant

Saturday, December 16, 2023

Big data

Bigdata

3V's of big data

Flow of big data

what is the diff. b/w data analysis and data scientist

Hadoop

Hadoop Creation History

Feature of hadoop

HDFS

HDFC Architecture

2 things
Namenode
datanode

HDFS Architecture

Block Storage

Yarn

Resource Manager

Application master

Node manager

YARN application startup

Hadoop Distributed File System (HDFS) commands

Introduction to Map Reduce

Partitioner

Hadoop Streaming (utility)

No comments:

Post a Comment

Saturday, December 16, 2023

Big data

Bigdata

3V's of big data

Flow of big data

what is the diff. b/w data analysis and data scientist

Hadoop

Hadoop Creation History

Feature of hadoop

HDFS

HDFC Architecture

2 thingsNamenodedatanode

HDFS Architecture

Block Storage

Yarn

Resource Manager

Application master

Node manager

YARN application startup

Hadoop Distributed File System (HDFS) commands

Introduction to Map Reduce

Partitioner

Hadoop Streaming (utility)

No comments:

Post a Comment

2 things
Namenode
datanode