Tuesday, November 28, 2023

GCP Data engineering

 Data engineering

4 stages of data.


Ingestion

Gather Data from multiple sources
Data gather from App

  •  Event log, Click stream Data, e-commerce Transaction

Streaming ingest

  •  PubSub

Batch Ingest

  •  Different Transfer services
  •  GCS-gsutil

Store data

Once your data gather, you need to think how you store your data.
Cost efficient & durable data storage

Structured data
Semi-structured data
Unstructured data

Structured data

  • Transactional
  • Analytical

https://cloud.google.com/products/databases?hl=en#store

 

Storage Type Category GCP Storage Name When and Why to Use Open Source Example
Structured (Transactional) Cloud SQL Structured data storage for relational databases. MySQL, PostgreSQL

Cloud Spanner Globally distributed, horizontally scalable database.


Suitable for transactional workloads and ACID compliance.


Examples: Online transactional systems, e-commerce.
Structured (Analytical) BigQuery Serverless data warehouse for analytics. Apache Hive, Amazon Redshift


Suited for complex queries, large-scale reporting.


Examples: Business intelligence, data warehousing.
Semi-Structured (Fully Indexed) Cloud Datastore NoSQL database for semi-structured data. MongoDB, Apache Cassandra


Fully managed, scalable, suitable for document-like data.


Examples: Content management systems, flexible data models.
Semi-Structured (Row Key) Cloud Bigtable Wide-column store for large-scale, real-time workloads. Apache HBase


Suitable for time-series data, IoT applications.


Examples: IoT data storage, monitoring systems.
Unstructured Cloud Storage Object storage for unstructured data, files, and media. Amazon S3, Azure Blob Storage


Scalable, durable, suitable for large datasets.


Examples: Image storage, backups, multimedia assets.

 

ACID properties:

 

 

ACID Property Definition Example
Atomicity Ensures that a transaction is treated as a single, indivisible unit of work. Either all operations within the transaction are executed successfully, or none are executed at all. Funds transfer between two bank accounts: Either the entire transfer (debit from one account and credit to another) is completed successfully or none of it happens.
Consistency Ensures that a transaction brings the database from one valid state to another, satisfying predefined integrity constraints before and after the transaction. Enforcing a rule that all accounts must maintain a positive balance; a transaction violating this rule will be rejected, ensuring consistency.
Isolation Ensures that the execution of one transaction is isolated from the execution of other transactions, even if multiple transactions are executing concurrently. Two transactions concurrently updating the same set of data: Isolation ensures that each transaction's changes are not visible to the other until both transactions are complete.
Durability Guarantees that once a transaction is committed, its effects persist even in the event of system failures, such as power outages or crashes. Confirmation message after completing an online purchase: Durability ensures that the purchase information is stored persistently and won't be lost, even if the system crashes after the confirmation.

 

 

Process and analyze

In this stage we need to think what kind of outcome we want?
what analysis you want to perform?
means you converting your data in some kind of meaningful information, rather then just just simple dump data.
You can analyze data with BigQuery also
Or you can apply some kind of ML

  •  BigQuery ML
  •  Spark ML with DataProc
  •  Vertex API 
  •  build ML Model with Auto ML/Custom Model 

 

https://cloud.google.com/products/databases?hl=en

Explore and visualize

Google Data studio - Easy to use BI engine
 Dashboard & Visualization

Datalab
Interactive Jupyter notebook
Support for all Data science libary

ML Prebuild API
vision Api
Speech Api 

 

Types of Data

Structure
Semi-structure
unstructured.


Structure data

Tabular Data

Represented by Rows & Columns

SQL can be used to interact with data

Fixed Schema

Each row has same number of columns

Relational Database are structured

MySQL, PostgreSQL,

In GCP, Cloud SQL, Cloud Spanner

 

 

Semi-Structured Data

Each Record has variable number of Properties
No Fixed schema
Flexible structure
NoSQL kind of Data  
Store data as key-value pair
JSON- Java Script object Notation are base way to represent semi structure data
MongoDB, Cassandra, Redis, Neo4j
In GCP, BigTable, DataStore, memoryStore

 

 

Unstructured Data

No Pre define Structure in Data
Image

  •  video data,
  •  natural Language are example of unstructured data

Google Cloud Storage,File store inside GCP to store Unstructure data

Batch Data vs Streaming Data

Batch data processing
 Defined start & end of data - data size is known
 processing high volume of data after certain periodic interval
 long time to process data
 payment processing
 
Streaming Data
 unbounded, No end defined
 data is processed as it arrives
 size
is unknown
 No much heavy processing - take millisecond - seconds to process data
 Stock data processing


SLA commited

99.9% uptime: This means that the service is guaranteed to be operational and available 99.9% of the time. In a year, this allows for approximately 8.76 hours of downtime.

99.99% uptime: This represents a higher level of reliability, with the service guaranteed to be available 99.99% of the time. This translates to about 52.56 minutes of downtime per year.

 

For 99.9% uptime:

Downtime=(199.9100)×Total time in a year\text{Downtime} = \left(1 - \frac{99.9}{100}\right) \times \text{Total time in a year}

Downtime=0.001×Total time in a year\text{Downtime} = 0.001 \times \text{Total time in a year}

Given that there are 24 hours in a day and 365 days in a year:

Downtime=0.001×(24×365)\text{Downtime} = 0.001 \times (24 \times 365)

Downtime=0.001×8760\text{Downtime} = 0.001 \times 8760

Downtime=8.76 hours\text{Downtime} = 8.76 \text{ hours}


Google Cloud Storage

Object Organization


Storage Location

Storage Class

Object lifecycle management
Standard storage
Nearline storage
Coldline storage
Archive storage

if its from any of above, it will move to next stage whatever the condition we set


one of example


Encryption (in GCS)

google managed encryption keys

  •  no configuration
  •  fully managed

customer managed encryption keys

  •  created key-ring in cloud KMS
  •  key will be managed by customer. like key rotation

customer supplied encryption keys

  •  we will generate key with: openssl rand =base64 32
  •  gsutil - encrypt with CSEK


create key ring if not available

select the key then after key is available and create the bucket, then you will see the bucket Encryption - Customer-managed


Customer supplied


generate base64 key
openssl rand -base64 32

create a file
cat.txt
Hello
Foo
bar

to see all git bucket
gsutil ls

copy your key to bucket
gsutil cp foo.txt

copy a file with customer supplied key
gsutil -o 'GSUtil:encryption_key'=3vdssgs97+cdsfsdF= cp fool.txt gs://manjeet-demo-1/

Now go to console and check it should be customer supplied


How to see data of file that encrypted by customer supplied key?
gsutil -o 'GSUtil:encryption_key'=3vdssgs97+cdsfsdF= cat gs://manjeet-demo-1/fool.txt


Object versioning

Help to prevent accidental deletion of object
Enable/Disable versioning at bucket level
Get access to older version with (object key+ version number)
If you don't need earlier version, delete it & reduce storage cost
If you don't specify version number, always retrieve latest version


cmd to enable versioning on a bucket
gsutil   versioning set on gs://manjeet-demo-1/

set the status of versioning on a bucket
gsutil    versioning get gs://manjeet-demo-1/

now create a file
hello.txt
i am writing from here

upload same file with added test
hello.txt
i am writing from here
2nd line

upload the file in bucket
gsutil cp hello.txt gs://manjeet-demo-1/

list all file in a bucket
gsutil ls gs://manjeet-demo-1/

but if you want to see whole versions
gsutil ls -a gs://manjeet-demo-1/


if you want to see content of hello.txt file one of version
gsutil cat  gs://manjeet-demo-1/hello.txt#1701399298051418

gsutil cat  gs://manjeet-demo-1/hello.txt#1701399356570041


delete 1st version of a file
gsutil rm gs://manjeet-demo-1/hello.txt#1701399298051418


Controlling access on bucket

Who can do what on GCS at what level
Permissions
Apply at Bucket level

uniform level access

  •  no object level permission
  •  apply uniform at all object inside bucket


Fine grained permission

  •  Access Control list -ACL For Each object Separately


Apply Project level

 IAM
 Different predefined Role
  •   Storage Admin
  •   Storage object Admin
  •   Storage object Creator
  •   Storage object Viewer

 Create Custom Role

Assign Bucket level Role

  •  Select bucket & assign role
  •  To user
  •  To other GCP service or product



Practical Uniform and Fine-grained

Create 2 bucket
Uniform
Fine-grained



create 2 file
uniform.txt
finegrained.txt

upload the file in there respective bucket


Uniform Access control means
all object in the bucket by using only bucket-level permission.

Fine-grained
Specify access to individual objects by using object-level permission(ACLs) in addition to your bucket-level permission.

When you try to give public access to Uniform bucket object you will get below error

How to access Uniform bucket object publically?
Give bucket level access.
Grant access - Principal - allUsersStorage Object Viewer 


 

Project Level role
Exam: we want to give access to a valid google user

after giving the access to a google user, he can login in Cloud storage
to verify if his/her access working?

 

Bucket level access.

How to access Uniform bucket object publically?
Give bucket level access.
Grant access - Principal - allUsersStorage Object Viewer


Signed URL for GCS bucket object(Temp. access)

temp access to outside user.
you can give access to user who doesn't have google account
Url expired after time period defined
Max period for which URL is valid is 7 days.
gsutil signurl -d 10m -u gs://<bucket>/<object>

Generate a key in your IAM- service account
Download the key in json format

  • create a object in a bucket.upload the key  in cloud shell
  • generate signurl with key
  • gsutil signurl -d 60s signed-key.json https://storage.cloud.google.com/<bucket>/<object>


Incase gsutil have issue in generate signurl use below dependency
sudo apt-get install libssl-dev
sudo pip3 install --upgrade pip
sudo pip3 install pyopenssl

Bucket Retention Policy

Minimum duration for which bucket will be protected from
Deletion
modification


GCS - Pricing

Storage pricing
Data access pricing
Go to cloud Console & create Bucket, observe pricing


Check with Nearline

Data Transfer Services


Data migration Services

From On-premises to Google Cloud Storage(GCS)

From One bucket to another bucket inside same GCP

From Other public cloud Amazon S3, Azure Container to GCS


ONlINE mode of transfer

gsutil - command line utility

  •  Online mode of transfer
  •  install locally Google Cloud SDK
  •  gsutil -m cp large_number_of_small_files (-m for parallel upload)
  •  Should we go for it or not?


Transfer Service for on-premises data
 This will quickly and securely move your data from private data centers into Google Cloud Storage
 Two step process

  •   install an agent
  •   create a transfer job


Transfer Appliance(OFFLINE mode of transfer)

  •  Physical device which securely transfer large amounts of data to Google Cloud Platform
  •  When data that exceeds 20TB or would take more than a week to upload


Data Transfer


Data Transfer

Block storage
Block storage - hard Disk storage


Direct attached - Local SSD

  •  Local SSD
  •  Physically attached to VM
  •  Very High Performance - 10x to 100x of persistence disk
  •  Costlier than Persistence Disk
  •  You can not re attach to other VM
  •  Once VM destroy, Local SSD will be deleted
  •  Lower Availability
  •  Temporary/Ephemeral Storage
  •  No Snapshot


Network attached Storage

  •  Network attached hard disk
  •  Persistent Disks
  •  Zonal,Regional
  •  Not attached directly to any VM
  •  Can be re-attached with other VM
  •  Very Flexible- resize easily
  •  Permanent storage
  •  Snapshot supported
  •  Cheaper than Local SSD

Create  Network Storage


attach existing disk created before from disk.

add new persistent disk

Disks created and attached


Storage

Which storage to use when

Cloud Storage

  • Unstructured data storage
  • Video stream, Image
  • Staging environment
  • Compliance
  • backup
  • data lake


Persistent Disk

  • Attach disk with VM & Containers
  • Share read-only disk with multiple VM
  • Database storage


Local Disk

  • Temporary high performance attach Disk


File Store

  • Performance predictable
  • Lift-shift millions of Files


OLTP vs OLAP

OLTP- Online Transaction Processing

Simple Query
Large number of small transaction
Traditional RDBMS
Database modification
Popular Database - MySQL,PostgreSQL,Oracle,MSSQL
ERP,CRM,Banking application
GCP - Cloud SQL, Cloud spanner



OLAP - Online Analytical Processing

Data warehousing
Data is collected from multiple source
Complex Query
Data analysis
Google Cloud Big Query- Petabyte Data warehouse
Reporting Application, Web click analysis, BI Dashboard app


RTO - Recovery Time objective

  •  maximum time for which system can be down

RPO - Recovery Point objective

  •  maximum time for which organization can tolerate Dataloss


Durability

If you loose data means
  •  business is down
  •  no business afford to loose data

how healthy & resilient your data is
Object Storage provider measure durability in terms of number of 9's
Exam: 99.99999999999999 -11 9's
That means that even with one billion objects, you would likely go a hundred years without losing a single one

Availability

If region goes down where your data stored

  •  Replicate data across many region

How much amount of time data is up/available to access.

  • Data replicated across multiple regions, means higher Availability
  • SLA - service level agreement
  • SLA - 99.99%: four 9's

https://uptime.is
https://cloud.google.com/terms/sla




No comments:

Post a Comment