Data engineering
4 stages of data.
Ingestion
Gather Data from multiple sources
Data gather from App
- Event log, Click stream Data, e-commerce Transaction
Streaming ingest
- PubSub
Batch Ingest
- Different Transfer services
- GCS-gsutil
Store data
Once your data gather, you need to think how you store your data.
Cost efficient & durable data storage
Structured data
Semi-structured data
Unstructured data
Structured data
- Transactional
- Analytical
https://cloud.google.com/products/databases?hl=en#store
| Storage Type Category | GCP Storage Name | When and Why to Use | Open Source Example |
| Structured (Transactional) | Cloud SQL | Structured data storage for relational databases. | MySQL, PostgreSQL |
| Cloud Spanner | Globally distributed, horizontally scalable database. | ||
| Suitable for transactional workloads and ACID compliance. | |||
| Examples: Online transactional systems, e-commerce. | |||
| Structured (Analytical) | BigQuery | Serverless data warehouse for analytics. | Apache Hive, Amazon Redshift |
| Suited for complex queries, large-scale reporting. | |||
| Examples: Business intelligence, data warehousing. | |||
| Semi-Structured (Fully Indexed) | Cloud Datastore | NoSQL database for semi-structured data. | MongoDB, Apache Cassandra |
| Fully managed, scalable, suitable for document-like data. | |||
| Examples: Content management systems, flexible data models. | |||
| Semi-Structured (Row Key) | Cloud Bigtable | Wide-column store for large-scale, real-time workloads. | Apache HBase |
| Suitable for time-series data, IoT applications. | |||
| Examples: IoT data storage, monitoring systems. | |||
| Unstructured | Cloud Storage | Object storage for unstructured data, files, and media. | Amazon S3, Azure Blob Storage |
| Scalable, durable, suitable for large datasets. | |||
| Examples: Image storage, backups, multimedia assets. |
ACID properties:
| ACID Property | Definition | Example |
| Atomicity | Ensures that a transaction is treated as a single, indivisible unit of work. Either all operations within the transaction are executed successfully, or none are executed at all. | Funds transfer between two bank accounts: Either the entire transfer (debit from one account and credit to another) is completed successfully or none of it happens. |
| Consistency | Ensures that a transaction brings the database from one valid state to another, satisfying predefined integrity constraints before and after the transaction. | Enforcing a rule that all accounts must maintain a positive balance; a transaction violating this rule will be rejected, ensuring consistency. |
| Isolation | Ensures that the execution of one transaction is isolated from the execution of other transactions, even if multiple transactions are executing concurrently. | Two transactions concurrently updating the same set of data: Isolation ensures that each transaction's changes are not visible to the other until both transactions are complete. |
| Durability | Guarantees that once a transaction is committed, its effects persist even in the event of system failures, such as power outages or crashes. | Confirmation message after completing an online purchase: Durability ensures that the purchase information is stored persistently and won't be lost, even if the system crashes after the confirmation. |
Process and analyze
In this stage we need to think what kind of outcome we want?
what analysis you want to perform?
means you converting your data in some kind of meaningful information, rather then just just simple dump data.
You can analyze data with BigQuery also
Or you can apply some kind of ML
- BigQuery ML
- Spark ML with DataProc
- Vertex API
- build ML Model with Auto ML/Custom Model
https://cloud.google.com/products/databases?hl=en
Explore and visualize
Google Data studio - Easy to use BI engine
Dashboard & Visualization
Datalab
Interactive Jupyter notebook
Support for all Data science libary
ML Prebuild API
vision Api
Speech Api
Types of Data
Structure
Semi-structure
unstructured.
Structure data
Tabular Data
Represented by Rows & Columns
SQL can be used to interact with data
Fixed Schema
Each row has same number of columns
Relational Database are structured
MySQL, PostgreSQL,
In GCP, Cloud SQL, Cloud Spanner
Semi-Structured Data
Each Record has variable number of Properties
No Fixed schema
Flexible structure
NoSQL kind of Data
Store data as key-value pair
JSON- Java Script object Notation are base way to represent semi structure data
MongoDB, Cassandra, Redis, Neo4j
In GCP, BigTable, DataStore, memoryStore
Unstructured Data
No Pre define Structure in Data
Image
- video data,
- natural Language are example of unstructured data
Google Cloud Storage,File store inside GCP to store Unstructure data
Batch Data vs Streaming Data
Batch data processing
Defined start & end of data - data size is known
processing high volume of data after certain periodic interval
long time to process data
payment processing
Streaming Data
unbounded, No end defined
data is processed as it arrives
size is unknown
No much heavy processing - take millisecond - seconds to process data
Stock data processing
SLA commited
99.9% uptime: This means that the service is guaranteed to be operational and available 99.9% of the time. In a year, this allows for approximately 8.76 hours of downtime.
99.99% uptime: This represents a higher level of
reliability, with the service guaranteed to be available 99.99% of the
time. This translates to about 52.56 minutes of downtime per year.
For 99.9% uptime:
Given that there are 24 hours in a day and 365 days in a year:
Google Cloud Storage
Object Organization
Storage Location
Storage Class
Object lifecycle management
Standard storage
Nearline storage
Coldline storage
Archive storage
if its from any of above, it will move to next stage whatever the condition we set
one of example
Encryption (in GCS)
google managed encryption keys
- no configuration
- fully managed
customer managed encryption keys
- created key-ring in cloud KMS
- key will be managed by customer. like key rotation
customer supplied encryption keys
- we will generate key with: openssl rand =base64 32
- gsutil - encrypt with CSEK
create key ring if not available
select the key then after key is available and create the bucket, then you will see the bucket Encryption - Customer-managed
Customer supplied
generate base64 key
openssl rand -base64 32
create a file
cat.txt
Hello
Foo
bar
to see all git bucket
gsutil ls
copy your key to bucket
gsutil cp foo.txt
copy a file with customer supplied key
gsutil -o 'GSUtil:encryption_key'=3vdssgs97+cdsfsdF= cp fool.txt gs://manjeet-demo-1/
Now go to console and check it should be customer supplied
How to see data of file that encrypted by customer supplied key?
gsutil -o 'GSUtil:encryption_key'=3vdssgs97+cdsfsdF= cat gs://manjeet-demo-1/fool.txt
Object versioning
Help to prevent accidental deletion of object
Enable/Disable versioning at bucket level
Get access to older version with (object key+ version number)
If you don't need earlier version, delete it & reduce storage cost
If you don't specify version number, always retrieve latest version
cmd to enable versioning on a bucket
gsutil versioning set on gs://manjeet-demo-1/
set the status of versioning on a bucket
gsutil versioning get gs://manjeet-demo-1/
now create a file
hello.txt
i am writing from here
upload same file with added test
hello.txt
i am writing from here
2nd line
upload the file in bucket
gsutil cp hello.txt gs://manjeet-demo-1/
list all file in a bucket
gsutil ls gs://manjeet-demo-1/
but if you want to see whole versions
gsutil ls -a gs://manjeet-demo-1/
if you want to see content of hello.txt file one of version
gsutil cat gs://manjeet-demo-1/hello.txt#1701399298051418
gsutil cat gs://manjeet-demo-1/hello.txt#1701399356570041
delete 1st version of a file
gsutil rm gs://manjeet-demo-1/hello.txt#1701399298051418
Controlling access on bucket
Who can do what on GCS at what levelPermissions
Apply at Bucket level
uniform level access
- no object level permission
- apply uniform at all object inside bucket
Fine grained permission
- Access Control list -ACL For Each object Separately
Apply Project level
Different predefined Role
- Storage Admin
- Storage object Admin
- Storage object Creator
- Storage object Viewer
Create Custom Role
Assign Bucket level Role
- Select bucket & assign role
- To user
- To other GCP service or product
Practical Uniform and Fine-grained
Create 2 bucket
Uniform
Fine-grained
create 2 file
uniform.txt
finegrained.txt
upload the file in there respective bucket
Uniform Access control means
all object in the bucket by using only bucket-level permission.
Fine-grained
Specify access to individual objects by using object-level permission(ACLs) in addition to your bucket-level permission.
When you try to give public access to Uniform bucket object you will get below error
How to access Uniform bucket object publically?
Give bucket level access.
Grant access - Principal - allUsers - Storage Object Viewer
Project Level role
Exam: we want to give access to a valid google user
after giving the access to a google user, he can login in Cloud storage
to verify if his/her access working?
Bucket level access.
How to access Uniform bucket object publically?
Give bucket level access.
Grant access - Principal - allUsers - Storage Object Viewer
Signed URL for GCS bucket object(Temp. access)
temp access to outside user.
you can give access to user who doesn't have google account
Url expired after time period defined
Max period for which URL is valid is 7 days.
gsutil signurl -d 10m -u gs://<bucket>/<object>
Generate a key in your IAM- service account
Download the key in json format
- create a object in a bucket.upload the key in cloud shell
- generate signurl with key
- gsutil signurl -d 60s signed-key.json https://storage.cloud.google.com/<bucket>/<object>
Incase gsutil have issue in generate signurl use below dependency
sudo apt-get install libssl-dev
sudo pip3 install --upgrade pip
sudo pip3 install pyopenssl
Bucket Retention Policy
Minimum duration for which bucket will be protected from
Deletion
modification
GCS - Pricing
Storage pricing
Data access pricing
Go to cloud Console & create Bucket, observe pricing
Check with Nearline
Data Transfer Services
Data migration Services
From On-premises to Google Cloud Storage(GCS)
From One bucket to another bucket inside same GCP
From Other public cloud Amazon S3, Azure Container to GCS
ONlINE mode of transfer
gsutil - command line utility
- Online mode of transfer
- install locally Google Cloud SDK
- gsutil -m cp large_number_of_small_files (-m for parallel upload)
- Should we go for it or not?
Transfer Service for on-premises data
This will quickly and securely move your data from private data centers into Google Cloud Storage
Two step process
- install an agent
- create a transfer job
Transfer Appliance(OFFLINE mode of transfer)
- Physical device which securely transfer large amounts of data to Google Cloud Platform
- When data that exceeds 20TB or would take more than a week to upload
Data Transfer
Data Transfer
Block storage
Block storage - hard Disk storage
Direct attached - Local SSD
- Local SSD
- Physically attached to VM
- Very High Performance - 10x to 100x of persistence disk
- Costlier than Persistence Disk
- You can not re attach to other VM
- Once VM destroy, Local SSD will be deleted
- Lower Availability
- Temporary/Ephemeral Storage
- No Snapshot
Network attached Storage
- Network attached hard disk
- Persistent Disks
- Zonal,Regional
- Not attached directly to any VM
- Can be re-attached with other VM
- Very Flexible- resize easily
- Permanent storage
- Snapshot supported
- Cheaper than Local SSD
Create Network Storage
attach existing disk created before from disk.
add new persistent disk
Disks created and attached
Storage
Which storage to use when
Cloud Storage
- Unstructured data storage
- Video stream, Image
- Staging environment
- Compliance
- backup
- data lake
Persistent Disk
- Attach disk with VM & Containers
- Share read-only disk with multiple VM
- Database storage
Local Disk
- Temporary high performance attach Disk
File Store
- Performance predictable
- Lift-shift millions of Files
OLTP vs OLAP
OLTP- Online Transaction Processing
Simple Query
Large number of small transaction
Traditional RDBMS
Database modification
Popular Database - MySQL,PostgreSQL,Oracle,MSSQL
ERP,CRM,Banking application
GCP - Cloud SQL, Cloud spanner
OLAP - Online Analytical Processing
Data warehousing
Data is collected from multiple source
Complex Query
Data analysis
Google Cloud Big Query- Petabyte Data warehouse
Reporting Application, Web click analysis, BI Dashboard app
RTO - Recovery Time objective
- maximum time for which system can be down
RPO - Recovery Point objective
- maximum time for which organization can tolerate Dataloss
Durability
If you loose data means- business is down
- no business afford to loose data
how healthy & resilient your data is
Object Storage provider measure durability in terms of number of 9's
Exam: 99.99999999999999 -11 9's
That means that even with one billion objects, you would likely go a hundred years without losing a single one
Availability
If region goes down where your data stored
- Replicate data across many region
How much amount of time data is up/available to access.
- Data replicated across multiple regions, means higher Availability
- SLA - service level agreement
- SLA - 99.99%: four 9's
https://uptime.is
https://cloud.google.com/terms/sla
No comments:
Post a Comment