Friday, December 23, 2022

Prometheus and Grafana

Logging
Log is useful when problem already happened, and you know about it, to diagnoise the problem,
In production you will go to the logs file and you will try to figure out, what the problem is and fix it.

but sometime, you need to find out why a product/application is slow, and below:
How many error and exceptions?
what is the response time?
How many times API is called?
How many servers?
How many users from New York?

For this type of information, we need 3 parameter
Metric - Value - Time
Metric - error, response time etc
value - error- 3 , response time - 4ms 


Prometheus - is Time series database
Its a - Store, Query, Alert

it can Store the time series, metrics time and value
it can  Query and Alert, based on threshold.
If threshold, if number of errors within a given 5 min, goes above 10, than raise an alert.

What is Telemetry?
Telemetry in s/w refers to the collection of business and diagnosis data from the s/w in production, and store and visualise it for the purpose of diagnosing production issues or improving the business.

Prometheus - visualization is not good, and its not enough for real usecases.

like if we need to visualize some kind of metrics that stored in SQL, Amazon Cloudwatch,
we dont need to move it into Prometheus. instead we can use Grafana for visualization

and you can also create alert in grafana


Data collection
How to collect metrics and store them in Prometheus.

so we have Prometheus in 1 server or a cluster
and we want to send the metrics to Prometheus
1. we using a code on the server, and using the library we send  logs to Prometheus
code is eg rupy, python, .net, java, for collect the metrics and send them to Prometheus.

  

Push Gateway - is component of  Prometheus, its act as temp storage where application can send them metrics to store, so then Prometheus can go  and pull the metrics from the pushGateway and scrape the metrics from Push Gateway.

 

 

2. in many case, we dont have source code, eg, SQL, Amazon CloudWatch, HAPROXY, IOT
and we can not change code over there, and storing IOT time series database, eg sensor
so in this scenario we use Exporter, and Prometheus will go and pull the metrics from the exporter


Scrapping
The process of connecting the the exporter and pulling the metrics into Prometheus is called Scrapping
scrapping can be configure in Prometheus config file prometheus.yml default connect time to exporter 15sec

Data model
Prometheus stores data as time series
Every time series is identified by metric name and labels
Labels are a key and value pair
Labels are optional
 <metric name> {key=value, key=value,...}
 auth_api_hit {count=1, time_taken=800} 

Data Types

1. Scalar   - Float   1  1.5
               String 

prometheus_http_requests_total code="200" job="prometheus"

here 200 is a string not a float, becz its in double quotes
 

Query
prometheus_http_requests_total{code=~"2.*",job="prometheus"}

if you do a query for float type then you dont use any double quotas or single quotes.

 

2.Instant Vectors -single sample value

instant vector select a set of time-series and in single sample value, for any-timestamp
you only use the metric name, you are not using range

if you want to filter the metrics, can use the key value pair, or labels
eg. auth_api_hit 5,
becz you get only 1 value that is why its called instant vector

if you use some filter
auth_api_hit {count=1,time_taken=800}
so value you will get 1

3. Range vector []

Another data type is Range vector
Range vector is similar to instant vectors except they select a range of sample
label_name[time_spec]
auth_api_hit[5m]  - means you are saying 5min before now


one metrics have mutiple value thats why its vector  - 

in vector 2 things matters, 1 [] time you mention and scrap interval time that is mentioned in prometheus.yml

Operators

+ Addition
- Subtraction
* Multiplication
/ Divison
% Modulo
^ Power

Scalar + Instant vector
Applies to value of instant vector 


eg we check prometheus metrics
prometheus_sd_updates_total

prometheus_sd_updates_total + 6

all value go up by 5, if you - all value go minus by 5

Instant Vector + Instant vector
Applies to every value of left vector and its matching value in the right vector



instant vector +5 

instant vector + instant vector

 

Comparison Binary Operators
== Equal
!= Non-equal
> Greater
< Less-than
>= Greater or Equal
<= Less-than or Equal


 

Filter Matchers/Selectors
<metric name> {filter_key=value, filter_key=value,...}
query in Prometheus very similar to metrics
filters are a setup of labels,
every labels is key value pair and they are separated by a comma,
and each comma, represent as and


=  Two values must be equal
!= Two values must NOT be equal
=~ Value on left must match the Regular Expression (regex) on right [this is for text and string matching]
!~ Value on left must NOT match the Regular Expression (regex) on right 


Aggregation Operators.
Aggregate the elements of a single Instant Vector
The result is a new Instant Vector with aggregated values

sum    Calculated sum over dimensions.
min    Selects minimum over dimensions.
max    Selects maximum over dimensions.
avg    Selects average over dimensions.
count  Selects number of elements over dimensions.
groups Groups elements. All values in resulting vector are equal to 1
count_values Counts the number of elements with the same values
topk   Largest elements by sample value
bottmk smallest elements by sample value
stddev Finds population standard deviation over dimensions
stdvar FInds population standard variation over dimensions

 
 


syntax for <Aggregation Operator> (<Instant Vector)
sum(node_cpu_total)

<Aggregation Operator> (<Instant Vector) by (<label list>)
sum(node_cpu_total) by (http_code)

<Aggregation Operator> (<Instant Vector>) without (<label list>)
sum(node_cpu_total) without (http_code)


offset - eg 5 min ago

eg
prometheus_http_requests_total offset 5m

but if we want to check 5 min ago data mention offset 5m

in case if we do it with a range vector
prometheus_http_requests_total[5m]
it will give a error


so in case if we do group by, the value always 1
group(prometheus_http_requests_total) by (code)

but if do in case avg/sum by the value will come accordingly
avg(prometheus_http_requests_total) by (code)

 

if we apply offset after by will it work?
avg(prometheus_http_requests_total) by (code) offset 5m  X
no it doesnot work offset will work only after metric
e.g
avg(prometheus_http_request_total offset 5h) by (code)

 

Functions


 


absents(<Instant Vector>) Checks if an instant vector has any members
                          Returns an empty vector if parameter has elements
absent(node_cpu_seconds_total{cpu="x09d"})
(cpu="x09d")

another eg -it will give error, becz it can not except range vector

absent(node_cpu_seconds_total[1h])



absent_over_time(<range Vector>) Check if an range vector has any members Returns an empty vector if parameter has elements
absent_over_time(node_cpu_seconds_total{cpu="xrft"}[1h])


clamp_min(node_cpu_seconds_total, 300)
all the value will come greater than 300

clamp_max(node_cpu_seconds_total, 15000)
all the value will come smaller than 15000

delta and iDelta
day_of_month(<Instant Vector>) For every UTC time returns day of month 1..31
day_of_week(<Instant Vector>) For every UTC time returns day of week 1..7



delta(<Instant Vector>) Can only be used with Gauges
idelta(<Range Vector>) Returns the difference b/w first and last items
delta(node_cpu_temp[2h]) CPU temp. Change in two hours

log2(<Instant Vector>) Returns binary logarithm of each scalar value
log10(<Instant Vector>) Returns decimal logarithm of each scalar value
In(<Instant Vector>) Returns neutral logarithm of each scalar value
sort(<Instant Vector>) Sorts elements in ascending order
sort_desc(<Instant Vector>) Sorts elements in ascending order
time() Returns a near-current time stamp
timestamp(<Instant Vector>) Returns the time stamp of each time series (element)

eg
sort(clamp(node_cpu_seconds_total, 300, 15000))        it will sort the value in asc order
timestamp(clamp(node_cpu_seconds_total, 300, 15000))    it will give metric timestamp
timestamp(clamp(node_cpu_seconds_total offset 1h, 300, 15000))    it will give metric timestamp for last 1 hour

Aggregation Over time
avg_over_time(<range Vector>) Returns the average of items in a range vector
sum_over_time(<range Vector>) Returns the sum of items in a range vector
min_over_time(<range Vector>) Returns the min of items in a range vector
max_over_time(<range Vector>) Returns the max of items in a range vector
count_over_time(<range Vector>) Returns the count of items in a range vector 



eg
avg_over_time(node_cpu_seconds_total{cpu="0"}[2h])

Alerts in Prometheus
alert visible on Prometheus dashboard,
for alert notification we need to use Alert manager
alert manager can trigger notification to diff platform like email,slack,PagerDuty,WebHook

for alert(in linux) - we put the alert rule files in /etc/prometheus/rules
for other Mac/window - we need to give relative path in prometheus.yaml file

for Mac
/etc/rule/alert.yaml
groups:
  - name: Alerts
    rules:
    - alert: Is Node Exporter Up
      expr: up{job="node_exporter"} == 0

 

another way rule
groups:
  - name: Alerts
    rules:
    - alert: Is Node Exporter Up
      expr: absent(up{job="node_exporter"})   
      for: 5m

note: here expr mean something returning that not equal to zero, when this condition match, you will get a firing alert

etc/prometheus.yaml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
    - targets: ["localhost:9090"]
    
  - job_name: "node_exporter"
    static_configs:
    - target: ["localhost:9100"]
 
  rule_files:
    - "rule/alerts.yaml"

restart the prometheus service

check again in targets


now lets test the alert stop the node service 9100

for different alert creation go to Prometheus official website

https://awesome-prometheus-alerts.grep.to/rules.html

The "for" expression
Use the "for" expression to define a time threshold
eg if alert for last 5m node_export node is down it will be a trigger

another way rule
groups:
  - name: Alerts
    rules:
    - alert: Is Node Exporter Up
      expr: absent(up{job="node_exporter"})   
      for: 5m


Alert Manager

Converts alerts to notifications
Can receive alerts from multiple Prometheus servers
Can de-duplicate alerts - in one single alert to avoid confusion
Can silent alerts - you can choose to silent some alert for sometime,(in case you raising false alarm)
It has a web user interface
The web user interface is access via port 9093
Is configured via alertmanager.yml file

Download Alert manager
1. go to official Prometheus website
https://prometheus.io/download/
copy link
sudo wget link
2. untar package
sudo tar xvf alertxxxx
3. mkdir /var/lib/alertmanager
4. cd alertxxxx_package
5. sudo mv * /var/lib/alertmanager
6 cd  /var/lib/alertmanager
7. mkdir data
8. sudo chown -R prometheus:prometheus /var/lib/alertmanager
9. sudo chown -R prometheus:prometheus /var/lib/alertmanager/*
10 sudo chmod -R 755 /var/lib/alertmanager
11 sudo chmod -R 755 /var/lib/alertmanager/*
12 sudo vi /etc/systemd/system/alertmanager.service

[Unit]
Description=Prometheus Alert Manager
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP \$MAINPID
ExecStart=/var/lib/alertmanager/alertmanager --storage-path="/var/lib/alertmanager/data"

SyslogIdentifier=prometheus_alert_manager
Restart=always

[Install]
WantedBy=multi-user.target


13 sudo systemctl daemon-reload
14 sudo systemctl start alertmanager
15 sudo systemctl enable alertmanager
16 sudo systemctl status alertmanager


alertmanager configuration file
config.yml
global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'in-v3.mailjet.com:587'
  smtp_from: 'info@cloudware.net.au'
  smtp_auth_username: ''
  smtp_auth_password: ''

route:
  receiver: main_receiver

  routes:
  - receiver: 'urgent_receiver'
    matchers:
    - severity='Critical'


receivers:
- name: 'main_receiver'
  email_configs:
  - to: 'manjeetyadavrajokri@gmail.com'

- name: 'urgent_receiver'
  email_configs:
  - to: 'info@cloudware.net.au'        

so now if you disable node exporter it will trigger the email to info@cloudware.net.au email becz as severity critical and route 

Integration Alert manager with slack
incoming webhook - its not a specific tool to slack, you can use it with microsoft team or hangout, you can use the same incoming webhook technolgy in order send notifications from prometheus to your messenger

1. 1st you need a channel
create channel on slack, and then need admin access on that channel, if you have admin access you will get a option in top corner View all members of this channel
2. click on corner tab - go to Integrations - add app - in search box - look incoming webhook - Install - add to slack - in Post to channel(choose the channel) - Click on "Add incoming webhooks integration"- Setup instructions - Webhook URL   


 

3.here copy the webhook url and change the Customize icon if you want. 

4. Save Settings

5. go to alert manager config file - and put below in receivers:

receivers

- name: 'urgent_receiver'
  slack_configs:
  - api_url: 'http://webhookurl'
    channel: '#udemy-course-for-prometheus'


stop node_exporter
restart alertmanager

now you should get the notification on slack

Inhibiting Alert
silencing is temporary - you can create silence rule
Inhibiting notifications is permanent

Inhibits an alert if another alert is firing

Scenario, when we have 2 alert as below server down and Website down, so if alert 1 trigger no need trigger alert 2:
Prometheus
Alert 1: Server Down
Alert 2: Website Down


 how to define?
1. find a metric on Prometheus
search below metric and copy that metric
prometheus_build_info


2. go to alert.yaml configuration file
replace expr and

global:
  - name: Alerts
    rules:
    - alert: Is Node Exporter Up?
      expr: up{job="node_exporter"}==0
      for: 0m
      labels:
        team: Team Alpha
        severity: Critical
      annotations:
        summary: "{{ $labels.instance}} Is Down"
        description: "Team Alpha to restart the server {{ $labels }} VALUE: {{ $value }}"
        
  - name: Alert
    rules:
    - alert: Is Node Exporter Up?
      expr: prometheus_build_info{branch="non-git", goversion="go1.16.5", instance="localhost:9090, job=prometheus, revsion"}
      for: 0m
      labels:
        team: Team Beta
        severity: Critical
      annotations:
        summary: "{{ $labels.instance}} Is Down"
        description: "Team Alpha to restart the server {{ $labels }} VALUE: {{ $value }}"

3. alertmanger config file

config.yaml

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'in-v3.mailjet.com:587'
  smtp_from: 'info@cloudware.net.au'
  smtp_auth_username: ''
  smtp_auth_password: ''

route:
  receiver: main_receiver

  routes:
  - receiver: 'urgent_receiver'
    matchers:
    - severity='Critical'

receivers:
- name: 'main_receiver'
  email_configs:
  - to: 'manjeetyadavrajokri@gmail.com'

- name: 'urgent_receiver'
  email_configs:
  - to: 'info@cloudware.net.au'

inhibit_rules:
- source_match:
    team: "Team Alpha"
  target_match:
    team: "Team Beta"
  equal: ['severity']


4. restart the alert manager

Recording rules
computed metrics



avg sum count

if we have large set of data and we doing avg sum count very frequently it can slow down the prometheus
so we can record the rules (and we can store it in another metric that created from rule)

Calculate "avg"     of "temperature" from IoT every 5 minutes and save it as IoT_Avg_Temp

you can record your rules in a yml files
IoT_rules.yml
db_rules.yml

or you can put in seprate yml file like in linux
/etc/prometheus/rules 

creating a rule

eg. we want below metric
avg by (mode) (node_cpu_seconds_total)
avg by (cpu) (node_cpu_seconds_total)
avg by (cpu) (node_cpu_seconds_total[1m])

we get metric bt we need time range as well for recording rule
we cant do as below, it will be invalid:
avg by (cpu) (node_cpu_seconds_total[1m])

below will work but we cant do agg like avg sum etc
node_cpu_seconds_total[1m]

so below is the way with rate we can use time range
rate(node_cpu_seconds_total[1m])
irate when counter change very frequestly use irate

but for alert and for recording rule, prometheus said you should use rate
avg by (cpu) (rate(node_cpu_seconds_total[1m]))
<label>:<metric name>:<aggregation type>
cpu   :node_cpu_seconds_total :avg

now go to rule.yml file
rules.yml
groups:
  - name: node exporter rules
    rules:
    - record: cpu:node_cpu_seconds_total:avg
      expr: avg by (cpu) (rate(node_cpu_seconds_total[1m]))
      labels:
        exporter_type: node
     

prometheus.yml
global:
  scrape_interval: 15s

  scrape_configs:
    - job_name: "prometheus"
      static_configs:
      - targets: ["localhost:9090"]
    - job_name: "node_exporter"
      static_configs:
      - targets: ["localhost:9100"]

    rules:
      - "rule/rules.yml"

    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - localhost:9093

restart the prometheus service    

Client Libraries
Choose a prometheus client library that matches the language in which your application is written. and internal metrics exposed via an HTTP endpoint on your application's instance.


Python Library
Generate some metric and make them available to prometheus

prom-test.py
if __name__ == '__main__'
  print("This is a python app");


add a endpoint to the application
install library
pip3 install prometheus-client

now prom-test.py

------------------------------------------------------------------------------------------------------------------------------------
from prometheus_client import start_http_server, Summary
import random
import time

REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")

@REQUEST_TIME.time()
def process_request(t):
    time.sleep(t)

if __name__ == '__main__'
  start_http_server(8000)
  while True:
     process_request(random.random())
  print("The end");


now go to browser open localhost:8000

Counter metrics

import counter class from prometheus client and than we can create a variable with any given name

from prometheus_client import start_http_server, Summary, Counter
import random
import time

REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","")

@REQUEST_TIME.time()
def process_request(t):
    MY_COUNTER.inc(5)
    time.sleep(t)

if __name__ == '__main__'
  start_http_server(8000)
  process_request(random.random())

  while True:
     A=1
 
  print("The end");
 
Gauge metrics

from prometheus_client import start_http_server, Summary, Counter, Gauge
import random
import time

REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","")
MY_GAUGE = Gauge("my_gauge","")

@REQUEST_TIME.time()
def process_request(t):
    MY_COUNTER.inc(5)
    MY_GAUGE.set(5)
    MY_GAUGE.inc(5)
    MY_GAUGE.dec(2)

    time.sleep(t)

if __name__ == '__main__'
  start_http_server(8000)
  process_request(random.random())

  while True:
     A=1
 
  print("The end");

For exception handling

from prometheus_client import start_http_server, Summary, Counter, Gauge
import random
import time

REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","")
MY_GAUGE = Gauge("my_gauge","")

@REQUEST_TIME.time()
@MY_COUNTER.count_exceptions()
def process_request(t):
    MY_COUNTER.inc(5)
    MY_GAUGE.set(5)
    MY_GAUGE.inc(5)
    MY_GAUGE.dec(2)
    time.sleep(t)

if __name__ == '__main__'
  start_http_server(8000)
  process_request(random.random())

  while True:
     A=1
 
  print("The end");
 

Labels

attaching labels to your metrics
when you introduce your label, Prometheus is not going add them to your metric,
and that is becz Prometheus doesn't know what value has to be assign,
to each of these label name (name, age), so for that you should assign some value to these label

from prometheus_client import start_http_server, Summary, Counter, Gauge
import random
import time

REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","",["name","age"])
MY_GAUGE = Gauge("my_gauge","")

@REQUEST_TIME.time()
@MY_COUNTER.count_exceptions()
def process_request(t):
    MY_COUNTER.labels(name="Joe", age=30).inc(3)
    time.sleep(t)

if __name__ == '__main__'
  start_http_server(8000)
  process_request(random.random())

  while True:
     A=1
 
  print("The end");


now to go Prometheus configuration    
prometheus.yml

scrape_configs:
  - job_name: "prometheus"
    static_configs:
    - targets: ["localhost:9090"]
  - job_name: "node_exporter"
    static_configs:
    - targets: ["localhost:9100"]
  - job_name: "python"
    static_configs:
    - targets: ["localhost:8000"]



  rule_files:
    - "rule/alerts.yml"
    - "rule/recording_rule.yml"

  alerting:
    alertmanagers:
 

restart Prometheus

go to Prometheus dashboard - targets - Endpoint 

https://localhost:8000/metrics

check for a metric
my_counter_total



Service Discovery and push gateway

so we have servers and we have auto scaling also,
so due to load number of server can increase or decrease.

so in this case, prometheus unable to get proper response from the all target.





or in case we have mutiple server and its behind the load balancer
so prometheus unable to get metrics correctly half of the time only 1 server will resonse behind the load-balancer.


or in case we have serverless function,(they dont have any ip, dont have any dns name. and its impossible to ) unable to get metric from that,
in that case

push gateway

For these type of situation we need a component called Prometheus Push Gateway, Push Gateway allows any code send metrics to that.
push gateway
we can configure and code our serverless send metrics to push gateway and our application  on the virtual server can push their metrics to push gateway, push gateway has an internal exporter, prometheus can connect to that and scrap the metrics from push gateway


Service Discovery
Configured in prometheus.yml file

<ec2_sd_config> - for aws ec2 machine on aws
<dns_sd_config> - dns based
<file_sd_config> - file based
<kubernetes_sd_config> - kubernetes
<azure_sd_config> - azure cloud based
<gcp_sd_config> - google cloud related

Service discovery and AWS
Works with AWS EC2 and AWS LightSail
in the prometheus config file need to use
<ec2_sd_config> or <lightsail_sd_config>
other config
port
region
access_key  secret_key
role_arn
filters source_labels
refresh_interval

source Labels
__meta_ec2_ami:the EC2 Amazon Machine image
__meta_ec2_availability_zone
__meta_ec2_availability_zone_id
__meta_ec2_instance_id
__meta_ec2_instance_state
__meta_ec2_instance_type
__meta_ec2_private_ip
__meta_ec2_public_ip
__meta_ec2_public_ip
__meta_ec2_private_dns_name
__meta_ec2_public_dns_name
__meta_ec2_vpc_id
__meta_ec2_tag_<tagkey>
__meta_ec2_tag_environment="production"

eg
scrap_configs:
  - job_name: "My EC2"
      ec2_sd_configs:
      - port: 8000


 Service Discovery in AWS


we have aws instance label "dev-prometheus"

prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
    - targets: ["localhost:9090"]

rule_files:
  - "rule/alerts.yml"
  - "rule/recording_rule.yml"
 
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093

- job_name: "aws sd"
  ec2_sd_configs:
   - port: 9090
     region:
     access_key:
     secret_key:
     filters:
       - name: tag:Name
         values:
           - dev-.*
  relable_configs:
    - source_labels: [__meta_ec2_tag_Name, __meta_ec2_private_ip]
      target_label: instance
    - source_labels: [__meta_ec2_tag_Name]
      regex: dev-.*
      action: keep/drop

restart prometheus service
the above job discover but it will get public ip in target,

For discover by public ip

- job_name: "aws sd"
  ec2_sd_configs:
   - port: 9090
     region:
     access_key:
     secret_key:
     filters:
       - name: tag:Name
         values:
           - dev-.*
  relable_configs:
    - source_labels: [__meta_ec2_tag_Name, __meta_ec2_private_ip]
      target_label: instance
    - source_labels: [__meta_ec2_tag_Name]
      regex: dev-test-.*
      action: drop
    - source_labels: [__meta_ec2_public_ip]
      replacement: ${1}:9090
      target_label: __address__

restart prometheus

File based service discovery
suppose you have some cloud that doesnt have built-in service discovery support
eg
IBM Cloud
Alibaba Cloud
in this case we use file_sd.yml file based


create a folder on the same location you have prometheus.yml file
mkdir file_sd
and put you sd file there in this folder

file.yml

- targets:
  - localhost:9100
  labels:
    team: "Team Alpha"  

now need to update prometheus.yml file


global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
    - targets: ["localhost:9090"]

rule_files:
  - "rule/alerts.yml"
  - "rule/recording_rule.yml"
 
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093

- job_name: "file sd"
  file_sd_configs:
    - files:
      - /usr/local/etc/file_sd/file.yml
      - /usr/local/etc/file_sd/file1.yml
      - /usr/local/etc/file_sd/*.yml


- job_name: "aws sd"
  ec2_sd_configs:
   - port: 9090
     region:
     access_key:
     secret_key:
     filters:
       - name: tag:Name
         values:
           - dev-.*
  relable_configs:
    - source_labels: [__meta_ec2_tag_Name, __meta_ec2_private_ip]
      target_label: instance
    - source_labels: [__meta_ec2_tag_Name]
      regex: dev-.*
      action: keep/drop   

restart prometheus service and check target in prometheus console.

Install push gateway

Download push gateway from official website:
https://prometheus.io/download/#pushgateway
wget https://github.com/prometheus/pushgateway/releases/download/v1.5.1/pushgateway-1.5.1.linux-amd64.tar.gz
create need service file
[Unit]
Description=Prometheus Pushgatway
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExceStart=/usr/local/bin/pushgateway

[Install]
WantedBy=multi-user.target

tar xvf pushgatewayxxxx.tar.gz
cd pushgatewayxxxx
cp pushgateway /usr/local/bin/pushgateway
chown prometheus:prometheus /usr/local/bin
chown prometheus:prometheus /usr/local/bin/*

vi /etc/systemd/system/pushgateway.service
pass above service file content

systemctl daemon-reload
systemctl start pushgateway
systemctl enable pushgateway
systemctl status pushgateway

go to web browser
check http://localhost:9091/metrics

How send metrics to pushgateway

Sending Metrics to PushGateway with Python

1st need to install prometheus client, and also need to activate python env.
install python client
pip3 install prometheus_client

/prometheus/python-client/push-to-gateway.py
from prometheus_client import push_to_gateway, CollectorRegistry, Gauge

registry = CollectorRegistry()

gauge = Gauge("python_push_to_gateway","python_push_to_gateway", registry=registry)

while True:
    gauge.set_to_current_time()
    push_to_gateway("localhost:9091", job="Job A", registry= registry)


run the python code
python3 push-to-gateway.py

now go to prometheus
check the metric name python_push_to_gateway

Authentication Methods in prometheus


Basic Authentication

it can be used to secure web ui and api

  • Choose username and password
  • Create a bcrypt hash of your password
  • Create a web configuration file
  • Launch prometheus with the web configuration file


1.add username
and password

if you have Apache Tools or https-tools
different way a bcrypt pass

2. htpasswd
htpasswd -nBC 10 "admin"
New passwd
re-type
admin:fsdfwewwef2efdfdssfs

3. go to prometheus.yml location
create a web.yml file on the same location
basic_auth_users:
  admin: fsdfwewwef2efdfdssfs

cmd
prometheus --web.config=/usr/local/etc/web.yml --config.file=/usr/local/etc/prometheus.yml

or in case if you have MAC
check prometheus.args file
--config.file /usr/local/etc/prometheus.yml
--web.listen-address=127.0.0.1:9090
--storage.tsdb.path /usr/local/var/prometheus
--web.config.file=/usr/local/etc/web.yml

and restart Prometheus service


Enabling HTTPS for Improved Security

check Prometheus console on https port, it will not open,
if you want to add certificate you can generate your own certifcate also and add that

cmd
openssl req -new -newkey rsa:2048 -days 365 -nodes -x509 -keyout my.key -out my.crt -subj "/C=BE/ST=Antwerp/L=Brasschaat/O=Inuits/CN=localhost"

or generate a pair of RSA key online
go to https://cryptotools

 

put your certificate on the same location where you have web.yml

open web.yml file
tls_server_config:
 cert_file: prom.crt
 key_file: prom.key


basic_auth_users:
  admin: fsdfwewwef2efdfdssfs

restart prometheus

go to prometheus again.
http will not work, now try with https

Enabling HTTPS on Exporters i.e. Node Exporter
making https communication  b/w prometheus and node exporter

go to same location where you have web.yml


go to same location where you have web.yml
create a file node_web.yml
tls_server_config:
 cert_file: /usr/local/etc/prom.crt
 key_file: /usr/local/etc/prom.key

run a cmd to check
node_exporter --web.config=/usr/local/etc/node_web.yml

if you are in linux machine, you need update node_exporter service file with below:
--web.config=/usr/local/etc/node_web.yml

restart node_exporter service

try to open node_exporter
https://localhost:9100 

now go to prometheus.yml and update certificate to job_name

scrap_configs:
 - job_name: "prometheus"
   static_configs:
   - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    scheme: https
       tls_server_config:
         ca_file: /usr/local/etc/prom.crt
         server_name: 'localhost'
       basic_auth:
         username: admin
         password: password

   static_configs:
   - targets: ["localhost:9100"]


restart prometheus service

now check metrics
node_cpu_seconds_total 

 

Securying PushGateway

go to same location where you have web.yml
create a file pushgateway.yml
tls_server_config:
 cert_file: /usr/local/etc/prom.crt
 key_file: /usr/local/etc/prom.key

run a cmd to check
./pushgateway --web.config=/usr/local/etc/pushgateway.yml

if you are in linux machine, you need update pusgateway service file with below:
--web.config=/usr/local/etc/pushgateway.yml

restart pusgateway service

try to open pusgateway
https://localhost:9091



Connecting to Push Gateway Securely

open push-to-gateway.py

/prometheus/python-client/push-to-gateway.py
from prometheus_client import push_to_gateway, CollectorRegistry, Gauge, registry
from prometheus_client.exposition import basic_auth_handler

def auth_handler(url, method, timeout, headers, data)
    return basic_auth_handler(url, method, timeout, headers, data, "admin","password")

registry = CollectorRegistry()

gauge = Gauge("python_push_to_gateway","python_push_to_gateway", registry=registry)

while True:
    gauge.set_to_current_time()
    push_to_gateway("https://localhost:9091", job="Job A", registry= registry, handler=auth_handler)

run the script
python push-to-gateway.py

if in case if you get error related to self signed cerificate, do below:
export SSL_CERT_FILE=/usr/local/etc/prom.crt

now go to prometheus.yml and update certificate to job_name

scrap_configs:
 - job_name: "prometheus"
   static_configs:
   - targets: ["localhost:9090"]

 - job_name: "pushgateway"
   scheme: https
     tls_server_config:
       ca_file: /usr/local/etc/prom.crt
       server_name: 'localhost'
     basic_auth:
       username: admin
       password: password
   static_configs:
   - targets: ["localhost:9091"]


restart prometheus service

now check promtheus web ui

now check metrics
python_push_to_gateway

 


What is Grafana?

Open-source software to:
 visualise time-series data (metrics)
 visualise metrics from various data-sources
 supports alerting
 multi-organisational

 


installing grafana

port 3000 need to open or work for grafana

1. go to https://gafana.com/grafana/download
select Linux

2. wget https://

sudo dpkg -i grafana_xxxx

3. sudo apt-get -f install

4. systemctl status grafana-server

5. systemctl start grafana-server

systemctl enable grafana-server

6. https://public_ip:3000
admin
admin

Configuring Grafana


1. go cd /etc/grafana
ls
grafana.ini
ldap.tomal
provisioning

dont make changes directly in the gafana.ini file directly, instead make a copy
cp grafana.ini custom.ini
and make changes in custom.ini

2. vi custom.ini

# INSTANCE NAME
;instance_name = ${HOSTNAME}

# Dirctory where grafana can store logs
;logs = /var/log/grafana

# Either "mysql", "postgres" or "sqlite3"
;type = sqlite3
;host = 127.0.0.1:3306
;name = grafana
;user = root

here database means, if you want to put your data on same location you can use sqlite3, but in case if you want to put your data on some other location like you have doc env, then mention your external database info.

another scenerio when you running multiple instance of grafana and they need to share data so no matter which grafana instance you access, you will see the same deshboard and same database - in this case you should have shared database like external mysql or postgresql

after making the changes restart grafana service

systemctl restart grafana

Connect Grafana to Prometheus


go to settings - Data Source - Add data source - Supported List will be there - eg. selected prometheus

url  - prometheus url
access  - Browser
Tiemout - leave blank

Auth
Basic Auth - disable with credential - disable
TLS Client Auth
Basic Auth Details

Alerting
- enable / disable
you will receive alert from prometheus as well.

2 way of quering
HTTP Method : POST / GET

Save - & Test

you will see Green checkmark

Creating Your First Grafana Dashboard

go to left side + icon - Click "Dashboard"
give dashboard name
Folder - put you dashboard in a folder - General / My folder or can create new folder also
Save

to Edit - click on setting option
Left side - General 


Annotations in Grafana

Annotation - choose spot on your graph, its more about time and event that happen at certain time rather than the actual data eg. for example to define a certain annotation as below




same way we can add multiple annotation also, all annotation enable disable button will be visiable on all chart.

Alerts in Grafana

Alerts are defined Graph Panel
Eech Graph Panel can have one to many alerts
Alerts rise when a rule is violated
A rule indicates if a value on the graph is above or below a threshold
Rules are stored in and evaluated by Rule Engine

OK
Pending
Alerting    

create an alert

config parameter in alert


here if alert US refunds Evaluate every 20s and if its keep active for 1m 

contion

when avg() queryA 10s before now is above 400

error and No data 

after setting alert check in in pannel you will see 1 red line 

and if you want to create new panel for alert only - Click + - select visualization


in panel Alert list type option you have many settings like 

max items,  filters , Status filter,  OK, Alerting , Pending - enable /disable 


User management in Grafana

create user - Server Admin - users - New User

switch organisation


incase if you dont want to create user, you can invite users:


 
Google Authentication for Grafana

go to https://console.developers.google.com

Create OAuth client ID

Application type : Web application
Name: Grafana

Authorised JavaScript origins
http://localhost:3000

Authorised redirect URIs
http://localhost:3000/login/google

Create

and you will get
Client Id
Client Secret


go to Grafana configuration
vi default.ini    
find auth.google in the file

#############3##
[auth.google]
enabled = true
allow_sign_up = true
client_id =  xxxxxxxxxxx
client_secret = xxxxxxxxx
scopes = https://www.googleapi.com/auth/userinfo.profile https://www.googleapis.com/auth/userinfo.email
auth_url = htps://accounts.google.com/o/oauth2/auth
token_url = htps://accounts.google.com/o/oauth2/token
api_url = htps://accounts.google.com/oauth2/v1/userinfo
allowed_domains =
hosted_domain =

restart grafana

and now try to login on grafana with google user

Authentication with LDAP

LDAP stands for lightweight Directory Access Protocol
its supported by major directory services such as Active Directory
Directories are mainly used to manage domains, groups and users

Practical
1.Create AD server window server 2016(in aws)
in server create domain
grafana.local
creat 2 users
binder
and grafana login user
Aref Karimi


Now go to grafana config location
ls
default.ini
ldap.toml

sample.ini

vi default.ini

#####################
[auth.ldap]
enabled = true
config_file = ..\conf\ldap.toml
allow_sign_up = true

#################################

vi ldap.toml

[[servers]]
# Ldap server host (specify multiple hosts space separated)
host = "13.54.19.240"
port = 389
use_ssl = false
start_tls = false
ssl_skip_verify = true
# set to path to your root CA certificate or leave unset to use system defaults
# root_ca_cert = "/path/to/certificate.crt"
# Authentication against LDAP servers requiring client certificates
# client_cert = "/path/to/client.crt"
# client_key = "/path/to/client.key"
 # Search user bind dn


bind_dn = "CN=binder,CN=Users,DC=grafana,DC=grafana,DC=local"
#If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
bind_password = 'asd123_'

# User search filter, for example "cn=%s" or "(sAMAccountName=%s)" or "(uid=%s)"
search_filter = "(sAMAccountName=%s)"


# An array of base dns to search through
search_base_dns = ["dc=grafana,dc=local"] 

for adding a group for admin access
we create a group in active directory

and go to
# Map ldap groups to grafana org roles
[[servers.group mappings]]
group_dn = "cn=grafana-admins,dc=grafana,dc=local"
org_role = "Admin"
# To make user an instance admin (Grafana Admin) uncomment line below
# grafana admin = true
# The Grafana organization database id, optional, if left out the default org (id 1) will be used
# org_id = 1 

Now if you go to grafana login by user Aref, you will be able to see admin access.

No comments:

Post a Comment