Logging
Log is useful when problem already happened, and you know about it, to diagnoise the problem,
In production you will go to the logs file and you will try to figure out, what the problem is and fix it.
but sometime, you need to find out why a product/application is slow, and below:
How many error and exceptions?
what is the response time?
How many times API is called?
How many servers?
How many users from New York?
For this type of information, we need 3 parameter
Metric - Value - Time
Metric - error, response time etc
value - error- 3 , response time - 4ms
Prometheus - is Time series database
Its a - Store, Query, Alert
it can Store the time series, metrics time and value
it can Query and Alert, based on threshold.
If threshold, if number of errors within a given 5 min, goes above 10, than raise an alert.
What is Telemetry?
Telemetry in s/w refers to the collection of business and diagnosis data from the s/w in production, and store and visualise it for the purpose of diagnosing production issues or improving the business.
Prometheus - visualization is not good, and its not enough for real usecases.
like if we need to visualize some kind of metrics that stored in SQL, Amazon Cloudwatch,
we dont need to move it into Prometheus. instead we can use Grafana for visualization
and you can also create alert in grafana
Data collection
How to collect metrics and store them in Prometheus.
so we have Prometheus in 1 server or a cluster
and we want to send the metrics to Prometheus
1. we using a code on the server, and using the library we send logs to Prometheus
code is eg rupy, python, .net, java, for collect the metrics and send them to Prometheus.
Push Gateway - is component of Prometheus, its act as temp storage where application can send them metrics to store, so then Prometheus can go and pull the metrics from the pushGateway and scrape the metrics from Push Gateway.
2. in many case, we dont have source code, eg, SQL, Amazon CloudWatch, HAPROXY, IOT
and we can not change code over there, and storing IOT time series database, eg sensor
so in this scenario we use Exporter, and Prometheus will go and pull the metrics from the exporter
Scrapping
The process of connecting the the exporter and pulling the metrics into Prometheus is called Scrapping
scrapping can be configure in Prometheus config file prometheus.yml default connect time to exporter 15sec
Data model
Prometheus stores data as time series
Every time series is identified by metric name and labels
Labels are a key and value pair
Labels are optional
<metric name> {key=value, key=value,...}
auth_api_hit {count=1, time_taken=800}
Data Types
1. Scalar - Float 1 1.5
String
prometheus_http_requests_total code="200" job="prometheus"
here 200 is a string not a float, becz its in double quotes
Query
prometheus_http_requests_total{code=~"2.*",job="prometheus"}
if you do a query for float type then you dont use any double quotas or single quotes.
2.Instant Vectors -single sample value
instant vector select a set of time-series and in single sample value, for any-timestamp
you only use the metric name, you are not using range
if you want to filter the metrics, can use the key value pair, or labels
eg. auth_api_hit 5,
becz you get only 1 value that is why its called instant vector
if you use some filter
auth_api_hit {count=1,time_taken=800}
so value you will get 1
3. Range vector []
Another data type is Range vector
Range vector is similar to instant vectors except they select a range of sample
label_name[time_spec]
auth_api_hit[5m] - means you are saying 5min before now
one metrics have mutiple value thats why its vector -
in vector 2 things matters, 1 [] time you mention and scrap interval time that is mentioned in prometheus.yml
Operators
+ Addition
- Subtraction
* Multiplication
/ Divison
% Modulo
^ Power
Scalar + Instant vector
Applies to value of instant vector
eg we check prometheus metrics
prometheus_sd_updates_total
all value go up by 5, if you - all value go minus by 5
Instant Vector + Instant vector
Applies to every value of left vector and its matching value in the right vector
instant vector +5
instant vector + instant vector
Comparison Binary Operators
== Equal
!= Non-equal
> Greater
< Less-than
>= Greater or Equal
<= Less-than or Equal
Filter Matchers/Selectors
<metric name> {filter_key=value, filter_key=value,...}
query in Prometheus very similar to metrics
filters are a setup of labels,
every labels is key value pair and they are separated by a comma,
and each comma, represent as and
= Two values must be equal
!= Two values must NOT be equal
=~ Value on left must match the Regular Expression (regex) on right [this is for text and string matching]
!~ Value on left must NOT match the Regular Expression (regex) on right
Aggregation Operators.
Aggregate the elements of a single Instant Vector
The result is a new Instant Vector with aggregated values
sum Calculated sum over dimensions.
min Selects minimum over dimensions.
max Selects maximum over dimensions.
avg Selects average over dimensions.
count Selects number of elements over dimensions.
groups Groups elements. All values in resulting vector are equal to 1
count_values Counts the number of elements with the same values
topk Largest elements by sample value
bottmk smallest elements by sample value
stddev Finds population standard deviation over dimensions
stdvar FInds population standard variation over dimensions

syntax for <Aggregation Operator> (<Instant Vector)
sum(node_cpu_total)
<Aggregation Operator> (<Instant Vector) by (<label list>)
sum(node_cpu_total) by (http_code)
<Aggregation Operator> (<Instant Vector>) without (<label list>)
sum(node_cpu_total) without (http_code)
offset - eg 5 min ago
eg
prometheus_http_requests_total offset 5m
but if we want to check 5 min ago data mention offset 5m
in case if we do it with a range vector
prometheus_http_requests_total[5m]
it will give a error
so in case if we do group by, the value always 1
group(prometheus_http_requests_total) by (code)
but if do in case avg/sum by the value will come accordingly
avg(prometheus_http_requests_total) by (code)
if we apply offset after by will it work?
avg(prometheus_http_requests_total) by (code) offset 5m X
no it doesnot work offset will work only after metric
e.g
avg(prometheus_http_request_total offset 5h) by (code)
Functions
absents(<Instant Vector>) Checks if an instant vector has any members
Returns an empty vector if parameter has elements
absent(node_cpu_seconds_total{cpu="x09d"})
(cpu="x09d")
another eg -it will give error, becz it can not except range vector
absent(node_cpu_seconds_total[1h])
absent_over_time(<range Vector>) Check if an range vector has any members Returns an empty vector if parameter has elements
absent_over_time(node_cpu_seconds_total{cpu="xrft"}[1h])
clamp_min(node_cpu_seconds_total, 300)
all the value will come greater than 300
clamp_max(node_cpu_seconds_total, 15000)
all the value will come smaller than 15000
delta and iDelta
day_of_month(<Instant Vector>) For every UTC time returns day of month 1..31
day_of_week(<Instant Vector>) For every UTC time returns day of week 1..7
delta(<Instant Vector>) Can only be used with Gauges
idelta(<Range Vector>) Returns the difference b/w first and last items
delta(node_cpu_temp[2h]) CPU temp. Change in two hours
log2(<Instant Vector>) Returns binary logarithm of each scalar value
log10(<Instant Vector>) Returns decimal logarithm of each scalar value
In(<Instant Vector>) Returns neutral logarithm of each scalar value
sort(<Instant Vector>) Sorts elements in ascending order
sort_desc(<Instant Vector>) Sorts elements in ascending order
time() Returns a near-current time stamp
timestamp(<Instant Vector>) Returns the time stamp of each time series (element)
eg
sort(clamp(node_cpu_seconds_total, 300, 15000)) it will sort the value in asc order
timestamp(clamp(node_cpu_seconds_total, 300, 15000)) it will give metric timestamp
timestamp(clamp(node_cpu_seconds_total offset 1h, 300, 15000)) it will give metric timestamp for last 1 hour
Aggregation Over time
avg_over_time(<range Vector>) Returns the average of items in a range vector
sum_over_time(<range Vector>) Returns the sum of items in a range vector
min_over_time(<range Vector>) Returns the min of items in a range vector
max_over_time(<range Vector>) Returns the max of items in a range vector
count_over_time(<range Vector>) Returns the count of items in a range vector
eg
avg_over_time(node_cpu_seconds_total{cpu="0"}[2h])
Alerts in Prometheus
alert visible on Prometheus dashboard,
for alert notification we need to use Alert manager
alert manager can trigger notification to diff platform like email,slack,PagerDuty,WebHook
for alert(in linux) - we put the alert rule files in /etc/prometheus/rules
for other Mac/window - we need to give relative path in prometheus.yaml file
for Mac
/etc/rule/alert.yaml
groups:
- name: Alerts
rules:
- alert: Is Node Exporter Up
expr: up{job="node_exporter"} == 0
another way rule
groups:
- name: Alerts
rules:
- alert: Is Node Exporter Up
expr: absent(up{job="node_exporter"})
for: 5m
note: here expr mean something returning that not equal to zero, when this condition match, you will get a firing alert
etc/prometheus.yaml
global:
scrape_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
static_configs:
- target: ["localhost:9100"]
rule_files:
- "rule/alerts.yaml"
restart the prometheus service
check again in targets
now lets test the alert stop the node service 9100
for different alert creation go to Prometheus official website
https://awesome-prometheus-alerts.grep.to/rules.html
The "for" expression
Use the "for" expression to define a time threshold
eg if alert for last 5m node_export node is down it will be a trigger
another way rule
groups:
- name: Alerts
rules:
- alert: Is Node Exporter Up
expr: absent(up{job="node_exporter"})
for: 5m
Alert Manager
Converts alerts to notifications
Can receive alerts from multiple Prometheus servers
Can de-duplicate alerts - in one single alert to avoid confusion
Can silent alerts - you can choose to silent some alert for sometime,(in case you raising false alarm)
It has a web user interface
The web user interface is access via port 9093
Is configured via alertmanager.yml file
Download Alert manager
1. go to official Prometheus website
https://prometheus.io/download/
copy link
sudo wget link
2. untar package
sudo tar xvf alertxxxx
3. mkdir /var/lib/alertmanager
4. cd alertxxxx_package
5. sudo mv * /var/lib/alertmanager
6 cd /var/lib/alertmanager
7. mkdir data
8. sudo chown -R prometheus:prometheus /var/lib/alertmanager
9. sudo chown -R prometheus:prometheus /var/lib/alertmanager/*
10 sudo chmod -R 755 /var/lib/alertmanager
11 sudo chmod -R 755 /var/lib/alertmanager/*
12 sudo vi /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alert Manager
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP \$MAINPID
ExecStart=/var/lib/alertmanager/alertmanager --storage-path="/var/lib/alertmanager/data"
SyslogIdentifier=prometheus_alert_manager
Restart=always
[Install]
WantedBy=multi-user.target
13 sudo systemctl daemon-reload
14 sudo systemctl start alertmanager
15 sudo systemctl enable alertmanager
16 sudo systemctl status alertmanager
alertmanager configuration file
config.yml
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'in-v3.mailjet.com:587'
smtp_from: 'info@cloudware.net.au'
smtp_auth_username: ''
smtp_auth_password: ''
route:
receiver: main_receiver
routes:
- receiver: 'urgent_receiver'
matchers:
- severity='Critical'
receivers:
- name: 'main_receiver'
email_configs:
- to: 'manjeetyadavrajokri@gmail.com'
- name: 'urgent_receiver'
email_configs:
- to: 'info@cloudware.net.au'
so now if you disable node exporter it will trigger the email to info@cloudware.net.au email becz as severity critical and route
Integration Alert manager with slack
incoming webhook - its not a specific tool to slack, you can use it with microsoft team or hangout, you can use the same incoming webhook technolgy in order send notifications from prometheus to your messenger
1. 1st you need a channel
create channel on slack, and then need admin access on that channel, if you have admin access you will get a option in top corner View all members of this channel
2. click on corner tab - go to Integrations - add app - in search box - look incoming webhook - Install - add to slack - in Post to channel(choose the channel) - Click on "Add incoming webhooks integration"- Setup instructions - Webhook URL
3.here copy the webhook url and change the Customize icon if you want.
4. Save Settings
5. go to alert manager config file - and put below in receivers:
receivers
- name: 'urgent_receiver'
slack_configs:
- api_url: 'http://webhookurl'
channel: '#udemy-course-for-prometheus'
stop node_exporter
restart alertmanager
now you should get the notification on slack
Inhibiting Alert
silencing is temporary - you can create silence rule
Inhibiting notifications is permanent
Inhibits an alert if another alert is firing
Scenario, when we have 2 alert as below server down and Website down, so if alert 1 trigger no need trigger alert 2:
Prometheus
Alert 1: Server Down
Alert 2: Website Down
how to define?
1. find a metric on Prometheus
search below metric and copy that metric
prometheus_build_info
2. go to alert.yaml configuration file
replace expr and
global:
- name: Alerts
rules:
- alert: Is Node Exporter Up?
expr: up{job="node_exporter"}==0
for: 0m
labels:
team: Team Alpha
severity: Critical
annotations:
summary: "{{ $labels.instance}} Is Down"
description: "Team Alpha to restart the server {{ $labels }} VALUE: {{ $value }}"
- name: Alert
rules:
- alert: Is Node Exporter Up?
expr: prometheus_build_info{branch="non-git", goversion="go1.16.5", instance="localhost:9090, job=prometheus, revsion"}
for: 0m
labels:
team: Team Beta
severity: Critical
annotations:
summary: "{{ $labels.instance}} Is Down"
description: "Team Alpha to restart the server {{ $labels }} VALUE: {{ $value }}"
3. alertmanger config file
config.yaml
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'in-v3.mailjet.com:587'
smtp_from: 'info@cloudware.net.au'
smtp_auth_username: ''
smtp_auth_password: ''
route:
receiver: main_receiver
routes:
- receiver: 'urgent_receiver'
matchers:
- severity='Critical'
receivers:
- name: 'main_receiver'
email_configs:
- to: 'manjeetyadavrajokri@gmail.com'
- name: 'urgent_receiver'
email_configs:
- to: 'info@cloudware.net.au'
inhibit_rules:
- source_match:
team: "Team Alpha"
target_match:
team: "Team Beta"
equal: ['severity']
4. restart the alert manager
Recording rules
computed metrics
avg sum count
if we have large set of data and we doing avg sum count very frequently it can slow down the prometheus
so we can record the rules (and we can store it in another metric that created from rule)
Calculate "avg" of "temperature" from IoT every 5 minutes and save it as IoT_Avg_Temp
you can record your rules in a yml files
IoT_rules.yml
db_rules.yml
or you can put in seprate yml file like in linux
/etc/prometheus/rules
creating a rule
eg. we want below metric
avg by (mode) (node_cpu_seconds_total)
avg by (cpu) (node_cpu_seconds_total)
avg by (cpu) (node_cpu_seconds_total[1m])
we get metric bt we need time range as well for recording rule
we cant do as below, it will be invalid:
avg by (cpu) (node_cpu_seconds_total[1m])
below will work but we cant do agg like avg sum etc
node_cpu_seconds_total[1m]
so below is the way with rate we can use time range
rate(node_cpu_seconds_total[1m])
irate when counter change very frequestly use irate
but for alert and for recording rule, prometheus said you should use rate
avg by (cpu) (rate(node_cpu_seconds_total[1m]))
<label>:<metric name>:<aggregation type>
cpu :node_cpu_seconds_total :avg
now go to rule.yml file
rules.yml
groups:
- name: node exporter rules
rules:
- record: cpu:node_cpu_seconds_total:avg
expr: avg by (cpu) (rate(node_cpu_seconds_total[1m]))
labels:
exporter_type: node
prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
static_configs:
- targets: ["localhost:9100"]
rules:
- "rule/rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
restart the prometheus service
Client Libraries
Choose a prometheus client library that matches the language in which your application is written. and internal metrics exposed via an HTTP endpoint on your application's instance.
Python Library
Generate some metric and make them available to prometheus
prom-test.py
if __name__ == '__main__'
print("This is a python app");
add a endpoint to the application
install library
pip3 install prometheus-client
now prom-test.py
------------------------------------------------------------------------------------------------------------------------------------
from prometheus_client import start_http_server, Summary
import random
import time
REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
@REQUEST_TIME.time()
def process_request(t):
time.sleep(t)
if __name__ == '__main__'
start_http_server(8000)
while True:
process_request(random.random())
print("The end");
now go to browser open localhost:8000
Counter metrics
import counter class from prometheus client and than we can create a variable with any given name
from prometheus_client import start_http_server, Summary, Counter
import random
import time
REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","")
@REQUEST_TIME.time()
def process_request(t):
MY_COUNTER.inc(5)
time.sleep(t)
if __name__ == '__main__'
start_http_server(8000)
process_request(random.random())
while True:
A=1
print("The end");
Gauge metrics
from prometheus_client import start_http_server, Summary, Counter, Gauge
import random
import time
REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","")
MY_GAUGE = Gauge("my_gauge","")
@REQUEST_TIME.time()
def process_request(t):
MY_COUNTER.inc(5)
MY_GAUGE.set(5)
MY_GAUGE.inc(5)
MY_GAUGE.dec(2)
time.sleep(t)
if __name__ == '__main__'
start_http_server(8000)
process_request(random.random())
while True:
A=1
print("The end");
For exception handling
from prometheus_client import start_http_server, Summary, Counter, Gauge
import random
import time
REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","")
MY_GAUGE = Gauge("my_gauge","")
@REQUEST_TIME.time()
@MY_COUNTER.count_exceptions()
def process_request(t):
MY_COUNTER.inc(5)
MY_GAUGE.set(5)
MY_GAUGE.inc(5)
MY_GAUGE.dec(2)
time.sleep(t)
if __name__ == '__main__'
start_http_server(8000)
process_request(random.random())
while True:
A=1
print("The end");
Labels
attaching labels to your metrics
when you introduce your label, Prometheus is not going add them to your metric,
and that is becz Prometheus doesn't know what value has to be assign,
to each of these label name (name, age), so for that you should assign some value to these label
from prometheus_client import start_http_server, Summary, Counter, Gauge
import random
import time
REQUEST_TIME = Summary("request_processing_second","Time spent processing a function")
MY_COUNTER = Counter("my_counter","",["name","age"])
MY_GAUGE = Gauge("my_gauge","")
@REQUEST_TIME.time()
@MY_COUNTER.count_exceptions()
def process_request(t):
MY_COUNTER.labels(name="Joe", age=30).inc(3)
time.sleep(t)
if __name__ == '__main__'
start_http_server(8000)
process_request(random.random())
while True:
A=1
print("The end");
now to go Prometheus configuration
prometheus.yml
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
static_configs:
- targets: ["localhost:9100"]
- job_name: "python"
static_configs:
- targets: ["localhost:8000"]
rule_files:
- "rule/alerts.yml"
- "rule/recording_rule.yml"
alerting:
alertmanagers:
restart Prometheus
go to Prometheus dashboard - targets - Endpoint
https://localhost:8000/metrics
check for a metric
my_counter_total
Service Discovery and push gateway
so we have servers and we have auto scaling also,
so due to load number of server can increase or decrease.
so in this case, prometheus unable to get proper response from the all target.
or in case we have mutiple server and its behind the load balancer
so prometheus unable to get metrics correctly half of the time only 1 server will resonse behind the load-balancer.
or in case we have serverless function,(they dont have any ip, dont have any dns name. and its impossible to ) unable to get metric from that,
in that case
push gateway
For these type of situation we need a component called Prometheus Push Gateway, Push Gateway allows any code send metrics to that.
push gateway
we can configure and code our serverless send metrics to push gateway and our application on the virtual server can push their metrics to push gateway, push gateway has an internal exporter, prometheus can connect to that and scrap the metrics from push gateway
Service Discovery
Configured in prometheus.yml file
<ec2_sd_config> - for aws ec2 machine on aws
<dns_sd_config> - dns based
<file_sd_config> - file based
<kubernetes_sd_config> - kubernetes
<azure_sd_config> - azure cloud based
<gcp_sd_config> - google cloud related
Service discovery and AWS
Works with AWS EC2 and AWS LightSail
in the prometheus config file need to use
<ec2_sd_config> or <lightsail_sd_config>
other config
port
region
access_key secret_key
role_arn
filters source_labels
refresh_interval
source Labels
__meta_ec2_ami:the EC2 Amazon Machine image
__meta_ec2_availability_zone
__meta_ec2_availability_zone_id
__meta_ec2_instance_id
__meta_ec2_instance_state
__meta_ec2_instance_type
__meta_ec2_private_ip
__meta_ec2_public_ip
__meta_ec2_public_ip
__meta_ec2_private_dns_name
__meta_ec2_public_dns_name
__meta_ec2_vpc_id
__meta_ec2_tag_<tagkey>
__meta_ec2_tag_environment="production"
eg
scrap_configs:
- job_name: "My EC2"
ec2_sd_configs:
- port: 8000
Service Discovery in AWS
we have aws instance label "dev-prometheus"
prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
rule_files:
- "rule/alerts.yml"
- "rule/recording_rule.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
- job_name: "aws sd"
ec2_sd_configs:
- port: 9090
region:
access_key:
secret_key:
filters:
- name: tag:Name
values:
- dev-.*
relable_configs:
- source_labels: [__meta_ec2_tag_Name, __meta_ec2_private_ip]
target_label: instance
- source_labels: [__meta_ec2_tag_Name]
regex: dev-.*
action: keep/drop
restart prometheus service
the above job discover but it will get public ip in target,
For discover by public ip
- job_name: "aws sd"
ec2_sd_configs:
- port: 9090
region:
access_key:
secret_key:
filters:
- name: tag:Name
values:
- dev-.*
relable_configs:
- source_labels: [__meta_ec2_tag_Name, __meta_ec2_private_ip]
target_label: instance
- source_labels: [__meta_ec2_tag_Name]
regex: dev-test-.*
action: drop
- source_labels: [__meta_ec2_public_ip]
replacement: ${1}:9090
target_label: __address__
restart prometheus
File based service discovery
suppose you have some cloud that doesnt have built-in service discovery support
eg
IBM Cloud
Alibaba Cloud
in this case we use file_sd.yml file based
create a folder on the same location you have prometheus.yml file
mkdir file_sd
and put you sd file there in this folder
file.yml
- targets:
- localhost:9100
labels:
team: "Team Alpha"
now need to update prometheus.yml file
global:
scrape_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
rule_files:
- "rule/alerts.yml"
- "rule/recording_rule.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
- job_name: "file sd"
file_sd_configs:
- files:
- /usr/local/etc/file_sd/file.yml
- /usr/local/etc/file_sd/file1.yml
- /usr/local/etc/file_sd/*.yml
- job_name: "aws sd"
ec2_sd_configs:
- port: 9090
region:
access_key:
secret_key:
filters:
- name: tag:Name
values:
- dev-.*
relable_configs:
- source_labels: [__meta_ec2_tag_Name, __meta_ec2_private_ip]
target_label: instance
- source_labels: [__meta_ec2_tag_Name]
regex: dev-.*
action: keep/drop
restart prometheus service and check target in prometheus console.
Install push gateway
Download push gateway from official website:
https://prometheus.io/download/#pushgateway
wget https://github.com/prometheus/pushgateway/releases/download/v1.5.1/pushgateway-1.5.1.linux-amd64.tar.gz
create need service file
[Unit]
Description=Prometheus Pushgatway
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExceStart=/usr/local/bin/pushgateway
[Install]
WantedBy=multi-user.target
tar xvf pushgatewayxxxx.tar.gz
cd pushgatewayxxxx
cp pushgateway /usr/local/bin/pushgateway
chown prometheus:prometheus /usr/local/bin
chown prometheus:prometheus /usr/local/bin/*
vi /etc/systemd/system/pushgateway.service
pass above service file content
systemctl daemon-reload
systemctl start pushgateway
systemctl enable pushgateway
systemctl status pushgateway
go to web browser
check http://localhost:9091/metrics
How send metrics to pushgateway
Sending Metrics to PushGateway with Python
1st need to install prometheus client, and also need to activate python env.
install python client
pip3 install prometheus_client
/prometheus/python-client/push-to-gateway.py
from prometheus_client import push_to_gateway, CollectorRegistry, Gauge
registry = CollectorRegistry()
gauge = Gauge("python_push_to_gateway","python_push_to_gateway", registry=registry)
while True:
gauge.set_to_current_time()
push_to_gateway("localhost:9091", job="Job A", registry= registry)
run the python code
python3 push-to-gateway.py
now go to prometheus
check the metric name python_push_to_gateway
Authentication Methods in prometheus
Basic Authentication
it can be used to secure web ui and api
- Choose username and password
- Create a bcrypt hash of your password
- Create a web configuration file
- Launch prometheus with the web configuration file
1.add username
and password
if you have Apache Tools or https-tools
different way a bcrypt pass
2. htpasswd
htpasswd -nBC 10 "admin"
New passwd
re-type
admin:fsdfwewwef2efdfdssfs
3. go to prometheus.yml location
create a web.yml file on the same location
basic_auth_users:
admin: fsdfwewwef2efdfdssfs
cmd
prometheus --web.config=/usr/local/etc/web.yml --config.file=/usr/local/etc/prometheus.yml
or in case if you have MAC
check prometheus.args file
--config.file /usr/local/etc/prometheus.yml
--web.listen-address=127.0.0.1:9090
--storage.tsdb.path /usr/local/var/prometheus
--web.config.file=/usr/local/etc/web.yml
and restart Prometheus service
Enabling HTTPS for Improved Security
check Prometheus console on https port, it will not open,
if you want to add certificate you can generate your own certifcate also and add that
cmd
openssl req -new -newkey rsa:2048 -days 365 -nodes -x509 -keyout my.key -out my.crt -subj "/C=BE/ST=Antwerp/L=Brasschaat/O=Inuits/CN=localhost"
or generate a pair of RSA key online
go to https://cryptotools
put your certificate on the same location where you have web.yml
open web.yml file
tls_server_config:
cert_file: prom.crt
key_file: prom.key
basic_auth_users:
admin: fsdfwewwef2efdfdssfs
restart prometheus
go to prometheus again.
http will not work, now try with https
Enabling HTTPS on Exporters i.e. Node Exporter
making https communication b/w prometheus and node exporter
go to same location where you have web.yml
go to same location where you have web.yml
create a file node_web.yml
tls_server_config:
cert_file: /usr/local/etc/prom.crt
key_file: /usr/local/etc/prom.key
run a cmd to check
node_exporter --web.config=/usr/local/etc/node_web.yml
if you are in linux machine, you need update node_exporter service file with below:
--web.config=/usr/local/etc/node_web.yml
restart node_exporter service
try to open node_exporter
https://localhost:9100
now go to prometheus.yml and update certificate to job_name
scrap_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
scheme: https
tls_server_config:
ca_file: /usr/local/etc/prom.crt
server_name: 'localhost'
basic_auth:
username: admin
password: password
static_configs:
- targets: ["localhost:9100"]
restart prometheus service
now check metrics
node_cpu_seconds_total
Securying PushGateway
go to same location where you have web.yml
create a file pushgateway.yml
tls_server_config:
cert_file: /usr/local/etc/prom.crt
key_file: /usr/local/etc/prom.key
run a cmd to check
./pushgateway --web.config=/usr/local/etc/pushgateway.yml
if you are in linux machine, you need update pusgateway service file with below:
--web.config=/usr/local/etc/pushgateway.yml
restart pusgateway service
try to open pusgateway
https://localhost:9091
Connecting to Push Gateway Securely
open push-to-gateway.py
/prometheus/python-client/push-to-gateway.py
from prometheus_client import push_to_gateway, CollectorRegistry, Gauge, registry
from prometheus_client.exposition import basic_auth_handler
def auth_handler(url, method, timeout, headers, data)
return basic_auth_handler(url, method, timeout, headers, data, "admin","password")
registry = CollectorRegistry()
gauge = Gauge("python_push_to_gateway","python_push_to_gateway", registry=registry)
while True:
gauge.set_to_current_time()
push_to_gateway("https://localhost:9091", job="Job A", registry= registry, handler=auth_handler)
run the script
python push-to-gateway.py
if in case if you get error related to self signed cerificate, do below:
export SSL_CERT_FILE=/usr/local/etc/prom.crt
now go to prometheus.yml and update certificate to job_name
scrap_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "pushgateway"
scheme: https
tls_server_config:
ca_file: /usr/local/etc/prom.crt
server_name: 'localhost'
basic_auth:
username: admin
password: password
static_configs:
- targets: ["localhost:9091"]
restart prometheus service
now check promtheus web ui
now check metrics
python_push_to_gateway
What is Grafana?
Open-source software to:
visualise time-series data (metrics)
visualise metrics from various data-sources
supports alerting
multi-organisational
installing grafana
port 3000 need to open or work for grafana
1. go to https://gafana.com/grafana/download
select Linux
2. wget https://
sudo dpkg -i grafana_xxxx
3. sudo apt-get -f install
4. systemctl status grafana-server
5. systemctl start grafana-server
systemctl enable grafana-server
6. https://public_ip:3000
admin
admin
Configuring Grafana
1. go cd /etc/grafana
ls
grafana.ini
ldap.tomal
provisioning
dont make changes directly in the gafana.ini file directly, instead make a copy
cp grafana.ini custom.ini
and make changes in custom.ini
2. vi custom.ini
# INSTANCE NAME
;instance_name = ${HOSTNAME}
# Dirctory where grafana can store logs
;logs = /var/log/grafana
# Either "mysql", "postgres" or "sqlite3"
;type = sqlite3
;host = 127.0.0.1:3306
;name = grafana
;user = root
here database means, if you want to put your data on same location you can use sqlite3, but in case if you want to put your data on some other location like you have doc env, then mention your external database info.
another scenerio when you running multiple instance of grafana and they need to share data so no matter which grafana instance you access, you will see the same deshboard and same database - in this case you should have shared database like external mysql or postgresql
after making the changes restart grafana service
systemctl restart grafana
Connect Grafana to Prometheus
go to settings - Data Source - Add data source - Supported List will be there - eg. selected prometheus
url - prometheus url
access - Browser
Tiemout - leave blank
Auth
Basic Auth - disable with credential - disable
TLS Client Auth
Basic Auth Details
Alerting
- enable / disable
you will receive alert from prometheus as well.
2 way of quering
HTTP Method : POST / GET
Save - & Test
you will see Green checkmark
Creating Your First Grafana Dashboard
go to left side + icon - Click "Dashboard"
give dashboard name
Folder - put you dashboard in a folder - General / My folder or can create new folder also
Save
to Edit - click on setting option
Left side - General
Annotations in Grafana
Annotation - choose spot on your graph, its more about time and event that happen at certain time rather than the actual data eg. for example to define a certain annotation as below
same way we can add multiple annotation also, all annotation enable disable button will be visiable on all chart.
Alerts in Grafana
Alerts are defined Graph Panel
Eech Graph Panel can have one to many alerts
Alerts rise when a rule is violated
A rule indicates if a value on the graph is above or below a threshold
Rules are stored in and evaluated by Rule Engine
OK
Pending
Alerting
create an alert
config parameter in alert
here if alert US refunds Evaluate every 20s and if its keep active for 1m
contion
when avg() queryA 10s before now is above 400
error and No data
after setting alert check in in pannel you will see 1 red line
and if you want to create new panel for alert only - Click + - select visualization
in panel Alert list type option you have many settings like
max items, filters , Status filter, OK, Alerting , Pending - enable /disable
User management in Grafana
create user - Server Admin - users - New User
switch organisation
incase if you dont want to create user, you can invite users:
Google Authentication for Grafana
go to https://console.developers.google.com
Create OAuth client ID
Application type : Web application
Name: Grafana
Authorised JavaScript origins
http://localhost:3000
Authorised redirect URIs
http://localhost:3000/login/google
Create
and you will get
Client Id
Client Secret
go to Grafana configuration
vi default.ini
find auth.google in the file
#############3##
[auth.google]
enabled = true
allow_sign_up = true
client_id = xxxxxxxxxxx
client_secret = xxxxxxxxx
scopes = https://www.googleapi.com/auth/userinfo.profile https://www.googleapis.com/auth/userinfo.email
auth_url = htps://accounts.google.com/o/oauth2/auth
token_url = htps://accounts.google.com/o/oauth2/token
api_url = htps://accounts.google.com/oauth2/v1/userinfo
allowed_domains =
hosted_domain =
restart grafana
and now try to login on grafana with google user
Authentication with LDAP
LDAP stands for lightweight Directory Access Protocol
its supported by major directory services such as Active Directory
Directories are mainly used to manage domains, groups and users
Practical
1.Create AD server window server 2016(in aws)
in server create domain
grafana.local
creat 2 users
binder
and grafana login user
Aref Karimi
Now go to grafana config location
ls
default.ini
ldap.toml
sample.ini
vi default.ini
#####################
[auth.ldap]
enabled = true
config_file = ..\conf\ldap.toml
allow_sign_up = true
#################################
vi ldap.toml
[[servers]]
# Ldap server host (specify multiple hosts space separated)
host = "13.54.19.240"
port = 389
use_ssl = false
start_tls = false
ssl_skip_verify = true
# set to path to your root CA certificate or leave unset to use system defaults
# root_ca_cert = "/path/to/certificate.crt"
# Authentication against LDAP servers requiring client certificates
# client_cert = "/path/to/client.crt"
# client_key = "/path/to/client.key"
# Search user bind dn
bind_dn = "CN=binder,CN=Users,DC=grafana,DC=grafana,DC=local"
#If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
bind_password = 'asd123_'
# User search filter, for example "cn=%s" or "(sAMAccountName=%s)" or "(uid=%s)"
search_filter = "(sAMAccountName=%s)"
# An array of base dns to search through
search_base_dns = ["dc=grafana,dc=local"]
for adding a group for admin access
we create a group in active directory
and go to
# Map ldap groups to grafana org roles
[[servers.group mappings]]
group_dn = "cn=grafana-admins,dc=grafana,dc=local"
org_role = "Admin"
# To make user an instance admin (Grafana Admin) uncomment line below
# grafana admin = true
# The Grafana organization database id, optional, if left out the default org (id 1) will be used
# org_id = 1
Now if you go to grafana login by user Aref, you will be able to see admin access.








































































No comments:
Post a Comment