Tuesday, December 13, 2022

ELK

ELK

Elasticsearch is Document Oriented
Insert Doc
Delete Doc
Retrieve Doc
Analyze Doc
Search Doc

How Elasticsearch do fastest search?
Inverted Index
and its created Inverted Index - using lucene apache lib
what is Inverted Index?
- Maps words to the actual document locations of where they occur.

Elasticsearch is Document Oriented

 

 

Insert a data in Elasticsearch we called Indexing.

Index document syntax:

 

HTTP verbs
  • Retrieve a single item (GET)
  • Retrieve a list of items (GET)
  • Create an item (POST)
  • Update an item (PUT)
  • Delete an item (DELETE)

 

lets example we want to index a document.
PUT /vehicles/car/123
{
  "make": "honda",
  "milage": 87000,
  "color": "red"
}

another one
PUT /vehicles/car/123
{
  "make": "Honda",
  "Color": "Black",
  "HP": 250,
  "milage": 24000,
  "price": 19300.47
}

response like as below:

 

so version will increment here if you update index again and again

type was deprecated in elasticsearch 6.2 and 7 also, now we can add multiple types like car, truck and so on.

and in case if you dont define id while you do indexing(inserting a document) it will automatically add one

 How to get a document?

GET /vehicles/car/123



GET /vehicles/car/123/_source - it give us only the data

If we want to check if any document exist or not?, we dont want to see metadata and data

HEAD /vehicles/car/123

 suppose updated color field black to blue
PUT /vehicles/car/123
{
  "make": "Honda",
  "Color": "blue",
  "HP": 250,
  "milage": 24000,
  "price": 19300.47
}


 
Update the document.
In databases if you update a data, it will update the same row data,
but in elasticsearch, when you do update any field etc. it will create new record
means in elasticsearch - documents is immutable, whenever update a document it will update to new version, and its create whole new document.

Elasticsearch also do have update api
POST /vehicles/car/123/_update
{
  "doc": {
    "make": "Honda",
    "Color": "blue",
    "HP": 250,
    "milage": 19000,
    "price": 19300.97
 }
}



POST /vehicles/car/123/_update
{
  "doc": {
    "price": 1000
 }
}



In case if we dont have any existing field in the document, and we want to add, that we can also add using PUT or POST update api
POST /vehicles/car/123/_update
{
  "doc": {
    "driver": "Tom"
 }
}


Delete a document
DELETE /vehicles/car/123

 

when we do delete a document its cant delete at same time and free the space for that, it will unlist that document, and its remain in memory sometime, and will purge that document later after sometime. 

How to delete a index? 

before that how to list whole index?

GET /vehicles

DELETE /vehicles

Components of index
GET /business
{
  "business":{
    "aliases":{}
    "mappings": { }
    "settings": { }
  }
}

Settings

mapping:


Note: we cant add more then 1 mapping type, like we have building, we can add another one under business like employee, only one mapping type will allow.

How to do search in document?
GET business/building/_search
it will search all the data in business index

 

default query
GET business/_search
{
   "query": {
     "match_all": {}
  }
}

and curl request like below

curl -XGET "http://localhost:9200/business/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }'
}

in case if you want to see this result with good indentation you can add pretty after _search

 


Distributed Execution of requests
We can make Elastic search cluster, that can handle multiple request, and also can handle really large search problem and we can break it up into really small ones and all small problem can parallel solve in a much shorter period of time.

Index setting structure:
number of shards: "5"
number of replicas: "1"

If you index a document in multi node env elasticsearch, it will forward your index request according to shard partition,and your document put in shard, if replica also configured, that will replicate with replica shard also.



if same way any index delete request comes in multi node env, it will go to the shard partition 1st to delete, and same thing will sync with replica also.

 

if any read request(GET) comes, it can be served by any node, becz, each node have replica and other node may be main, so it will serve the traffic in round robin fashion.


In elastic search index is logical representation of the data, physically actually its shards, so that's how index get splits amongst the cluster

what is basically shard ?
A shard is Lucine index, its search library that was introduced in 1999, and elasticsearch is distributed, the searching power comes from shard

what is shard contains?
shards having segments and in segments have Inverted index, and what is inverted index - inverted index is contain words(tokens) that have table kind of structure where each word - shows in how many document/index its occured,

Now let suppose if we need to search a word man, so it will give all documents with score according the their relevancy.

what is analysis in elasticsearch?
converting document(text) into something called tokens or terms. and it will put in Inverted index.



how in analysis process document put in segment?
converted document into tokens - and its put it in buffer-- and once buffer filled -- it will put it in segment in a shard

and once data saved in segment it will immutable inverted index, its permanent.

 

Text analysis  - below are the filter

1.Remove Stop words - The
2.Lowercasing - The Swimmers
3.Stemming (get root word) - Swimming, Swimmers - swim
4.Synonyms - thin skinny 


 Analyzer Architecture 

 

 


1. Tokenizer - create a tokens, split
2. Filter        -

After above Tokenize and filter below is the document looks like

 

 and where this Analyzer work?
it work on fields in document like below - name, username, tweeted


 and analyzer work on both time query and indexing.

where we configure analyzer and replica and shard settings ?
when we create a index structure,
we need to configure mapping and settings


PUT /customers
{
   "settings": {
   "number_of_shards": 2,
   "number_of_replicas": 1
  }
   "mappings": {}
}

 GET customers

PUT /customers
{
   "mappings": {
     "online": {
       "properties": {
         "gender": {
           "type": "text",
           "analyzer": "standard"          
        },
         "age": {
           "type": "integer"
        },
         "total_spent": {
           "type": "float"
        },
         "is_new": {
           "type": "boolean"
        },
         "name": {
           "type": "boolean"
           "analyzer": "standard"
        }       
      }
    }
  }    
}
Note: after elastic version 7 only 1 mapping we can define
fields we mention under properties
Another thing we can add any additional field later on as well.
e.g.
PUT /customers/online/124
{
   "name": "Mary Cranford",
   "address": "310 Clark Ave",
   "gender": "female",
   "age": "34",
   "total_spent": 550.75,
   "is_new": false
}

But in production you may or may not allow to elasticsearch to add any field at any point of time, or might be you want to fix the fields


 

GET customer/_mapping/online
{
   "dynamic": false
}

GET customer/_mapping/online
{
   "dynamic": "strict"
}


more on analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-analyzers.html

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

 

 

 

 DSL Query context

 

 GET /courses/_search
{
   "query": {
      "match_all": {}
  }
}


 

GET /courses/_search
{
   "query": {
      "match": {"name" : "computer" }
  }
}


Here in each query result get some hits and number of hits is scrore on the basis of relevancy of document.

 GET /courses/_search
{  
   "query": {
   "exists": {"field": "professor.email"}
  }
}

it will check if field profession.email is exist ? and for each found have same score 1


 how to query 2 condition in DSL query?
we should use must clause and then bool

in must condition 

Must
GET /courses/_search
{  
   "query": {
     "bool"{
       "must":[
         {"match
": {"name" : "computer" }},
         {"match": {"room": "c8"}}
        
      ]
    }  
  }
}

 Must not

GET /courses/_search
{
   "query": {
     "bool": {
       "must": [
         {"match": {"name": "accounting"}},
         {"match": {"room": "e3"}}
        ],

       "must_not":[
         {"match": {"professor.name": "bill"}}
       ]
    }
  }
}

 should means - it would be nice to have, it will not give precedence over must, its almost always ignore by elastic search
and we can force should, if we define minimum_should_match 

what is minimum_should_match, ?
if we mention value 1, means one value should be true in multiple conditions.

GET /courses/_search
{
   "query": {
     "bool": {
       "must": [
         {"match": {"name": "accounting"}},
         {"match": {"room": "e3"}}
        ],

       "must_not":[
         {"match": {"professor.name": "bill"}}
       ],
       "should": [
      ]
    },
    "minimum_should_match": 1
  }
}

multi_match query ?
in multi match query
we search a query in multiple field,
and it find in any field its return true,
and condition will be true
GET /courses/_search
{
   "query": {
     "multi_match": {
       "fields": ["name", "professor.department"],
       "query": "accounting"
    }    
  }
}

what is match_phrase?
its a query that match a phrase in a field
GET /courses/_search
{
   "query": {
     "multi_phrase": {
       "course_description": "from the business school taken by final year"
    }    
  }
}
Note: but in match_phrase, if you break any token it will unable to search that query, exam: final in above query break fin, it will unable to search it

 

how we can match_phrase if we have broken token and word in a field, while we do searching?
multi_phrase_prefix - partial token can be figured out by this.
GET /courses/_search
{
   "query": {
     "multi_phrase_prefix": {
       "course_description": "from the business school taken by fin"
    }    
  }
}


now make combined query that all we used, must, must not should range
GET /courses/_search
{
   "query": {
     "bool": {
       "must": [
         {"match": {"name": "accounting"}}
        ],
        "must_not":[
         {"match": {"room": "e7"}}
        ],
         "should": [
           {"range":{
             "students_enrolled": {
               "gte": 10,
               "lte": 20
            }
          }
          }
         ]
    }
  }


 DSL filter context

GET /courses/_search
{
   "filter":{
     "match": {"name": "accounting"}
  }
}
but its not a vaild query


Note filter also go inside the query, and
in query, syntax go inside the bool
GET /courses/_search
{
  "query": {
    "bool"
      "filter":{
        "match": {"name": "accounting"}
       }
     }
  }
}

Note: Also noticed all document score will be zero in filter 

how query will diff from filter
1. filter do not do scoring, it will not check doc, relevancy.
2. another thing filter query also get cached, and it will be help in searching fast.
3. why filter query fast then a query, becz it do not do scoring like query do.


{
  "query": {
    "bool" {
      "filter":{
        "bool": {
          "must": [
             {"match": {"professor.name": "bill"}}
             {"match": {"name": "accounting"}}  
           ],
          "must_not":[
            {"match": {"room": "e7"}}
            ]
        }     
       }
     }
  }
}

query and filter together
GET /courses/_search
{
  "query": {
    "bool" {
      "filter": {
        "bool" {
          "must": [
             {"match": {"professor.name": "bill"}}
             {"match": {"name": "accounting"}}  
           ]
        }     
       },  
       "must": [
          {"match": {"room": "e3"}}
      ]    
     }          
  }
}


how to boost a document or what is field boosting ?

GET /courses/_search
{
  "query": {
    "bool" {
      "filter": {
        "bool" {
          "must": [
             {
               "range":{
                 "student_enrolled":{
                   "gte": 12
                }
              }
            }
           ]
        }     
       },  
       "should": [
          {"match": {"room": "e3"}}
            {
              "range": {
                "student_enrolled":{
                  "gte": 13
               }
             }
      ]    
     }          
  }
}


bulk indexing - bulk api
how to index multiple document in single go?
POST /vehicles/cars/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "white", "make" : "honda", "sold" : "2016-10-28", "condition": "okay"}
{ "index": {}}
{ "price" : 20000, "color" : "white", "make" : "honda", "sold" : "2016-11-05", "condition": "new" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2016-05-18", "condition": "new" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2016-07-02", "condition": "good" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2016-08-19" , "condition": "good"}
{ "index": {}}
{ "price" : 18000, "color" : "red", "make" : "dodge", "sold" : "2016-11-05", "condition": "good"  }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2016-01-01", "condition": "new"  }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2016-08-22", "condition": "new"  }
{ "index": {}}
{ "price" : 10000, "color" : "gray", "make" : "dodge", "sold" : "2016-02-12", "condition": "okay" }
{ "index": {}}
{ "price" : 19000, "color" : "red", "make" : "dodge", "sold" : "2016-02-12", "condition": "good" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "chevrolet", "sold" : "2016-08-15", "condition": "good" }
{ "index": {}}
{ "price" : 13000, "color" : "gray", "make" : "chevrolet", "sold" : "2016-11-20", "condition": "okay" }
{ "index": {}}
{ "price" : 12500, "color" : "gray", "make" : "dodge", "sold" : "2016-03-09", "condition": "okay" }
{ "index": {}}
{ "price" : 35000, "color" : "red", "make" : "dodge", "sold" : "2016-04-10", "condition": "new" }
{ "index": {}}
{ "price" : 28000, "color" : "blue", "make" : "chevrolet", "sold" : "2016-08-15", "condition": "new" }
{ "index": {}}
{ "price" : 30000, "color" : "gray", "make" : "bmw", "sold" : "2016-11-20", "condition": "good" }

by default how much max document shows by kibana dev tool?
10
how to see more if need?
mention size in the query
GET /vehicles/cars/_search
{
   "size": 20,
   "query": {
     "match_all":{}
   }
}

and how to do pagination, suppose want to see only doc max?
GET /vehicles/cars/_search
{
   "from": 0,
   "size": 5,
   "query": {
     "match_all":{}
   }
}

how to sort the documents by particular field base?
{
   "from": 0,
   "size": 5,
   "query": {
     "match_all":{}
   },
   "sort": [
     {"price": {"order": "desc"}}
  ]
}


how to get count of documents?
GET /vehicles/cars/_count
{
   "query": {
     "match":{"make": "toyota"}
   }
}

Aggerigation
give insight from high level overview
and this done through summarization
eg we have 1000s of employee data of US
so we can query below using aggr as below
1. which state have highest num of employees
2. % of employees male and female
3. avg. saly. of emplys per state or city
4. which yr did the employs received highest bones per state
aggr help business answer this kind of questions

GET /vehicles/cars/_search
{
  "aggs": {
    "popular_cars": {
      "terms": {
        "field": "make.keyword"
        }
   }
 }
}

why we use make.keyword?
becz aggs we use to find how much or avg of this/that
usually in field data tokenized,
so using keyword it will keep whole field in 1 word


GET /vehicles/cars/_search
{
  "aggs": {
    "popular_cars": {
      "terms": {
        "field": "make.keyword"
     },
     "aggs": {
       "avg_price": {
         "avg": {
           "field": "price"
         }
       }
     }
   }
 }
}

max price / min price
GET /vehicles/cars/_search
{
  "aggs": {
    "popular_cars": {
      "terms": {
        "field": "make.keyword"
     },
     "aggs": {
       "avg_price": {
         "avg": {
           "field": "price"
         }
       },
     "aggs": {
       "max_price": {
         "max": {
           "field": "price"
         }
       },     
     "aggs": {
       "min_price": {
         "min": {
           "field": "price"
         }
       },     
     }
   }
 }
}

add query also in aggs and incase if you want to put size also or just want to hide hits
GET /vehicles/cars/_search
{
 "size": 5
 "query":{
   "match": {"color": "red"}
  },
    "aggs": {
      "popular_cars": {
        "terms": {
          "field": "make.keyword"
       },
       "aggs": {
         "avg_price": {
           "avg": {
             "field": "price"
           }
         },
       "aggs": {
         "max_price": {
           "max": {
             "field": "price"
           }
         },     
       "aggs": {
         "min_price": {
           "min": {
             "field": "price"
           }
         },     
       }
     }
   }
}

in case we want to see all min/max/avg in once, then we can use stats
GET /vehicles/cars/_search
{
 "size": 0
 "query":{
   "match": {"color": "red"}
  },
    "aggs": {
      "popular_cars": {
        "terms": {
          "field": "make.keyword"
       },
       "aggs": {
         "stats_on_price": {
           "stats": {
             "field": "price"
           }
         }
       }
     }
   }
}


in aggs we have 2 parts bucket and metric
e.g above query we divide in in 2 part
bucket
    "aggs": {
      "popular_cars": {
        "terms": {
          "field": "make.keyword"
       },

metric
       "aggs": {
         "stats_on_price": {
           "stats": {
             "field": "price"

we have 2 parts here key is a bucket and stats_on_price is metric

create a range bucket
GET /vehicles/cars/_search
{
 "size": 0
 "query":{
   "match": {"color": "red"}
  },
    "aggs": {
      "popular_cars": {
        "terms": {
          "field": "make.keyword"
       },
       "aggs": {
         "sold_date_ranges": {
           "range": {
             "field": "sold"
             "ranges": [
               { "from": "2016-01-01",
               "to": "2016-05-18"
              },
               { "from": "2016-05-18",
               "to": "2017-05-18"
              },      
            ]
       }
         }
       }
     }
   }
}

find vehicle counts on the basis of vehicle condition
GET /vehicles/cars/_search
{
 "size": 0
 "query":{
   "match": {"color": "red"}
  },
    "aggs": {
      "cars_conditions": {
        "terms": {
          "field": "condition.keyword"
       }
     }
   }
}

find diff vehicle condition avg price
GET /vehicles/cars/_search
{
 "size": 0
 "query":{
   "match": {"color": "red"}
  },
    "aggs": {
      "cars_conditions": {
        "terms": {
          "field": "condition.keyword"
       },
       "aggs": {
         "avg_price": {
           "avg": {
             "field": "price"
          }
        }
      }
     }
   }
}


find diff vehicle condition min max price on the basis of manufacturer?
GET /vehicles/cars/_search
{
 "size": 0
    "aggs": {
      "cars_conditions": {
        "terms": {
          "field": "condition.keyword"
       },
       "aggs": {
         "avg_price": {
           "avg": {
             "field": "price"
          }
        },
        "make": {
          "terms":{
            "field": "make.keyword"
         },
         "aggs":{
           "min_price": {"min": {"field": "price"}},
           "max_price": {"max": {"field": "price"}},
        }
       }
      }
     }
   }
}

 

Logstash

Logstash is pipeline engine, its use to ingest data, from multitude of source simultaneously,
you will be getting data from application logs, database or other no sequel data stores.
its doesn't matter the source of data could be anything.
and you can inguest data not only elastic search, but other no sequel engines.
you can ingest data of all shapes and sizes and sources
 

Logstash Pipeline
we have 3 stages in logstash pipeline
1. Data Sources
2. Logstash pipeline (inputs - filters - outputs)
3. Data Destinations



1. Data Source - whatever data coming from the source
2. Logstash -
   a. in input it will handle data coming from source,
   b. in filter it will filtering out data whatever that not need.
   c. in output, it will ready the data that need to store on destination.
3. Data Dest. - storing the data after Logstash processing.

 

Download logstash
go to logstash dir
create config file
and then run the logstash
bin/logstash -f filename.config

verify the argument input
bin/logstash -e 'input { stdin {}} output {stdout {}}'
if you get below error check java version

steps how to ingest dummy apache logs using logstage:
1. download dummy logs from below git
   https://github.com/elastic/elk-index-size-tests/blob/master/logs.gz
2. create folder on same location with name data/logs wherever you have kibna/elasticsearch/logstash
and put the above log file in folder data/logs.
3. create apache.conf file in data/ folder (data/apache.conf), and its logstach conf file.
4.
input
{
    file {
        path => "../../data/logs/logs"
        type => "logs"
        start_position => "beginning"
   }
}    
     
filter
{   
    grok {
       match => {
           "message" => "%{COMBINEDAPACHELOG}"
       }
    }     
    mutate {
        convert => {"bytes" => "integer"}
    date {
       match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
       locale => en
       remove_field => "timestamp"
       }        
    }
    geoip {
        source => "clientip"
    }
    useragent {
        source => "agent"
        target => "useragent"
    }
}
output {
    stdout {
        codec => dots
    }

}
5. run the logstash with config location bin/logstash -f filename.config
   
input can be anything like database, it could be another instance of elasticsearch
its have various plugins for filter
here above we are using 5 plugins in filter
grok - its a regular expersion matcher, and its comes with various inbuild variables, in above example we are using {COMBINEDAPACHELOG} variable

logs github page:
https://github.com/elastic/logstash/blob/v1.4.2/patterns/grok-patterns
its give you what variable, what translate to in regular expression, so you dont need to memorize or think, you can use directly these variables.


logstash parse things from the logs file with the help of filter, it could be also possible, additional fields, that are put into the document that do not exist the logs, and these fields basically enhenced fields that logstash able to deduce.
so logstash not just parser, its also a enhancer of the data

logstash official documentation
logstash reference
the key area that need to check in documentation
input plugins
output plugins
filter plugins

so after apache logs data loads in elasticsearch, if you want to count the data number of document exist there
localhost:9200/logstash-*/_count

so after apache logs data loads in elasticsearch, if you want to count the data number of document exist there
localhost:9200/logstash-*/_count

how to check document on kibna dasboard
go to - Managment - Elasticsearch - Index Managment



The discover tab - assist in exploring the data present in elastic search indices, its provide the ablility to query data filter data and inspect the document to structure.

diff b/w discover and visualize tab, - help in basically building visulalization, it contains varity of visualization, such as bar chart, line chart maps, tags clouds,

and after we create visualization, we can display it in dashboard.

1. so we need to define 1st index pattern e.g logstash-*

2. config settings

above these all  are the fields in given doucment.

Now if you go to visualize - you can create visualization now

selected suppose - data table


after creating visualization - go to dashboard - select your visualization and save


what is beats?
its a lightweight data shippers
it will move data from application server to logstash server

file beats
is allow shipping logs to logstash or other tools, such as kafka or mongodb or mysql or otherdatabases

Diff bw logstash and file beat
logstash is central location where the logs is being stash, parsed and send to elasticsearch for indexing.,
logstash should have dedicated node for log stash, and file beat could be on application server, from there its sending logs to logstash.
file beat is 1 kind, beat can be other type also, such as metrics(system metric), packet beat(networking), heartbeat(inform app is on/off, up and running)


filebeat also backpressure-sensitive protocol when sending data to logstash or elasticsearch.

Filebeat installation
1. click on - kibana - logging - apache logs -
2. Getting started - download the filebeat
3. unzip - go to filebeat folder ls you will see filebeat.yml

4. when you open filebeat.yml file
you will see 2 main thing
1. filebeat input
filebeat inputs:
- type: log
- enabled: false
- paths:
 - var/log/*.log

2. filebeat output
-----Elasticsearch output------
output.elasticsearch:
 hosts: ["localhost:9200"]
-----Logstash output----------
output.logstash:
 hosts: ["localhost:5044"]

for more documentation you can go to official website


configure filebeat with logstash
1. go to logstash config file
input
{
    beats {
        path => 5044
   }
}    
     
filter
{   
    grok {
       match => {
           "message" => "%{COMBINEDAPACHELOG}"
       }
    }     
    mutate {
        convert => {"bytes" => "integer"}
    date {
       match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
       locale => en
       remove_field => "timestamp"
       }        
    }
    geoip {
        source => "clientip"
    }
    useragent {
        source => "agent"
        target => "useragent"
    }
}
output {
    stdout {
        codec => dots
    }
    elasticsearch {
       hosts => ["http://localhost:9200"]
       index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
    }
}

2. go to filebeat.yml file
uncomment - Logstash output part and comment out Elasticsearch output part
#-----Elasticsearch output------
#output.elasticsearch:
# hosts: ["localhost:9200"]
-----Logstash output----------
output.logstash:
 hosts: ["localhost:5044"]

3. run the logstash conf file
bin/logstash -f ~/data/apache.conf

4. go to filebeat folder and run filebeat script
./filebeat








No comments:

Post a Comment