Elasticsearch Workshop

2016

Agenda

  • How does a search engine work
  • Elasticsearch
    • What is
    • Use cases
    • Own experience
    • How to get started
    • Mapping
    • Queries and filters
    • Aggregations
    • Highlighting
  • Workshop tasks

How does a search engine work?

Your document collection is big!
Scan through all the documents every time you search for something?
Pre-process the documents and create an index!

Create an inverted index

Find unique words

Search against the inverted index

Sort by relevance

How well each document matches the query

Use Cases

What can Elasticsearch be used for?

For Big Data

Github uses Elasticsearch to search 20TB data, including 1.3 billion files and 130 billion code lines

Text search

With filtering, aggregations, highlightning, pagination...

Pure Analytics

Count things and summarize your data, lots of data, often on timestamped data!

Centralized Logging

Logs > Logstash > Elasticsearch > Kibana

Geolocation

Own experience with Elasticsearch

Alt Mulig Mat

  • Text based searching
  • Structured searching (get all "Dessert" recipes)

How to use Elasticsearch?

Commonly used in addition to another database...

How to get started with Elasticsearch?

It is that easy

  • Download Elasticsearch from www.elastic.co
  • Elasticsearch only requires Java to run

wget https://download.elasticsearch.org/elasticsearch/release/...
tar -zxvf elasticsearch-2.2.0.tar.gz
cd elasticsearch-2.2.0/bin
./elasticsearch.sh
                

Zero configurations

  • Elasticsearch just works
    • No configuration is needed
    • It has sensible defaults settings

Is Elasticsearch alive?

You can access it at http://localhost:9200 on your web browser, which returns this:


{
   "status":200,
   "name":"Cypher",
   "cluster_name":"elasticsearch",
   "version":{
      "number":"1.5.2",
      "build_hash":"62ff9868b4c8a0c45860bebb259e21980778ab1c",
      "build_timestamp":"2015-04-27T09:21:06Z",
      "build_snapshot":false,
      "lucene_version":"4.10.4"
   },
   "tagline":"You Know, for Search"
}
                

REST API

  • Elasticsearch hides the complexities of Lucene behind a REST API
    • POST (create)
    • GET (read)
    • PUT (update)
    • DELETE (delete)

CURL works just fine!

  • An index is like a database
  • An type is like a SQL table

What is stored in Elasticsearch?

JSON documents!


{
   "title": "Elasticsearch Worshop",
   "date": "2016-04-08"
}
                

Let's do an example - A book website

  • We are building a website to find books
  • We have a collection of books
  • We want simple text based searching

How to store the books?

The act of storing data in Elasticsearch is called indexing.


$curl -X POST localhost:9200/books/computer/1 --data
'{
    "name": "The Pragmatic Programmer",
    "category": "Programming",
    "price": 29.90
}'

$curl -X POST localhost:9200/books/computer/2 --data
'{ 
    "name": "Clean Code",
    "category": "Programming",
    "price": 14.90
}'

$curl -X POST localhost:9200/books/computer/3 --data
'{
    "name": "Working Effectively with Legacy Code",
    "category": "Refactoring",
    "price": 45.50
}'
                

Get


                    $curl -X GET localhost:9200/books/computer/1
                

Result:


{
   "_index": "books",
   "_type": "computer",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "name": "The Pragmatic Programmer",
      "category": "Programming",
      "price": 29.9
  }
}
                

Update


$curl -X PUT localhost:9200/books/computer/1 --data
'{
   "name":"The Awesome Programmer"
}'
                

Result:


{
   "_index":"books",
   "_type":"computer",
   "_id":"1",
   "_version":2,
   "created":false
}
               

Delete


                    $curl -X DELETE localhost:9200/books/computer/1
                

So far

  • All we have is NoSQL document store which is
    • Fast
    • Scalable
    • Easy to use
  • Now to the really cool part, full-text search...

Full-text search

Find all books that contains the word "code"


                    $curl -X GET localhost:9200/books/computer/_search?q=code
                

Full-text search - Result

Sorted by relevance!


{
   "took":6,
   "timed_out":false,
   "_shards":{
      "total":5,
      "successful":5,
      "failed":0
   },
   "hits":{
      "total":2,
      "max_score":0.15342641,
      "hits":[
         {
            "_index":"books",
            "_type":"computer",
            "_id":"2",
            "_score":0.15342641,
            "_source":{
               "name":"Clean Code",
               "category":"Programming",
               "price":14.9
            }
         },
         {
            "_index":"books",
            "_type":"computer",
            "_id":"3",
            "_score":0.11506981,
            "_source":{
               "name":"Working Effectively with Legacy Code",
               "category":"Refactoring",
               "price":45.5
            }
         }
      ]
   }
}
                

Mapping

What is mapping?

Mapping is used to define how a document, and the fields it contains, are stored and indexed.

This is similar to a database schema.

Mapping example

Define the data types of the document fields


{
  "mappings": {
    "book": {
      "properties": {
        "name": {
          "type": "string"
        },
        "category": {
          "type": "string",
          "index": "not_analyzed"
        },
        "price": {
          "type": "float"
        }
      }
    } 
  }
}
                

Queries and filters

Query DSL

  • Alternative way of building queries
  • Allows us to build queries using JSON

Full-text query

Find the books with a name that contains the word "code"


                            $ curl -XGET ‘localhost:9200/books/book/_search’ -d
'{
  "query": {
    "match": {
      "name": "code"
    }
  }
}'
                        

Term query

Find books belonging to the "Programming" category


{
  "query": {
    "term": {
      "category": "Programming"
    }
  }
}
                        

Filter

Find books belonging to the "Programming" category, while skipping relevance scoring


{
  "query": {
    "bool": {
      "filter": [
        { "term": { "category": "Programming" } }
      ]
    }
  }
}
                        

Query vs Filter

Query Filter
Full text search Exact match
Relevance scoring Binary yes/no
Relatively slow Fast
Not cacheable Cacheable

Choosing between query and filter

  • Use queries for full-text search, or for cases where you want a relevance score
  • Use filters for everything else

Or combine them: Filter first, then query remaining docs.

Aggregations

  • Used to perform analysis on the data
  • Broken into 2 "families"
    • Metric
    • Bucket
Metric Bucket
Min Range
Max Terms
Sum Histogram
Avg
Stats

Buckets

Range

...
"aggs" : {
  "price_ranges" : {
    "range" : {
      "field" : "price",
      "ranges" : [
        { "to" : 10 },
        { "from" : 10, "to" : 30 },
        { "from" : 30 }
      ]
    }
  }
}
...
									

...
"buckets": {
  "*-10.0": {
    "to": 10,
    "doc_count": 0
  },
  "10.0-30.0": {
    "from": 10,
    "to": 30,
    "doc_count": 2
  },
  "30.0-*": {
    "from": 30,
    "doc_count": 1
  }
}
...
									

Buckets

Histogram

...
"aggs" : {
  "prices" : {
    "histogram" : {
      "field" : "price",
      "interval" : 15
    }
  }
}
...
									

...
"prices" : {
  "buckets": [
    {
      "key": 0,
      "doc_count": 1
    },
    {
      "key": 15,
      "doc_count": 1
    },
    {
      "key": 30,
      "doc_count": 0
    },
    {
	  "key": 45,
	  "doc_count": 1
	}
    ]
  }
}
...
									

Buckets

Terms

...
"aggs" : {
  "categories" : {
    "terms" : {
      "field" : "category"
    }
  }
}
...									

...
"buckets": [
  {
    "key": "programming",
    "doc_count": 2
  },
  {
    "key": "refactoring",
    "doc_count": 1
  }
...
										

Metrics

Min

...
"aggs" : {
  "min_price" : {
    "min" : {
     "field" : "price"
    }
  }
}
...
										

...
"aggregations": {
  "min_price": {
    "value": 14.9
  }
}
...
										

Metrics

Avg

...
"aggs" : {
  "avg_price" : {
    "avg" : {
      "field" : "price"
    }
  }
}
...
										

...
"aggregations": {
  "avg_price": {
    "value": 30.099999999999998
  }
}
...
										

Metrics

Stats

...
"aggs" : {
  "price_stats" : {
    "stats" : {
      "field" : "price"
    }
  }
}
...
										

...
"aggregations": {
  "prices_stats": {
    "count": 3,
    "min": 14.9,
    "max": 45.5,
    "avg": 30.099999999999998,
    "sum": 90.3
  }
}
...
										

Highlighting


{
  "query": {
    "match": {
      "name": "legacy code"
    }
  },
  "highlight": {
    "fields": {
      "name": {}
    }
  }
}
										

...
"highlight": {
  "name": [
  "Working Effectively with
   Legacy Code"
   ]
}
...
"highlight": {
  "name": [
  "Clean Code"
  ]
}
...
										

Workshop tasks

What will you learn?

19 tasks - learning Query DSL

  • Intro task - match all (task 0)
  • Full-text search (task 1-4)
  • Term queries (task 5-8)
  • Aggregations (task 9-13)
  • Combine full-text search and aggregations (task 14)
  • Sorting (task 15)
  • Highlighting (task 16)
  • Pagination (task 17-18)

List of pizzas

The data that are used during the workshop is a list of pizzas, with the mapping

Tasks look like this


Feature: Topic of the task

 // Use https://www.elastic.co/guide/en/...

 Scenario: Description of the task
  Given all pizzas are indexed
  When I make a query
  """
  { todo }
  """
  Then the response should contain
  """
  { subset }
  """
						
  • Your task is to replace the `{ todo }` with the correct query
  • A query needs to return a correct response { subset } to be passed

Known issues - A query need to be correctly aligned

Correct


  When I make a query
  """
  {
    ...
  }
  """
							

Wrong


 When I make a query
  """{
    ...
}
  """
							

Known issues - Tab is causing trouble

  • Please use spaces instead of tab for indentations

Compare against subsets

Total


{
   "workshop": "Elasticsearch",
   "date" : "2016-04-08"
}
							

Subset


{
   "date" : "2016-04-08"
}
							

The tasks

  • The tasks can be found under tests_bdd/
  • Use the Chrome extension Sense during the creation of queries
    • Better feedback
    • Autocomplete
    • ...
  • When the query is correct, paste it into the task and run it for validation

Running

  • Manual installation
    • Windows: `run-tasks.cmd`
    • Linux: `./run-tasks.sh`
  • Docker
    • make run-tasks