Getting started with Elasticsearch in Python

Python Elasticsearch

The updated version of this post for Elasticsearch 7.x is available here.

In this post, I am going to discuss Elasticsearch and how you can integrate it with different Python apps.

What is ElasticSearch?

ElasticSearch (ES) is a distributed and highly available open-source search engine that is built on top of Apache Lucene. It’s an open-source which is built in Java thus available for many platforms. You store unstructured data in JSON format which also makes it a NoSQL database. So, unlike other NoSQL databases ES also provides search engine capabilities and other related features.

ElasticSearch Use Cases

You can use ES for multiple purposes, a couple of them given below:

  • You are running a website that provides lots of dynamic content; be it an e-commerce website or a blog. By implementing ES you can not only provide a robust search engine for your web app but can also provide native auto-complete features in your app.
  • You can ingest different kinds of log data and then can use to find trends and statistics.

 

Setting up and Running

The easiest way to install ElasticSearch is to just download it and run the executable. You must make sure that you are using Java 7 or greater.

Once download, unzip and run the binary of it.

elasticsearch-6.2.4 bin/elasticsearch

There will be a lots of text in the scrolling window. If you see something like below then it seems it’s up.

[2018-05-27T17:36:11,744][INFO ][o.e.h.n.Netty4HttpServerTransport] [c6hEGv4] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}

But, since, seeing is believing, access the URL http://localhost:9200 in your browser or via cURL and something like below should welcome you:

{
  "name" : "c6hEGv4",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "HkRyTYXvSkGvkvHX2Q1-oQ",
  "version" : {
    "number" : "6.2.4",
    "build_hash" : "ccec39f",
    "build_date" : "2018-04-12T20:37:28.497551Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Now, before I move onto accessing Elastic Search in Python, let’s do some basic stuff. As I mentioned that ES provides a REST API interface, we will be using it to carry on different tasks.

Basic Examples

The very first thing you have to do is creating an Index. Everything is stored in an Index. The RDBMS equivalent of Index is a database so don’t confuse it with the typical indexing concept you learn in RDBMS. I am using PostMan to run REST APIs.

Elastic Search - Create Index

If it runs successfully you will see something like below in response:

{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "company"
}

So we have created a database with the name company. In other words, we have created an Index called company.  If you access http://localhost:9200/company from your browser then you will see something like below:

{
  "company": {
    "aliases": {
      
    },
    "mappings": {
      
    },
    "settings": {
      "index": {
        "creation_date": "1527638692850",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "RnT-gXISSxKchyowgjZOkQ",
        "version": {
          "created": "6020499"
        },
        "provided_name": "company"
      }
    }
  }
}

Ignore mappings for a while as we will discuss it later. It’s actually nothing but creating a Schema of your document. creation_date is self-explanatory.  The number_of_shards tells about the number of partitions that will keep the data of this Index. Keeping entire data on a single disk does not make sense at all. If you are running a cluster of multiple Elastic nodes then entire data is split across them. In simple words, if there are 5 shards then entire data is available across 5 shards and ElasticSearch cluster can serve requests from any of its node.

Replicas talk about mirroring of your data. If you are familiar with the master-slave concept then this should not be new for you. You can learn more about basic ES concepts here.

The cURL version of creating an index is a one-liner.

➜  elasticsearch-6.2.4 curl -X PUT localhost:9200/company
{"acknowledged":true,"shards_acknowledged":true,"index":"company"}%

You can also perform both index creation and record insertion task in a single go. All you have to do is to pass your record in JSON format. You can something like below in PostMan:

Elasticsearch - Add a record

Make sure you set Content-Type as application/json

It will create an index, named, company here if it does not exist and then create a new type called employees here. Type is actually the ES version of a table in RDBMS.

The above requests will output the following JSON structure:

{
    "_index": "company",
    "_type": "employees",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

 

You pass /1 as an ID of your record. It is not necessary though. All it will do is to set _id field with the value 1. You then pass your data in JSON format which will eventually be inserted as a new record or document. If you access http://localhost:9200/company/employees/1 from the browser you will something like below:

{"_index":"company","_type":"employees","_id":"1","_version":1,"found":true,"_source":{
    "name": "Adnan Siddiqi",
    "occupation": "Consultant"
}}

You can see the actual record along with the meta. If you want you can change the request as http://localhost:9200/company/employees/1/_source and it will only output the JSON structure for the record only.

The cURL version would be:

➜  elasticsearch-6.2.4 curl -X POST \
>   http://localhost:9200/company/employees/1 \
>   -H 'content-type: application/json' \
>   -d '{
quote>     "name": "Adnan Siddiqi",
quote>     "occupation": "Consultant"
quote> }'
{"_index":"company","_type":"employees","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}%

What if you want to update that record? Well, it’s pretty simple. All you have to do is to change your JSON record. Something like below:

ElasticSearch - Update Record

and it will generate the following output:

{
    "_index": "company",
    "_type": "employees",
    "_id": "1",
    "_version": 2,
    "result": "updated",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 1,
    "_primary_term": 1
}

Notice the _result field which is now set to updated instead of created

And of course, you can delete the certain record too.

ElasticSearch - Delete record

And if you are going crazy or your girlfriend have dumped you, you can burn the entire world by running curl -XDELETE localhost:9200/_all from command-line.

Let’s do some basic search. If you run http://localhost:9200/company/employees/_search?q=adnan, it will search all fields under the type employees and returns the relevant records.

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "company",
        "_type": "employees",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "name": "Adnan Siddiqi",
          "occupation": "Software Consultant"
        }
      }
    ]
  }
}

The max_score field tells how relevant the record is, that is, the highest score of the record. If there were multiple records then it’d be a different number.

You can also limit your search criteria to a certain field by passing the field name. Therefore, http://localhost:9200/company/employees/_search?q=name:Adnan will search only in name field of the document. It is actually SQL equivalent of SELECT  * from table where name='Adnan'

I just covered the basic examples. ES can do lots of things but I will let you explore it further by reading the documentation and will switch over to accessing ES in Python.

Accessing ElasticSearch in Python

To be honest, the REST APIs of ES is good enough that you can use requests library to perform all your tasks. Still, you may use a Python library for ElasticSearch to focus on your main tasks instead of worrying about how to create requests.

Install it via pip and then you can access it in your Python programs.

pip install elasticsearch

To make sure it’s correctly installed, run the following basic snippet from command-line:

➜  elasticsearch-6.2.4 python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
>>> es
<Elasticsearch([{'host': 'localhost', 'port': 9200}])>

Web Scraping and Elasticsearch

Let’s discuss a little practical use case of using Elasticsearch. The objective is to access online recipes and store them in Elasticsearch for searching and analytics purpose. We will first scrape data from Allrecipes and store it in ES. We will also be creating a strict Schema or mapping, in case of ES, so that we can make sure that data is being indexed in correct format and type. I am just pulling the listing of salad recipes only. Let’s begin!

Scraping data

import json
from time import sleep

import requests
from bs4 import BeautifulSoup


def parse(u):
    title = '-'
    submit_by = '-'
    description = '-'
    calories = 0
    ingredients = []
    rec = {}

    try:
        r = requests.get(u, headers=headers)

        if r.status_code == 200:
            html = r.text
            soup = BeautifulSoup(html, 'lxml')
            # title
            title_section = soup.select('.recipe-summary__h1')
            # submitter
            submitter_section = soup.select('.submitter__name')
            # description
            description_section = soup.select('.submitter__description')
            # ingredients
            ingredients_section = soup.select('.recipe-ingred_txt')

            # calories
            calories_section = soup.select('.calorie-count')
            if calories_section:
                calories = calories_section[0].text.replace('cals', '').strip()

            if ingredients_section:
                for ingredient in ingredients_section:
                    ingredient_text = ingredient.text.strip()
                    if 'Add all ingredients to list' not in ingredient_text and ingredient_text != '':
                        ingredients.append({'step': ingredient.text.strip()})

            if description_section:
                description = description_section[0].text.strip().replace('"', '')

            if submitter_section:
                submit_by = submitter_section[0].text.strip()

            if title_section:
                title = title_section[0].text

            rec = {'title': title, 'submitter': submit_by, 'description': description, 'calories': calories,
                   'ingredients': ingredients}
    except Exception as ex:
        print('Exception while parsing')
        print(str(ex))
    finally:
        return json.dumps(rec)


if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
        'Pragma': 'no-cache'
    }
    url = 'https://www.allrecipes.com/recipes/96/salad/'
    r = requests.get(url, headers=headers)
    if r.status_code == 200:
        html = r.text
        soup = BeautifulSoup(html, 'lxml')
        links = soup.select('.fixed-recipe-card__h3 a')
        for link in links:
            sleep(2)
            result = parse(link['href'])
            print(result)
            print('=================================')

So this is the basic program that pulls data. Since we need data in JSON format, therefore, I converted it accordingly.

Creating Index

OK, so we got the desired data and we have to store it. The very first thing we have to do is creating an index. Let’s name it recipes. The type will be called salads. The other thing I am going to do is to create a mapping of our document structure.

Before we go to create an index, we have to connect ElasticSearch server.

import logging
def connect_elasticsearch():
    _es = None
    _es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
    if _es.ping():
        print('Yay Connect')
    else:
        print('Awww it could not connect!')
    return _es

if __name__ == '__main__':
  logging.basicConfig(level=logging.ERROR)

_es.ping() actually pings the server and returns True if gets connected. It took me a while to figure out how to catch stack trace, found out that it was just being logged!

def create_index(es_object, index_name='recipes'):
    created = False
    # index settings
    settings = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "mappings": {
            "members": {
                "dynamic": "strict",
                "properties": {
                    "title": {
                        "type": "text"
                    },
                    "submitter": {
                        "type": "text"
                    },
                    "description": {
                        "type": "text"
                    },
                    "calories": {
                        "type": "integer"
                    },
                }
            }
        }
    }

    try:
        if not es_object.indices.exists(index_name):
            # Ignore 400 means to ignore "Index Already Exist" error.
            es_object.indices.create(index=index_name, ignore=400, body=settings)
            print('Created Index')
        created = True
    except Exception as ex:
        print(str(ex))
    finally:
        return created

Lot’s of things happening here.  First, we passed a config variable that contains the mapping of entire document structure. Mapping is the Elastic’s terminology for a schema. Just like we set certain field data type in tables, we do something similar here. Check the docs, it covers more than that. All fields are of type text but calories which is of type Integer

Next, I am making sure that the index does not exist at all and then creating it. The parameter ignore=400 is not required anymore after checking but in case you do not check the existence you can suppress the error and overwrite the existing index. It’s risky though. It’s like overwriting the DB.

If the index is successfully created, you can verify it by visiting http://localhost:9200/recipes/_mappings and it will print something like below:

{
  "recipes": {
    "mappings": {
      "salads": {
        "dynamic": "strict",
        "properties": {
          "calories": {
            "type": "integer"
          },
          "description": {
            "type": "text"
          },
          "submitter": {
            "type": "text"
          },
          "title": {
            "type": "text"
          }
        }
      }
    }
  }
}

By passing dynamic:strict we are forcing Elasticsearch to do a strict checking of any incoming document. Here, salads is actually the document type. The Type is actually the Elasticsearch’s answer of RDBMS tables.

Records Indexing

The next step is storing the actual data or document.

def store_record(elastic_object, index_name, record):
    try:
        outcome = elastic_object.index(index=index_name, doc_type='salads', body=record)
    except Exception as ex:
        print('Error in indexing data')
        print(str(ex))

Run it and you will be welcomed by:

Error in indexing data
TransportError(400, 'strict_dynamic_mapping_exception', 'mapping set to strict, dynamic introduction of [ingredients] within [salads] is not allowed')

Can you guess why is it happening? Since we did not set ingredients in our mapping, ES did not allow us to store the document that contains ingredients field. Now you know the advantage of assigning a mapping at first place. You can avoid corrupting your data by doing this. Now, let’s change mapping a bit and now it will look like below:

"mappings": {
            "salads": {
                "dynamic": "strict",
                "properties": {
                    "title": {
                        "type": "text"
                    },
                    "submitter": {
                        "type": "text"
                    },
                    "description": {
                        "type": "text"
                    },
                    "calories": {
                        "type": "integer"
                    },
                    "ingredients": {
                        "type": "nested",
                        "properties": {
                            "step": {"type": "text"}
                        }
                    },
                }
            }
        }

We added ingrdients of type nested and then assigned the data type of internal field. In our case it’s text

The nested data type let you set the type of nested JSON objects. Run it again and you will be greeted by the following output:

{
  '_index': 'recipes',
  '_type': 'salads',
  '_id': 'OvL7s2MBaBpTDjqIPY4m',
  '_version': 1,
  'result': 'created',
  '_shards': {
    'total': 1,
    'successful': 1,
    'failed': 0
  },
  '_seq_no': 0,
  '_primary_term': 1
}

Since you did not pass the _id at all, ES itself assigned a dynamic ID to the stored document. I use Chrome, I use ES data viewer with the help of a tool called ElasticSearch Toolbox to view the data.

ElasticSearch Toolbox

Before we move on, let’s send a string in calories field and see how it goes. Remember we had set it as integer. Upon indexing it gave the following error:

TransportError(400, 'mapper_parsing_exception', 'failed to parse [calories]')

So now you know the benefits of assigning a mapping for your documents. If you don’t, it will still work as Elasticsearch will assign its own mapping at runtime.

Querying Records

Now, the records are indexed, its time to query them as per our need. I am going to make a function, called, search() which will display results w.r.t our queries.

def search(es_object, index_name, search):
    res = es_object.search(index=index_name, body=search)

It is pretty basic. You pass index and search criteria in it. Let’s try some queries.

if __name__ == '__main__':
  es = connect_elasticsearch()
    if es is not None:
        search_object = {'query': {'match': {'calories': '102'}}}
        search(es, 'recipes', json.dumps(search_object))

The above query will return all records in which calories is equal to 102. In our case the output would be:

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'YkTAuGMBzBKRviZYEDdu',
                    '_index': 'recipes',
                    '_score': 1.0,
                    '_source': {'calories': '102',
                                'description': "I've been making variations of "
                                               'this salad for years. I '
                                               'recently learned how to '
                                               'massage the kale and it makes '
                                               'a huge difference. I had a '
                                               'friend ask for my recipe and I '
                                               "realized I don't have one. "
                                               'This is my first attempt at '
                                               'writing a recipe, so please '
                                               'let me know how it works out! '
                                               'I like to change up the '
                                               'ingredients: sometimes a pear '
                                               'instead of an apple, '
                                               'cranberries instead of '
                                               'currants, Parmesan instead of '
                                               'feta, etc. Great as a side '
                                               'dish or by itself the next day '
                                               'for lunch!',
                                'ingredients': [{'step': '1 bunch kale, large '
                                                         'stems discarded, '
                                                         'leaves finely '
                                                         'chopped'},
                                                {'step': '1/2 teaspoon salt'},
                                                {'step': '1 tablespoon apple '
                                                         'cider vinegar'},
                                                {'step': '1 apple, diced'},
                                                {'step': '1/3 cup feta cheese'},
                                                {'step': '1/4 cup currants'},
                                                {'step': '1/4 cup toasted pine '
                                                         'nuts'}],
                                'submitter': 'Leslie',
                                'title': 'Kale and Feta Salad'},
                    '_type': 'salads'}],
          'max_score': 1.0,
          'total': 1},
 'timed_out': False,
 'took': 2}

What if you want to get records in which calories greater than 20?

search_object = {'_source': ['title'], 'query': {'range': {'calories': {'gte': 20}}}}

You can also specify which columns or fields you want to return. The above query will return all records in which calories are greater than 20. Also, it will display title field only under _source.

Conclusion

Elasticsearch is a powerful tool that can help to make your existing or new apps searchable by providing robust features to return the most accurate result set. I have just covered the gist of it. Do read docs and get yourself acquainted with this powerful tool. Especially fuzzy search feature is quite awesome. If I get chance I will cover Query DSL in coming posts.

Like always, the code is available on Github.

Header Image Credit: tryolabs.com

If you like this post then you should subscribe to my blog for future updates.

* indicates required