from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm from matplotlib.ticker import LinearLocator, FormatStrFormatter import matplotlib.pyplot as plt import numpy as np
fig = plt.figure() ax = fig.gca(projection='3d')
X = np.arange(-5, 5, 0.25) Y = np.arange(-5, 5, 0.25) X, Y = np.meshgrid(X, Y)
Z = X**2+Y**2
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=0, antialiased=False)
Lately, here at Tryolabs, we started gaining interest in big data and search related platforms which are giving us excellent resources to create our complex web applications. One of them is Elasticsearch.
Elastic{ON}15, the first ES conference is coming, and since nowadays we see a lot of interest in this technology, we are taking the opportunity to give an introduction and a simple example for Python developers out there that want to begin using it or give it a try.
### 1. What is Elasticsearch?
Elasticsearch is a distributed, real-time, search and analytics platform.
2. Yeah, but what IS Elasticsearch?
Good question! In the previous definition you can see all these hype-sounding tech terms (distributed, real-time, analytics), so let’s try to explain. ES is distributed, it organizes information in clusters of nodes, so it will run in multiple servers if we intend it to. ES is real-time, since data is indexed, we get responses to our queries super fast! And last but not least, it does searches and analytics. The main problem we are solving with this tool is exploring our data! A platform like ES is the foundation for any respectable search engine.
3. How does it work?
Using a restful API, Elasticsearch saves data and indexes it automatically. It assigns types to fields and that way a search can be done smartly and quickly using filters and different queries. It’s uses JVM in order to be as fast as possible. It distributes indexes in “shards” of data. It replicates shards in different nodes, so it’s distributed and clusters can function even if not all nodes are operational. Adding nodes is super easy and that’s what makes it so scalable. ES uses Lucene to solve searches. This is quite an advantage with comparing with, for example, Django query strings. A restful API call allows us to perform searches using json objects as parameters, making it much more flexible and giving each search parameter within the object a different weight, importance and or priority. The final result ranks objects that comply with the search query requirements. You could even use synonyms, autocompletes, spell suggestions and correct typos. While the usual query strings provides results that follow certain logic rules, ES queries give you a ranked list of results that may fall in different criteria and its order depend on how they comply with a certain rule or filter. ES can also provide answers for data analysis, like averages, how many unique terms and or statistics. This could be done using aggregations. To dig a little deeper in this feature check the documentation here.
4. Should I use ES?
The main point is scalability and getting results and insights very fast. In most cases using Lucene could be enough to have all you need. It seems sometimes that these tools are designed for projects with tons of data and are distributed in order to handle tons of users. Startups dream of growing to that scenario, but may start thinking small first to build a prototype and then when the data is there, start thinking about scaling problems. Does it make sense and pays off to be prepared to grow A LOT? Why not? Elasticsearch has no drawback and is easy to use, so it’s just a decision of using it to be prepared for the future. I’m going to give you a quick example of a dead simple project using Elasticsearch to quickly and beautifully search for some example data. It will be quick to do, Python powered and ready to scale in case we need it to, so, best of both worlds.
5. Easy first steps with ES
For the following part it would be nice to be familiarized with concepts like Cluster, Node, Document, Index. Take a look at the official guide if you have doubts. First things first, get ES from here. I followed this video tutorial to get things started in just a minute. I recommend all you to check it out later. Once you downloaded ES, it’s as simple as running bin/elasticsearch and you will have your ES cluster with one node running! You can interact with it at http://localhost:9200/ If you hit it you will get something like this:
view rawes_first.json hosted with ❤ by GitHub Creating another node is as simple as: bin/elasticsearch -Des.node.name=Node-2 It automatically detects the old node as its master and joins our cluster. By default we will be able to communicate with this new node using the 9201 port http://localhost:9201. Now we can talk with each node and receive the same data, they are supposed to be identical.
6. Let’s Pythonize this thing!
To use ES with our all time favorite language; Python, it gets easier if we install elasticsearch-py package. pip install elasticsearch Now we will be able to use this package to index and search data using Python.
7. Let’s add some public data to our cluster
So, I wanted to make this project a “real world example”, I really did, but after I found out there is a star wars API (http://swapi.co/), I couldn’t resist it and ended up being a fictional - ”galaxy far far away” example. The API is dead simple to use, so we will get some data from there. I’m using an IPython Notebook to do this test, I started with the sample request to make sure we can hit the ES server.
make sure ES is up and running
1 2 3
import requests res = requests.get('http://localhost:9200') print(res.content)
view rawes_first.py hosted with ❤ by GitHub Then we connect to our ES server using Python and the elasticsearch-py library: #connect to our cluster
1 2
from elasticsearch import Elasticsearch es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
view rawes_first3.py hosted with ❤ by GitHub I added some data to test, and then deleted it. I’m skipping that part for this guide, but you can check it out in the notebook. Now, using The Force, we connect to the Star Wars API and index some fictional people. #let’s iterate over swapi people documents and index them
1 2 3 4 5 6 7 8 9
import json r = requests.get('http://localhost:9200') i = 1 while r.status_code == 200: r = requests.get('http://swapi.co/api/people/'+ str(i)) es.index(index='sw', doc_type='people', id=i, body=json.loads(r.content)) i=i+1 print(i)
view rawes_first4.py hosted with ❤ by GitHub Please, notice that we automatically created an index “sw” and a “doc_type” with de indexing command. We get 17 responses from swapi and index them with ES. I’m sure there are much more “people” in the swapi DB, but it seems we are getting a 404 with http://swapi.co/api/people/17. Bug report here! :-) Anyway, to see if all worked with this few results, we try to get the document with id=5. es.get(index=’sw’, doc_type=’people’, id=5) view rawes_first10.py hosted with ❤ by GitHub We will get Princess Leia:
view rawes_first5.py hosted with ❤ by GitHub Now, let’s add more data, this time using node 2! And let’s start at the 18th person, where we stopped.
1 2 3 4 5 6
r = requests.get('http://localhost:9201') i = 18 while r.status_code == 200: r = requests.get('http://swapi.co/api/people/'+ str(i)) es.index(index='sw', doc_type='people', id=i, body=json.loads(r.content)) i=i+1
view rawes_first6.py hosted with ❤ by GitHub We got the rest of the characters just fine.
8. Now, let’s try an interesting search
Where is Darth Vader? Here is our search query: es.search(index=”sw”, body={“query”: {“match”: {‘name’:’Darth Vader’}}}) view rawes_first7.py hosted with ❤ by GitHub This will give us both Darth Vader AND Darth Maul. Id 4 and id 44 (notice that they are in the same index, even if we use different node client call the index command). Both results have a score, although Darth Vader is much higher than Darth Maul (2.77 vs 0.60) since Vader is a exact match. Take that Darth Maul!
view rawes_first8.py hosted with ❤ by GitHub So, this query will give us results if the word is contained exactly in our indexed data. What if we want to build some kind of autocomplete input where we get the names that contain the characters we are typing? There are many ways to do that and another great number of queries. Take a look here to learn more. I picked this one to get all documents with prefix “lu” in their name field: es.search(index="sw", body={"query": {"prefix" : { "name" : "lu" }}}) view rawes_first9.py hosted with ❤ by GitHub We will get Luke Skywalker and Luminara Unduli, both with the same 1.0 score, since they match with the same 2 initial characters.
view rawes_first11.py hosted with ❤ by GitHub There are many other interesting queries we can do. If, for example, we want to get all elements similar in some way, for a related or correction search we can use something like this: es.search(index=”sw”, body={“query”: {“fuzzy_like_this_field” : { “name” : {“like_text”: “jaba”, “max_query_terms”:5}}}}) view rawes_first1.py hosted with ❤ by GitHub And we got Jabba although we had a typo in our search query. That is powerful!
This was just a simple overview on how to set up your Elasticsearch server and start working with some data using Python. The code used here is publicly available in this IPython notebook. We encourage you to learn more about ES and specially take a look at the Elastic stack where you will be able to see beautiful analytics and insights with Kibana and go through logs using Logstash. In following posts we will talk about more advanced ES features and we will try to extend this simple test and use it to show a more interesting Django app powered by this data and by ES. Hope this post was useful for developers trying to enter the ES world. At Tryolabs we’re Elastic official partners. If you want to talk about Elasticsearch, ELK, applications and possible projects using these technologies, drop us a line to hello@tryolabs.com (or fill out this form) and we will be glad to connect!
Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol. 这样是出于安全的考虑,访问非加密页时,如果来源是加密页,客户端不发送Referer,IE一直都是这样实现的,Firefox浏览器也不例外。但这并不影响从加密页到加密页的访问。
接收门店变更的消息,准实时更新。 给每一个POI缓存数据设置5分钟的过期时间,过期后从DB加载再回设到DB。这个策略是对第一个策略的有力补充,解决了手动变更DB不发消息、接消息更新程序临时出错等问题导致的第一个策略失效的问题。通过这种双保险机制,有效地保证了POI缓存数据的可靠性和实时性。 缓存是否会满,缓存满了怎么办? 对于一个缓存服务,理论上来说,随着缓存数据的日益增多,在容量有限的情况下,缓存肯定有一天会满的。如何应对? ① 给缓存服务,选择合适的缓存逐出算法,比如最常见的LRU。 ② 针对当前设置的容量,设置适当的警戒值,比如10G的缓存,当缓存数据达到8G的时候,就开始发出报警,提前排查问题或者扩容。 ③ 给一些没有必要长期保存的key,尽量设置过期时间。
提高性能,节省线程创建和销毁的开销 限流,给线程池一个固定的容量,达到这个容量值后再有任务进来,就进入队列进行排队,保障机器极限压力下的稳定处理能力在使用JDK自带的线程池时,一定要仔细理解构造方法的各个参数的含义,如core pool size、max pool size、keepAliveTime、worker queue等,在理解的基础上通过不断地测试调整这些参数值达到最优效果。 如果单机的处理能力不能满足需求,这个时候需要使用多机多线程的方式。这个时候就需要一些分布式系统的知识了。首先就必须引入一个单独的节点,作为调度器,其他的机器节点都作为执行器节点。调度器来负责拆分任务,和分发任务到合适的执行器节点;执行器节点按照多线程的方式(也可能是单线程)来执行任务。这个时候,我们整个任务系统就由单击演变成一个集群的系统,而且不同的机器节点有不同的角色,各司其职,各个节点之间还有交互。这个时候除了有多线程、线程池等机制,像RPC、心跳等网络通信调用的机制也不可少。后续我会出一个简单的分布式调度运行的框架。