Elasticsearch是一个分布式的搜索引擎,可以检索全文并进行分析。其实和mongodb很像,不过Elasticsearch把文件存在Lucene index文件里,而mongodb存在bson里,另外es的文本功能也更丰富。
我之所以要用Elasticsearch是因为公司过去十几年的翻译文件只是以文件的形式存在硬盘里,没有能够好好地利用起来,如果能用搜索引擎对所有文件进行全文检索那就可以提高利用程度。
了解Elasticsearch可以访问官网的教程:https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html。
Elasticsearch是用java写的,以rest api或者java api对外提供接口。可以通过http://127.0.0.1:9200访问。
Table of Contents
相关概念
- Near Realtime(近乎实时),从检索文件到可以被搜索到只需要很短的时间
- Node and Cluster(节点和集群),node就是在一台电脑上运行的实例,一组node构成一个cluster
- Index(索引),Index就是一堆文件的集合,这类文件应该有相似的结构,比如都具有某个属性
- Type(类型),Index下还可以细分成几个type,但是Elasticsearch6开始只能有一个type,而7.0后就将弃用。
- Document(文档),信息的基本单元,用json格式来进行描述
- Shards(分片),如果要index存储的数据很多,那么磁盘空间可能不够,而且读写性能会有问题。通过拆分index为一个个shards,可以解决这一问题。
- Replica是shard的备份,规避节点出现故障的风险
基础操作
添加三个文档到叫做twitter的索引
curl -XPUT 'http://localhost:9200/twitter/_doc/1?pretty' -H 'Content-Type: application/json' -d '
{
"user": "kimchy",
"post_date": "2009-11-15T13:12:00",
"message": "Trying out Elasticsearch, so far so good?"
}'
curl -XPUT 'http://localhost:9200/twitter/_doc/2?pretty' -H 'Content-Type: application/json' -d '
{
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "Another tweet, will it be indexed?"
}'
curl -XPUT 'http://localhost:9200/twitter/_doc/3?pretty' -H 'Content-Type: application/json' -d '
{
"user": "elastic",
"post_date": "2010-01-15T01:46:38",
"message": "Building the site, should be kewl"
}'
列出所有索引
$ curl -X GET "localhost:9200/_cat/indices?v"
结果:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open twitter FLSTkYHXRIqIQajFzNnVDQ 5 1 3 0 14kb 14kb
查看某个索引
$ curl -XGET 'http://localhost:9200/twitter/_doc/1?pretty=true'
结果:
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T13:12:00",
"message" : "Trying out Elasticsearch, so far so good?"
}
}
检索
有两种办法。
使用URL:
curl -XGET 'http://localhost:9200/twitter/_search?q=user:kimchy&pretty=true'
或者使用Elasticsearch的基于json的查询语言:
curl -XGET 'http://localhost:9200/twitter/_search?pretty=true' -H 'Content-Type: application/json' -d '
{
"query" : {
"match" : { "user": "kimchy" }
}
}'
结果:
{
"took" : 20,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "Another tweet, will it be indexed?"
}
},
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T13:12:00",
"message" : "Trying out Elasticsearch, so far so good?"
}
}
]
}
}
可以使用size可以指定检索返回几条结果,使用from指定从第几条开始返回结果。
$ curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
"from" : 0, "size" : 1,
"query" : {
"term" : { "user" : "kimchy" }
}
}
'
更新
$ curl -X POST "localhost:9200/twitter/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
"doc": { "user": "Jane Doe" }
}
'
删除
$ curl -X DELETE "localhost:9200/twitter/_doc/2?pretty"
检查健康状况
$ curl -X GET "localhost:9200/_cat/health?v"
结果:
epoch timestamp cluster status node.total node.data shards pri relo i nit unassign pending_tasks max_task_wait_time active_shards_percent
1552132035 11:47:15 elasticsearch yellow 1 1 15 15 0 0 15 0 - 50.0%
更复杂的操作
Highlight 高亮
检索之前存储的文档的message内容,并高亮匹配的内容。
curl -X GET "localhost:9200/_search?pretty=true" -H 'Content-Type: application/json' -d'
{
"query" : {
"match": { "message": "tweet" }
},
"highlight" : {
"fields" : {
"message" : {}
}
}
}
'
结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 15,
"successful" : 15,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "Another tweet, will it be indexed?"
},
"highlight" : {
"message" : [
"Another <em>tweet</em>, will it be indexed?"
]
}
}
]
}
}
如果不想显示_source里的内容,可以在表达式里加上_source,变成以下这样的:
curl -X GET "localhost:9200/_search?pretty=true" -H 'Content-Type: application/json' -d'
{
"query" : {
"match": { "message": "tweet" }
},
"_source":"",
"highlight" : {
"fields" : {
"message" : {}
}
}
}
'
结果:
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 15,
"successful" : 15,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.2876821,
"_source" : { },
"highlight" : {
"message" : [
"Another <em>tweet</em>, will it be indexed?"
]
}
}
]
}
}
aggregations 聚合
类似于SQL中的GROUP BY,可以对数据进行分析然后分组。
用法:
curl -X GET "localhost:9200/twitter/_search?pretty=true" -H 'Content-Type: application/json' -d'
{
"aggs" : {
"usernames" : {
"terms" : { "field" : "user.keyword" }
}
}
}
'
其中usernames可以自己定义,我们这里使用的term aggregation,根据字段进行分组。
除了检索结果,可以看到还多了aggregations这一部分。其中buckets是符合检索条件的文档的集合,里面会包含一些统计信息。
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "Another tweet, will it be indexed?"
}
},
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T13:12:00",
"message" : "Trying out Elasticsearch, so far so good?"
}
},
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"user" : "elastic",
"post_date" : "2010-01-15T01:46:38",
"message" : "Building the site, should be kewl"
}
}
]
},
"aggregations" : {
"usernames" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "kimchy",
"doc_count" : 2
},
{
"key" : "elastic",
"doc_count" : 1
}
]
}
}
}
把size设置为0,这样就不会显示hits内容:
curl -X GET "localhost:9200/twitter/_search?pretty=true" -H 'Content-Type: application/json' -d'
{
"size":0,
"aggs" : {
"usernames" : {
"terms" : { "field" : "user.keyword" }
}
}
}
'
结果:
{
"took" : 41,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"usernames" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "kimchy",
"doc_count" : 2
},
{
"key" : "elastic",
"doc_count" : 1
}
]
}
}
}
使用中文分词
elasticsearch在检索中文时会把检索词拆分成一个个单字然后进行匹配。比如以下这样的:
检索:
$ curl -X GET "localhost:9200/_search?pretty=true" -H 'Content-Type: application/json' -d'
{
"query" : {
"match": { "message": "名字张三" }
},
"highlight" : {
"fields" : {
"message" : {}
}
}
}
'
结果:
{
"took" : 140,
"timed_out" : false,
"_shards" : {
"total" : 15,
"successful" : 15,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.1507283,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.1507283,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T13:12:00",
"message" : "我是中华人民共和国的成员,我的名字叫张三。"
},
"highlight" : {
"message" : [
"我是中华人民共和国的成员,我的<em>名</em><em>字</em>叫<em>张</em><em>三</em>。"
]
}
}
]
}
}
我们可以使用ik插件来进行分词。
在此https://github.com/medcl/elasticsearch-analysis-ik下载和安装ik插件,然后重启elasticsearch。
重新建立一个索引:
$ curl -XPUT http://localhost:9200/twitter
设置索引里的文档的mapping,使用ik进行分析:
curl -XPOST http://localhost:9200/twitter/_doc/_mapping -H 'Content-Type:application/json' -d'
{
"properties": {
"message": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}'
添加一条带中文的记录:
$ curl -XPUT 'http://localhost:9200/twitter/_doc/5?pretty' -H 'Content-Type: application/json' -d '
{
"user": "kimchy",
"post_date": "2009-11-15T13:12:00",
"message": "我是中华人民共和国的成员,我的名字叫张三。"
}'
检索结果:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 15,
"successful" : 15,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.8630463,
"_source" : {
"user" : "kimchy",
"post_date" : "2009-11-15T13:12:00",
"message" : "我是中华人民共和国的成员,我的名字叫张三。"
},
"highlight" : {
"message" : [
"我是中华人民共和国的成员,我的<em>名字</em>叫<em>张三</em>。"
]
}
}
]
}
}
这里我们修改的mapping是用来定义字段的属性的,添加文档时会进行自动生成,并且不推荐再进行更改。我们要修改的话就得重新索引。
查看修改后的mapping:
$ curl -XGET 'http://localhost:9200/twitter/_mapping?pretty=true'
{
"twitter" : {
"mappings" : {
"_doc" : {
"properties" : {
"message" : {
"type" : "text",
"analyzer" : "ik_max_word"
},
"post_date" : {
"type" : "date"
},
"user" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
其它
如果硬盘空间占用过高(高于95%,说明见此),elasticsearch会切换为只读状态,会提示以下错误:
{
"error" : {
"root_cause" : [
{
"type" : "cluster_block_exception",
"reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
}
],
"type" : "cluster_block_exception",
"reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
},
"status" : 403
}
这时可以通过以下操作进行解除:
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
产品栈
除了Elasticsearch,elastic公司还开发了一系列相关产品:
- Kibana,前端控制台,可以管理和可视化数据
- Logstash,可以对日志文件进行收集和处理并传到elasticsearch里
- Beats,从服务端收集日志、网络、监控数据的代理程序
- Elastic Cloud,一站式SaaS订阅服务