ES中映射可以分为动态映射和静态映射
动态映射:
在关系数据库中,需要事先创建数据库,然后在该数据库下创建数据表,并创建表字段、类型、长度、主键等,最后才能基于表插入数据。而Elasticsearch中不需要定义Mapping映射(即关系型数据库的表、字段等),在文档写入Elasticsearch时,会根据文档字段自动识别类型,这种机制称之为动态映射。
动态映射规则如下:
静态映射
静态映射是在Elasticsearch中也可以事先定义好映射,包含文档的各字段类型、分词器等,这种方式称之为静态映射。
ES映射操作
动态映射
删除原创建的索引
DELETE /es_db
创建索引
PUT /es_db
创建文档(ES根据数据类型, 会自动创建映射)
PUT /es_db/_doc/1
{
"name": "Jack",
"sex": 1,
"age": 25,
"remark": "java入门至精通",
"address": "广州小蛮腰"
}
获取文档映射
GET /es_db/_mapping
创建静态映射
PUT /es_db
{
"mappings": {
"properties": {
"name": {
"type": "keyword",
"index": true,
"store": true
},
"sex": {
"type": "integer",
"index": true,
"store": true
},
"age": {
"type": "integer",
"index": true,
"store": true
},
"remark": {
"type": "text",
"index": true,
"store": true
},
"address": {
"type": "text",
"index": true,
"store": true
}
}
}
}
静态映射时指定text类型的ik分词器
设置ik分词器的文档映射
先删除之前的es_db
再创建新的es_db
定义ik_smart的映射
PUT /es_db
{
"mappings": {
"properties": {
"name": {
"type": "keyword",
"index": true,
"store": true
},
"sex": {
"type": "integer",
"index": true,
"store": true
},
"age": {
"type": "integer",
"index": true,
"store": true
},
"remark": {
"type": "text",
"index": true,
"store": true,
"analyzer": "ik_smart",
"search_analyzer": "ik_smart"
},
"address": {
"type": "text",
"index": true,
"store": true
}
}
}
}
对已存在的mapping映射进行修改
具体方法
1)如果要推倒现有的映射, 你得重新建立一个静态索引
2)然后把之前索引里的数据导入到新的索引里
3)删除原创建的索引
4)为新索引起个别名, 为原索引名
POST _reindex
{
"source": {
"index": "es_db"
},
"dest": {
"index": "es_db_index"
}
}
DELETE /es_db
PUT /es_db_index/_alias/es_db
注意: 通过这几个步骤就实现了索引的平滑过渡,并且是零停机
ES核心类型
字符串:string,string类型包含 text 和 keyword。
text:该类型被用来索引长文本,在创建索引前会将这些文本进行分词,转化为词的组合,建立索引;允许es来检索这些词,text类型不能用来排序和聚合。
keyword:该类型不能分词,可以被用来检索过滤、排序和聚合,keyword类型不可用text进行分词模糊检索。
数值型:long、integer、short、byte、double、float
日期型:date
布尔型:boolean
keyword 与 text 映射类型的区别
将 book 字段设置为 keyword 映射 (只能精准查询, 不能分词查询,能聚合、排序)
POST /es_db/_doc/_search
{
"query": {
"term": {
"book": "elasticSearch入门至精通"
}
}
}
将 book 字段设置为 text 映射能模糊查询, 能分词查询,不能聚合、排序)
POST /es_db/_doc/_search
{
"query": {
"match": {
"book": "elasticSearch入门至精通"
}
}
}
ES底层分数计算逻辑
relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度。Elasticsearch使用的是 term frequency/inverse document frequency算法,简称为TF/IDF算法
Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关
PUT /score/_doc/1
{
"doc":"hello you, and world is very good"
}
PUT /score/_doc/2
{
"doc":"hello, how are you"
}
GET /score/_search
{
"query": {
"match": {
"doc": "hello world"
}
}
}
查询结果如下:
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.56501156,
"hits" : [
{
"_index" : "score",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.56501156,
"_source" : {
"doc" : "hello you, and world is very good"
}
},
{
"_index" : "score",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.1546153,
"_source" : {
"doc" : "hello, how are you"
}
}
]
}
}
Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关
Field-length norm:field长度,field越长,相关度越弱
doc1:{ "title": "hello article", "content": "...... N个单词" }
doc2:{ "title": "my article", "content": "...... N个单词,hi world" }
hello world在整个index中出现的次数是一样多的,doc1更相关,title field更短
分析一个document上的_score是如何被计算出来的
GET /es_db/_doc/1/_explain
{
"query": {
"match": {
"remark": "java developer"
}
}
}
#! Deprecation: [types removal] Specifying a type in explain requests is deprecated.
{
"_index" : "es_db",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 1.4691012,
"description" : "sum of:",
"details" : [
{
"value" : 0.5598161,
"description" : "weight(remark:java in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.5598161,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.5389965,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 3,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.472103,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.2,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.90928507,
"description" : "weight(remark:developer in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.90928507,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.87546873,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.472103,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.2,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
vector space model 向量模型计数
多个term对一个doc的总分数
hello world --> es会根据hello world在所有doc中的评分情况,计算出一个query vector,query向量
hello这个term,给的基于所有doc的一个评分就是3
world这个term,给的基于所有doc的一个评分就是6
[3, 6]
query vector
doc vector,3个doc,一个包含hello,一个包含world,一个包含hello 以及 world
3个doc
doc1:包含hello --> [3, 0]
doc2:包含world --> [0, 6]
doc3:包含hello, world [3, 6]
会给每一个doc,拿每个term计算出一个分数来,hello有一个分数,world有一个分数,再拿所有term的分数组成一个doc vector画在一个图中,取每个doc vector对query vector的弧度,给出每个doc对多个term的总分数。每个doc vector计算出对query vector的弧度,最后基于这个弧度给出一个doc相对于query中多个term的总分数。弧度越大,分数月底; 弧度越小,分数越高
如果是多个term,那么就是线性代数来计算,无法用图表示