猿创征文|ES索引字段映射类型以及ES底层打分逻辑

发布于:2022-12-29 ⋅ 阅读:(508) ⋅ 点赞:(0)

ES中映射可以分为动态映射和静态映射

动态映射:

在关系数据库中,需要事先创建数据库,然后在该数据库下创建数据表,并创建表字段、类型、长度、主键等,最后才能基于表插入数据。而Elasticsearch中不需要定义Mapping映射(即关系型数据库的表、字段等),在文档写入Elasticsearch时,会根据文档字段自动识别类型,这种机制称之为动态映射。
动态映射规则如下:
在这里插入图片描述

静态映射

静态映射是在Elasticsearch中也可以事先定义好映射,包含文档的各字段类型、分词器等,这种方式称之为静态映射。

ES映射操作

动态映射

删除原创建的索引
DELETE /es_db
创建索引
PUT /es_db

创建文档(ES根据数据类型, 会自动创建映射)

PUT /es_db/_doc/1
{
  "name": "Jack",
  "sex": 1,
  "age": 25,
  "remark": "java入门至精通",
  "address": "广州小蛮腰"
}

获取文档映射

GET /es_db/_mapping	

创建静态映射

PUT /es_db
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword",
        "index": true,
        "store": true
      },
      "sex": {
        "type": "integer",
        "index": true,
        "store": true
      },
      "age": {
        "type": "integer",
        "index": true,
        "store": true
      },
      "remark": {
        "type": "text",
        "index": true,
        "store": true
      },
      "address": {
        "type": "text",
        "index": true,
        "store": true
      }
    }
  }
}

静态映射时指定text类型的ik分词器

设置ik分词器的文档映射
先删除之前的es_db
再创建新的es_db
定义ik_smart的映射

PUT /es_db
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword",
        "index": true,
        "store": true
      },
      "sex": {
        "type": "integer",
        "index": true,
        "store": true
      },
      "age": {
        "type": "integer",
        "index": true,
        "store": true
      },
      "remark": {
        "type": "text",
        "index": true,
        "store": true,
        "analyzer": "ik_smart",
        "search_analyzer": "ik_smart"
      },
      "address": {
        "type": "text",
        "index": true,
        "store": true
      }
    }
  }
}

对已存在的mapping映射进行修改

具体方法
1)如果要推倒现有的映射, 你得重新建立一个静态索引
2)然后把之前索引里的数据导入到新的索引里
3)删除原创建的索引
4)为新索引起个别名, 为原索引名

POST _reindex
{
  "source": {
    "index": "es_db"
  },
  "dest": {
    "index": "es_db_index"
  }
}

DELETE /es_db

PUT /es_db_index/_alias/es_db

注意: 通过这几个步骤就实现了索引的平滑过渡,并且是零停机

ES核心类型

字符串:string,string类型包含 text 和 keyword。
text:该类型被用来索引长文本,在创建索引前会将这些文本进行分词,转化为词的组合,建立索引;允许es来检索这些词,text类型不能用来排序和聚合。
keyword:该类型不能分词,可以被用来检索过滤、排序和聚合,keyword类型不可用text进行分词模糊检索。
数值型:long、integer、short、byte、double、float
日期型:date
布尔型:boolean

keyword 与 text 映射类型的区别

将 book 字段设置为 keyword 映射 (只能精准查询, 不能分词查询,能聚合、排序)

POST /es_db/_doc/_search
{
  "query": {
    "term": {
      "book": "elasticSearch入门至精通"
    }
  }
}

将 book 字段设置为 text 映射能模糊查询, 能分词查询,不能聚合、排序)

POST /es_db/_doc/_search
{
  "query": {
    "match": {
      "book": "elasticSearch入门至精通"
    }
  }
}

ES底层分数计算逻辑

relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度。Elasticsearch使用的是 term frequency/inverse document frequency算法,简称为TF/IDF算法

Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关

PUT /score/_doc/1
{
  "doc":"hello you, and world is very good"
  
}
PUT /score/_doc/2
{
  "doc":"hello, how are you"
  
}

GET /score/_search
{
  "query": {
    "match": {
      "doc": "hello world"
    }
  }
}

查询结果如下:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.56501156,
    "hits" : [
      {
        "_index" : "score",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.56501156,
        "_source" : {
          "doc" : "hello you, and world is very good"
        }
      },
      {
        "_index" : "score",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.1546153,
        "_source" : {
          "doc" : "hello, how are you"
        }
      }
    ]
  }
}

Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关

Field-length norm:field长度,field越长,相关度越弱

doc1:{ "title": "hello article", "content": "...... N个单词" }

doc2:{ "title": "my article", "content": "...... N个单词,hi world" }

hello world在整个index中出现的次数是一样多的,doc1更相关,title field更短

分析一个document上的_score是如何被计算出来的

GET /es_db/_doc/1/_explain
{
  "query": {
    "match": {
      "remark": "java developer"
    }
  }
}
#! Deprecation: [types removal] Specifying a type in explain requests is deprecated.
{
  "_index" : "es_db",
  "_type" : "_doc",
  "_id" : "1",
  "matched" : true,
  "explanation" : {
    "value" : 1.4691012,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 0.5598161,
        "description" : "weight(remark:java in 0) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 0.5598161,
            "description" : "score(freq=1.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 0.5389965,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 3,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 5,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.472103,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "freq, occurrences of term within document",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 2.0,
                    "description" : "dl, length of field",
                    "details" : [ ]
                  },
                  {
                    "value" : 2.2,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 0.90928507,
        "description" : "weight(remark:developer in 0) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 0.90928507,
            "description" : "score(freq=1.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 0.87546873,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 2,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 5,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.472103,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "freq, occurrences of term within document",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 2.0,
                    "description" : "dl, length of field",
                    "details" : [ ]
                  },
                  {
                    "value" : 2.2,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

vector space model 向量模型计数

多个term对一个doc的总分数
hello world --> es会根据hello world在所有doc中的评分情况,计算出一个query vector,query向量

hello这个term,给的基于所有doc的一个评分就是3
world这个term,给的基于所有doc的一个评分就是6

[3, 6]

query vector
doc vector,3个doc,一个包含hello,一个包含world,一个包含hello 以及 world
3个doc

doc1:包含hello --> [3, 0]
doc2:包含world --> [0, 6]
doc3:包含hello, world [3, 6]

会给每一个doc,拿每个term计算出一个分数来,hello有一个分数,world有一个分数,再拿所有term的分数组成一个doc vector画在一个图中,取每个doc vector对query vector的弧度,给出每个doc对多个term的总分数。每个doc vector计算出对query vector的弧度,最后基于这个弧度给出一个doc相对于query中多个term的总分数。弧度越大,分数月底; 弧度越小,分数越高
如果是多个term,那么就是线性代数来计算,无法用图表示
在这里插入图片描述


网站公告

今日签到

点亮在社区的每一天
去签到