【项目】Boost搜索引擎-易微帮

项目相关背景

现在市面上已经出现很多搜索引擎，比如：百度、Google、Bing等等，它们都是全网性搜索
而我做得项目就像cplusplus网站中搜索C++的相关知识一样，同样做的是站内搜索，它的搜索更垂直

搜索引擎的宏观原理

在这里插入图片描述
server是我们所对应的服务器，在服务器上一定运行的有服务软件searcher，做项目之前首先我们需要准备好数据，也就是相关的html，我们将下载好的html保存到磁盘上面的data/input路径下，接下来就是需要做1. 去标签&&数据清理 2.建立索引（索引为了让搜索的速度更加快）

接下来，用户通过浏览器以HTTP请求GET的方式上传搜索关键字，发起一次搜索任务，服务软件检索索引，查找相关的html，将拼接后的html返回用户

搜索引擎技术栈和项目环境

技术栈：C/C++ C++11，STL， 准标准库Boost，Jsoncpp， cppjieba， cpp-httplib,  html5, css, js,  jQuery, Ajax

项目环境：Centos7云服务器 vim/gcc(g++)/Makefile vs2022 vscode

正排索引 VS 倒排索引 - 搜索引擎具体原理：

1. 正排索引就是从文档ID找到文档的内容（文档中的关键字）

文档1：大学生需要考试
文档2：大学生开始周开始了

文档ID	文档内容
1	大学生需要考试
2	大学生开始周开始了

2. 目标文档进行分词

目的：方便建立倒排索引和查找
文档1 -> 大学生需要考试：大学生/ 需要 / 考试
文档2 -> 大学生考试周开始了：大学生/ 开始 / 考试周

3. 倒排索引：根据文档内容，分词，整理不重复的各个关键字，对应联系到文档ID的方案

关键字	文档ID
大学生	文档1，文档2
需要	文档1
考试	文档1
开始	文档2
考试周	文档2

模拟一次查找的过程：
用户输入: 大学生 -> 倒排索引中查找 -> 提取出文档ID(1,2) -> 根据正排索引 -> 找到正排索引 -> 找到文档内容 -> 构建响应结果

编写数据去标签与数据清洗的模块 Parser

1.准备数据

boost 官网： https://www.boost.org/
我们先创建一个项目文件夹mkdir boost_searcher,在boost官网中下载网页
在这里插入图片描述
选择最新的版本

下载的本地磁盘后，我们将这个压缩包上传到我们刚刚在云服务器上创建的项目文件夹中这里我们使用 rz 命令将本地的文件上传到服务器上

注意：如果上传的速度很慢，我们可以打开Windows中的cmd，使用scp协议上传到云服务器指定的路径下：

scp /path/to/localfile username@server_ip:/path/to/destination

/path/to/localfile 是本地文件的路径和文件名。

username 是你在服务器上的用户名。

server_ip 是你的服务器的IP地址。

/path/to/destination 是文件在服务器上的目标路径和文件名。

我们点开boost库中的一个手册，通过URL我们发现https://www.boost.org/doc/libs/1_84_0/doc/html/array.html它的手册都是在doc/html这个路径下的

我们将这个路径下的所有htm拷贝到项目目录下的data/input目录下,使用这些html建立索引

mkdir -p data/input
cp -rf boost_1_84_0/doc/html.* data/input

2.去标签

//创建去标签的源代码文件
touch parser.cc

在这里插入图片描述

<> : html的标签，这个标签对我们进行搜索是没有价值的，需要去掉这些标签，一般标签都是成对出现的！

在这里插入图片描述

ls -Rl | grep -E '*.html' | wc -l  //查看一共有多少个.html文件

在这里插入图片描述
目标：把每一个文档都去标签，然后写入到同一个文件当中！每一个文档的内容只占一行，文档和文档内容之间使用\3区分
\3在ASCLL表中它是控制字符，是不可显示的，所以不会污染内容

3.编写parser.cc

#include <iostream>
#include <string>
#include <vector>

// 这个目录下面放的是所有的html网页
const std::string src_path = "data/input";
const std::string output = "data/raw_html/raw.txt";


typedef struct DocInfo
{
  std::string title;
  std::string content; //文档内容
  std::string url;  //表示该 文档在官网中的url
}DocInfo_t;

//const & :输入
//*: 输出
//&： 输入输出

bool EnumFile(const std::string& src_path, std::vector<std::string>* file_list);
bool ParseHtml(const std::vector<std::string>& file_list, std::vector<DocInfo_t>* results);
bool SaveHtml(const std::vector<DocInfo_t>& results, const std::string& output);

int main()
{
  std::vector<std::string> file_list;
  // 第一步：递归式的显示每一个html文件名带路径，保存到files_list 方便后期对一个个文件进行读取
  if (!EnumFile(src_path, &file_list))
  {
    std::cerr << "enum file name error!" << std::endl;
    return 1;
  }

  //第二步：按照file_list读取每个文件的内容，并解析
  std::vector<DocInfo_t> results;
  if(!ParseHtml(file_list, &results))
  {
    std::cerr << "Parse html error" << std::endl;
    return 2;
  }

  //第三步：把解析完毕的文件内容，写入到output文件中 按照\3作为每个html的分隔符
  if(!SaveHtml(results, output))
  {
    std::cerr << "Save Html error" << std::endl;
    return 3;
  }
  return 0;
}

bool EnumFile(const std::string& src_path, std::vector<std::string>* file_list)
{

}
bool ParseHtml(const std::vector<std::string>& file_list, std::vector<DocInfo_t>* results)
{

}
bool SaveHtml(const std::vector<DocInfo_t>& results, const std::string& output)
{
  
}

4.服务器上安装boost库

//安装boost开发文件和头文件
sudo yum install -y boost-devel

Boost库是一个由C++编写的开源软件库，它提供了许多功能丰富的模块，涵盖了从数据结构到多线程编程等各个领域。Boost库的目标是为C++程序员提供一组高质量、可移植、跨平台的工具和组件，以增强C++语言的功能和性能。Boost库中的许多组件在C++标准化过程中被采纳为标准的一部分，因此它们在C++社区中被广泛使用。

5.完成EnumFile（保存html文件名带路径）

bool EnumFile(const std::string& src_path, std::vector<std::string>* file_list)
{
  //简化boost命名空间
  namespace fs = boost::filesystem;
  fs::path root_path(src_path);
   
  //如果文件路径不存在，直接返回false
  if(!fs::exists(root_path))
  {  
    std::cerr << src_path << "not exit" << std::endl;
    return false;
  }
  //定义一个空的迭代器，用来进行判断递归结束
  fs::recursive_directory_iterator end;
  for(fs::recursive_directory_iterator iter(root_path); iter != end; iter++)
  {
    //首先检查当前的文件是否是一个普通文件 
    if(!fs::is_regular_file(*iter))
    {
      continue;
    }
    //提取出文件后缀，判断是否是.html文件
    if(iter->path().extension() != ".html")
    {
      continue;
    }
    //走到这里当前的路径一定是合法的，以.html为后缀
    file_list->push_back(iter->path().string());
  }
}

6.完成ParseHtml（解析文档内容）

bool ParseHtml(const std::vector<std::string>& file_list, std::vector<DocInfo_t>* results)
{
  for(const std::string& file : file_list)
  {
    std::string result;
    //读取当前路径的文件 将这个html的内容放在result字符串中
    if(!ns_util::FileUtil::ReadFile(file, &result))
    {
      continue;
    }

    //解析指定的文件 提取title
    DocInfo_t doc;
    if(!ParseTitle(result, &doc.title))
    {
      continue;
    }

    //解析文件的content
    if(!ParseContent(result, &doc.content))
    {
      continue;
    }

    //解析文件路径 构建url
    if(!ParseUrl(file, &doc.url))
    {
      continue;
    }

    //当前文档的相关结果保存到了 doc结构里面
    results->push_back(std::move(doc));
    
    //for Debug
    ShowDebug(doc);
  }
  return true;
}

自己写的小组件，打开文件然后读取html网页内容

//util.hpp文件
#pragma once
#include<iostream>
#include<string>
#include<fstream>

namespace ns_util
{
    class FileUtil
    {
        public:
            static bool ReadFile(const std::string& file_path, std::string* out)
            {
                std::ifstream in(file_path, std::ios::in);
                if(!in.is_open())
                {
                    std::cerr << "open file " << file_path << "error" << std::endl;
                    return false;
                }
                std::string line;
                while(std::getline(in, line))
                {
                    *out += line;
                }
                in.close();
                return true;
            }
    };
}

解析指定的文件，提取title

static bool ParseTitle(const std::string& file, std::string* title)
{
  //解析title，寻找文档中<title></title>标签即可
  std::size_t begin = file.find("<title>");
  if(begin == std::string::npos)
  {
    return false;
  }
  std::size_t end = file.find("</title>");
  if(end == std::string::npos)
  {
    return false;
  }

  //将位置定位到 title的第一个字符
  begin += std::string("<title>").size();
  if(begin > end)
  {
    return false;
  }

  //截取title
  *title =  file.substr(begin, end - begin);
  return true;
}

解析文件的content

static bool ParseContent(std::string& file, std::string* content)
{
  //去标签  基于一个简易的状态机编写
  enum status
  {
    LABLE,
    CONTENT
  };

  //一开始的html文件一定是一个标签
  enum status s = LABLE;
  for(char c : file)
  {
    switch(s)
    {
      case LABLE:
        if(c == '>')s = CONTENT;
        break;
      case CONTENT: 
        if(c == '<')s = LABLE;
        else 
        {
          if(c == '\n') c = ' ';
          content->push_back(c);
        }
        break;
      default:
        break;
    }
  }
  return true;
}

解析文件路径，构建URL

static bool ParseUrl(const std::string& file_path, std::string* url)
{
  //官网的搜索路径
  std::string url_head = "https://www.boost.org/doc/libs/1_85_0/doc/html";

  //自己本地下载的html保存的路径
  std::string url_tail = file_path.substr(src_path.size());
  *url = url_head + url_tail;
  return true;
}

7.完成SaveHtml（将保存在结构中的html的content保存到文件中）

bool SaveHtml(const std::vector<DocInfo_t>& results, const std::string& output)
{
#define SEP '\3'
  //将保存在结构中的html的content保存到文件中
  //title\3content\3url\n  方便我们使用getline函数获取一个html的全部信息
  //按照二进制的方式写入
  std::ofstream out(output.c_str(), std::ios::out | std::ios::binary);
  if(!out.is_open()) 
  {
    std::cerr << "open " << output << "error" << std::endl;
    return false;
  }

  //进行文件内容的写入
  for(auto& item : results)
  {
    std::string out_string;
    out_string += item.title;
    out_string += SEP;
    out_string += item.content;
    out_string += SEP; 
    out_string += item.url;
    out_string += '\n';

    out.write(out_string.c_str(), out_string.size());
  }
  out.close();
  return true;
}

编写建立索引模块（index）

正排索引我们使用的数据结构是vector，它的vector下标就是文档的ID

倒排索引我们使用的数据结构是Hash表，我们根据文档内容的关键字查找文档拉链

namespace ns_index
{
    struct DocInfo
    {
        std::string title;  //文档的标题
        std::string content;  //文档对应的去标签之后的内容
        std::string url; //官网文档的url
        uint64_t doc_id; //文档的ID
    };

    struct InvertedElem
    {
        uint64_t doc_id;
        std::string word;
        int weight;  //权重
    };
    
    class Index
    {
    private:
        //正排索引使用的数据结构是数组，数组的下标是天然的文档ID
        std::vector<DocInfo>foward_index;
        //倒排索引 -》 一个关键字和一组(个)InvertedElem对应
        std::unordered_map<std::string, std::vector<InvertedElem>> inverted_index;
    public:
        Index(){}
        ~Index(){}

        //根据doc_id 找到文档的内容
        DocInfo* GetForwardIndex(const uint64_t doc_id)
        {
            return nullptr;
        }

        //根据关键字获得倒排拉链
        std::vector<InvertedElem>* GetInvertedElemList(const std::string& word)
        {
            return nullptr;
        }

        //根据去标签格式化的文档， 构建正排 倒排索引
        bool BuildIndex(const std::string& input)
        {
            return true;
        }
    private:
    	//建立正派索引
        DocInfo* BuildForwardIndex(const std::string& line)
        {
            //解析 line， 字符串切分
           return nullptr;
        }
        
        //建立倒排索引
        bool BuildInvertedIndex(const DocInfo& doc)
        {
            return true;
        }


    };
}

1.根据文档的ID查找文档的内容

DocInfo* GetForwardIndex(const uint64_t doc_id)
{
	if(doc_id >= foward_index.size())
	{
		std::cerr << "doc_id out range, error!" << std::endl;
		return nullptr;
	}
	return &foward_index[doc_id];
}

2.根据关键字获得倒排拉链

std::vector<InvertedElem>* GetInvertedElemList(const std::string& word)
{
	auto iter = inverted_index.find(word);
	if(iter == inverted_index.end())
	{
		std::cerr << word << "not find in unordered_map" << std::endl;
		return nullptr;
	}
	return &(iter->second);
}

3.构建索引（根据去标签的格式化文档构建正排、倒排索引）

//根据去标签格式化的文档， 构建正排 倒排索引
bool BuildIndex(const std::string& input)
{
	std::ifstream in(input, std::ios::in | std::ios::binary);
	if(!in.is_open())
	{
		std::cerr << input << "open error" << std::endl;
		return false;
	}
	std::string line;
	while(std::getline(in,line))  //getline函数遇到\n会停止，并且自动丢失，但是当getline读取到文件末尾的时候就会返回false
	{
		//这里是将一个清洗之后的Html通过解析 构建正排索引
		DocInfo* doc = BuildForwardIndex(line);  
		if(doc == nullptr)
		{
			std::cerr << "build error" << std::endl;
			continue;
		}
		//将解析完成的结构构倒排索引
		BuildInvertedIndex(*doc);
	}
	return true;
}

4. 构建正排索引

DocInfo* BuildForwardIndex(const std::string& line)
{
	//解析 line， 字符串切分
	std::vector<std::string>results;

	//将html文件解析出来之后 使用sep作为分隔符
	const std::string sep = "\3";
	ns_util::StringUtil::CutString(line, &results, sep);
	if(results.size() != 3)
	{
		return nullptr;
	}

	//字符串填充到DonInfo
	DocInfo doc;
	doc.title = results[0];
	doc.content = results[1];
	doc.url = results[2];
	doc.doc_id = foward_index.size();  //对应的ID就是数组下标
            
	//插入到正排索引的foward_index中
	foward_index.push_back(doc);
	return &doc;
}

5. 构建倒排索引

倒排索引一定是一个关键字和一组(个)InvertedElem对应[关键字和倒排拉链的映射关系]

std::unordered_map<std::string, InvertedList> inverted_index;

struct InvertedElem{ uint64_t doc_id; std::string word; int weight; };
//文档结构：
title ：吃葡萄
content: 吃葡萄不吐葡萄皮
url: http://XXXX
doc_id: 123
根据文档内容，形成一个或者多个InvertedElem(倒排拉链)
因为当前我们是一个一个文档进行处理的，一个文档会包含多个”词“，都应当对应到当前的doc_id

需要对 title && content都要先分词 --使用jieba分词
title: 吃/葡萄/吃葡萄(title_word)
content：吃/葡萄/不吐/葡萄皮(content_word)
词和文档的相关性（词频：在标题中出现的词，可以认为相关性更高一些，在内容中出现相关性低一些）
词频统计

struct word_cnt{
title_cnt;
content_cnt;
}
unordered_map<std::string, word_cnt> word_cnt;
for &word : title_word{
word_cnt[word].title_cnt++; //吃（1）/葡萄（1）/吃葡萄（1）
}
for &word : content_word {
word_cnt[word].content_cnt++; //吃（1）/葡萄（1）/不吐（1）/葡萄皮（1）
}

知道了在文档中，标题和内容每个词出现的次数

自定义相关性

for &word : word_cnt{
//具体一个词和123文档的对应关系，当有多个不同的词，指向同一个文档的时候，此时该优先显示谁？？相关性！
struct InvertedElem elem;
elem.doc_id = 123;
elem.word = word.first;
//相关性,我们这里就简单写了
elem.weight = 10*word.second.title_cnt + word.second.content_cnt ; 
inverted_index[word.first].push_back(elem);
}

获取链接： git clone https://gitcode.net/mirrors/yanyiwu/cppjieba.git
注意细节，我们需要自己执行： cd cppjieba; cp -rf deps/limonp include/cppjieba/, 不然会编译报错

#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <fstream>
#include "util.hpp"
#include <mutex>

namespace ns_index
{
    struct DocInfo
    {
        std::string title;   // 文档的标题
        std::string content; // 文档对应的去标签之后的内容
        std::string url;     // 官网文档的url
        uint64_t doc_id;     // 文档的ID
    };

    struct InvertedElem
    {
        uint64_t doc_id;
        std::string word;
        int weight; // 权重
    };

    class Index
    {
    private:
        // 正排索引使用的数据结构是数组，数组的下标是天然的文档ID
        std::vector<DocInfo> foward_index;
        // 倒排索引 -》 一个关键字和一组(个)InvertedElem对应
        std::unordered_map<std::string, std::vector<InvertedElem>> inverted_index;

    private:
        Index() {}
        Index(const Index &) = delete;
        Index &operator=(const Index &) = delete;

        static Index *instance;
        static std::mutex mutx;

    public:
        ~Index() {}
        static Index *GetInstance()
        {
            if (instance == nullptr)  //挡住申请锁的繁琐
            {

                mutx.lock();
                // 在多线程下，这里是不安全的，每一线程可以并发的访问单例，创建多个对象，所以这里需要加锁
                if (instance == nullptr)
                {
                    instance = new Index();
                }
                mutx.unlock();
                return instance;
            }
        }

        // 根据doc_id 找到文档的内容
        DocInfo *GetForwardIndex(const uint64_t doc_id)
        {
            if (doc_id >= foward_index.size())
            {
                std::cerr << "doc_id out range, error!" << std::endl;
                return nullptr;
            }
            return &foward_index[doc_id];
        }

        // 根据关键字获得倒排拉链
        std::vector<InvertedElem> *GetInvertedElemList(const std::string &word)
        {
            auto iter = inverted_index.find(word);
            if (iter == inverted_index.end())
            {
                std::cerr << word << "not find in unordered_map" << std::endl;
                return nullptr;
            }
            return &(iter->second);
        }

        // 根据去标签格式化的文档， 构建正排 倒排索引
        bool BuildIndex(const std::string &input)
        {
            std::ifstream in(input, std::ios::in | std::ios::binary);
            if (!in.is_open())
            {
                std::cerr << input << "open error" << std::endl;
                return false;
            }
            std::string line;
            while (std::getline(in, line))
            {
                DocInfo *doc = BuildForwardIndex(line);
                if (doc == nullptr)
                {
                    std::cerr << "build error" << std::endl;
                    continue;
                }

                BuildInvertedIndex(*doc);
            }
            return true;
        }

    private:
        DocInfo *BuildForwardIndex(const std::string &line)
        {
            // 解析 line， 字符串切分
            std::vector<std::string> results;

            // 将htnl文件解析出来之后 使用sep作为分隔符
            const std::string sep = "\3";
            ns_util::StringUtil::CutString(line, &results, sep);
            if (results.size() != 3)
            {
                return nullptr;
            }

            // 字符串填充到DonInfo
            DocInfo doc;
            doc.title = results[0];
            doc.content = results[1];
            doc.url = results[2];
            doc.doc_id = foward_index.size(); // 对应的ID就是数组下标

            // 插入到正排索引的foward_index中
            foward_index.push_back(doc);
            return &doc;
        }

        bool BuildInvertedIndex(const DocInfo &doc)
        {
            // 分词
            std::vector<std::string> title_words;
            struct word_cnt
            {
                int title_cnt;
                int content_cnt;
                word_cnt() : title_cnt(0), content_cnt(0) {}
            };
            // 用在保存词频的映射表
            std::unordered_map<std::string, word_cnt> word_map;
            ns_util::JiebaUtil::CutString(doc.title, &title_words);

            for (std::string &s : title_words)
            {
                word_map[s].title_cnt++;
            }

            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.content, &content_words);
            // 对内容进行词频统计
            for (auto &s : content_words)
            {
                word_map[s].content_cnt++;
            }

            for (auto &word_pair : word_map)
            {
                InvertedElem item;
                item.doc_id = doc.doc_id;
                item.word = word_pair.first;
                item.weight = 10 * word_pair.second.title_cnt + 1 * word_pair.second.content_cnt;
                std::vector<InvertedElem> &inverted_list = inverted_index[word_pair.first];
                inverted_list.push_back(std::move(item));
            }

            return true;
        }
    };
    Index *Index::instance = nullptr;
}

编写搜索引擎模块 Searcher

#pragma once
#include "index.hpp"
#include"util.hpp"
#include<algorithm>
#include<jsoncpp/json/json.h>

namespace ns_seacher
{
    class Searcher
    {
    private:
        ns_index::Index* index;  //获取单例对象索引
    public:
        Searcher() {}
        ~Searcher() {}
        void InitSearcher(const std::string &input)
        {
            // 获取或者创建index对象
            index = ns_index::Index::GetInstance();
            // 根据index对象获取索引
            index->BuildIndex(input);
        }

        // query搜索关键字   返回给用户浏览器的数据  搜索结果
        void Search(const std::string &query, std::string *json_string)
        {
            // 将关键字分词  -> 查找
            std::vector<std::string>words;
            ns_util::JiebaUtil::CutString(query,& words);

            // 根据分词搜索
            std::vector<ns_index::InvertedElem> inverted_list_all;
            for(std::string word : words)
            {
                boost::to_lower(word);
                std::vector<ns_index::InvertedElem> *inverted_list = index->GetInvertedElemList(word);
                if(inverted_list == nullptr)
                {
                    continue;
                }
                inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());
            }
            // 汇总查找结果，按照相关性  降序排序
            std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const ns_index::InvertedElem& e1, const ns_index::InvertedElem& e2){
                return e1.weight > e2.weight;
            });

            // 查找出来的结果，构建json串
            Json::Value root;
            for(auto& item : inverted_list_all)
            {
                //查正排
                ns_index::DocInfo*  doc = index->GetForwardIndex(item.doc_id);
                if(nullptr == doc)
                {
                    continue;
                }
                Json::Value elem;
                elem["title"] = doc->title;
                elem["desc"] = GetDesc(doc->content, item.word);
                elem["url"] = doc->url;

                root.append(elem);
            }
            Json::StyledWriter writer;
            *json_string = writer.write(root);   //序列化
        }

        //截取内容的简化，显示在摘要中
        std::string  GetDesc(const std::string& content, const std::string& word)
        {
            //找到关键字在content中的首次出现  ，然后往前找50字节(如果没有50字节，从begin开始)，往后找100字节（截取到end）
            const std::size_t prev_pos = 50;
            const std::size_t next_pos = 100;

            //找到content中word首次出现的位置
            std::size_t pos = content.find(word);
    
            //获取Start end
            std::size_t start = 0;
            std::size_t end = content.size() - 1;
            if(pos - prev_pos >= start)start = pos - prev_pos;
            if(pos + next_pos <= end)end = pos + next_pos;
            //截取子串
            if(start >=  end)return "None";
            return content.substr(start, end - start);
        }
    };
};

编写http_server 模块

cpp-httplib库：https://gitee.com/zhangkt1995/cpp-httplib?_from=gitee_search

注意：cpp-httplib在使用的时候需要使用较新版本的gcc，centos 7下默认gcc 4.8.5
升级gcc

安装scl，升级gcc版本
sudo yum install centos-release-scl scl-utils-build
升级gcc
sudo yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++
启动新版本的gcc：命令行启动，只在本次会话中有效
scl enable devtoolset-7 bash
安装cpp-httplib
最新的cpp-httplib在使用的使用如果不是较新的gcc的话，运行的时候会有问题，建议使用cpp-httplib 0.7.15
https://gitee.com/welldonexing/cpp-httplib/tree/v0.7.15下载zip，将文件拖拽到服务器上，使用unzip解压当前的cpp-httplib.zip文件

注意: 如果我们在测试的时候出现下面这个情况的话，说明gcc/g++版本有点低，使用cpp-httplib的时候会报错
在这里插入图片描述

//解决方法：
scl enable devtoolset-7 bash  // 注意这个方法只在本次会话中有效（gcc/g++）

//2，将上述命令添加到~/.bash_profile文件中 每一次启动bash进程的时候都会运行这个文件

测试cpp-httplib

httpserver.cc的编写

#include"cpp-httplib/httplib.h"
#include "searcher.hpp"

const std::string root_path = "./wwwroot";
const std::string input = "data/raw_html/raw.txt";

int main()
{
    ns_seacher::Searcher search;
    search.InitSearcher(input);

    httplib::Server svr;
    svr.set_base_dir(root_path.c_str());
    //我们这里模仿百度 用户的请求参数在url中的样式
    svr.Get("/s", [&search](const httplib::Request & req, httplib::Response &rsp){
        //这里是判断url是否存在指定参数名的参数
        if(!req.has_param("word"))
        {
            //汉字对应text/plain
            rsp.set_content("必须有搜索关键字！","text/plain; charset=utf-8");
            return;
        }
        std::string word = req.get_param_value("word");
        std::cout << "用户正在搜索：" << word << std::endl;
        std::string json_string;
        search.Search(word, &json_string);
        rsp.set_content(json_string, "application/json");
    });
    svr.listen("0.0.0.0",8081);
    return 0; 
}

编写前端代码

1.先搭好框架 HTML

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Boost 搜索引擎</title>
</head>
<body>
    <div class="container">
        <div class="search">
            <input type="text" value="输入搜索关键字">
            <button>皮蛋搜索</button>
        </div>
        <div class="result">
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要</p>
                <i>http://search</i>
            </div>
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要</p>
                <i>http://search</i>
            </div><div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要</p>
                <i>http://search</i>
            </div><div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要</p>
                <i>http://search</i>
            </div><div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要</p>
                <i>http://search</i>
            </div><div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要</p>
                <i>http://search</i>
            </div> 
        </div>
    </div>
</body>
</html>

2.编写css样式

<style>
        /* 去掉网页所有的内外边距 */
        * {
            /* 设置外边距 */
            margin: 0;
            /* 设置内边距 */
            padding: 0;
        }

        html,
        body {
            height: 100%;
        }

        .container {
            width: 800px;
            margin: 0px auto;
            margin-top: 15px;
        }

        .container .search {
            width: 100%;
            height: 52px;
        }

        /* input在进行高度设置的时候需要考虑边框的问题 */
        .container .search input {
            float: left;
            width: 600px;
            height: 50px;
           
            border: 1px solid black;
            border-right: none;
            padding-left: 10px;
            color:  68, 67, 67;
            font-size: 15px;
        }

        .container .search button {
            float: left;
            width: 150px;
            height: 52px;
            background-color: #4e6ef2;
            color: #FFF;
            font-size: 19px;
            font-family: Georgia, 'Times New Roman', Times, serif;

        }

        .container .result {
            width: 100%;
        }

        .container .result .item {
            margin-top: 15px;
        }

        .container .result .item a {
            display: block;
            /* 将a标签的下划线去掉 */
            text-decoration: none;
            font-size: 22px;
            color: #4e6ef2;
        }

        .container .result .item a:hover {
            color: #2c428b;
            text-decoration: underline;
            /* 深蓝色 */
        }

        .container .result .item a:active {
            color: #335b9b;
            /* 点击后的颜色 */
        }

        .container .result .item p {
            margin-top: 5px;
            color: #71777D;
            font: 16px / normal Arial, Helvetica, Sans-Serif;
        }

        .container .result .item i {
            display: block;
            font-style: normal;
            color: #4007A2;
        }
    </style>

3. 编写js代码

引入JQuery： <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>

<script>
        function Search(){
            //1.提取数据
            let query = $(".container .search input").val();
            console.log("quety=" + query);
            //2.发起http请求  ajax:属于一个和前后端进行数据交互的函数 Jquery中的
            $.ajax({
                type:"GET",
                url:"/s?word=" + query,
                success:function(data){
                    console.log(data); 
                    BuildHtml(data)
                }
            });
        }

        function BuildHtml(data){
            let result_lable = $(".container .result");
            result_lable.empty();
            for(let elem of data)
            {
                let a_lable = $("<a>",{
                    text: elem.title,
                    href: elem.url,
                    //跳转到新的页面
                    target: "_blank"
                });

                let p_lable = $("<p>", {
                    text: elem.desc,
                });

                let i_lable = $("<i>", {
                    text: elem.url
                });

                let div_lable = $("<div>", {
                    class: "item"
                });
                a_lable.appendTo(div_lable);
                p_lable.appendTo(div_lable);
                i_lable.appendTo(div_lable);
                div_lable.appendTo(result_lable);
            }
        }
    </script>

【项目】Boost搜索引擎

项目相关背景

搜索引擎的宏观原理

搜索引擎技术栈和项目环境

正排索引 VS 倒排索引 - 搜索引擎具体原理：

1. 正排索引就是从文档ID找到文档的内容（文档中的关键字）

2. 目标文档进行分词

3. 倒排索引：根据文档内容，分词，整理不重复的各个关键字，对应联系到文档ID的方案

编写数据去标签与数据清洗的模块 Parser

1.准备数据

2.去标签

3.编写parser.cc

4.服务器上安装boost库

5.完成EnumFile（保存html文件名带路径）

6.完成ParseHtml（解析文档内容）

7.完成SaveHtml（将保存在结构中的html的content保存到文件中）

编写建立索引模块（index）

1.根据文档的ID查找文档的内容

2.根据关键字获得倒排拉链

3.构建索引（根据去标签的格式化文档构建正排、倒排索引）

4. 构建正排索引

5. 构建倒排索引

编写搜索引擎模块 Searcher

编写http_server 模块

编写前端代码

网站公告

今日签到

热门文章

最新发布

【项目】Boost搜索引擎

项目相关背景

搜索引擎的宏观原理

搜索引擎技术栈和项目环境

正排索引 VS 倒排索引 - 搜索引擎具体原理：

1. 正排索引就是从文档ID找到文档的内容（文档中的关键字）

2. 目标文档进行分词

3. 倒排索引：根据文档内容，分词，整理不重复的各个关键字，对应联系到文档ID的方案

编写数据去标签与数据清洗的模块 Parser

1.准备数据

2.去标签

3.编写parser.cc

4.服务器上安装boost库

5.完成EnumFile（保存html文件名带路径）

6.完成ParseHtml（解析文档内容）

7.完成SaveHtml（将保存在结构中的html的content保存到文件中）

编写建立索引模块 （index）

1.根据文档的ID查找文档的内容

2.根据关键字获得倒排拉链

3.构建索引（根据去标签的格式化文档构建正排、倒排索引）

4. 构建正排索引

5. 构建倒排索引

编写搜索引擎模块 Searcher

编写http_server 模块

编写前端代码

网站公告

今日签到

热门文章

最新发布

编写建立索引模块（index）