字符串搜索@串的模式匹配@朴素暴力算法-EW帮帮网

文章目录

串的模式匹配

子串的定位操作通常称为串的模式匹配，它求的是子串（常称模式串）在主串中的位置。

本文主要介绍暴力算法的不同风格的写法

暴力匹配算法(BF算法)

首先介绍采用定长顺序存储结构，给出一种不依赖于其他串操作的暴力匹配算法。(Brute-Force)

暴力算法又有不同的写法

下面的算法是串从位序1开始计数的(不能直接运行,可运行版本附在后面)

int Index_deprecated(String S, String T)
{
    int i = 1, j = 1;
    while (i <= S.length && j <= T.length)
    {
        if (S.ch[i] == T.ch[j])
        {
            ++i;
            ++j; // 继续比较后继字符
        }
        else
        {
            // 指针后退重新开始匹配
            j = 1; // 模式串的指针重置为1
            i = i - j + 2;
        }
    }
    if (j > T.length)
        return i - T.length;
    else
        return -1;
}

算法思想

在上述算法中，分别用计数指针 $i$ 和 $j$ 指示主串 $S$ 和模式串 $T$ 中当前正待比较的字符位置。

算法思想为：从主串 $S$ 的第一个字符起，与模式串 $T$ 的第一个字符比较，若相等，则继续逐个比较后续字符；否则从主串的下一个字符起，重新和模式串的字符比较；

以此类推，直至模式串 $T$ 中的每个字符依次和主串 $S$ 中的一个连续的字符序列相等，则称匹配成功，函数值为与模式串 $T$ 中第一个字符相等的字符在主串 $S$ 中的序号，否则称匹配不成功，函数值为零。

分析

在开始前,需要明确偏移规律:

对于序列 $a_{p}a_{p+1}\cdots a_{q}$ , $(q>p,p,q\in\mathbb{N})$ ,指向 $a_{p}$ 的指针偏移到 $a_{q}$ 需要的偏移量为 $q - p$ (设从当前元素偏移到下一个元素的偏移量为1);反之,从 $q$ 偏移到 $p$ 的负方向的偏移量也为 $q - p$

例如 $1, 2, 3, 4, 5$ ,指针从3偏移到5需要的偏移量为 $5 - 3 = 2$ ;反之从3偏移到5需要的偏移量为 $5 - 3 = 2$

在基于回溯的算法中,我们会用到这个基本事实,例如从字符串的第 $j$ 个字符回退到第 $1$ 个字符,那么偏移量为 $j - 1$

基于回溯的暴力匹配算法中,设串 $S_{1}$ = $s_{t}s_{t+1}\cdots{s_{x}}$ 是主串 $s_{1}s_{2}\cdots{s_{n}}$ 的一个子串;而模式串 $p_{1}p_{2}\cdots{p_{m}}$ ;

为了检查匹配主串和字串,需要设立两个计数指针 $i, j$ ,分别指向主串的当前字符位置和模式串的当前位置

设某个匹配过程中从主串的 $s_{t}$ 和 $p_{1}$ 开始匹配,一直到 $s_{x},p_{j}$ 发生了失配( $s_{x}\neq{p_{j}}$ ),
$\Large\begin{matrix} s_{t}&s_{t+1}&\cdots&\overset{\downarrow^{\Large i}}{\boxed{s_{x}}}\cdots\\ p_{1}&p_{2}&\cdots&\underset{{\uparrow}_{\Large j}}{\boxed {p_{j}}}\cdots\\ \end{matrix}$
此时需要回溯然后重新开始匹配

那么回溯的量(偏移量)为多少合适?检查本次匹配的起始位置 $s_{i}$ ,因为本次匹配失败,所以只能在主串的下一个位置 $s_{t+1}$ 开始新一轮的匹配尝试;问题是主串的指针应该如何回溯到 $s_{t+1}$ ?(或者说回溯偏移量为多少的问题)

不难发现 $s_{t+1}$ 和模式串的 $p_{2}$ 位置相对应,那么从 $p_{j}$ 回溯到 $p_{2}$ 的回溯量为 $j - 2$ ,那么 $s_{x}$ 回溯到 $s_{t+1}$ 的偏移量也为 $j - 2$ ,从而指针 $i$ 回溯到 $i - (j - 2)$ = $i - j + 2$
$\Large\begin{matrix} s_{t}&\overset{\downarrow^{i'}}{s_{t+1}}&\cdots&\overset{\downarrow^{i}} s_{x}\cdots\\ p_{1}&\underset{{\uparrow}_{j'}}{p_{2}}&\cdots&\underset{{\uparrow}_{j}}{p_{j}}\cdots\\ \end{matrix}$

此外,根据偏移量 $j - 2$ 还可以得出 $x = t + 1 + j - 2$ = $t + j - 1$ (虽然这个关系式对于算法不重要)

观察 $p_{j}\to{p_{2}}$ 子序列的下标差,实际上有 $i - i^{'}$ = $j - 2$ ,也可以得出 $i^{'} = i - j + 2$ ,这就是 $i$ 的回溯公式

另一方面模式串 $T$ 的指针 $j$ 也要回溯,不过回溯的目标是指向 $T$ 的首字符,也就是1

令 $L(T)=\rm{T.Length}$ , $L(S)=\rm{S.Length}$

这样循环执行,直到S,T中某个串的所有字符被遍历完(如果 $T$ 存在于 $S$ ,那么最终 $j = L (T) + 1$ ,或写作 $j > L (T)$ )否则 $S$ 中不存在 $T$ 那么 $i = L (S) + 1$ ,或者也用 $j$ 来判断: $j\leqslant{L(T)}$ )

若经过上述判断 $j > L (T)$ ,即 $T$ 存在于 $S$ 中,应该返回出现的位置

根据下面意图可以看出,返回的位置是 $s_{t}$ 的索引,退出循环的时候 $i$ 指向 $s_{x+1}$ ,注意此时 $p'_{\small L(T)+1}$ 是虚拟的,实际上不存在该字符,标注出来是为了表示 $j = L (T) + 1$ , $p_{L(T)}$ 是模式串 $T$ 的最后一个字符了

如何得到指向 $s_{t}$ 的 $i^{'}$ 呢?我们同样有两种办法, $s_{x+1},s_{t}$ 的偏移量可以转换到 $p_{1},p_{\small {L(T)+1}}$ 的偏移量,也就是 $L (T)$ ,所以 $i^{'} = i - L (T)$
$\Large\begin{matrix} \overset{\downarrow^{i'}} {\boxed {s_{t}}}&{s_{t+1}}&\cdots&s_{x}&\overset{\downarrow^{i}} {\boxed {s_{x+1}}}\cdots\\ {p_{1}}&{p_{2}}&\cdots&{p_{\small L(T)}}&\underset{{\uparrow}_{j}}{\boxed {(p'_{\small L(T)+1})}}\cdots\\ \end{matrix}$

小结

简单模式匹配算法的最坏时间复杂度为 $O (nm)$ ，其中 $n$ 和 $m$ 分别为主串和模式串的长度。

例如，当模式串为 6个0,1个1组成的长度为7的串而主串为前45个’0’和1个’1’组成的长度为46的串时，由于模式串中的前 6 个字符均为 ‘0’，主串中的前 45 个字符均为 ‘0’，每趟匹配都是比较到模式串中的最后一个字符时才发现不等，指针 i 需要回溯 39 次

(字串在尾部对齐时成功匹配,主串的指针移动到第46-7+1=40时不再回溯(前39次都要回溯,对于本例,每次回溯就要比较7次(失配在模式串的最后一个字符处),比较次数和模式串长度相等), $i$ 移动到主串第40个字符时也要比较7次,因此总比较次数为 $40 \times 7 = 280$ 次。

可执行代码

将上述从1开始计数的指针和公式全部减去1,调整必要的比价关系运算符号,就可以得到可运行代码

/* 定义C语言的String(基于堆分配内存) */
typedef struct String
{
    char *ch;
    int length;
} String;



int Index(String S, String T)
{
    int i = 0, j = 0;
    while (i < S.length && j < T.length) // 从0开始计数
    {
        if (S.ch[i] == T.ch[j])
        {
            // printf("%d %d\n", i, j);
            // printf("%c %c\n", S.ch[i], T.ch[j]);
            ++i;
            ++j; // 继续比较后继字符(如果两个串都还未结束)
        }
        else
        {
            // 指针后退重新开始匹配
            j = 0; // 模式串的指针重置为0
            i = i - j + 1;
        }
    } // 离开循环时至少有一个字符串结束
    // printf("%d %d\n", i, j);
    if (j == T.length)
    {

        printf("%d \n", i - T.length);
        return i - T.length;
    }
    else
    {

        printf("%d \n", -1);
        return -1;
    }
}

辅助函数和运行和测试

// 初始化函数
void StrAssign(String *str, const char *initValue)
{
    if (str == NULL)
    {
        return; // 避免对空指针操作
    }
    if (initValue == NULL)
    {
        str->ch = NULL; // 如果初始值为空字符串
        str->length = 0;
    }
    else
    {
        str->length = strlen(initValue);                            // 获取字符串长度
        str->ch = (char *)malloc((str->length + 1) * sizeof(char)); // 分配内存
        if (str->ch != NULL)
        {
            strcpy(str->ch, initValue); // 复制字符串
        }
    }
}

// 销毁函数
void freeString(String *str)
{
    if (str != NULL && str->ch != NULL)
    {
        free(str->ch); // 释放动态分配的内存
        str->ch = NULL;
        str->length = 0;
    }
}
 
// 测试函数
void testIndex()
{
    // 测试用例1：两个空字符串
    String S1, T1;
    StrAssign(&S1, "");
    StrAssign(&T1, "");
    assert(Index(S1, T1) == 0);

    // 测试用例2：S不包含T
    String S2, T2;
    StrAssign(&S2, "abcdef");
    StrAssign(&T2, "xyz");
    assert(Index(S2, T2) == -1);

    // 测试用例3：S包含T
    String S3, T3;
    StrAssign(&S3, "abcdef");
    StrAssign(&T3, "def");
    assert(Index(S3, T3) == 3);

    // 测试用例4：T为空字符串
    String S4, T4;
    StrAssign(&S4, "abcdef");
    StrAssign(&T4, "");
    assert(Index(S4, T4) == 0);

    // 测试用例5：S为单个字符，S等于T
    String S5, T5;
    StrAssign(&S5, "a");
    StrAssign(&T5, "a");
    assert(Index(S5, T5) == 0);

    // 测试用例6：S为单个字符，S不等于T
    String S6, T6;
    StrAssign(&S6, "a");
    StrAssign(&T6, "b");
    assert(Index(S6, T6) == -1);

    // 测试用例7：给定的测试用例
    String S, T;
    StrAssign(&S, "hello");
    StrAssign(&T, "ll");
    assert(Index(S, T) == 2);

    // 更多测试用例...
}

int main()
{

    testIndex();
    return 0;
}

其他暴力算法

以下的暴力算法逻辑上更加简单,需要做的分析比较少,但是上述回溯法暴力算法可以更好和KMP匹配算法接轨

#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
 

/**
 * @brief Determine if the main string M starts with the target string T
 *
 * @param M The main string
 * @param T The target string
 * @return 1 if M starts with T, 0 otherwise
 */
bool startMatch(char *M, char *T)
{
    // Check if the length of M is less than the length of T
    if (strlen(M) < strlen(T))
        return 0;

    // Iterate through T and compare the characters with M
    for (int i = 0; T[i] != '\0'; i++)
    {
        // If the characters do not match, return 0
        if (M[i] != T[i])
        {
            return false;
        }
    }

    // If all characters match, return 1
    return true;
}
/**
 * @brief Determine if the main string M starts with the target string T
 *
 * @details This is a more concise version of startMatch, it uses a while loop to compare the characters of M and T.
 *          If the characters do not match, return false. Otherwise, return true.
 *
 * @param M The main string
 * @param T The target string
 * @return 1 if M starts with T, 0 otherwise
 */
bool startMatch2(char *M, char *T)
{
    /*
    主串M和模式串T匹配成功的标志是T被遍历匹配到最后一个字符'\0'(此前中途如果遇到不匹配的情况,则表明匹配失败)
    主串M的长度可能短于模式串T,对于两个字符串的遍历,任意一个遇到'\0'结束或者失配时结束遍历,最后再检查一下是否T被遍历完(检查离遍历循环时模式串的检查位置是否停留在\0处即可,如果是说明匹配成功,否则失败)
     */
    // while(*M++==*T++);//比较到出现'\0'时应该立即结束循环,即便是两串相等,防止'\0'=='\0'后继续比较下去

    // while (*M && *T && (*M++ == *T++))//比较到出现'\0'结束循环;把++放到循环里代码更加紧凑,但是可读性降低,不方便调试(新手不友好);这里不能省略*M,*T
    //*M保证了主串还未结束,字符'0'是真值(和`\0`不同,`\0`是0,是假值)

    // 如果将++放到循环体中,仔细观察可以发现,*M,*T可以省略掉其中的一个,*M==*T会判断剩下的一个
    while (*M && (*M == *T))
    {
        // printf("M:%c,T:%c\n", *M, *T);//打印放在循环末尾会丢失第一个字符打印结果
        M++;
        T++;
    }
    return *T == '\0';
}
void testStartMatchFunc(bool (*func)(char *, char *))
{
    assert(func("hello", "hello") == true);        // Full match
    assert(func("hello", "hell") == true);         // Prefix match
    assert(func("hello", "hello there") == false); // M longer than T, no match
    assert(func("hello", "helloa") == false);      // Different characters, no match
    assert(func("", "") == true);                  // Both empty strings
    assert(func("a", "a") == true);                // Single character match
    assert(func("abc", "ab") == true);             // T ends before M
    assert(func("abc", "abcd") == false);          // T longer than M, no match
    // assert(func("abc", "abc\0") == true);          // T ends with null character
    // assert(func("abc\0", "abc") == false);         // M ends with null character, T doesn't
}
int subString(char *M, char *T)
{

    int lenM = strlen(M);
    int lenT = strlen(T);
    int stop = lenM - lenT + 1; // 需要遍历主串的大最大字符数(主串的最大位序字符)
    for (int i = 0; i < stop; i++)
    {
        if (startMatch(M + i, T))
        {
            return i;
        }
    }
    return -1;
}
// 单元测试函数
void test_subString()
{
    assert(subString("hello", "ll") == 2);
    assert(subString("aaaaa", "aa") == 0);
    assert(subString("abcdef", "def") == 3);
    assert(subString("abcdef", "xyz") == -1);
    assert(subString("a", "a") == 0);
    assert(subString("", "a") == -1);
    assert(subString("abc", "") == 0);
    assert(subString("aaa", "a") == 0);
    assert(subString("abcabc", "abc") == 0);
    assert(subString("abcabc", "bca") == 1);
    printf("All tests passed successfully.\n");
}
int main()
{

    // testIndex();
    test_subString();
    return 0;
}

字符串搜索@串的模式匹配@朴素暴力算法

文章目录

串的模式匹配

暴力匹配算法(BF算法)

分析

小结

可执行代码

其他暴力算法

网站公告

今日签到

热门文章

最新发布