kmp 算法

-> 概念说明

KMP 是三个人名的简称 Knuth-Morris-Pratt (opens new window)，该算法主要用于字符串匹配

举例说明， leetcode No.28 (opens new window), 类似自行实现字符串的 indexOf方法。

判断一个字符串是否包含另一个子串
如果包含，返回子串在父字符串中第一次出现的索引
不包含则返回 -1

最简单的实现则是暴力循环

brute-force

class Solution {
    public int strStr(String haystack, String needle) {
        int c_len = needle.length();
        int p_len = haystack.length();
        int ret = -1;
        if(c_len == 0){
            return 0;
        }
        if(p_len == 0 || p_len < c_len){
            return ret;
        }
        for(int i =0; i < p_len; i++){
            if(haystack.charAt(i) != needle.charAt(0)){
                continue;
            }
            int j = 1;
            while(j < c_len && p_len - i >= c_len){
                if(haystack.charAt(i+j) == needle.charAt(j)){
                    j++;
                }else{
                    break;
                }
            }
            if(j == c_len){
                ret = i;
                break;
            }
        }
        return ret;
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

-> 使用说明

虽然能实现需求，但逐位遍历效率太低。
KMP算法提供了一种实现：根据已经匹配的部分，得出一种准确向后移动多位的规律，而不必逐位遍历。从而实现效率提升。

这种方法需要额外维护一张表，称之为部分匹配表(partial match table), 举例说明如下：

char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

1
2
3

假设要搜索的目标字符串为 abababca, 那么它的部分匹配表则为 [0, 0, 1, 2, 3, 4, 0, 1]

当发现不匹配需要向后移位时，移动规律为

移动位数 = 已匹配的字符个数 - 对应(已匹配部分的最后一个字符)部分匹配表的值

假设原字符串为 bacbababaabcbab,

则第一次匹配位置为
```
bacbababaabcbab
 |
 abababca
```
1
2
3
此时，已匹配个数为 1, 对应部分匹配表的值 table[0]即为 0, 因此向后移动 1位
移动之后不匹配，继续向后遍历，第二次匹配位置为
```
bacbababaabcbab
    |||||
    abababca
```
1
2
3
此时，已匹配个数为 5, 对应部分匹配表的值 table[4]即为 3, 因此向后移动 2位
```
bacbababaabcbab
    xx|||
      abababca
```
1
2
3
此时，已匹配个数为 3, 对应部分匹配表的值 table[2]即为 1, 继续向后移动 2位
```
bacbababaabcbab
    xxxx|
        abababca
```
1
2
3
至此，搜索字符串长度已经超过原字符串剩余部分，故可以得出结论，不存在这样的子串

-> partial match table

可以看到，KMP算法借助部分匹配表，空间换时间，可以打破每次只移动一位的限制，提升效率。
核心部分就是这个表该如何得到。
简单来说是：每个位置子串的 prefix和 suffix的最长交集部分的长度

prefix 指除了最后一个字符以外，一个字符串的全部头部的组合
suffix 指除了第一个字符以外，一个字符串的全部尾部的组合
以 abababca为例：
位置 0的子串a, prefix 为空, suffix 为空，
- 没有交集, 故值为 0
位置 1的子串ab, prefix 为 ["a"], suffix 为 ["b"]，
- 没有交集, 故值为 0
位置 2的子串aba, prefix 为 ["a", "ab"], suffix 为 ["ba", "a"]
- 存在一个交集 "a", 长度为 1，故值为 1
位置 3的子串abab, prefix 为 ["a", "ab", "aba"], suffix 为 ["bab", "ab", "b"]
- 存在一个交集 "ab", 长度为 2, 故值为 2
位置 4的子串ababa, prefix 为 ["a", "ab", "aba", "abcb"], ["baba", "aba", "ba", "a"]
- 存在交集为 ["a", "aba"], 最长为 "abc", 长度为 3, 故值为 3

......

以此类推，可以得到部分匹配表每个位置的值。

部分匹配表在具体实现时通常称之为 next数组, 假设搜索字符串为 s, 每个位置上的字符为 s[n]

next[0] 一定为 0
next[1]的值，由 s[0] == s[1] 决定，相等则为 1, 不相等则为 0
next[n]的值，由前面的值决定
- 如果 next[n] == s[next[n-1]]成立, 则 next[n] = next[n-1] + 1
- 如果不成立，如果 next[n-1] 为 0，则 next[n] 为 0
  - 如果 next[n-1] 大于 0，则令 next[n-1] = next[next[n-1] - 1] 继续上述比较，直到 next[n-1]为 0为止

具体实现如下：

next 数组

public int[] getNext(String needle) {
    int len = needle.length();
    if(len == 0){
        return new int[]{0};
    }

    int[] next = new int[len];

    int i = 1;
    int now = 0;
    while(i < len){
        if(needle.charAt(i) == needle.charAt(now)){
            next[i] = now + 1;
            i++;
            now++;
        }else if(now == 0){
            i++;
        }else{
            now = next[now-1];
        }
    }
    return next;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

因此，即可实现完整的 KMP算法

KMP

class Solution {
    public int strStr(String haystack, String needle) {
        int c_len = needle.length();
        int p_len = haystack.length();
        int ret = -1;
        if(c_len == 0){
            return 0;
        }
        if(p_len == 0 || p_len < c_len){
            return ret;
        }

        int[] next = this.getNext(needle);
        int j = 1;
        for(int i =0; i < p_len; i = i + j - next[j-1]){
            if(haystack.charAt(i) != needle.charAt(0)){
                continue;
            }
            j = 1;
            while(j < c_len && p_len - i >= c_len){
                if(haystack.charAt(i+j) == needle.charAt(j)){
                    j++;
                }else{
                    break;
                }
            }
            if(j == c_len){
                ret = i;
                break;
            }
        }
        return ret;
    }
    public int[] getNext(String needle) {
        int len = needle.length();
        if(len == 0){
            return new int[]{0};
        }
        
        int[] next = new int[len];
        
        int i = 1;
        int now = 0;
        while(i < len){
            if(needle.charAt(i) == needle.charAt(now)){
                next[i] = now + 1;
                i++;
                now++;
            }else if(now == 0){
                i++;
            }else{
                now = next[now-1];
            }
        }
        return next;
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

-> 总结

KMP 算法的原理是根据要搜索的字符串自身特点，总结出遍历过程中可以直接跳过的情况的规律，从而减少了总的遍历次数。
因此，如果被搜索字符串没有重复的情况，该算法不仅没有提升效率，还略微增加了一点负担。
相反，如果被搜索字符串自身包含大量重复，该算法优化的提升则会非常明显。

编辑此页

#KMP

更新于: 2019-10-23 16:32

← 数组划分 manacher 算法→