KMP 알고리즘

Drill_Labito 2023. 1. 9. 16:10

Reference : https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/?ref=gcse

KMP Algorithm for Pattern Searching - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

www.geeksforgeeks.org

Reference 2 : https://youtu.be/GTJr8OvyEVQ

KMP ( Knuth Morris Pratt ) 알고리즘은, 문자열을 처리하는 알고리즘 중 한가지 방법으로 이전에 다뤘던 Pattern Searching 알고리즘 방법 중 한가지 방법이다.

이전에 Naive Pattern Searching 방법으로 무식하게 패턴을 이중반복문 돌려가면서 짜는 방식의 경우, 최악의 경우 시간복잡도가 O( NM ) 이므로 ( N : Text 길이, M : 패턴 길이 ) 비효율적인 방법이다.

KMP 알고리즘의 경우, 결론을 먼저 말하면 시간복잡도와, 공간복잡도의 효율을 따지면 다음과 같다.

- 시간복잡도 ( Time Complexity ) : O( N )

- 공간복잡도 ( Auxiliary Space ) : O( M )

( N : Text 길이, M : Pattern 길이 )

이제 KMP 알고리즘의 원리를 살펴보자.

처음에 이 알고리즘을 이해하기가 어려워서, Reference1, 2 에 있는 설명글 및, 영상을 여러번 돌려보고나서야 이해가 되었다.

먼저 KMP 알고리즘을 알기전에 사전지식으로 Prefix, Suffix 라는 개념을 알아야 한다.

예를들어 "ABC" 라는 문자열이 있을 때, Prefix, Suffix 는 다음과 같다.

- Prefix : "", "A", "AB", "ABC"

- Suffix : "", "C", "BC", "ABC"

즉, 주어진 문자열의 앞에서부터 하나씩 부분문자열을 잘라낸것을 Prefix, 뒤에서부터 하나씩 잘라낸 부분문자열을 Suffix 라고 이해하면 되지 싶다.

이제 이 개념을 알았으니, KMP 알고리즘을 이용하기 전, 전처리 작업을 진행해보자.

lps[ ] ( Longest proper Prefix String ; 참고한 사이트에서 이런 명칭을 쓰고 있어서 그대로 들고옴 ) 배열을 구하는게 전처리 작업인데,

lps 배열의 사이즈는 Pattern 사이즈를 M이라고 할 때, M과 동일하게 취급하면 된다.

txt 를 입력으로 들어온 문자열이라고 하고,

pat 를 찾는 패턴이라고 할 때

lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i]

라고 정의해보자.

즉 lps[i] 는 부분문자열의 prefix == suffix 가 될 수 있는 가장 긴 것의 길이이다.

( lps[i] could also be defined as the longest prefix which is also a proper suffix. We need to use it properly in one place to make sure that the whole substring is not considered. )

=> 이 때, prefix 가 0~i 까지의 부분문자열과 같으면 안된다.

말로설명하면 헷갈리니, 직접 lps를 구하는 것을 보자.

참고한 사이트에선, lps 를 구할 때, 패턴 문자 전체에 대해 일치하는 경우는 빼고 구하고 있어, 그대로 적용해봄.

ex1 ) "AAAA"

i / substring / lps[i]

0 / A / 0

1 / AA / 1 ( prefix : A, suffix : A )

2 / AAA / 2 ( prefix : AA, suffix : AA )

3 / AAAA / 3 ( prefix : AAA, suffix : AAA )

lps[] = { 0, 1, 2, 3 }

ex2 ) "AABAACAABAA"

2번 예제부턴, 바로 부분문자열의 prefix, suffix 가 가장 긴 경우만 바로 표시를 해볼 것이다.

i / substring / lps[i]

0 / A / 0

1 / AA / 1 ( prefix : A, suffix : A )

2 / AAB / 0 ( X )

3 / AABA / 1 ( prefix : A, suffix : A )

4 / AABAA / 2 ( prefix : AA, suffix : AA )

5 / AABAAC / 0 ( X )

6 / AABAACA / 1 ( prefix : A, suffix : A )

7 / AABAACAA / 2 ( prefix : AA, suffix : AA )

8 / AABAACAAB / 3 ( prefix : AAB, suffix AAB )

9 / AABAACAABA / 4 ( prefix : AABA, suffix : AABA )

10 / AABAACAABAA / 5 ( prefix : AABAA, suffix : AABAA )

lps[] = { 0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5 }

ex3 ) "AAABAAA"

i / substring / lps[i]

0 / A / 0

1 / AA / 1 ( prefix : A, suffix : A )

2 / AAA / 2 ( prefix : AA, suffix : AA )

3 / AAAB / 0 ( X )

4 / AAABA / 1 ( prefix : A, suffix : A )

5 / AAABAA / 2 ( prefix : AA, suffix : AA )

6 / AAABAAA / 3 ( prefix : AAA, suffix : AAA )

예시를 통해 보면, lps[i] 는, 주어진 패턴 문자열의 부분문자열 ( substring ) 중에서, prefix, suffix 가 가장 길게 일치하는 길이를 담아놓는다고 이해할 수 있다.

이제 이 lps 배열을 이용하여, 빠르게 Text 문자열에서, pattern 문자열을 구하는 과정을 알아보자.

아래 과정은 lps 를 구하는 과정을 나타낸 것이다.

-------------------------------

ex )

- pat[] = "AAACAAAA"

len = 0, i = 0.

lps[0] is always 0, we move to i = 1

len = 0, i = 1

Since pat[len] and pat[i] match, do len++,

store it in lps[i] and do i++.

len = 2, lps[2] = 2, i = 3

len = 2, i = 3.

Since pat[len] and pat[i] do not match, and len > 0,

set len = lps[len - 1] = lps[1] = 1

len = 1, i = 3.

Since pat[len] and pat[i] do not match and len > 0,

len = lps[len - 1] = lps[0] = 0

len = 0, i = 3.

Since pat[len] and pat[i] do not match and len = 0,

Set lps[3] = 0 and i = 4

We know that characters pat

len = 0, i = 4

Since pat[len] and pat[i] match, do len++,

store it in lps[i] and do i++. len = 1, lps[4] = 1, i = 5

len = 1, i = 5.

Since pat[len] and pat[i] match, do len++,

store it in lps[i] and do i++.

len = 2, lps[5] = 2, i = 6

len = 2, i = 6

Since pat[len] and pat[i] match, do len++,

store it in lps[i] and do i++.

len = 2, lps[5] = 2, i = 6

len = 2, i = 6

Since pat[len] and pat[i] match, do len++,

store it in lps[i] and do i++.

len = 3, lps[6] = 3, i = 7

len = 3, i = 7.

Since pat[len] and pat[i] do not match and len > 0,

set len = lps[len - 1] = lps[2] = 2

len = 2, i = 7.

Since pat[len] and pat[i] match, do len++,

store it in lps[i] and do i++.

len = 3, lps[7] = 3, i = 8

We stop here as we have constructed the whole lps[ ]

-------------------------------

이제 이렇게 구한 lps를 활용하여, 이전에 무식한 방법으로 했던 Naive 알고리즘과 어떻게 차이가 있는지 알아보자.

먼저 KMP 알고리즘에선, lps 를 이용하여, Text 문자열에서 pattern 문자열과 다른 부분이 나오면, lps 배열을 참고하여, 바로 pattern 의 처음으로 돌아가는것이 아니라, lps 에서 들어있는 내용물의 이전 인덱스로 돌아간다.

말로 설명하는것보다, 실제로 예시를 통해 이해를 해보자.

-----------------------------

ex ) txt[] = "AAAAABAAABA", pat[] = "AAAA"

lps[] = { 0, 1, 2, 3 }

i = 0, j = 0

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 1, j = 1

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 2, j = 2

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

pat[i] and pat[j] match, do i++, j++

i = 3, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 4, j = 4

Since j == M, print pattern found and reset j,

j = lps[j - 1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three characters of this window.

Value of lps[j - 1] ( in above step ) gave us index of next character to match.

i = 4, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 5, j = 4

Since j == M, print pattern found and reset j, j = lps[j - 1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three characters of this window.

Value of lps[j - 1] ( in above step ) gave us index of next character to match

i = 5, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j - 1] = lps[2] = 2

i = 5, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j - 1] lps[2] = 2

i = 5, j = 2

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j - 1] = lps[1] = 1

i = 5, j = 1

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j - 1] = lps[0] = 0

i = 5, j = 0

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j is 0, we do i++

i = 6, j = 0

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++ and j++

We continue this way till there are sufficient characters in the text to be compared with the characters in the pattern.

-----------------------------

이제 위 과정을 구현해보면 아래와 같다.

#include <iostream>

#include <string.h>

using namespace std;

// KMP algorithm implementation

void computeLPSArray(char* pat, int M, int* lps);

// Prints occurrences of txt[] in pat[]

void KMPSearch(char* pat, char* txt)

{

int M = strlen(pat);

int N = strlen(txt);

// craate lps[] that will hold the longest prefix suffix

// values for pattern

int lps[1000]; // int lps[M];

// Preprocess the pattern ( calculate lps[] array )

computeLPSArray(pat, M, lps);

int i = 0; // index for txt[]

int j = 0; // index for pat[]

while ((N - i) >= (M - j))

{

if (pat[j] == txt[i])

{

j++;

i++;

}

if (j == M)

{

printf("Found pattern at index %d\n", i - j);

j = lps[j - 1];

}

// mismatch after j matches

else if (i < N && pat[j] != txt[i])

{

// Do not match lps[0..lps[j-1]] characters,

// they will match anyway

if (j != 0)

j = lps[j - 1];

else

i = i + 1;

}

// Fills lps[] for given pattern pat[0..M-1]

void computeLPSArray(char* pat, int M, int* lps)

{

// length of the previous longest prefix suffix

int len = 0;

lps[0] = 0; // lps[0] is always 0

// the loop calculates lps[i] for i = 1 to M-1

int i = 1;

while (i < M)

{

if (pat[i] == pat[len])

{

len++;

lps[i] = len;

i++;

}

else // (pat[i] != pat[len])

{

// This is tricky. Consider the example.

// AAACAAAA and i = 7. The idea is similar

// to search step

if (len != 0)

{

len = lps[len - 1];

// Also, note that we do not increment

// i here

}

else // if (len == 0)

{

lps[i] = 0;

i++;

}

int main()

{

char txt[] = "ABABDABACDABABCABAB";

char pat[] = "ABABCABAB";

KMPSearch(pat, txt);

return 0;

}