Tuesday, June 12, 2012

Boyer-Moore algorithm

This is an efficient algorithm for string search. On wikipedia, I found it difficult to understand. But then, I finally learnt it. What I found is that a good way of learning is to see how does different cases work - learn by example. So, I decided to have a blog where I present examples followed by algo/code. Note that I am assuming here that the string from which strings search is performed to be made has ASCII characters only. This is just to make the algorithm simpler and easier to understand.
Let me first give an insight in layman terms, on how this algorithm works:
The string is searched starting from the end towards the beginning.
Before start searching there is a preprocessing done on the string to be searched (S). There is a small table created (I call it jump table). As the name suggests, the table is used for jumping to a new location in the Original string (T) in case the string is not matched at the current location.
Let's assume S: LAP
And T: THIS IS A LAPTOP
To create the jump table, we will give each character in S as the position value:
LAP
210


i.e., L is given 2, A as 1 and P as 0.
The algorithm starts comparing from ^ and moves towards left as shown below:
T: THIS IS A LAPTOP
S: LAP
 <-- ^


In this case, the match is found at I in string T. 
Now since I is not in the jump table, string S starting location becomes 'S', i.e., at the next location to last mismatched location.

T: THIS IS A LAPTOP
S:    LAP
   <--  ^
In this iteration also, the same follows, so the next iteration point is:
T: THIS IS A LAPTOP
S:       LAP
      <--  ^
And the next iteration follows the same:
T: THIS IS A LAPTOP
S:          LAP
         <--  ^
Now, in this iteration, the mismatch character in T is 'A', which is in the jump table - has a value 1. This means that the jump for the next iteration is 1 character. Hence the next iteration looks like.

T: THIS IS A LAPTOP
S:           LAP
          <--  ^

For this iteration, the match is found, hence the algorithm terminates.
If there is no mismatch, the algorithm terminates when the string T ends.
I hope, I am able to clearly state the algorithm using the above example.
I have tried to capture the same in the C/C++ code as follows. Note that for the jump table I have used an array of 256 entries, one for each ASCII character. This can be replaced with different implementations, but the idea remains the same.
Now the code:

// The algorithm here is boyer-moore's string matching algorithm
#include <iostream>
#include <cstring>
#define MAX 256
using namespace std;


// Assuming that the array has been initialized to appropriated values
void PreprocessString(char *str, int size, int *arr, int sarr)
{
if (str == NULL)
{
cout << "String passed is null" << endl;
return;
}

if (size <= 0)
{
cout << "String can't be processed as its size is zero or negative" << endl;
return;
}

if (arr == NULL)
{
cout << "Int array is NULL" << endl;
return;
}

if (sarr <= 0)
{
cout << "Array size is zero or negative" << endl;
return;
}

for (int i = 0; i < size; i++)
if (arr[str[i]] == -1)
arr[str[i]] = size - i - 1;

return;
}

int BoyerMoore(char *T, char *S, int lt, int ls)
{
if (T == NULL)
{
cout << "String to be searched is NULL" << endl;
return -1;
}
if (S == NULL)
{
cout << "Pattern string is NULL" << endl;
return -1;
}

if (lt <= 0)
{
cout << "Actual string can't be processed as its size is zero or negative" << endl;
return -1;
}
if (ls <= 0)
{
cout << "Pattern string can't be processed as its size is zero or negative" << endl;
return -1;
}

// Initialize the array assuming the ascii characters in the array
int arr[MAX];

for (int i = 0; i < MAX; i++)
{
arr[i] = -1;
}

PreprocessString(S, ls, arr, MAX);

// This will tell how much to jump for next iteration
int jump = 0;

for(int counter = 0; counter <= lt - 1;)
{
// Now perform the backward search
int j;
for (j = ls - 1; j >= 0; j--)
{
if (S[j] != T[counter + j])
break;
}

// if j < 0, we have a match at counter location
if (j < 0) return counter;

// Else we need to adjust jump
if (arr[T[counter + j]] != -1)
jump = arr[T[counter + j]];
else
jump = j == 0 ? 1 : j;

counter += jump;
}

return -1;
}


I appreciate any feedback!

No comments:

Post a Comment

Post a Comment