This is part 2 of the post “suffix array”, which covers the string matching/searching problem that can be solved by suffix array. You may want to read part 1 first.
String Matching/Searching
Once we have the suffix array, it’s easy to check if another string appears in the input string/text or not. If the string appears in the input text, if will be the prefix of one or more its suffix. Since the suffix array are sorted suffix of the input string, binary search can be applied.
Below is an implementation that finds all occurrence of a string in the input text,
/**
this program demonstrate how to use suffix array to search for a particular substring
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int pstrcmp(const void *a, const void *b) {
//printf("pstrcmp: %s %s\n", (const char*)*(char **)a, (const char*)*(char **)b);
return strcmp((const char *)*(char **)a, (const char *)*(char **)b);
}
int mybsearch(const char *target, char **ap, int len) {
int l, r, m;
int tsize = strlen(target);
for (l = 0, r = len - 1; l < len && r >= 0 && l <= r;){
m = (l+r)/2;
if (strncmp(target, ap[m], tsize) < 0) {
r = m - 1;
} else if (strncmp(target, ap[m], tsize) > 0) {
l = m + 1;
} else {
return m;
}
}
return -1;
}
int main(int argc, char **argv) {
int i;
char **ap; //suffix pointers array
int len;
int bpos;
len = strlen(argv[1]);
ap = (char**)malloc(len*sizeof(char*));
for (i = 0; i < len; ++i) {
ap[i] = &(argv[1][i]);
}
for (i = 0; i < len; ++i) {
printf("%s\n", ap[i]);
}
printf("sort the suffices\n");
qsort(ap, len, sizeof(char *), pstrcmp);
for (i = 0; i < len; ++i) {
printf("%s\n", ap[i]);
}
printf("searching for string:%s\n", argv[2]);
bpos = mybsearch(argv[2], ap, len);
printf("found at suffix array pos, original string pos: %d, %d\n", bpos, (ap[bpos]-argv[1])/sizeof(char));
if (bpos > 0) {
printf("found %s\n", ap[bpos]);
//after binary search, we searching in both left directions and right directions for suffix
//that also has the target string as prefix
for (i = bpos-1; i >= 0; --i){
if (strncmp(argv[2], ap[i], strlen(argv[2])) == 0) {
printf("found at suffix array pos, original string pos: %d, %d\n", i, (ap[i]-argv[1])/sizeof(char));
} else {
break;
}
}
for (i = bpos + 1; i < len; ++i) {
if (strncmp(argv[2], ap[i], strlen(argv[2])) == 0) {
printf("found at suffix array pos, original string pos: %d, %d\n", i, (ap[i]-argv[1])/sizeof(char));
} else {
break;
}
}
}
}
The program accepts two input parameters, the input text and the string to search for.
Compile the code using the command below,
gcc -o sufsearch sufsearch.c
Below is a screenshot of the execution,
Figure 1. Searching Strings using Suffix Array
Suppose the length of the string is m, the input text is n. Note that every comparison of the string with suffix of the input text consumes O(m) and O(logn) comparisons is needed to perform the binary search. Therefore, the run time for the algorithm above is O(mlogn), excluding the cost of building the suffix array.
There’re are better algorithms that use the longest common prefix to help the search process. The run time is O(m + logn). Interested users can refer to reference 6 for details.
Note that the input text can be preprocessed to build the suffix array, and the search target string can be given later. This differs from KMP algorithm that the pattern must be given before the computation can begin. This difference makes them suitable for different applications.
References:
1. Suffix Array, Wikipedia. http://en.wikipedia.org/wiki/Suffix_array
2. Brief Introduction to Suffix Array: http://sary.sourceforge.net/docs/suffix-array.html
3. Algorithms, 4th edition website: http://algs4.cs.princeton.edu/63suffix/
4. Programming Perls. 2nd Edition.
5. libdivsufsort, http://code.google.com/p/libdivsufsort/
6. Suffix Arrays: a new method for On-Line String Searches: http://delivery.acm.org/10.1145/330000/320218/p319-manber.pdf?ip=137.132.250.14&acc=ACTIVE%20SERVICE&CFID=67974067&CFTOKEN=52233823&__acm__=1330495809_b1bcb53c53d3d7a276f870cb1e0fdf89