Suffix array can be used to find longest common substring (LCS) among multiple strings. This post illustrate the algorithm by an example of finding LCS among two strings.
Algorithm
The idea is that LCS for two strings is the longest common prefix of some suffices of the two string. Suppose we have two strings abc and bcd, then the suffix of the two strings will be {abc, bc, c} and {bcd, cd, d} respectively. If we sort them according to alphabetical order, it will be rearranged as below,
Figure 1. Sorted Suffices of Both abc and bcd
It is not difficult to tell that the LCS is bc, which is longest common prefix of the two suffices from different strings.
This observation leads to the algorithm below,
1. combine two input strings into one single string
2. compute the suffix array
3. find the longest common prefix of two neighboring suffix that start with suffix of different input strings
Using the same example abc and bcd, we combine the strings to get a single string abcbcd, then the algorithm can be carried out as figure below,
Figure 2. LCS Algorithm using Suffix Array
There are two prefix of the neighboring suffices that start with suffix of different input strings, bc of {bcbcd, bcd} and c of {cbcd, cd}. The longer one is bc. Therefore the LCS of abc and bcd is bc.
Implementation
Below is a sample implementation of the algorithm described above in C,
/**
find longest common substring of two input texts using suffix array
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int pstrcmp(const void *a, const void *b) {
//printf("pstrcmp: %s %s\n", (const char*)*(char **)a, (const char*)*(char **)b);
return strcmp((const char *)*(char **)a, (const char *)*(char **)b);
}
int lcp(char *a, char *b) {
int len1, len2;
int i;
len1 = strlen(a);
len2 = strlen(b);
for (i = 0; (i < len1) && (i < len2); ++i) {
if (a[i] != b[i]) break;
}
return i;
}
int main(int argc, char **argv) {
int i;
char **ap; //suffix pointers array
int len1, len2;
char *cstr;
int lcslen = 0, lcplen, lcssufpos = -1;
len1 = strlen(argv[1]);
len2 = strlen(argv[2]);
ap = (char**)malloc((len1+len2)*sizeof(char*));
cstr = (char*)malloc((len1+len2)*sizeof(char));
cstr = strcat(argv[1], argv[2]);
for (i = 0; i < len1+len2; ++i) {
ap[i] = &(cstr[i]);
}
for (i = 0; i < len1+len2; ++i) {
printf("%s\n", ap[i]);
}
printf("sort the suffices\n");
qsort(ap, len1+len2, sizeof(char *), pstrcmp);
for (i = 0; i < len1+len2; ++i) {
printf("%s\n", ap[i]);
}
printf("finding the lcs\n");
for (i = 0; i < len1 + len2 - 1; ++i) {
if ((ap[i] - cstr >= len1) && (ap[i+1] - cstr >= len1)) {
//both starts with suffix of second string
continue;
} else if ((ap[i] - cstr < len1) && (ap[i+1] - cstr < len1)) {
//both starts with suffix of first string
continue;
} else {
lcplen = lcp(ap[i], ap[i+1]);
if (lcplen > lcslen) {
lcslen = lcplen;
lcssufpos = i;
}
}
}
if (lcssufpos == -1) {
printf("no lcs\n");
} else {
printf("%.*s\n", lcslen, ap[lcssufpos]);
}
}
Compile the code using command below,
gcc -o suflcs suflcs.c
Then a sample execution,
Figure 3. Execution of LCS Program
Note that the algorithm can be extended to more than two input strings, say K strings. Then the problem becomes finding the longest common prefix of neighboring K suffices which start at suffix of K input strings.
References:
1. http://algs4.cs.princeton.edu/63suffix/LCS.java.html