Quantcast
Channel: roman10 » Data Structure & Algorithms
Viewing all articles
Browse latest Browse all 15

Suffix Array Part 3 — Longest Common Substring (LCS)

$
0
0

Suffix array can be used to find longest common substring (LCS) among multiple strings. This post illustrate the algorithm by an example of finding LCS among two strings.

Algorithm

The idea is that LCS for two strings is the longest common prefix of some suffices of the two string. Suppose we have two strings abc and bcd, then the suffix of the two strings will be {abc, bc, c} and {bcd, cd, d} respectively. If we sort them according to alphabetical order, it will be rearranged as below,

Figure 1. Sorted Suffices of Both abc and bcd

It is not difficult to tell that the LCS is bc, which is longest common prefix of the two suffices from different strings.

This observation leads to the algorithm below,

1. combine two input strings into one single string
2. compute the suffix array
3. find the longest common prefix of two neighboring suffix that start with suffix of different input strings

Using the same example abc and bcd, we combine the strings to get a single string abcbcd, then the algorithm can be carried out as figure below,

Figure 2. LCS Algorithm using Suffix Array

There are two prefix of the neighboring suffices that start with suffix of different input strings, bc of {bcbcd, bcd} and c of {cbcd, cd}. The longer one is bc. Therefore the LCS of abc and bcd is bc.

Implementation

Below is a sample implementation of the algorithm described above in C,

/**

find longest common substring of two input texts using suffix array

*/

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

 

int pstrcmp(const void *a, const void *b) {

    //printf("pstrcmp: %s %s\n", (const char*)*(char **)a, (const char*)*(char **)b);

    return strcmp((const char *)*(char **)a, (const char *)*(char **)b);

}

 

int lcp(char *a, char *b) {

    int len1, len2;

    int i;

    len1 = strlen(a);

    len2 = strlen(b);

    for (i = 0; (i < len1) && (i < len2); ++i) {

        if (a[i] != b[i]) break;

    }

    return i;

}

 

int main(int argc, char **argv) {

    int i;

    char **ap;      //suffix pointers array

    int len1, len2;

    char *cstr;

    int lcslen = 0, lcplen, lcssufpos = -1;

    len1 = strlen(argv[1]);

    len2 = strlen(argv[2]);

    ap = (char**)malloc((len1+len2)*sizeof(char*));

    cstr = (char*)malloc((len1+len2)*sizeof(char));

    cstr = strcat(argv[1], argv[2]);

    for (i = 0; i < len1+len2; ++i) {

        ap[i] = &(cstr[i]);

    }

    for (i = 0; i < len1+len2; ++i) {

        printf("%s\n", ap[i]);

    }

    printf("sort the suffices\n");

    qsort(ap, len1+len2, sizeof(char *), pstrcmp);

    for (i = 0; i < len1+len2; ++i) {

        printf("%s\n", ap[i]);

    }

    printf("finding the lcs\n");

    for (i = 0; i < len1 + len2 - 1; ++i) {

        if ((ap[i] - cstr >= len1) && (ap[i+1] - cstr >= len1)) {

            //both starts with suffix of second string

            continue;

        } else if ((ap[i] - cstr < len1) && (ap[i+1] - cstr < len1)) {

            //both starts with suffix of first string

            continue;

        } else {

            lcplen = lcp(ap[i], ap[i+1]);

            if (lcplen > lcslen) {

                lcslen = lcplen;

                lcssufpos = i;

            }

        }

    }

    if (lcssufpos == -1) {

        printf("no lcs\n");

    } else {

        printf("%.*s\n", lcslen, ap[lcssufpos]);

    }

}

Compile the code using command below,

gcc -o suflcs suflcs.c

Then a sample execution,

Figure 3. Execution of LCS Program

Note that the algorithm can be extended to more than two input strings, say K strings. Then the problem becomes finding the longest common prefix of neighboring K suffices which start at suffix of K input strings.

References:
1. http://algs4.cs.princeton.edu/63suffix/LCS.java.html


Viewing all articles
Browse latest Browse all 15

Trending Articles