Class LongestCommonSubsequenceDistance

java.lang.Object
org.cicirello.sequences.distance.EditDistance
org.cicirello.sequences.distance.LongestCommonSubsequenceDistance
All Implemented Interfaces:
SequenceDistanceMeasurer, SequenceDistanceMeasurerDouble

public final class LongestCommonSubsequenceDistance extends EditDistance
LongestCommonSubsequenceDistance is a form of EditDistance, where the edit operations are limited to deletions and insertions (i.e., no replacements or changes), and where the cost of an edit operation is simply 1. It is the minimum number of element insertions and deletions necessary to transform one sequence into the other. It can be computed as: N + M - 2 * lcs(s1, s2), where lcs(s1, s2) is the length of the longest common subsequence of sequences s1 and s2, and N and M are the lengths of s1 and s2 (see section 5 of Wagner and Fischer (1974)).

This class supports computing distance for Java String objects or arrays of any of the primitive types, or arrays of objects. It makes no assumptions about the contents of the Strings or arrays, and they can contain duplicates, or can be such that some elements only appear in one or the other of the sequences, or can be of different lengths.

Runtime: O(n*m), where n and m are the lengths of the two sequences (i.e., Strings or arrays).

If you need to compute this distance for permutations (i.e., same length, same set of elements, unique elements), then it is recommended to instead use the ReinsertionDistance class. That class exploits the common length, common set of elements, and unique elements properties of permutations to more efficiently (in O(n lg n) time) compute the longest common subpermutation (i.e., that class does not delegate the work to the edit distance algorithm). However, the result of ReinsertionDistance is half of what LongestCommonSubsequenceDistance would compute. This is because for permutations the elements that would be inserted are exactly the same as those that would be deleted by the edit operations, and ReinsertionDistance is defined as an edit distance with one edit operation, removal/reinsertion (i.e., a deletion is only half the operation, and the insertion is the other half of the operation).

Wagner and Fischer's String Edit Distance was introduced in:
R. A. Wagner and M. J. Fischer, "The string-to-string correction problem," Journal of the ACM, vol. 21, no. 1, pp. 168–173, January 1974.

  • Constructor Details

    • LongestCommonSubsequenceDistance

      public LongestCommonSubsequenceDistance()
      Constructs a longest common subsequence distance.