Class KendallTauSequenceDistance

java.lang.Object
org.cicirello.sequences.distance.KendallTauSequenceDistance
All Implemented Interfaces:
SequenceDistanceMeasurer, SequenceDistanceMeasurerDouble

public final class KendallTauSequenceDistance extends Object implements SequenceDistanceMeasurer
Kendall Tau Sequence Distance is the minimum number of adjacent swaps necessary to transform one sequence into the other. It is an edit distance with adjacent swap as the edit operation. It is applicable only if both sequences are the same length and contain the same set of elements.

As a distance metric, Kendall Tau Distance originated specifically to measure distance between permutations (i.e., sequence of unique elements). But, the Kendall Tau Sequence Distance that is implemented here is an extension of Kendall Tau Distance to general sequences (i.e., strings that can contain duplicate elements).

Consider this example. Let s1 = "abcdaabb" and s2 = "dcbababa". The shortest sequence of adjacent swaps to edit s2 into s1 is the following sequence of 9 swaps: "cdbababa", "cbdababa", "bcdababa", "bcadbaba", "bacdbaba", "abcdbaba", "abcdabba", "abcdabab", "abcdaabb".

In this Java class, we provide implementations of two algorithms. Both algorithms are relevant for computing the distance between arrays of primitive values as well as distance between String objects. For computing the Kendall Tau Sequence Distance of two arrays of any primitive type (e.g., arrays of ints, longs, shorts, bytes, chars, floats, doubles, or booleans), as well as for computing the distance between two String objects, the runtime of both algorithms is O(n lg n), where n is the length of the array or String.

If you are computing the distance between two arrays of Objects, the two algorithms have the following restrictions. The default algorithm requires the objects to be of a class that overrides the hashCode and equals methods of the Object class. The alternate algorithm requires Objects to be of a class that implements the Comparable interface, and overrides the equals method of the Object class. The runtime for computing distance between arrays of objects via the default algorithm is O(h(m) n + n lg n), where n is the array length, m is the size of the objects in the array, and h(m) is the runtime to compute a hash of an object of size m. The runtime for the alternate algorithm for arrays of objects is O(c(m) n lg n), where n and m are as before, and c(m) is the runtime of the compareTo method for objects of size m. The default algorithm is the preferred algorithm in most cases. The alternate algorithm may run faster if the cost to compare objects, c(m), is significantly less than the cost to hash objects, h(m).

Runtime: O(n lg n) for String objects and sequences of primitives, where n is the length of the sequence.

If your sequences are guaranteed not to have duplicates, and to contain the same set of elements, then consider instead using the KendallTauDistance class, which assumes permutations of the integers from 0 to N-1.

This distance metric, and both algorithms, is first described in the paper:
V.A. Cicirello, "Kendall Tau Sequence Distance: Extending Kendall Tau from Ranks to Sequences," Industrial Networks and Intelligent Systems, 7(23), Article e1, April 2020.

  • Constructor Details

    • KendallTauSequenceDistance

      public KendallTauSequenceDistance()
      The KendallTauDistance class provides two algorithms. The default algorithm requires sequence elements to either be primitives (e.g., byte, short, int, long, char, float, double, boolean) or to be objects of a class that overrides the hashCode and equals methods of the Object class.
    • KendallTauSequenceDistance

      public KendallTauSequenceDistance(boolean useAlternateAlg)
      The KendallTauDistance class provides two algorithms. This constructor enables you to select which algorithm to use.

      The default algorithm requires sequence elements to either be primitives (e.g., byte, short, int, long, char, float, double, boolean) or to be objects of a class that overrides the hashCode and equals methods of the Object class.

      The alternate algorithm requires sequence elements to either be primitives (e.g., byte, short, int, long, char, float, double, boolean) or to be objects of a class that implements the Comparable interface, and overrides the equals method of the Object class.

      Under most conditions, the preferred algorithm is the default. The alternate algorithm may be desirable if the cost to compare objects is significantly less than the cost to hash objects, or if the objects are of a class that implements Comparable but which does not provide an implementation of hashCode.

      Parameters:
      useAlternateAlg - To use the alternate algorithm pass true. To use the default algorithm pass false.
  • Method Details

    • distance

      public int distance(int[] s1, int[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(long[] s1, long[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(short[] s1, short[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(byte[] s1, byte[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(char[] s1, char[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(String s1, String s2)
      Measures the distance between two Strings.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First String.
      s2 - Second String.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(float[] s1, float[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(double[] s1, double[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(boolean[] s1, boolean[] s2)
      Measures the distance between two arrays.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements
    • distance

      public int distance(Object[] s1, Object[] s2)
      Measures the distance between two arrays of objects. The objects in the arrays must be of a class that has overridden the Object.equals method.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Parameters:
      s1 - First array.
      s2 - Second array.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if sequences are of different lengths, or contain different elements.
      ClassCastException - If the distance measurer object is configured, via the constructor, to use the alternate algorithm, but the arrays passed to this method contain objects that do not implement the Comparable interface.
    • distance

      public <T> int distance(List<T> s1, List<T> s2)
      Measures the distance between two lists of objects. The objects in the lists must be of a class that has overridden the Object.equals method.
      Specified by:
      distance in interface SequenceDistanceMeasurer
      Type Parameters:
      T - Type of List elements.
      Parameters:
      s1 - First list.
      s2 - Second list.
      Returns:
      distance between s1 and s2
      Throws:
      IllegalArgumentException - if s1.size() is not equal to s2.size(), or if they contain different elements.
      ArrayStoreException - If the distance measurer object is configured, via the constructor, to use the alternate algorithm, but the Lists passed to this method contain objects that do not implement the Comparable interface.