Distance functions

The library comes with a set of predefined distance measures: two for plain strings and two for paths.

Levenshtein distance

This is an implementation of the classic Levenshtein distance function for strings[16]. This function can be used for clustering regular strings.

Damerau distance

This is an implementation of the classic Damerau-Levenshtein distance function for strings[17]. This function can be used for clustering regular strings.

Journey distance

This function is an adaptation of the Damerau distance that can be used to cluster paths. Paths should be represented as strings, in which each event is coded as three characters that encode event type, duration and intensity. For example this string "122523" encodes a two-event path, where the first event is of type 1, duration 2 and intensity 2; the second event is of type 5, duration 2 and intensity 3. The function computes Damerau-Levenshtein distance of event types (every third character in the string) and adds a small factor to account for duration and intensity differences between the matching events in both strings for which the distance is computed. The event weights are used as a multiplication factor for insertions and deletions. Adding or removing of events with high weights will count more towards the total distance than adding or removing events with low weights.

Event histogram distance

This function first computes a vector of event total intensities per path. Total intensity of an event type in a path, is the sum of intensity x duration of all the events of this type. These vectors are then normalized. To compute a distance between two paths, a weighted sum of total intensity differences for each event type is computed. The weights are taken from the eventWeights table.

[16] https://en.wikipedia.org/wiki/Levenshtein_distance

[17] https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance