Approximate nearest neighbors and sequence comparison with block operations

Abstract

We study sequence nearest neighbors (SNN). Let D be a database ofn sequences; we would like topreprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S, Q) <= d(S, T) for any other sequence T in D. Here d(S, Q) denotes the distance between sequences S and Q, defined to be the minimum number of edit operations needed to transform one to another (all edit operations will be reversible so that d(S, T) = d(T, S) for any two sequences T and S). These operations correspond to the notion of similarity between sequences that we wish to capture in a given application. Natural edit operations include character edits (inserts, replacements, deletes etc), block edits (moves, copies, deletes, reversals) and block numerical transformations (scaling by an additive or a multiplicative constant). The SNN problem arises in many applications.