The Complexity of Gene Placement

Goldberg LA, Goldberg P, Pevzner P, Paterson M, Sahinalp SC, Sweedyk Z
17 January 1999
Proc. of SODA 1999, 386-395
Presented at SODA’09 (Tenth Annual ACM-SIAM Symposium on Discrete Algorithms), Baltimore, Maryland, USA, January 17-19

Abstract

We focus on algorithmic problems related to deriving gene locations on DNA sequences of closely related species by using comparative mapping data. Conventional genetic mapping generates intervals on the DNA sequence of given species for potential gene positions. The simultaneous analysis of gene intervals in related species,e.g., humanand mouse, may eliminate some of the ambiguities and lead to better estimates of gene locations. We address the problem of eliminating the ambi- guities in gene orders by means of minimizing the number of conserved regions among the species. This is equivalent to the problem of choosing gene coordinates (gene placement) that satisfy the genetic mapping constraints and minimize the breakpoint distance between genomes. We first show that the gene ordering problem is hard: there is no polynomial-time approximation scheme unless P = NP, even under the restrictions that: (1) the order of genes in one of species is known, or (2) at most two intervals overlap at any location on the map of any of the species. Then we provide two polynomial-time algorithms under restriction (1) above; the first approximates the problem within a factor of 3, and thesecond exactly solves the problem under the additional restriction that (3) no more than O􏰀􏰀logn􏰁/􏰀loglogn􏰁􏰁 intervals overlap at a location on the map of any of the species. We also prove the tractability of the general problem when there is a single conserved region i.e., when there exists a gene placement resulting in identical gene orders).