Computational Biology (Current State and History)

This article was originaly appeared on simons institute website

Since Descartes, and especially in the contributions of Kepler, Galileo and Newton, the main goal of the physical sciences has been to develop universal laws expressed as mathematical formulae, to which one can “input” measured conditions, i.e., known variables, to produce an “output,” representing certain aspects of the unknown. Biology as a scientific discipline did not initially follow this reductionist path until the revolutionary works of Darwin and Mendel, which demonstrated that biological systems shared common traits and patterns, expressed as “models”. By the early 20th century, mathematical methods, and statistical methods in particular, were already in use for the analysis of quantitative data in the biological sciences. The roots of computational biology, or bioinformatics, can be traced to that time, although, as in other sciences, biological data analyses were initially performed by humans. This changed after the advent of electronic computers, which were developed while pioneering physicists like Delbruck and Crick were establishing that hereditary information is carried by DNA, in long double-helical chains, encoding digital information – just like the tape in a Turing machine. Although the first computers, such as ENIAC, were initially employed in science for physical (thermonuclear) modeling by Von Neumann and others, it soon became natural for the likes of Gamow and Ulam to consider the use of ENIAC for biomolecular modeling and comparison. Simultaneously, Fisher started to use EDSAC in Britain to assess statistical significance, measured through the newly introduced notion of p-value. These happened while both Turing and Shannon, respective founders of theoretical computer science and information theory, were working on problems in biological computing.

By the late 1950s, computers were already used in population studies, species classification, and taxonomy construction. However, all these applications used the computer as a fancy calculator. While the work of Wiener and others on cybernetics laid the foundations of systems biology, much of the groundwork in that direction was speculative and not very algorithmic. Arguably, computational biology only became a distinct discipline in the 1970s with the biomolecular sequence comparison algorithms of Needleman and Wunsch as well as Smith and Waterman, developed in response to growth in the field of molecular biology. The 1970s also witnessed the development of the first protein structure prediction algorithms to aid crystallography and molecular dynamics simulations.

The first bioinformatics algorithms were developed simultaneously with the young field of theoretical computer science. After the introduction of Sanger sequencing, the first computational methods for genomic and proteomic sequence similarity search, as well as RNA secondary structure prediction, benefited from new algorithmic design and analysis ideas in dynamic programming (e.g. by Levenshtein for edit distance computation) and data structures for string matching (e.g. the suffix trees of Weiner and McCreight). With the kick-off of the Human Genome Project, the 1990s witnessed a significant increase in algorithmic activity for solving biomolecular problems. Novel methods for multiple sequence alignment, motif finding, haplotyping, phylogenetic tree construction, and protein structure analysis and prediction emerged at this time, all preparing the field for the eventual completion of the human genome sequence. The late 1990s were especially exciting, with both public and private efforts aiming to assemble the human genome from Sanger sequencing reads. Simultaneous development of microarray technologies resulted in numerous statistical and combinatorial techniques for clustering, classification and regression. Additional advances in co-expression analysis and protein interaction prediction through the yeast two hybrid system expanded systems biology into the realm of network science.

The completion of the Human Genome Project and the emergence of next generation sequencing technologies moved bioinformatics into a new era of Big Data science. Read mapping and variant detection algorithms, gene annotation and functional element discovery methods, gene expression and alternative splicing analysis tools that use RNA-Seq data, became central to international scientific projects of unprecedented scale, such as the 1000 Genomes Project, ENCODE and modENCODE, The Cancer Genome Atlas, and the International Cancer Genome Consortium.

As the size and variety of biomolecular data grow, interactions among molecular biology, theoretical computer science, statistics and statistical machine learning will need to deepen. Although algorithmic technology from the 80s and 90s, such as suffix arrays, locality sensitive hashing and color coding, as well as general techniques in linear and non-linear optimization and approximation algorithms for NP-hard problems, have already found profound applications in bioinformatics, newer techniques in streaming, sketching, metric embeddings, compressed data structures, differential privacy, homomorphic encryption and others are yet to attract significant interest from the computational biology community. Similarly, emerging problems in big data genomics, transcriptomics and proteomics have hardly attracted any attention in the theoretical computer science community. One of the objectives of our lab is to bring together the practices and expertise of the two communities, and develop algorithms and technologies to advance both research fields.