# Stéphane Robin, François Rodolphe + Sophie Schbath

Cambridge University Press 2005
A book review by Danny Yee © 2009 https://dannyreviews.com/
Given a genome or part of a genome, how can we tell if a particular DNA sequence — a word — is unusual, occurs unusually often or rarely, or is distributed unusually? DNA, Words and Models presents some mathematics which is useful in answering this question.

The mathematics is kept in close touch with biology. The two main case studies are Chi sequences in E. coli, which encourage recombination and help prevent free double strand DNA, and bacterial restriction sites, which are targeted by restriction enzymes as a counter to phages. The first are expected to be frequent and well distributed in bacterial genomes; the latter to be avoided.

Before we can start thinking about what might be unusual or exceptional, we need a background model for normal sequences. Chapters two and three introduce permutation models, Bernoulli models, where nucleotide probabilities are independent, and Markov chains, where each nucleotide is dependent on some number (the order) of preceding nucleotides. There are various ways of estimating the model parameters; maximum likelihood is most often used with Markov chains.

Chapter four extends these models to cover various kinds of heterogeneity. Phased models can represent codons. Piecewise homogeneous models model genetic units such as operons and introns and promoters, and hidden Markov chains the situation where this structure is not known. And translation conditional models capture synonymity in translation from DNA sequence to amino acid sequence.

Looking at the number of occurrences of a word, an exact distribution can be derived with a first order Markov model, and Gaussian and compound Poisson approximations are possible for (respectively) long sequences and rare words. Markov and compound Poisson models can also be applied to the distribution of word occurrences — which can be looked at through cumulative distances, distribution homogeneity, intensity plots and moving windows.

Chapter six looks at words with unexpected frequencies. Again there are different approximations to exact distributions, but comparisons of different models are more revealing than the results of single models. This is applied to overrepresentation of Chi sites in E. coli and H. influenzae and to underrepresentation of palindromes of length six in E. coli and Lambda phage.

And finally, on words with unexpected locations, chapter seven considers the distribution of Chi sites in H. influenzae, of palindromes in E. coli, and of promoter sites in B. subtilis.

DNA, Words and Models will not be difficult reading for anyone with experience of formal mathematics, but they will want to be comfortable with at least basic combinatorial algebra. Computational issues are discussed, and approximate methods presented and their accuracy explored, but the approach remains analytical — there's no use of simulations or numerical methods, or details of algorithms.

Robin et al. provide a clear explanation of the mathematics, but also of what models are, of their limitations, and of how they connect to biology. So they are up front about Markov chains being used because they are easy to work with, even though they have no biological justification, and in a few places they present mathematics that hasn't yet found applications. An afterword considers the limitations of in silico approaches, but also their power: "There is a complementary relationship between in silico analysis and experimental work: the former is not only an ancillary to the latter."

February 2009

- buy from Amazon.com or Amazon.co.uk
Related reviews:
- more biology