*DNA, Words and Models*presents some mathematics which is useful in answering this question.

The mathematics is kept in close touch with biology. The two main case
studies are Chi sequences in *E. coli*, which encourage recombination
and help prevent free double strand DNA, and bacterial restriction
sites, which are targeted by restriction enzymes as a counter to phages.
The first are expected to be frequent and well distributed in bacterial
genomes; the latter to be avoided.

Before we can start thinking about what might be unusual or exceptional, we need a background model for normal sequences. Chapters two and three introduce permutation models, Bernoulli models, where nucleotide probabilities are independent, and Markov chains, where each nucleotide is dependent on some number (the order) of preceding nucleotides. There are various ways of estimating the model parameters; maximum likelihood is most often used with Markov chains.

Chapter four extends these models to cover various kinds of heterogeneity. Phased models can represent codons. Piecewise homogeneous models model genetic units such as operons and introns and promoters, and hidden Markov chains the situation where this structure is not known. And translation conditional models capture synonymity in translation from DNA sequence to amino acid sequence.

Looking at the number of occurrences of a word, an exact distribution can be derived with a first order Markov model, and Gaussian and compound Poisson approximations are possible for (respectively) long sequences and rare words. Markov and compound Poisson models can also be applied to the distribution of word occurrences — which can be looked at through cumulative distances, distribution homogeneity, intensity plots and moving windows.

Chapter six looks at words with unexpected frequencies. Again there
are different approximations to exact distributions, but comparisons of
different models are more revealing than the results of single models.
This is applied to overrepresentation of Chi sites in *E. coli* and
*H. influenzae* and to underrepresentation of palindromes of length six
in *E. coli* and *Lambda* phage.

And finally, on words with unexpected locations, chapter seven considers
the distribution of Chi sites in *H. influenzae*, of palindromes in
*E. coli*, and of promoter sites in *B. subtilis*.

*DNA, Words and Models* will not be difficult reading for anyone with
experience of formal mathematics, but they will want to be comfortable
with at least basic combinatorial algebra. Computational issues are
discussed, and approximate methods presented and their accuracy explored,
but the approach remains analytical — there's no use of simulations or
numerical methods, or details of algorithms.

Robin et al. provide a clear explanation of the mathematics, but also of
what models are, of their limitations, and of how they connect to biology.
So they are up front about Markov chains being used because they are easy
to work with, even though they have no biological justification, and in
a few places they present mathematics that hasn't yet found applications.
An afterword considers the limitations of *in silico* approaches, but also
their power: "There is a complementary relationship between *in silico*
analysis and experimental work: the former is not only an ancillary to
the latter."

February 2009

**External links:**-
- buy from Amazon.com or Amazon.co.uk

**Related reviews:**-
- more biology

- books about mathematics

- books published by Cambridge University Press