Peter Ruest wrote:
> Basically, any sequence within the transastronomically huge
> combinatorial space of the L^20 possible sequences of proteins of length
> L would be accessible during evolution, if there is a mutational path
> which leads from an existing sequence to the target considered and which
> does not contain any intermediates which are selected against (or even
> lethal). In order to evaluate this mechanism of evolution and the
> probability of its success, we should have an idea about the frequency
> of useful sequences in sequence space. This information has been
> missing, but now some indications about it are available.
You should at least write a mental note somewhere that
correlation effects in a polymer are *not* limited to single
peptides, nor single nucleotides, nor any other monomer that
you can name. Typically, nearest neighboring monomers tend
to be coupled due to the lack of free rotation about their
bond axes. For nucleotides, this includes correlation between
the aromatic rings. For peptides, it is more complicated
because you have more interactions: hydrophobic, hydrophilic,
and acid/base interactions. In any case, you can (and should)
expect a polymer to have correlation between its nearest
This "correlation" is defined by the persistence length
which is also a measure of the "stiffness" of a polymer
chain. It is a measure of the tendency for a polymer
chain to "persist" as a _single unit_. Based on the
few proteins measured using atomic force microscopy,
it would seem that the persistence length is
on the order of about 3 peptides. Peptides like
proline are likely to be longer and glycine is
likely to be shorter, but I think 3 peptides
is a good estimate to start from.
The reason this is important is that interacting peptides within
a given persistence length are *not* independent. This is one
point that is not adequately treated in the information entropy
assumptions made on proteins.
As I understand it, information entropy originates from the
assumption that the transmitter and the receiver are sampling
rapidly enough that a complete data set is presumably passed.
It is then a matter of estimating the fidelity of the information
based on the sources of noise in the channel. However,
regardless of how noisy the channel is, each datum can be
taken as independent.
However, suppose the transmitter sends data at a frequency
of 5 MHz and the receiver detects this data at only 1 MHz.
Clearly quite a lot of information is lost, and you cannot apply
the same independence assumptions about the data you are receiving
anymore. To properly account for the data, you must first rescale
the transmission data to something of a 1 MHz channel where you
can again consider the data to be independent.
Hence, I would be inclined to argue that the number of degrees
of freedom have been greatly overestimated in L^20, and
L^(20/3) is a more realistic estimate of the odds involved.
That is admittedly still a big number for any long protein
chain, and may still lead to astronomically huge odds, but
certainly not _as_ huge.
> Keefe A.D., Szostak J.W., "Functional proteins from a random-sequence
> library", Nature 410 (2001), 715-718, generated a library of 6x10^12
> proteins, each containing 80 contiguous random amino acids, and enriched
> those proteins that bound to ATP. They found four new families of
> ATP-binding proteins unrelated to each other and unrelated to the
> natural ones. The selectively enriched substitutions were distributed
> over 62 of the 80 randomized amino acids, and a core domain of 45 amino
> acids sufficient for ATP-binding was defined. Keefe et al. estimated
> that roughly 1 in 10^11 of all random-sequence proteins have ATP-binding
> Silverman J.A., Balakrishnan R., Harbury P.B., "Reverse engineering the
> ([beta]/[alpha])8 barrel fold", Proceedings of the National Academy of
> Sciences USA 98 (2001), 3092-3097, analyzed the most commonly occurring
> fold among protein catalysts, the TIM (triosephosphate isomerase) barrel
> consisting of 8 analogous units of beta sheet, loop, alpha helix, and
> turn, which together form a barrel accommodating a variable active site,
> used in a large family of different enzymes. Silverman et al. applied
> combinatorial mutagenesis of 182 amino acid positions in the barrel and
> functional selection for TIM activity in E.coli, requiring a minimal
> threshold of 10^-4 of wild-type activity. They estimate that fewer than
> 1 in 10^10 of the sequences in their degenerate library are able to
> complement in vivo.
> Thus, the two estimates agree quite well, even though they are derived
> in very different ways. If we look at protein sequence space, less (how
> much?) than 1 in 10^10 sequences is a triosephosphate isomerase enzyme,
> and 1 in 10^11 sequences binds ATP, which is a partial activity of many
> As the human genome contains an estimated 30,000 genes, and the number
> of different protein folds is estimated to be a few thousand, we may, as
> a very rough approximation, assume that there are less than 10^4
> basically different protein families in the biosphere, within each of
> which a number of similar proteins can be derived from each other by
> feasible evolutionary paths.
> The question is whether each of the 10^4 different protein families can
> be similarly derived from one or very few initial sequences, or by
> random mutational walks. If a novel enzyme or other functional protein
> is to arise, which is not easily derivable by a few selected mutations
> from an already existing one, we need a mutational random walk. The
> probability of finding any sequence with the activity required is about
> 10^-11. If, at a given moment in the evolution of a species, any one of
> 10^4 different novel activities will prove advantageous, the probability
> of finding any such sequence is about 10^-7.
I am still not sure myself exactly what to make of the folds.
Do they represent a language? If so, to what extent: are they
mere commands or is there something more? By (admittedly rather
dangerously poor) analogy, the early 8008 processor functioned
successfully with only 17 instructions. Hence, if the "function"
of a protein is quite limited, then the required "instruction set"
could also be quite small.
So one thing that seems to need clarification is the level
of complexity of a given protein. There is a big difference
between the complexity of a human language, and the complexity
of a simple computer program carrying out a small instruction
set. Likewise, how many instructions are actually necessary
is not fully clear to me. In that case, it is not so much
the _number_ of folds, but what the folds actually _do_ that
needs to be defined clearly.
> These estimates assume that directed evolution in the lab is a valid
> model for natural evolution. Of course, this is not the case, as in
> directed evolution one does not have to bother about the viability of
> each intermediate organism in a linear sequence of point mutations, but
> only about the isolated activity of a new protein sequence after several
> or many mutations. Directed evolution jumps around in sequence space,
> whereas natural evolution is limited to single-step paths, and none of
> these steps must go downhill on the fitness surface.
> How, then, is it possible that any one of the 10^3 or 10^4 basically
> different protein folds (families) arose (anywhere in the biosphere),
> let alone all of them? If there was the need for 10^3 different searches
> with probabilities of around 10^-10, it seems a hopeless proposition.
> (And the few million years available for the formation of the first
> viable organism appear transastronomically inadequate.)
> The only possibility of a way out seems to be to claim that every single
> one of the different protein families used in the biosphere are
> intimately connected in sequence space, such that simple linear
> sequences of point mutations, with all intermediates naturally selected,
> will do for all proteins. In this case, more than 99.999999999% (eleven
> nines altogether) of sequence space is barren for life and was never
> visited by any sequence during evolution. Whether this is a feasible
> proposition will have to be shown experimentally.
> This still leaves us with the mystery of the origin of the first living
> organism capable of natural evolution.
> But the very interesting finding of the two papers mentioned is that the
> protein sequence space is extremely sparsely populated with useful
> sequences. This makes evolution (which, for theological reasons, I
> believe has happened) an astonishingly marvelous process.
You are beginning to rant again here. I can agree to some
extent that the laboratory conditions _somewhat_ favor the
expectations of the experimentalist. When these ideal
conditions are removed, and these materials have to compete
with all the other crud in a vat full of brown tar, it is
not particularly clear that the results will be favorable.
I think it also pertinent to say here that often the one
thing that seems seriously lacking in these exchanges
(perhaps more so from the evolution side) is reverence for how
astoundingly lucky we really are to even have the privilege
to think about where we came from. YEC folk err greatly in
other ways, but I recognize that (in part) this is because
they respect the Lord. In much the same way, I'm sure this
is probably at the heart of ID arguments, viz., by invoking
evolution, we seem to be denying the Lord's providence in
our lives. I think it fair enough to say that ignoring
the Lord is folly, and I understand that I have regularly
come up short on more than this account alone.
That being said,
I am not fully decided on this matter, but I would
contend that there are a lot of curious properties in
polymers that allow for interesting possibilities.
The abiogenesis arguments although persuasive *may* turn
out to be wrong, but they are certainly arguments that
can be tested and a testable hypothesis is something that
a scientist can work on. "Give up" arguments are not
(or at least, not until the funding runs out).
As I currently see it, the major problems that currently
plague an abiogenesis scenario are probably as follows.
(1) A power source for running an RNA world. RNA does
not appear to have a very large diversity of catalytic
activity (at least compared to proteins). Without an
engine and something to burn, the RNA world would
"run out of gas" rather quickly. Introducing proteins
brings us back to the chicken or egg question and greatly
increases the complexity of the prebiotic world.
(2) The "replicaters" in a prebiotic world. If proteins
must be an integral part of the abiogenesis process,
the transcription machinery becomes more complicated
as well. There have been a few attempts at replicaters
for RNA (I suspect mostly inadequate), but if this must
include the replication of proteins, then the difficultly
of making "first base" becomes far more insurmountable.
(3) Even if we can eventually find a way to explain (1)
and (2), let's not forget that life is an astoundingly
lucky privilege and we should not forget to honor the
Lord. Our call to follow Christ is in no way diminished
whether life came about by probabilities or miracles.
Life itself is itself a "miracle," and it is blessing that we
*can* even chose to follow.
by Grace we proceed,
This archive was generated by hypermail 2b29 : Mon Aug 06 2001 - 11:22:30 EDT