Re: Evolution of proteins in sequence space

Date: Mon Aug 06 2001 - 11:21:49 EDT

  • Next message: John W Burgeson: "Re: Wheel of God"

    Peter Ruest wrote:

    > Basically, any sequence within the transastronomically huge
    > combinatorial space of the L^20 possible sequences of proteins of length
    > L would be accessible during evolution, if there is a mutational path
    > which leads from an existing sequence to the target considered and which
    > does not contain any intermediates which are selected against (or even
    > lethal). In order to evaluate this mechanism of evolution and the
    > probability of its success, we should have an idea about the frequency
    > of useful sequences in sequence space. This information has been
    > missing, but now some indications about it are available.

    You should at least write a mental note somewhere that
    correlation effects in a polymer are *not* limited to single
    peptides, nor single nucleotides, nor any other monomer that
    you can name. Typically, nearest neighboring monomers tend
    to be coupled due to the lack of free rotation about their
    bond axes. For nucleotides, this includes correlation between
    the aromatic rings. For peptides, it is more complicated
    because you have more interactions: hydrophobic, hydrophilic,
    and acid/base interactions. In any case, you can (and should)
    expect a polymer to have correlation between its nearest

    This "correlation" is defined by the persistence length
    which is also a measure of the "stiffness" of a polymer
    chain. It is a measure of the tendency for a polymer
    chain to "persist" as a _single unit_. Based on the
    few proteins measured using atomic force microscopy,
    it would seem that the persistence length is
    on the order of about 3 peptides. Peptides like
    proline are likely to be longer and glycine is
    likely to be shorter, but I think 3 peptides
    is a good estimate to start from.

    The reason this is important is that interacting peptides within
    a given persistence length are *not* independent. This is one
    point that is not adequately treated in the information entropy
    assumptions made on proteins.

    As I understand it, information entropy originates from the
    assumption that the transmitter and the receiver are sampling
    rapidly enough that a complete data set is presumably passed.
    It is then a matter of estimating the fidelity of the information
    based on the sources of noise in the channel. However,
    regardless of how noisy the channel is, each datum can be
    taken as independent.

    However, suppose the transmitter sends data at a frequency
    of 5 MHz and the receiver detects this data at only 1 MHz.
    Clearly quite a lot of information is lost, and you cannot apply
    the same independence assumptions about the data you are receiving
    anymore. To properly account for the data, you must first rescale
    the transmission data to something of a 1 MHz channel where you
    can again consider the data to be independent.

    Hence, I would be inclined to argue that the number of degrees
    of freedom have been greatly overestimated in L^20, and
    L^(20/3) is a more realistic estimate of the odds involved.
    That is admittedly still a big number for any long protein
    chain, and may still lead to astronomically huge odds, but
    certainly not _as_ huge.

    > Keefe A.D., Szostak J.W., "Functional proteins from a random-sequence
    > library", Nature 410 (2001), 715-718, generated a library of 6x10^12
    > proteins, each containing 80 contiguous random amino acids, and enriched
    > those proteins that bound to ATP. They found four new families of
    > ATP-binding proteins unrelated to each other and unrelated to the
    > natural ones. The selectively enriched substitutions were distributed
    > over 62 of the 80 randomized amino acids, and a core domain of 45 amino
    > acids sufficient for ATP-binding was defined. Keefe et al. estimated
    > that roughly 1 in 10^11 of all random-sequence proteins have ATP-binding
    > activity.
    > Silverman J.A., Balakrishnan R., Harbury P.B., "Reverse engineering the
    > ([beta]/[alpha])8 barrel fold", Proceedings of the National Academy of
    > Sciences USA 98 (2001), 3092-3097, analyzed the most commonly occurring
    > fold among protein catalysts, the TIM (triosephosphate isomerase) barrel
    > consisting of 8 analogous units of beta sheet, loop, alpha helix, and
    > turn, which together form a barrel accommodating a variable active site,
    > used in a large family of different enzymes. Silverman et al. applied
    > combinatorial mutagenesis of 182 amino acid positions in the barrel and
    > functional selection for TIM activity in E.coli, requiring a minimal
    > threshold of 10^-4 of wild-type activity. They estimate that fewer than
    > 1 in 10^10 of the sequences in their degenerate library are able to
    > complement in vivo.
    > Thus, the two estimates agree quite well, even though they are derived
    > in very different ways. If we look at protein sequence space, less (how
    > much?) than 1 in 10^10 sequences is a triosephosphate isomerase enzyme,
    > and 1 in 10^11 sequences binds ATP, which is a partial activity of many
    > enzymes.
    > As the human genome contains an estimated 30,000 genes, and the number
    > of different protein folds is estimated to be a few thousand, we may, as
    > a very rough approximation, assume that there are less than 10^4
    > basically different protein families in the biosphere, within each of
    > which a number of similar proteins can be derived from each other by
    > feasible evolutionary paths.
    > The question is whether each of the 10^4 different protein families can
    > be similarly derived from one or very few initial sequences, or by
    > random mutational walks. If a novel enzyme or other functional protein
    > is to arise, which is not easily derivable by a few selected mutations
    > from an already existing one, we need a mutational random walk. The
    > probability of finding any sequence with the activity required is about
    > 10^-11. If, at a given moment in the evolution of a species, any one of
    > 10^4 different novel activities will prove advantageous, the probability
    > of finding any such sequence is about 10^-7.

    I am still not sure myself exactly what to make of the folds.
    Do they represent a language? If so, to what extent: are they
    mere commands or is there something more? By (admittedly rather
    dangerously poor) analogy, the early 8008 processor functioned
    successfully with only 17 instructions. Hence, if the "function"
    of a protein is quite limited, then the required "instruction set"
    could also be quite small.

    So one thing that seems to need clarification is the level
    of complexity of a given protein. There is a big difference
    between the complexity of a human language, and the complexity
    of a simple computer program carrying out a small instruction
    set. Likewise, how many instructions are actually necessary
    is not fully clear to me. In that case, it is not so much
    the _number_ of folds, but what the folds actually _do_ that
    needs to be defined clearly.

    > These estimates assume that directed evolution in the lab is a valid
    > model for natural evolution. Of course, this is not the case, as in
    > directed evolution one does not have to bother about the viability of
    > each intermediate organism in a linear sequence of point mutations, but
    > only about the isolated activity of a new protein sequence after several
    > or many mutations. Directed evolution jumps around in sequence space,
    > whereas natural evolution is limited to single-step paths, and none of
    > these steps must go downhill on the fitness surface.
    > How, then, is it possible that any one of the 10^3 or 10^4 basically
    > different protein folds (families) arose (anywhere in the biosphere),
    > let alone all of them? If there was the need for 10^3 different searches
    > with probabilities of around 10^-10, it seems a hopeless proposition.
    > (And the few million years available for the formation of the first
    > viable organism appear transastronomically inadequate.)
    > The only possibility of a way out seems to be to claim that every single
    > one of the different protein families used in the biosphere are
    > intimately connected in sequence space, such that simple linear
    > sequences of point mutations, with all intermediates naturally selected,
    > will do for all proteins. In this case, more than 99.999999999% (eleven
    > nines altogether) of sequence space is barren for life and was never
    > visited by any sequence during evolution. Whether this is a feasible
    > proposition will have to be shown experimentally.
    > This still leaves us with the mystery of the origin of the first living
    > organism capable of natural evolution.
    > But the very interesting finding of the two papers mentioned is that the
    > protein sequence space is extremely sparsely populated with useful
    > sequences. This makes evolution (which, for theological reasons, I
    > believe has happened) an astonishingly marvelous process.

    You are beginning to rant again here. I can agree to some
    extent that the laboratory conditions _somewhat_ favor the
    expectations of the experimentalist. When these ideal
    conditions are removed, and these materials have to compete
    with all the other crud in a vat full of brown tar, it is
    not particularly clear that the results will be favorable.

    I think it also pertinent to say here that often the one
    thing that seems seriously lacking in these exchanges
    (perhaps more so from the evolution side) is reverence for how
    astoundingly lucky we really are to even have the privilege
    to think about where we came from. YEC folk err greatly in
    other ways, but I recognize that (in part) this is because
    they respect the Lord. In much the same way, I'm sure this
    is probably at the heart of ID arguments, viz., by invoking
    evolution, we seem to be denying the Lord's providence in
    our lives. I think it fair enough to say that ignoring
    the Lord is folly, and I understand that I have regularly
    come up short on more than this account alone.
    That being said,
    I am not fully decided on this matter, but I would
    contend that there are a lot of curious properties in
    polymers that allow for interesting possibilities.
    The abiogenesis arguments although persuasive *may* turn
    out to be wrong, but they are certainly arguments that
    can be tested and a testable hypothesis is something that
    a scientist can work on. "Give up" arguments are not
    (or at least, not until the funding runs out).

    As I currently see it, the major problems that currently
    plague an abiogenesis scenario are probably as follows.

    (1) A power source for running an RNA world. RNA does
    not appear to have a very large diversity of catalytic
    activity (at least compared to proteins). Without an
    engine and something to burn, the RNA world would
    "run out of gas" rather quickly. Introducing proteins
    brings us back to the chicken or egg question and greatly
    increases the complexity of the prebiotic world.

    (2) The "replicaters" in a prebiotic world. If proteins
    must be an integral part of the abiogenesis process,
    the transcription machinery becomes more complicated
    as well. There have been a few attempts at replicaters
    for RNA (I suspect mostly inadequate), but if this must
    include the replication of proteins, then the difficultly
    of making "first base" becomes far more insurmountable.

    (3) Even if we can eventually find a way to explain (1)
    and (2), let's not forget that life is an astoundingly
    lucky privilege and we should not forget to honor the
    Lord. Our call to follow Christ is in no way diminished
    whether life came about by probabilities or miracles.
    Life itself is itself a "miracle," and it is blessing that we
    *can* even chose to follow.
    by Grace we proceed,

    This archive was generated by hypermail 2b29 : Mon Aug 06 2001 - 11:22:30 EDT