Evolution of proteins in sequence space

From: pruest@pop.dplanet.ch
Date: Thu Aug 02 2001 - 11:18:27 EDT

  • Next message: George Hammond: "Re: WHY 15-BILLION YEARS = 6000 YEARS"

    Proteins may evolve in two basically different modes. One mode is by a
    sequence of point mutations. The other mode is by genetic recombination
    of preexisting modules or fragments. (Let me ignore deletions which
    presumably are deleterious in the vast majority of cases - except
    perhaps for some occasional deletions of entire codons.) Each of the new
    sequences produced must then be accepted (and fixed in the population)
    by natural selection or by random drift (if it is lost, it does not
    contribute to evolution). Novel sequence information is generated in the
    first case, series of several point mutations, only.

    Basically, any sequence within the transastronomically huge
    combinatorial space of the L^20 possible sequences of proteins of length
    L would be accessible during evolution, if there is a mutational path
    which leads from an existing sequence to the target considered and which
    does not contain any intermediates which are selected against (or even
    lethal). In order to evaluate this mechanism of evolution and the
    probability of its success, we should have an idea about the frequency
    of useful sequences in sequence space. This information has been
    missing, but now some indications about it are available.

    Keefe A.D., Szostak J.W., "Functional proteins from a random-sequence
    library", Nature 410 (2001), 715-718, generated a library of 6x10^12
    proteins, each containing 80 contiguous random amino acids, and enriched
    those proteins that bound to ATP. They found four new families of
    ATP-binding proteins unrelated to each other and unrelated to the
    natural ones. The selectively enriched substitutions were distributed
    over 62 of the 80 randomized amino acids, and a core domain of 45 amino
    acids sufficient for ATP-binding was defined. Keefe et al. estimated
    that roughly 1 in 10^11 of all random-sequence proteins have ATP-binding

    Silverman J.A., Balakrishnan R., Harbury P.B., "Reverse engineering the
    ([beta]/[alpha])8 barrel fold", Proceedings of the National Academy of
    Sciences USA 98 (2001), 3092-3097, analyzed the most commonly occurring
    fold among protein catalysts, the TIM (triosephosphate isomerase) barrel
    consisting of 8 analogous units of beta sheet, loop, alpha helix, and
    turn, which together form a barrel accommodating a variable active site,
    used in a large family of different enzymes. Silverman et al. applied
    combinatorial mutagenesis of 182 amino acid positions in the barrel and
    functional selection for TIM activity in E.coli, requiring a minimal
    threshold of 10^-4 of wild-type activity. They estimate that fewer than
    1 in 10^10 of the sequences in their degenerate library are able to
    complement in vivo.

    Thus, the two estimates agree quite well, even though they are derived
    in very different ways. If we look at protein sequence space, less (how
    much?) than 1 in 10^10 sequences is a triosephosphate isomerase enzyme,
    and 1 in 10^11 sequences binds ATP, which is a partial activity of many

    As the human genome contains an estimated 30,000 genes, and the number
    of different protein folds is estimated to be a few thousand, we may, as
    a very rough approximation, assume that there are less than 10^4
    basically different protein families in the biosphere, within each of
    which a number of similar proteins can be derived from each other by
    feasible evolutionary paths.

    The question is whether each of the 10^4 different protein families can
    be similarly derived from one or very few initial sequences, or by
    random mutational walks. If a novel enzyme or other functional protein
    is to arise, which is not easily derivable by a few selected mutations
    from an already existing one, we need a mutational random walk. The
    probability of finding any sequence with the activity required is about
    10^-11. If, at a given moment in the evolution of a species, any one of
    10^4 different novel activities will prove advantageous, the probability
    of finding any such sequence is about 10^-7.

    These estimates assume that directed evolution in the lab is a valid
    model for natural evolution. Of course, this is not the case, as in
    directed evolution one does not have to bother about the viability of
    each intermediate organism in a linear sequence of point mutations, but
    only about the isolated activity of a new protein sequence after several
    or many mutations. Directed evolution jumps around in sequence space,
    whereas natural evolution is limited to single-step paths, and none of
    these steps must go downhill on the fitness surface.

    How, then, is it possible that any one of the 10^3 or 10^4 basically
    different protein folds (families) arose (anywhere in the biosphere),
    let alone all of them? If there was the need for 10^3 different searches
    with probabilities of around 10^-10, it seems a hopeless proposition.
    (And the few million years available for the formation of the first
    viable organism appear transastronomically inadequate.)

    The only possibility of a way out seems to be to claim that every single
    one of the different protein families used in the biosphere are
    intimately connected in sequence space, such that simple linear
    sequences of point mutations, with all intermediates naturally selected,
    will do for all proteins. In this case, more than 99.999999999% (eleven
    nines altogether) of sequence space is barren for life and was never
    visited by any sequence during evolution. Whether this is a feasible
    proposition will have to be shown experimentally.

    This still leaves us with the mystery of the origin of the first living
    organism capable of natural evolution.

    But the very interesting finding of the two papers mentioned is that the
    protein sequence space is extremely sparcely populated with useful
    sequences. This makes evolution (which, for theological reasons, I
    believe has happened) an astonishingly marvellous process.


    Dr Peter Ruest			Biochemistry
    Wagerten			Creation and evolution
    CH-3148 Lanzenhaeusern		Tel.:	++41 31 731 1055
    Switzerland			E-mail:	<pruest@dplanet.ch
     - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    	In biology - there's no free lunch -
    		and no information without an adequate source.
    	In Christ - there is free and limitless grace -
    		for those of a contrite heart.

    This archive was generated by hypermail 2b29 : Thu Aug 02 2001 - 11:18:42 EDT