Lee Altenberg's Home Page > Papers | E-mail me

Knowledge Representation in the Genome:
New Genes, Exons, and Pleiotropy

Lee Altenberg

Genetics 110: supplement, s41. 1985.


A ubiquitous process of selection is described, that would structure the genome's encoding of the phenotype so as to channel the production of genetic variation towards more adaptive directions. This process provides a selection mechanism to Gilbert's explanation for why genes have introns. Data from the DNA sequence databases are presented showing new features of exons at the DNA level predicted by this process.

New selected genes are thought to arise from the duplication of existing selected DNA sequences; therefore, the creation of new genes can be thought of as "reproduction" in a population of DNA sequences. This is not merely a feature of putative "selfish DNA", but applies to genes essential to the organism. Any feature that increases the probability that a sequence can spawn new adaptive genes gives it a "fitness" advantage on this level. If such features are conserved then the genome will tend to accumulate them in lineages where the genome is accreting new genes. Such a process of "constructional" selection would operate over geologic time. The feature that evolves by constructional selection is not organismal fitness, but the ability to construct new useful genes.

New sequences with fewer pleiotropic effects, affecting a minimum number of functions, generally would receive a "constructional" advantage. The genome may therefore come to possess sources of low-pleiotropy variation, e.g., exons corresponding to protein structural and functional units. Predictions applying to homeosis, genetic correlation of characters, and functional morphology are discussed.

Text of the Presented Paper

Given at the 54th Annual Meeting of the Genetics Society of America, Boston University, August 11-15, 1985

I am going to talk about a kind of selection process which has not really been talked about before. I'm calling it "constructional selection", because it would emerge out of the construction of new useful genes over the course of evolution. What is interesting about it is that it leads to the prediction that the way the genotype encodes its knowledge about the phenotype will be such as to make evolution easier.

I'm going to introduce this idea in the context of Gilbert's exon shuffling hypothesis, where it gives some new predictions about exons. I will describe some tests I have done of these predictions, and if there is time, talk about more general predictions.

When split genes were discovered in 1977, Gilbert put forth a novel idea to account for why genes had introns. The idea was that having introns could speed up the evolution of proteins, because introns would increase the frequency of recombination events within genes, and even more important, if exons coded for separate functions of a protein, these could be reassorted into novel combinations. Then Blake extended this idea by proposing that exons might code for the folding units of proteins, domains and subdomains.

Now, in the intervening years many proteins have been found which fit Gilbert's and Blake's predictions, where the exons did code for peptide structures or functions. Then again, there have been proteins found where exons did not seem to relate to peptide structure. But overall, their predictions have support, and more recently, exon sharing between different gene families has been found.

Yet we are still left with an explanatory problem, which Crick called "the fallacy of evolutionary foresight". Even though having exons which code for peptide structures might help speed up evolution in the future, it does not explain where this correspondence came from in the first place. Doolittle, Crick and others have raised this problem, and two possible solutions have been offered.

Blake has put forward the idea that maybe exons were the original "proto-genes", as it were, coding for the minimal units of peptide structure, later assembled into larger units of peptide structure and function. The other idea is that introns arose by being inserted into intact coding regions, and that selection would allow introns to fall only in non-critical regions of the protein, such as hinge regions between secondary structures or domains.

Here I want to toss a third idea into the hat, the idea of constructional selection, which is basically that, in the construction of new genes over evolution, some exons have proliferated more than other exons.

I'll explain the general idea like this: We're all familiar with the idea of selfish DNA, which is supposedly DNA that survives not because of what it does for the organism, but because of what it does for itself, by leaving more copies of itself within the genome. But now, if we think about the DNA of "normal" genes, genes that benefit the organism, this DNA also appears to arise by the duplication of existing DNA.

So here I ask the question, what if there were features of selectively maintained DNA sequences that made them better able to construct new, useful genes? Then you would have a kind of competition between different DNA sequences as to which was most likely to foster the next set of new useful genes in the genome.

If a feature that conferred an advantage in this competition was conserved over time, then we can see that as new genes are added to the genome over evolutionary time, these features would come to predominate. In other words, as new genes are added to the genome, the genes you get should be better and better able to construct new useful genes. That's the basic idea of constructional selection.

Now my main point is that the features of split genes that Gilbert and Blake said could speed up evolution would do more than that. They would give a competitive advantage to those genes and exons that had them over those that didn't, in becoming a part of new, useful genes.

Now if we dissect the process where new genes are stably added to the genome, we see two basic places constructional selection could act. First there is the overall rate at which duplications of sequences are added to the genome. Second, there is the chance that this new DNA does something useful for the organism, and is therefore fixed in the population and preserved by selection. If you think in terms of baseball, the first is like the number of times you get up to bat, while the second is like your batting average.

And here are a list of features that would affect the batting average of a duplication in producing a useful gene. I want to focus on this property of modularity. This is where Blake's ideas about the folding properties of an exon product would come in. Blake's reasoning was that if an exon coded for a stable peptide structure, then when it was added to another peptide, it would stand a better chance of producing a functional protein.

Now, this idea can be extended to a general idea of modularity. Modularity requires, first, that an exon product have properties that are intrinsic to it, that are independent of different surrounding peptides it may become a part of. Second, it requires that it not disturb the properties of peptides to which it is added.

Now, why should high modularity increase the constructional advantage of an exon? Well, if the properties of a protein are selected for, then a principle of low pleiotropy would apply, which is that the more features of a protein that exon shuffling affects, the less likely it is that the resulting product is useful. Greater modularity , as I've defined it, means lower pleiotropy on the molecular level, and hence, I claim, a higher batting average in producing useful products.

So, let's consider what would happen in evolution if split genes and modular exons did have a constructional advantage. If we started with a genome containing split genes and unsplit genes, then after eons had gone by and new genes had accumulated, more of the the new genes would be split genes. And even if the exons we started with were random with respect to peptide structure, it would be those exons having greater modularity which would proliferate over the eons.

Now, recall Blake's explanation for why many exons correspond to protein structural units, which is that they were the original structural units from which all larger proteins were assembled. Here, what I'm saying is that constructional selection could have produced this correspondence regardless of whether exons were the original "proto-genes", and regardless of what the original composition of exons in the genome was like. And with respect to evolutionary foresight, we see that through differential exon proliferation, the usefulness of exons for future evolution could evolve from exon shuffling itself. What it requires is that properties, like modularity, be conserved over geologic time, which seems reasonable for many exons.

Now, once you start thinking in terms of a competition between exons in producing new genes, you can think of other properties that would give them an advantage. Here, I want to consider two properties of exons on the DNA level rather than the peptide level, that would give them greater modularity.

The first one is the exon being a multiple of 3 nucleotides in length. This would allow it to be tandemly duplicated or inserted into other genes without shifting the reading frame downstream. The second feature is the ends of the exon falling between rather than within codons. If it is between codons then the terminal amino acid of the exon won't change during an exon shuffle (unless of course the reading frame is off for the entire exon). Both of these features would give the exon a constructional advantage.

One would predict, then that if constructional selection had left its mark on the exons in the genome, that there should be an excess of exons that are multiples of 3 in length and whose introns fall between codons. And this is what I have tested by going through the large DNA sequence databases now available, and counting exon lengths. This is Postdoc work I've done with Doug Brutlag at Stanford.

The result we find is that there are both significant excesses of exons and pairs of exons that are multiples of 3 in length, and a significant excess of introns that fall between codons.

Here are the data where I've tallied all the exons in the database, excluding pseudogenes and viruses. The exons fall into 5 categories, and here are the numbers that are 0, 1 or 2 nucleotides longer than a multiple of 3. When we do a G-test for the goodness of fit between these observed frequencies and a random expectation, we find that in this one category, the middle coding axons, just as we would predict, there is an excess of exons that are multiples of 3 long. Now in the data collection, I have not counted exons that are suspected duplications of other exons, as in the collagen gene, so this figure is actually on the conservative side.

Now when we look at the introns, we find a significant excess of introns falling between rather than within codons. What is more striking, but which I can't explain, is the deficit of introns between the 2nd and 3rd nucleotides of codons. Now this raises a possible source of ambiguity for the test. Instead of constructional selection, maybe it is because of introns being inserted selectively that this pattern emerges. And if introns had a preferred phase in the codon to insert, this would also generated an excess of exons a multiple of 3 long. Under this hypothesis, the observed number of exons a multiple of 3 long, though in excess, is not significantly so.

So here's a problem. In order to resolve this ambiguity we need to go back and think about how constructional selection would work. Exon shuffling could occur for pairs or larger groups of exons, not just single exons, in which case constructional selection would act on these larger units.

Now let's take a look at all the genes with two or more middle exons from the NBRF database. These are the lengths of their middle exons, mod 3.

We notice that in most of the cases where an exon is not a multiple of 3, it is part of a pair of exons whose total length is a multiple of 3. I've put these in parentheses. Now is this significant?

To test this, I've done a stochastic simulation where all the intron phases from the actual set of genes were randomly permuted to regenerate the exons. Then I counted all the exons that weren't a multiple of 3 and couldn't be paired into a multiple of 3. 5,000 replicates of this intron scrambling give us a distribution on the number of unpaired exons left over for this set of introns. And the result is that the observed number left over was significantly less than expected. So even if introns were being inserted non- randomly with respect to codon phase, it would not explain why so many pairs of exons were multiples of 3 long.

So here we have at least circumstantial support for this idea that constructional selection has affected the composition of the exons in the genome.

I'll conclude by mentioning that this kind of selection would be expected to operate on other genetic units besides exons, such as regulatory sequences and whole genes-- any unit of duplication. The principles of modularity and low pleiotropy should apply here as well, and offer another point of view for approaching questions about the evolution of development and the genetic correlation of characters.