The Epigenetics Revolution (27 page)

Read The Epigenetics Revolution Online

Authors: Nessa Carey

Tags: #Science/Life Sciences/Genetics and Genomics

BOOK: The Epigenetics Revolution
7.49Mb size Format: txt, pdf, ePub
Back in the 1970s scientists compared simple one-celled organisms and complex creatures like humans. The amount of DNA in their cells seemed surprisingly similar, considering how dissimilar the organisms were. This suggested that some genomes must contain a lot of DNA that isn’t really used for anything, and led to the concept of ‘junk DNA’
3
– chromosome sequences that don’t do anything useful, because they don’t code for proteins. At around the same time a number of labs showed that large amounts of the mammalian genome contain DNA sequences that seem to be repeated over and over again, and don’t code for proteins (repetitive DNA). Because they don’t code for protein, it was assumed they weren’t contributing anything to the cell’s functions. They just appeared to be along for the ride
4
,
5
. Francis Crick and others coined the phrase ‘selfish DNA’ to describe these regions. These two models, of junk DNA and selfish DNA, have been delightfully described recently as ‘the emerging view of the genome as being largely populated by genetic hobos and evolutionary debris
6
’.
We humans are remarkable, with our trillions of cells, our hundreds of cell types, our multitudes of tissues and organs. Let’s compare ourselves (a little smugly, perhaps) with a distant relative, a microscopic worm, the nematode
Caenorhabditis elegans. C. elegans
, as we usually call it, is only about one millimetre long and lives in soil. It has many of the same organs as higher animals, such as a gut, mouth and gonads. However, it only consists of around 1,000 cells. Remarkably, as
C. elegans
develops, scientists have been able to identify exactly how each of these cells arises.
This tiny worm is a powerful experimental tool, because it provides a roadmap for cell and tissue development. Scientists are able to alter expression of a gene and then plot out with great precision the effects of that mutated gene on normal development. In fact,
C. elegans
has laid the foundation for so many breakthroughs in developmental biology that in 2002 the Nobel Committee awarded the Prize in Physiology or Medicine to Sydney Brenner, Robert Horvitz and John Sulston for their work on this organism.
We can’t fault
C. elegans
on grounds of utility, but it is clearly a much less complex organism than our good selves. Why are we so much more sophisticated? Given the importance of proteins in cellular function, the original assumption was that complex organisms like mammals have more protein-coding genes than simple creatures like
C. elegans
. This was a perfectly reasonable hypothesis but it has fallen foul of a phenomenon described by Thomas Henry Huxley. He was Darwin’s great champion in the 19th century and it was Huxley who first described ‘the slaying of a beautiful hypothesis by an ugly fact’.
As DNA sequencing technologies improved in cost and efficiency, numerous labs throughout the world sequenced the genomes of a number of different organisms. They were able to use various software tools to identify the likely protein-coding genes in these different genomes. What they found was really surprising. There were far fewer protein-coding genes than expected. Before the human genome was decoded, scientists had predicted there would be over 100,000 such genes. We now know the real number is between 20,000 and 25,000 genes
7
. Even more oddly,
C. elegans
contains about 20,200 genes
8
, not so very different a number from us.
Not only do we and
C. elegans
have about the same number of genes, these genes tend to code for pretty much the same proteins. By this we mean that if we analyse the sequence of a gene in human cells, we can find a gene of broadly similar sequence in the nematode worm. So the phenotypic differences between worms and humans aren’t caused by
Homo sapiens
having more, different or ‘better’ genes.
Admittedly, more complicated organisms tend to splice their genes in more ways than simpler creatures. Using our CARDIGAN example from
Chapter 3
as an analogy once again,
C. elegans
might only be able to make the proteins DIG and DAN whereas mammals would be able to make those two proteins and also CARD, RIGA, CAIN and CARDIGAN.
This certainly would allow humans to generate a much greater repertoire of proteins than the 1mm worm, but it introduces a new problem. How do more complicated organisms regulate their more complicated splicing patterns? This regulation could in theory be controlled solely by proteins, but this in turn has difficulties. The more proteins a cell needs to regulate in a complicated network, the more proteins it needs to do the regulation. Mathematical models have shown that this rapidly leads to a situation where the number of proteins that we need begins to out-strip the number of proteins that we actually possess – clearly a non-starter.
Do we have an alternative? We do, and it’s indicated in
Figure 10.1
.
Figure 10.1
This graph demonstrates that the complexity of living organisms scales much better with the percentage of the genome that doesn’t code for protein (black columns) than it does with the number of basepairs coding for protein in a genome (white columns). The data are adapted from Mattick, J. (2007), Exp Biol. 210: 1526–1547.
At one extreme we have the bacteria. Bacteria have very small, highly compacted genomes. Their protein-coding genes cover about 4,000,000 base-pairs, which is about 90 per cent of their genome. Bacteria are very simple organisms and fairly rigid in the way they control their gene expression. But things change as we move further up the evolutionary tree.
The protein-coding genes of
C. elegans
cover about 24,000,000 base-pairs, but that only accounts for about 25 per cent of their genome. The remaining 75 per cent doesn’t code for protein. By the time we reach humans, the protein-coding regions cover about 32,000,000 base-pairs, but this only represents about 2 per cent of the total genome. There are various ways that we can calculate the protein-coding regions, but they make relatively little difference to the astonishing bottom line. Over 98 per cent of the human genome doesn’t code for protein. All but 2 per cent of our genome is ‘junk’.
In other words, the numbers of genes, or the sizes of these genes, don’t scale with complexity. The only feature of a genome that really seems to get bigger as organisms get more complicated is the section that
doesn’t
code for protein.
The tyranny of language
So what are these non-coding regions of the genome doing, and why are they so important? It’s when we start to consider this that we begin to notice what a strong effect language and terminology have on human thought processes. These regions are called non-coding, but what we mean is that they don’t code for
protein
. This isn’t the same as not coding at all.
There is a well-known scientific proverb: absence of evidence is not the same as evidence of absence. For example, in astronomy, once scientists had developed telescopes that could detect infrared radiation, they were able to detect thousands of stars that had never been ‘seen’ before. The stars had always been there, but we couldn’t detect them conclusively until we had an instrument for doing so. A more everyday example might be a mobile phone signal. Such signals are all around us, but we cannot detect them unless we have a mobile phone. In other words, what we find depends very much on how we are looking.
Scientists identify the genes which are expressed in a specific cell type by analysing the RNA molecules. This is done by extracting all the RNA from cells and then analysing it using various different techniques, so that you build a database of all the RNA molecules that are present. When researchers in the 1980s first began investigating which genes were expressed in a given cell type, the techniques available were relatively insensitive. They were also designed to detect only mRNA molecules, as these were the ones that were assumed to be important. These methods tended to be good at detecting highly expressed mRNAs and quite poor at detecting the less well-expressed sequences. Another confounding factor was that the software used to analyse mRNA was set so that it would ignore signals originally generated from repetitive, i.e. ‘junk’, DNA.
These techniques served us very well for profiling the mRNA that we were already interested in – the mRNA molecules that coded for proteins. But as we have seen, this only represents about 2 per cent of the genome. It wasn’t until new detection technologies were coupled with hugely increased computing power that we began to realise that something very interesting was happening in the remaining 98 per cent – the non-coding part of our genome.
With these improved methodologies, the scientific world began to appreciate that there was actually a huge amount of transcription going on in the parts of the genome that didn’t code for proteins. Initially this was dismissed as ‘transcriptional noise’. It was suggested that there was a baseline murmur of expression from all over the genome, as if these regions of DNA occasionally produced an RNA molecule that got above a detection threshold. The concept was that although we could detect these molecules with our new, more sensitive equipment, they weren’t really biologically meaningful.
The phrase ‘transcriptional noise’ implies a basically random event. However, the patterns of expression of these non-protein-coding RNAs were different for different cell types, which suggested that their transcription was far from random
9
. For example, there was a lot of this expression in the brain. It’s now become clear that the patterns of expression are different in different brain regions
10
. This effect is reproducible when the various brain regions are compared from different individuals. This isn’t what we would expect if this low-level transcription of RNA was a purely random process.
It is becoming clearer that this transcription from genes that don’t code for protein is actually critically important for cellular function. Oddly, however, we remain caught in a linguistic trap of our own making. The RNA that is produced from these regions, the RNA that was previously under our radar, is still called non-coding RNA (ncRNA). It’s a sloppy shorthand, because what we really mean is non-
protein
-coding RNA. The ncRNA does, in fact, code for something – it codes for itself, a functional RNA molecule. Unlike mature mRNA, which is an RNA means to a protein end, ncRNAs are themselves the end-points.
Re-defining rubbish
This is the paradigm shift. For at least 40 years molecular biologists and geneticists have focused almost exclusively on the genes that code for proteins, and the proteins themselves. There have been exceptions, but we’ve just treated these as the odd bits of rubble on the top of the shed. But non-coding RNAs are finally starting to stand firmly alongside proteins as fully functional molecules. Different but equal.
These ncRNAs are found all over the genome. Some come from introns. Originally it was assumed that the spliced-out bits of mRNA from the introns get degraded by cells. It now seems much more likely that at least some (if not all or most) are actually processed to act as functional ncRNAs in their own right. Others overlap genes, frequently transcribed from the opposite strand to the protein-coding mRNA. Yet others are found in regions where there are no protein-coding genes at all.
We met two ncRNAs in the last chapter. These were
Xist
and
Tsix
, the ncRNAs that are required for X inactivation. These are both very long ncRNAs, of several thousand kilobases in length. When
Xist
was first identified, it was only the second known ncRNA. Current estimates suggest there are thousands of such molecules in the cells of higher mammals, with over 30,000 ‘long’ ncRNAs (defined as having a length greater than 200 bases) reported in mice
11
. Long ncRNAs may actually out-number protein-coding mRNAs.
In addition to X inactivation, long ncRNAs also appear to play a critical role in imprinting. Many imprinted regions contain a section that encodes a long ncRNA, which silences the expression of surrounding genes. This is similar to the effect of
Xist
. The protein-coding mRNAs are silenced on the copy of the chromosome which expresses the long ncRNA. For example, there is an ncRNA called
Air
, expressed in the placenta, exclusively from the paternally inherited mouse chromosome 11. Expression of
Air
ncRNA represses the nearby
Igf2r
gene, but only on the same chromosome
12
. This mechanism ensures that
Igf2r
is only expressed from the maternally inherited chromosome.

Other books

The Accidental Call Girl by Portia Da Costa
Monkey Business by Kathryn Ledson
5ive Star Bitch by Tremayne Johnson
Moribund Tales by Erik Hofstatter