Apr 24

Apr 24 Mapping the Unknown

Biology, Measurement, Technology, History

It’s hard to believe, but the first complete sequence of the human genome was only completed in 2003. The final draft was a culmination of 13 years of intense international collaboration at 20 different institutes. It was meant to be a victory, a roadmap to a consensus genome that would allow scientists from around the world to more easily trace back the genetic roots of disease.

17 years later, the sequence is not yet fully complete – there are still around 50 unclosed gaps – down from around 800 in the early 2000s in the consensus sequence. These areas represent repetitive stretches of nucleotides like those found in telomeres, areas that are highly variable like the MHC immune complex, and other issues of unknown provenance.

In the years since the Human Genome Project was officially declared completed, there has been a revolution in genome sequencing, driven by the tumbling price of sequencing a genome. In 2003, the year that the Project was completed, the cost of sequencing a single human genome was around $100 million USD. In 2020, the average cost for full genome sequencing, which includes all the pieces that aren’t known to code for anything, has fallen to around $1,500. Want to sequence just the parts that code for proteins? That’ll run you only about $200.

The proliferation of next-generation sequencing approaches, a methodological zoo that deserves its own treatment, has led to the huge decrease in price for sequencing, launched the personalized medicine movement, and also opened our eyes to the massive differences between individual. It turns out that our early understanding of what constitutes a “human genome” was deeply naïve. Rather than a single consensus sequence that reflects a “healthy” human, we’ve found enormous variation at each position in the genome.

In addition to finding almost bottomless variation between individuals, the development of single-cell sequencing technologies has demonstrated that there may not even be a consensus sequence in a single individual, let alone for humans in general.

The Human Holobiont

A transformative effect of the plummeting cost of sequencing was the beginning of the “-omics” era, where scientists were able to turn their attention to network analysis, rather than on the study of a single organism.

Microbiology, the study of single-celled microorganisms, immediately benefited from this approach. One of the greatest difficulties for traditional microbiology is that an organism needs to be culturable in order to be identified – and we have no idea how to culture the vast majority of environmental microbes. For a long time, it didn’t matter that much – the sorts of advancements that could be made from the microbes that we could culture were sufficient. We had cloning that let us produce recombinant molecules like insulin for a fraction of the price. We could study the organisms that cause human disease in the laboratory, because those were the ones for whom the techniques were designed.

It had long been discussed that a change was necessary, that continuing to study only the microbes that could cause harm was reductionist, that the model organisms that worked in the lab couldn’t explain the vast complexity of the microbes that were beyond regular culture techniques – but the reality was that there was little funding for basic science, and even less for pursuing the kind of research that didn’t appear to have any relationship to the questions of human wellbeing – either through agriculture, the food supply, or medicine.

The arrival of sequencing technologies, especially those that didn’t require any kind of nucleic acid enrichment prior to sequencing, solved this problem – and allowed microbe hunters to go out into the world and see what they could find.

In 2003, the year the Human Genome Project was completed, there were 754 published papers that mentioned the “microbiota,” the term used to refer to microbes that lived in the environment prior to the “-omics” era. As far back as the 1950s, scientists had been pointing out that there were bacteria living in our bodies. They pointed to the obvious places - gingival plaques, the gut. In an incredible paper published in 1966, Dr. Rene Dubos of Rockefeller University makes the astounding claim that, based on his research, normal development of the intestines requires the presence of microbes.

At this point, his conclusion is old news. We’ve come to realize that humans are outnumbered by their microbial hitchhikers about 30-1, and that they do everything – from synthesizing neurotransmitters, to signaling satiety, to modulation of the immune system, to affecting the progression of diseases like obesity, Parkinson’s, and cirrhosis.

Those effects, profound as they are – don’t even begin to cover the whole of it. Emerging research suggests that in addition to the bacterial load in the guts, we have a circulating population of bacteria in the blood, a diverse population in the lungs, the small but steady population in the eyes. Women even appear to have a bladder microbiome.

Given that we’re not walking, talking biofilms (though… maybe, more on that in another article), it’s been a continuous shock to discover stable populations of bacteria everywhere in the body, and has seriously shaken the historical understanding that the function of bacteria is to cause disease.

These discoveries have shifted our understanding of how to define “human.” Instead of viewing ourselves as a ship tossed on the sea of biology, more and more researchers are realizing that we are an ecological construct unto ourselves – the human holobiont. In this view, the human being is simply one part of a more complex ecosystem, one shared with bacteria, archea, fungi, and viruses.

The Deep Well of Metagenomes

Sequencing the human genome allowed the price of sequencing other genomes to drop precipitously, and this sort of drop encouraged a new discipline – metagenomics. Instead of focusing on the genome of a single species, metagenomicists focus on capturing all genetic diversity of a given environment. This approach has underscored the diversity of the human gut microbiota, and supported the perspective of the human holobiont rather than the human being.

It was also a better approach for understanding the taxonomic relationships between the organisms that colonize every single surface in the environment. Historically, bacterial taxonomic trees were built off of ribosomal sequences – which was functional, but wasn’t able to uncover the full extent of similarities between different kinds of bacteria.

The difficulty arises when attempting to capture sequences of low abundance, rather than the species that predominate in an environment. Most genomes, including the original human genome, are assembled using something called “shotgun sequencing.” In this approach, chromosomal DNA is first extracted from cells and broken into smaller pieces. Those smaller pieces are placed into vectors, circular pieces of DNA of known sequence. These vectors are then amplified in a laboratory bacterium, usually E. coli, and then the sequence of the variable portion is determined. Here’s a graphic overview of the process:

The main hurdle to reconstructing the diversity of an ecosystem comes down to read depth, represented on the far right of the graphic. For the human genome, the rule of thumb is that a given sequence needs to appear in at least 100 reads in order to have a decent sense of the variability at that site, since not every cell carries the exact same genetic code.

In a complex environment, where there’s a few species that dominate, it can be difficult to get sufficient read coverage for genomes that are relatively rare. This makes it difficult to reconstruct the full complexity of a system, and casts some doubt on conclusions reached through metagenomic analysis – since small changes would be hidden due to a lack of read depth.

This has become an issue as researchers have moved into studying even smaller parts of the microbiome – viruses. As we’ve covered previously, viruses are a strange process that doesn’t neatly fit into our biological definitions of life – but contain nucleic acids, and are massively abundant.

Take, for example, an early viral metagenomics paper on genetic diversity of uncultured viral communities. In order to isolate sufficient genetic material for their study, the team had to process 200 L of seawater to remove contaminants and enrich for virus-like-particles. Once they had collected a size-selected fraction of material, they tested it for the presence of contaminants like bacteria, and then sequenced it following a similar method to the shotgun approach described above. The method yielded a ton of information, but left a lot of uncertainty in how the information fit together – in the abstract, they report between 374 and 7,114 viral types on the sample. A significant spread, by any measure.

It was an improvement over previous approaches to understanding viral diversity that had to do with manually examining viruses and grouping them based off of capsid geometry – but still fell far short of reliability. One thing this and other early studies showed, was that viruses weren’t closely related to one another in the same way that the rest of life is; most genetic material in a viral metagenomics studies has never been seen before, while 95% of genetic material in conventional sequencing has been previously identified.

Seventeen years later, these limitations still plague metagenomics research, as the majority of new viral sequences are unrelated to one another and cannot be effectively grouped together in order to paint a picture that accurately reflects some understanding of how this sort of diversity is maintained, rather than simply pointing out that it occurs.

Attempts to standardize viral sequence assembly have been made, but even these attempts point out the obvious – the approaches being used to study viral diversity are still in their infancy. One serious failure has been that most studies target DNA-based viruses, effectively obscuring all other viral diversity.

But, right now, explorations of the human virome are still in their infancy, and the role that viruses play in the human holobiont is unclear. In some ways, it feels like the early 2000s all over again – methods for probing these unknown depths are improving, and there is a sense that great changes are on the horizon.