The limiting factor then became the speed at which the shorter shotgun fragments could be sequenced. Hood’s work in automating the sequencing process—notably in adopting four different fluorescent dyes for each of DNA’s four chemical letters—meant that the raw sequence data could be produced in great quantities. This shifted the ratelimiting step to what was known as base-calling: deciding which of the four bases—A, C, G or T—was indicated by the fluorescent bands. As Green noted in 1996: “Editing (correction of base calls and assembly errors) is at present one of the most skill-intensive aspects of genome sequencing, and as such is a bottleneck to increased throughput, a potential source of uneven sequence quality, and an obstacle to more widespread participation in genomic sequencing by the community. We are working toward the long term goal of completely removing the need for human intervention at this stage, with short-term goals of improving the accuracy of assembly and base-calling, and of more precisely delineating sequence regions requiring human review.”
Green’s solution was a pair of programs, called Phred (for Phil’s read editor) and Phrap (Phil’s revised assembly program). Phred takes the raw fluorescent output and then decides which bases are represented, assigning each one a measure of how likely it is to be wrong. This is already useful, since it flags up the least reliable sections, allowing human intervention to be concentrated where it is most needed. Phred’s output can be fed into Phrap, an assembly program like the TIGR Assembler. It is an intelligent one, though, that can take into account the differing quality of base calls, rather than simply assuming that they are all correct. As a result, Phrap can make judgments about which of several possible overlaps is more likely to be correct, leading to higher-quality assembly and fewer errors.
Green’s software allowed human intervention in the sequencing process to be minimized. This meant that it was possible to run more machines faster, and for longer, without the need to scale up the human personnel involved— a critical requirement of the Human Genome Project, since sequencing three billion base-pairs was necessary. When the possibility of sequencing the human genome was first discussed back in the 1980s, it was generally assumed that new sequencing technologies would need to be developed to address the issue. As it turned out, the old technologies—dating right back to the use of gels to separate biological molecules in 1954—proved adequate; the trick was scaling the whole process up to unimagined levels. Hood’s machines together with Green’s programs—and ones written by Rodger Staden, a pioneer in the study of base call quality and in many other areas of bioinformatics—helped make this possible.
Even before these tools were available, the public project was beginning to gear up for the final stage of sequencing the human genome. In particular, the two most experienced sequencers—John Sulston at the Sanger Centre, and Bob Waterston at the Washington University School of Medicine—were starting to push for an acceleration of the public genome project.
In his autobiography, Sulston wrote that toward the end of 1994 Waterston did some calculations on what might be possible, and sent what Sulston called “an indecent proposal.” It was a strategy for completing the human genome in even less time than the 15-year period laid down in the timetable at the start of the U.S. public project in 1990. That this was at all conceivable, as Sulston pointed out, was due to the fact that the suggestion “departed from our previous practice in proposing that we should churn out sequence as fast as possible and use automatic assembly and editing procedures to string it together to a reasonable but not absolute standard of accuracy—99.9 per cent rather than 99.99.” This was by no means, however, a turning away from the absolutist principles that had hitherto guided them. It was merely a stepping stone toward the final destination. “We could continue the slower process of hand finishing,” Sulston explained, but as a separate track alongside the faster sequencing. In this way, they would have the best of both worlds: obtaining the rough sequence more quickly but leading to the full ‘gold standard’ in due course.
Waterston presented this idea for the first time at a meeting in Reston, Virginia, on December 16, 1994. He explained how the experience gained in sequencing the nematode worm led him and Sulston to believe that it was time to move on to the final phase of the human genome project. He said that his own lab was producing 15,000 runs per week of 400–500 bases each on the automated sequencing machines. He projected that he could scale this up to 84,000 runs, and noted that if three laboratories managed this kind of output it would take just five years to sequence 99 percent of the human genome with 99.9 percent accuracy. Sulston and Waterston made more presentations in early 1995, and in June 1995, Science was reporting that “no pistol shot marked the start, but the race to sequence the human genome began in earnest this spring.” In October of that year, one of the most respected voices in the genomic community, Maynard Olson, wrote an article entitled simply “A time to sequence” in which he looked back at the progress of the Human Genome Project so far, considering the way forward to the ultimate goal.
As Olson noted: “Many participants in the Human Genome Project, including this author, envisioned the project as a vehicle for developing powerful new sequencing tools that would displace the techniques of the 1980s through a combination of fundamental advances and automation. What has happened instead is arguably a better development for experimental biology. Sequencing methodology has improved incrementally in a way that is leading to convergence, rather than divergence, between the methods employed in ‘genome centers’ and those used in more typical molecular biology laboratories.” Olson concluded grandly, with a striking observation: “While huge, the central task of the Human Genome Project is bounded by one of the most remarkable facts in all of science: The development of a human being is guided by just 750 megabytes of digital information. In vivo, this information is stored as DNA molecules in an egg or sperm cell. In a biologist’s personal computer, it could be stored on a single CD-ROM. The Human Genome Project should get on with producing this disk, on time and under budget.”
Although Olson called the technological convergence between genomic centers and traditional laboratories “a better development,” it contained within it the seed of a change that was to prove unwelcome to many in the field. The genome centers achieved their increasingly impressive sequencing rates by scaling up the whole process. To gain the greatest advantage from this approach, though, it would be necessary to create a few high-throughput centers rather than to support many smaller laboratories where economies of scale would not be so great. These larger centers could push scaling to the limit. This meant that for the Human Genome Project to succeed under this regime, more and more resources would have to be concentrated at fewer centers. The move toward what has been called an “industrialization” of biology was a painful transition for the community in which many fine institutions found their grants becoming static or even being reduced. They also looked on with envy as more money was piled into a few select institutions that were being groomed as sequencing powerhouses.