|
JUDITH BLAKE
CENTER FOR BIODIVERSITY AND CONSERVATION
SPRING SYMPOSIUM
CONSERVATION GENETICS
IN THE AGE OF GENOMICS
AMERICAN MUSEUM OF
NATURAL HISTORY
APRIL 5-6, 2001
DAY THREE,
PART IV
THE ROLE
OF EXPANDING TECHNOLOGY
IN CONSERVING BIODIVERSITY
ROB DeSALLE (MODERATOR), Curator,
Division of Invertebrate Zoology,
American Museum of Natural History
RD: ... One of the charges that we gave ourselves as organizers of this
conference was articulated quite nicely yesterday by George Amato. If
I remember it correctly, we charged ourselves to assess the past legacy
of genetics in conservation and also to look forward to the future of
conservation genetics. Both sessions today will attempt to do that looking
forward into the future.
At first glance, this first session might seem like a hodgepodge. But
in Dr. Conway’s wonderful talk yesterday, he articulated the genomics
and new genetic technology to bring us vast genetic and reproductive knowledge
of organisms that will be the tools of the 21st-century conservation biologists.
This session is about some of those tools, including genomic databases
and what their impact will be on our way of dealing with conservation.
Another set of subjects we will touch upon in this session concerns genetic
modification and cloning. For those of you who did make it in this morning
from the poster session last night, you will have noticed that there were
a lot of posters on genetic resource banking. And I believe Ollie may
touch on that this afternoon.
These are all new things for us, in this field of conservation genetics.
And with these new things come new questions—new social, ethical,
and legal questions. And so we’ve also included in this session
a talk by some lawyers. (Laughter) And I’ll refrain from my lawyer
jokes until I introduce them, okay?
Without further delay, I’d like to introduce our first speaker.
This is Judy Blake. I first met Judy last year at an NSF workshop, and
I was totally impressed with how calm she is. She said she was nervous
just a second ago. But with someone who has to deal with so much data—so
many data—and so big of a problem as annotating genomes, I was just
amazed at how calm and organized she was. And I think Judy has made some
really great contributions to genomic science, and I think she’s
one of the few people who actually thinks across genomics to conservation
biology.
Judy has written many, many papers—in single papers she has citation
indices that outdo total citation indices of many scientists—by
an order of magnitude, with a single paper. This is how influential her
work is, and has been.
Judy is now at The Jackson Laboratory in Bar Harbor, Maine. She runs,
or directs, the Mouse Genome Informatics Group there. And her talk today
is on “Comparative Genomics and the Conservation of Biodiversity.”
Judy?
(Applause)
COMPARATIVE GENOMICS AND
THE CONSERVATION OF BIODIVERSITY
Judith A. Blake, Research Scientist,
Mouse Genome Informatics, The Jackson Laboratory
JB: I want to thank the organizers
for inviting me to this symposium, because I really enjoy talking across
groups. I am involved with genomics right now, very deeply, but I have
a long-standing interest in biodiversity and conservation genetics. This
kind of talk does make me nervous, because I’m forced to think across
these disciplines and to try to synthesize what it is that we have in
common and where we might go together.
The charge I’ve been given today is “Comparative Genomics
and the Conservation of Biodiversity.” I am dividing this talk into
two parts. The first part looks at some whole genome comparisons. And
this means whole genome comparisons. Because we have whole genomes now,
and that’s been such a major advance in our thinking. We have a
complete set of sequence from an organism. And secondly, I’m going
to look at integrated information systems for data exploration. This is
where, I think, there is a lot happening that will impact on the field
of conservation genetics.
I want to start with the microbial genomes. This is where the first sequencing
was done—with Haemophilus influenzae, when I was at the Institute
of Genomic Research. Microbial genomes have been a fascinating study in
the impact of a new technology on understanding biodiversity. The first
sequences were done more as demonstrations of the technology. But since
then, the sequencing continued, and we have over 40 completed genomes
now. Microbial life forms are known to occur in all parts of the earth—from
high-temperature archaea to methanogens. There are many, many interesting
genomes, and it’s thought, quite amazing I think, that fewer than
1% of them have been described in any way.
The estimate is that we’ll have 115 or more completed microbial
genomes in the next two or three years. Some of them are in private hands,
being analyzed by private companies, but many of them are in the public
domain. Many of these genomes are from uncultivated microorganisms—I
mean to say, these are organisms that can’t be grown in the laboratory—
so this sequencing technology is enabling us to discover new things about
our world. There seems to be hot spots of microbial diversity at depths
below 100 feet in the ocean that are only just being described and understood.
Also, a lot of these microbial genomes are very special animals—for
instance, you may have heard of Deinococcus radiodurans. This organism
survives 1.5 million rads of radiation—which is 3,000 times the
lethal dose for humans. Its genome has been sequenced, and it is now being
genetically engineered to transform divalent mercury into less toxic forms.
And, it also might come to be used in radioactive waste cleanup.
This slide I’m showing now is looking from the Department of Energy
Joint Genome Institute—JGI—a relatively new consolidation
of Department of Energy labs. Last October they had a microbial marathon,
and they sequenced 15 microbial genomes in a month.
So, of course, the first thing you want to do when you have all this information
is compare things. And I have to say that comparative genomics is really
an exciting endeavor right now—just looking at systems and saying:
What’s happening here? What’s happening there? For example,
these are two of the first genomes that were sequenced—Haemophilus
influenzae, as I mentioned; and Mycoplasma pneumoniae. And people are
starting to look at and compare functional systems between these two organisms.
So this slide is a comparison of transport-system characteristics from
two distantly related organisms, and we want some statements about where
different aspects of transport systems occur in different groups of organisms.
Of course, all of this started with the Human Genome Project. And one
of the great things that happened was the recognition that we weren’t
going to start right off and sequence the human genome. And so the concept
of model organisms was developed. These ”model organisms”
were considered key organisms that would give us insights—they are
typically useful in the laboratory, and well-known biologically.
Initially, the first genome that was planned to be sequenced was E. coli.
It actually ended up, I think, being fifth or sixth. This slide shows
a set of species whose genomes have been sequenced now. And, again—very
elemental comparative work being done. For instance, among the organisms
whose genomes are sequenced, genome size does not correlate with the numbers
of genes. This statement itself illustrates how basic our knowledge is
in this world of whole-genome comparisons.
I’d like to mention particularly that upwards of 50% of all the
genomes that have been sequenced, up to 50% of the sequence elements that
we think are coding for gene products, represent sequences that are dissimilar
to anything we’ve seen before. We can identify new gene families
of sequences, but we don’t have any idea what the gene products
of these families might do. So we have a lot of discovery work to do,
to understand those genes and their role in these organisms.
Here, again, we’re looking at comparative gene classes. And when
we look at conservation genetics, or taxonomic studies, often we’re
looking at genes, and we’re doing comparative analysis between genes
in different species. Here sequences are being functionally annotated
and described based on certain sequence characteristics. In this analysis,
an overall similarity of sequence is not the defining factor. One of the
great genome informatics projects going on right now is at the Swiss-Prot
protein database, a resource based in the U.K. and in Switzerland. They
are doing a very careful analysis of the actual architecture of a protein,
the particular elements that must be present in a given configuration,
in order for the protein to have a certain function. So it’s not
the overall similarity—it’s that there’s a certain critical
amino acid; a certain structure to the molecule that gives it its function.
Again, though, you’re seeing here the comparative gene classes in
this slide. And here are the model organisms. When we start looking at,
say, acetylcholine receptors, we see that there are 17 of these elements
in gene products in human; 12 in fly; 56 in worm. And we’re just
trying to sort out the numbers. I often think of this process as comparable
to the early explorations of Humboldt, going down the Amazon or wherever,
and just picking up beetles and other natural objects and throwing them
into bins. And that’s really what we are doing. We’re just
binning genes, at this point, because we have no idea how it’s all
going to sort out when we’re comparing them.
We go a little further, now, and here’s an example of gene-specific
similarities—and now we’re working into the gene sets that
we know something about. Recently the fly genome was finished. And one
of the first things that we observed was that we could identify in the
fly genome genes that had high degrees of similarity to genes recognized
as being important in human diseases. We heard the other night about recessive
disease. Of course, only a small portion of diseases and syndromes of
human interest that have a genetic component are understood to be due
to the nature of a single variant. Most diseases occur with a complex
interaction of genetics and environment and are mostly beyond our full
understanding at this time.
But here is a slide of some of those single-gene diseases. And here we
can see—for instance, in the upper block—there’s polycystic
kidney, PKD-1 gene, and it has a very closely related gene in fly. And
so, in fact, the fly, as a genetic model, is being used to study human
diseases. And that only reinforces for us how closely related we human
beings are to all other organisms.
Of course, the human-mouse comparison is the one we hear the most about.
Mouse is the primary model system for understanding human biology and
disease. That’s because it has much of the same physiology; it has
a short generation time; we are genetically able to manipulate this organism
to specifically study human disease processes.
Here we see some of the work of Lisa Stubbs. She set out to analyze the
sequence of human chromosome 19. And there are 15 relevant conserved regions
in mouse, and she sequenced all those, as well (she has a paper coming
out in Genomics really soon, and I’ve been talking with her and
hearing her talk). And so she’s able to look at these conserved
regions between mouse and human, and begin to come to some new understandings
about genomic level similarities and differences between mice and humans.
Human genes, for example, appear to be on average much bigger—the
introns are much bigger than the same gene in the mouse. Mouse genes are
much smaller. Segments in the mouse genes duplicate more, so there’s
more members of a gene family in mouse. And so we’re starting to
get these bits of information from this kind of comparative work.
Of course, in our own work in the Mouse Genome Informatics group, we’ve
paid very close attention to the orthologous gene groups. We focus on
comparative analysis of human, mouse, and rat. We do, actually, collect
information for 18 different mammals, a representative set of mammals
that are important in various ways. And so that’s yet another element
of comparative genetics that is mostly yet to be realized.
I’d like to end this section of the talk by saying that one of the
impacts that this genetic technology revolution is having is that we’re
moving genes onto different genetic backgrounds. So in trying to understand
the function of genes, it’s at a point where we choose the experimental
system that works for us: Is it fly? Is it mouse? And so, in many cases,
we have instances where we’re moving human genes onto specific mouse
backgrounds, in order to study the function of the human gene. So the
interest is in moving from understanding the similarities between the
sequences to understanding the function of the genes—and moving
beyond, into understanding how the organism generates a certain phenotype.
At the Jackson Laboratory, we have a very large induced-mutant resource.
It’s a national repository of specifically developed strains of
mice. Scientists create particular mouse models, typically moving a particularly
constructed gene onto a particular genetic background. And then, when
their work is done, we at the Jackson Laboratory will preserve that genetic
construction and make it available to the scientific community. And here
in this slide we see a listing of a particular group of those induced
mutant mouse strains—a web page showing available neural-2 defect
mouse model resources. So in comparative genomics, the data generation
has been phenomenal. There is lots of data, and a fair amount of preliminary
analysis. We are amassing large catalogues of genes, and putative genes—about
half of which have no prediction of function at all, because we’ve
never seen anything like them before. And in fact, we’re moving,
then—the first flush of genome projects is over, and we’re
rapidly— there’s whole sets of people moving into describing
what we call the “transcriptome,” which is that set of all
the gene products produced from a given genome.
As many of you are probably aware, a gene can produce multiple gene products.
I think the biggest number I’ve heard so far is one gene with 138
exons. And it’s predicted it could produce 38,000 gene products—and
that’s before they’re modified by methylation or anything
else that might influence their function as mature proteins. Ultimately,
we want to get to understanding, then, the proteome. So what does this
get us to? And what is happening in the development of information and
analysis systems to manage all this data. So this is the role of expanding
technology. We’ve had all the generation of sequence data, and now
we’re looking at visualization and annotation. And, most important
from my perspective—and, I think, for this community, as well—is
data integration.
Here’s an example of visualization of genomes. This slide presents
a linearization of a circular microbial genome—Neisseria meningitidis—and
a specific strain. This is available publicly, this visualization. And
actually, what happens here is, you can click on any of these, and you
keep drilling down further and further. This is about a 2-million-base-pair
genome. And you drill down further and further—you can actually
get to the actual DNA sequence of a given segment of this gene.
This can be compared to a kind of GIS system. It’s just a linear
model. One can envision having all kinds of information available to someone
on a transect in this way. Of course, there’s a lot of data collection
and standardization that goes on in order to be able to have this kind
of presentation.
We have many, many sequence-annotation tools. Here we have curated sets
and computational sets of genes. And when we annotate BACs (Bacterial
Artificial Chromosomes) we use these kinds of tools. This slide shows
an interesting tool, first developed at Berkeley. The grad students who
developed it were brought into the Celera fly-genome project, and there
they developed a proprietary application. Now the annotations and visualization
software is being redeveloped—by Berkeley, again—in conjunction
with the Sanger Institute. And now this is all freeware, and available
to anyone who wants to use it. And we in our group are using it in various
ways in our project of integrating mouse sequence with mouse biology.
In the Mouse Genome Informatics Databases we represent the biology of
the laboratory mouse. Right now, as might be guessed, we’re consumed
with sequence and the integration of sequence data with other biological
attributes of the mouse. We’ve just recently accessioned 21,000
full-length cDNAs, and we have done the analysis and determined that 12,000
of them represent new genes in the mouse. On the other side of the equation,
there are major mutagenesis centers now that are starting to generate
and characterize mouse mutants based on phenotype analysis, at the rate
of about 4 to 8,000 a month. There are several major international mutagenesis
programs. Some focus on heart, lung, and blood disorders; others focus
on neurobiology phenotypes. And all of this, the genome sequence and the
phenotypic analysis, comes together—at some point, we hope—in
generating the understanding of the whole phenomena of the laboratory
mouse as an organism.
Three main projects in our group are the mouse genome sequencing project,
headed by Carol Bult, which is integrating the high throughput mouse-sequence
data with the mouse data; the gene-expression project under the leadership
of Martin Ringwald is looking at expression data and integrating and understanding
the micro-array and expression profiles; and the mouse genome database
overseen by Janan Eppig.
I want to talk now about how we generate unique designations for genomic
features. This is a very important and difficult task. We do a lot of
indexing and co-curation with the sequencing centers—and with the
different sequence repositories, such as NCBI and Swiss Prot—in
order to have a representative set of information about a gene, or any
other genomic feature.
So what that allows us to do, then, is to integrate data including, for
example, multiple publications, various sets of mapping data, and information
about phenotypic alleles. We have a representative consensus map representation;
a little minimap of the chromosome, showing the location of a gene on
a chromosome. And I’m going to spend some time now talking about
the classification systems we are using, which are standard systems, and
how we then carefully curate links to collect a set of sequence associations
for this gene.
Ultimately, we want to move beyond the mouse. We want to be able to speak
the same language when we’re talking to people describing other
organisms. I was just at a meeting recently where we had a big discussion:
What is the definition of development? What do you mean when Arabidopsis,
which is a plant—what do you mean when you use the word “development”?
When does the process of ”development” start? Different research
communities will give you different answers to this question. When does
the anatomical element called the heart—where does that start in
the developmental process? What are the components of the “heart”?
What is outside the concept “heart”? What is inside the term
“heart”? Does it include the circulatory elements? What about
the pericardium? Typically, the pericardium is not considered a part of
the heart, yet when researchers search for information about pericarditus,
they might logical query for “heart disease.”
So we’re moving towards developing a common language for biology,
at least for molecular biology. And this work has come, again, out of
the model organism community, where we want to use the power of comparative
genomics to look at and compare information about shared genes. We’re
really looking at a set of genes that happen to be on different genotypes.
You know, one way to approach this problem is to say it’s the same
gene, whether you have it on a fly background, or on a yeast background,
or on a mouse background. But we were using different languages to describe
our knowledge about the function of the gene. And so when we would use
a term—like cell-division cycle number 42 gene—is this what
the orthologous gene is called in all of the model organisms? And what
is our collective understanding of this gene’s function?
So we all got together, that is to say, folks working on the informatics
of the model organisms. The project I will describe to you now actually
started with mouse, yeast, and fly bioinformatics communities. We figured
that if we could have a common language of describing the molecular biology
of our three organisms, then it would help the whole biological community
to communicate more effectively. And that’s what we were having
a problem doing, as we were doing our comparative analysis.
So we formed the Gene Ontology Consortium—now expanded from fly,
yeast, and mouse to include Arabidopsis, E. coli., and others. The microbial
genomes are talking with us, and want to use our standard vocabularies.
And other groups are now involved as well, including C. elegans (the worm).
We have set about creating structured, controlled vocabularies for molecular
biology. Why do we want structured vocabularies? We want to standardize
our annotations. We want to be able to ask complex queries across all
these genomes. We want to be able to ask questions such as “What
are all the genes in the model organism systems that function in the initiation
of cell division?” for example. And for that we need a standard
set of terms which also have to have standardized definitions. What the
definition is isn’t as important as the fact that we have defined
a definition for the term we’re using. Thus we would be able use
terms (concepts) with a clear understanding of the definition of the concept
within this system.
The Gene Ontology Consortium has built three separate vocabularies: ”molecular
function”—what a gene product actually does; ”biological
process”—a set of broader concepts such as DNA replication;
carbohydrate metabolism—a larger process that this gene product
is involved in. And, finally ”cellular component”--where is
this gene product found?
The goals of the Consortium, then, are to develop these structured vocabularies—these
ontologies—each term of which has a unique ID, has a definition,
and has a defined relationship to other terms. It is a hierarchical system,
but it’s not a simple hierarchy. Many terms have multiple parents.
And I can review that for people who are interested.
And then, we each, in our own areas of expertise— So the mouse people—mouse
scientific curators—annotate our gene products to a specific term—with
a term, with a reference, with a citation, and with an evidence statement.
We are documenting why we think that this gene product is involved in
this molecular function, for example. In the end, of course, we realized
immediately that if we toss all of these annotations into a common data
resource, then the biological scientific community can query that resource
across all these shared collected annotations, across the genomes of any
organisms that use this annotation standard.
So here we see a representation. Again, I showed you this before—but
now you understand more about these new terms we have. And we have various
ways of browsing the hierarchies and looking at the definitions of each
term. And we have an evidence statement. This association of the term
“skeletal development” with this vitamin-D receptor gene was
inferred from a mutant phenotype presented in this reference.
And throughout our annotations, in everything we do, this is a basic paradigm—
if you make any data association, assign any attribute to a gene object,
there must be an evidence statement and a citation for this annotation
event.. This is very important…because one of the problems we’ve
had in the first flush of genomic information is the transference of putative
function based on untraceable statements or perhaps because of some general
sequence similarity. And so by setting the standard of providing evidence
and citation for attribute assignments we are trying to drive the data
accumulation with some statement of confidence in the knowledge presented.
And so we have the Gene Ontology Consortium, and I’ll give you the
web url at the end of the talk.
We have a series, then, of structured vocabularies. And this, I’d
say, is the major thing that we’re working on now. We have the GO
vocabularies. But also, there are anatomies for each of these organisms
that are being standardized; nomenclature for each of these genomic features
categorized in the integration process; also, now the development of phenotype
vocabularies; disease models; and, of course, many other smaller ones.
The impact of genomics, then, I think, has been a whole-genome view of
comparative analysis—and we’re just beginning to understand
the implications of that—and the development of an integrated information
system to handle all this comprehensive data, including data-generation
and analysis tools, integrated information systems, and all these shared
structured vocabularies.
Where do I think this takes us, in terms of comparative genomics and the
conservation of biodiversity? Well, as I showed in the microbial systems,
one event that I think was somewhat unexpected, was that we’ve had
a great interest and impact on the increased discovery and analysis of
biodiversity—particularly in the microbial area. And I certainly
heard it yesterday, in people talking about genotypes, and understanding
the genetic diversity in various groups of organisms. And so the tools
that have been developed have been very important in that.
On the genomic side there’s certainly been an increased recognition
of the diverse organisms as sources of comparative information, and this
has been very useful from both perspectives. I think when people were
coming from the biomedical approach of wanting the human genome for study
and understanding of the human condition, sometimes the recognition of
the power of genetic diversity wasn’t there. And it certainly has
been eye-opening for many folks.
And finally, I think these new tools—some of which I showed you
today—for data visualization and integration can be useful models—or,
perhaps, even used directly in extending biodiversity representations
and data exploration.
I’d like to acknowledge the many people who have contributed to
this work, and I list them here. Most of them in our group, in Mouse Genome
Informatics, but also bottom-left are major players in the Gene Consortium.
The url for the Gene Ontology Consortium is http://www.geneontology.org.
And there’s a shot of down the road, the coast of Maine.
Thank you very much.
(Applause)
Return
to transcripts page
|