Metazome
Overview
Metazome is a joint project of the Department of Energy's and the to facilitate comparative genomic studies amongst metazoans. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of version 2.0.4, Metazome provides access to twenty-four sequenced and annotated metazoan genomes, clustered at nine evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, Ensembl, and JGI are hyper-linked and searchable.
Included Organisms
Genes from the following organisms are clustered in release 2.0.4 of Metazome:
Organism | common name | Source |
Homo sapiens | Human | NCBI 36 build from |
Mus musculus | Mouse | NCBI m36 build from |
Rattus norvegicus | Rat | RGSC 3.4 from |
Canis familiaris | Dog | NCBI gene models build 2 version 1 on assembly release by Broad 10 May 2005. |
Monodelphis domestica | Opossum | MonDom 4 from |
Gallus gallus | Chicken | Assembly WASHUC 1, Mar 2004 from (gene build December 2005) |
Xenopus tropicalis | Frog | assembly and annotation |
Gasterosteus aculeatus | Stickleback | Broad S1 from |
Oryzias latipes | Medaka | MEDAKA 1 from |
Takifugu rubripes | Fugu pufferfish | assembly and annotation |
Danio rerio | Zebrafish | Zv6 from |
Ciona savignyi | Seasquirt - savignyi | CSAV 2.0 from |
Ciona intestinalis | Seasquirt - intestinalis | JGI 2 from |
Brachiostoma floridae | Amphioxus | from April 11 2006 |
Strongylocentrotus purpuratus | Sea Urchin | NCBI gene build 2 version 1 on |
Drosophila melanogaster | Fruitfly | BDGP 4 from |
Anopheles gambiae | Mosquito | AgamP3 from |
Aedes aegypti | Yellow fever Mosquito | AaegL 1 from |
Bombyx mori | Silkworm | |
Tribolium castaneum | Red Flour Beetle | NCBI gene build 1 version 1 on |
Caenorhabditis elegans | Worm | release WS164 |
Caenorhabditis briggsae | Briggsae Worm | release WS164 |
Lottia gigantea | Owl limpet (snail) | Early p02S assembly (see for most recent Lottia assembly and annotation) |
Nematostella vectensis | Sea anemone | assembly and annotation |
Nodes
Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:
Eumetazoan: | Gene families representing the most recent common ancestor of bilateria and cnidaria. |
Bilaterian: | Gene families representing the most recent common ancestor of deuterostomes and protostomes. |
Protostome: | Gene families representing the most recent common ancestor of lophotrochozoans + ecdysozoans. |
Deuterostome: | Gene families representing the most recent common ancestor of hemichordates and echinoderms. |
Arthropod: | Gene families representing the most recent common ancestor of coleoptera and diptera. |
Chordate: | Gene families representing the most recent common ancestor of cephalochordates + olfactores. |
Vertebrate: | Gene families representing the most commen recent ancestor of teleosts and tetrapods. |
Teleost: | Gene families representing the most commen recent ancestor of euteleostei and otocephala. |
Tetrapod: | Gene families representing the most commen recent ancestor of amniota and amphibia. |
Clustering Methodology
Clustering is performed hierarchically, from the crown (extant organism) nodes to the root, at each node performing in-group (paralog) clustering followed by out-group (ortholog) clustering. In-group relationships and out-group relationships are defined as follows: all descendant nodes (internal as well as crown) that connect through the same (opposite) deepest branch to a given node are considered in-group (out-group) nodes with respect to that node. For example, human, mouse, rat, dog, opossum, chicken, frog and the teleosts are all in-group with respect to the chordate node, but with respect to the vertebrate node, teleosts are an out-group for tetrapods, and vice versa. Gene families are created at a given node by first creating paralog clusters amongst the in-group members, with in-group genes with greater similarity (as measured by blastp) to each other than their most-similar outgroup gene being placed in the paralogous cluster. Four-fold-degenerate transversion rates (4DTV, one measure of the divergence time between genes based on the number of neutral substitutions) are used to preclude clustering of "paralogs" with low degrees of similarity which are drawn together simply by the lack of a high-scoring hit to an outgroup gene. After paralog clustering along both branches, clusters are merged across branches to capture orthologous relationships. This merging uses a combination of Mutual-Best-Hit (MBH, based on blastp sequence similarity) and synteny (the hypothesis that disjoint genomic regions that contain a fair number of genes with high quality hits to each other likely share a common origin in a speciation, whole genome duplication, or segmental duplication event, though only the former is directly relevant in orthology identification) metrics. This process is repeated as we move down from the crown nodes (organisms) to the root.
Note that, by construction, for any given node and gene family, the members of that family remain together in any (larger) gene family defined at a more ancient (i.e., closer to the root) node; the clustering is rigorously hierarchical. Also note that every gene from an organism present at a particular node is in one and only one cluster/family at that node (the clusters are "hard", not "fuzzy").
Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.
Clustering Statistics
Node | Gene Families | Singletons | Largest Cluster Size and Defline |
Eumetazoan | 31902 | 203761 | 496 - Notch-1 related |
Bilaterian | 30367 | 192975 | 489 - Helix-loop-helix DNA-binding domain-containing protein |
Protostome | 16314 | 75876 | 141 - UDP-glucuronosyltransferase-related |
Deuterostome | 21031 | 126047 | 230 - Major Facilitator Superfamily domain-containing protein |
Arthropod | 8663 | 32389 | 49 - Odorant-binding protein AgamOBP3 |
Chordate | 17527 | 91781 | 189 - RAB5A |
Vertebrate | 18382 | 87487 | 121 - G protein signaling regulator |
Teleost | 16213 | 33235 | 26 - Glicacolin-related and C1q domain-containing protein |
Tetrapod | 15654 | 62031 | 30 - ETS domain-containing predicted transcription factor |
Metazome Team
Software: | David M. Goodstein, Rusty Howson, Rochak Neupane, Shengqiang Shu |
Analysis: | Bill Dirks, Uffe Hellsten, Therese Mitros, Dan Rokhsar |