Metazome
Overview
Metazome is a joint project of the Department of Energy's Joint Genome Institute and the Center for Integrative Genomics to facilitate comparative genomic studies amongst metazoans. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of version 2.0.4, Metazome provides access to twenty-four sequenced and annotated metazoan genomes, clustered at nine evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, Ensembl, and JGI are hyper-linked and searchable.
Included Organisms
Genes from the following organisms are clustered in release 2.0.4 of Metazome:
Organism | common name | Source |
Homo sapiens | Human | NCBI 36 build from Ensembl 41 |
Mus musculus | Mouse | NCBI m36 build from Ensembl 41 |
Rattus norvegicus | Rat | RGSC 3.4 from Ensembl 41 |
Canis familiaris | Dog | NCBI gene models build 2 version 1 on assembly release by Broad 10 May 2005. |
Monodelphis domestica | Opossum | MonDom 4 from Ensembl 41 |
Gallus gallus | Chicken | Assembly WASHUC 1, Mar 2004 from Ensembl 41 (gene build December 2005) |
Xenopus tropicalis | Frog | JGI v4.1 assembly and annotation |
Gasterosteus aculeatus | Stickleback | Broad S1 from Ensembl 41 |
Oryzias latipes | Medaka | MEDAKA 1 from Ensembl 41 |
Takifugu rubripes | Fugu pufferfish | JGI v4.0 assembly and annotation |
Danio rerio | Zebrafish | Zv6 from Ensembl 41 |
Ciona savignyi | Seasquirt - savignyi | CSAV 2.0 from Ensembl 41 |
Ciona intestinalis | Seasquirt - intestinalis | JGI 2 from Ensembl 41 |
Brachiostoma floridae | Amphioxus | Brafl1 JGI annotation from April 11 2006 |
Strongylocentrotus purpuratus | Sea Urchin | NCBI gene build 2 version 1 on Baylor's assebmly Spur_v2.1 |
Drosophila melanogaster | Fruitfly | BDGP 4 from Ensembl 41 |
Anopheles gambiae | Mosquito | AgamP3 from Ensembl 41 |
Aedes aegypti | Yellow fever Mosquito | AaegL 1 from Ensembl 41 |
Bombyx mori | Silkworm | Beijing Genomics Institute |
Tribolium castaneum | Red Flour Beetle | NCBI gene build 1 version 1 on Baylor assembly Tcas_2.0 |
Caenorhabditis elegans | Worm | Wormbase release WS164 |
Caenorhabditis briggsae | Briggsae Worm | Wormbase release WS164 |
Lottia gigantea | Owl limpet (snail) | Early p02S assembly (see JGI for most recent Lottia assembly and annotation) |
Nematostella vectensis | Sea anemone | JGI v1.0 assembly and annotation |
Nodes
Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:
Eumetazoan: | Gene families representing the most recent common ancestor of bilateria and cnidaria. |
Bilaterian: | Gene families representing the most recent common ancestor of deuterostomes and protostomes. |
Protostome: | Gene families representing the most recent common ancestor of lophotrochozoans + ecdysozoans. |
Deuterostome: | Gene families representing the most recent common ancestor of hemichordates and echinoderms. |
Arthropod: | Gene families representing the most recent common ancestor of coleoptera and diptera. |
Chordate: | Gene families representing the most recent common ancestor of cephalochordates + olfactores. |
Vertebrate: | Gene families representing the most commen recent ancestor of teleosts and tetrapods. |
Teleost: | Gene families representing the most commen recent ancestor of euteleostei and otocephala. |
Tetrapod: | Gene families representing the most commen recent ancestor of amniota and amphibia. |
Clustering Methodology
Clustering is performed hierarchically, from the crown (extant organism) nodes to the root, at each node performing in-group (paralog) clustering followed by out-group (ortholog) clustering. In-group relationships and out-group relationships are defined as follows: all descendant nodes (internal as well as crown) that connect through the same (opposite) deepest branch to a given node are considered in-group (out-group) nodes with respect to that node. For example, human, mouse, rat, dog, opossum, chicken, frog and the teleosts are all in-group with respect to the chordate node, but with respect to the vertebrate node, teleosts are an out-group for tetrapods, and vice versa. Gene families are created at a given node by first creating paralog clusters amongst the in-group members, with in-group genes with greater similarity (as measured by blastp) to each other than their most-similar outgroup gene being placed in the paralogous cluster. Four-fold-degenerate transversion rates (4DTV, one measure of the divergence time between genes based on the number of neutral substitutions) are used to preclude clustering of "paralogs" with low degrees of similarity which are drawn together simply by the lack of a high-scoring hit to an outgroup gene. After paralog clustering along both branches, clusters are merged across branches to capture orthologous relationships. This merging uses a combination of Mutual-Best-Hit (MBH, based on blastp sequence similarity) and synteny (the hypothesis that disjoint genomic regions that contain a fair number of genes with high quality hits to each other likely share a common origin in a speciation, whole genome duplication, or segmental duplication event, though only the former is directly relevant in orthology identification) metrics. This process is repeated as we move down from the crown nodes (organisms) to the root.
Note that, by construction, for any given node and gene family, the members of that family remain together in any (larger) gene family defined at a more ancient (i.e., closer to the root) node; the clustering is rigorously hierarchical. Also note that every gene from an organism present at a particular node is in one and only one cluster/family at that node (the clusters are "hard", not "fuzzy").
Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.
Clustering Statistics
Node | Gene Families | Singletons | Largest Cluster Size and Defline |
Eumetazoan | 31902 | 203761 | 496 - Notch-1 related |
Bilaterian | 30367 | 192975 | 489 - Helix-loop-helix DNA-binding domain-containing protein |
Protostome | 16314 | 75876 | 141 - UDP-glucuronosyltransferase-related |
Deuterostome | 21031 | 126047 | 230 - Major Facilitator Superfamily domain-containing protein |
Arthropod | 8663 | 32389 | 49 - Odorant-binding protein AgamOBP3 |
Chordate | 17527 | 91781 | 189 - RAB5A |
Vertebrate | 18382 | 87487 | 121 - G protein signaling regulator |
Teleost | 16213 | 33235 | 26 - Glicacolin-related and C1q domain-containing protein |
Tetrapod | 15654 | 62031 | 30 - ETS domain-containing predicted transcription factor |
Metazome Team
Software: | David M. Goodstein, Rusty Howson, Rochak Neupane, Shengqiang Shu |
Analysis: | Bill Dirks, Uffe Hellsten, Therese Mitros, Dan Rokhsar |