Metazome

Overview

Metazome is a joint project of the Department of Energy's and the to facilitate comparative genomic studies amongst metazoans. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of version 2.0.4, Metazome provides access to twenty-four sequenced and annotated metazoan genomes, clustered at nine evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, Ensembl, and JGI are hyper-linked and searchable.

Included Organisms

Genes from the following organisms are clustered in release 2.0.4 of Metazome:

Organismcommon nameSource
Homo sapiensHumanNCBI 36 build from
Mus musculusMouseNCBI m36 build from
Rattus norvegicusRatRGSC 3.4 from
Canis familiarisDogNCBI gene models build 2 version 1 on assembly release by Broad 10 May 2005.
Monodelphis domesticaOpossumMonDom 4 from
Gallus gallusChickenAssembly WASHUC 1, Mar 2004 from (gene build December 2005)
Xenopus tropicalisFrogassembly and annotation
Gasterosteus aculeatusStickleback Broad S1 from
Oryzias latipesMedakaMEDAKA 1 from
Takifugu rubripesFugu pufferfish assembly and annotation
Danio rerioZebrafishZv6 from
Ciona savignyiSeasquirt - savignyiCSAV 2.0 from
Ciona intestinalisSeasquirt - intestinalisJGI 2 from
Brachiostoma floridaeAmphioxus from April 11 2006
Strongylocentrotus purpuratusSea UrchinNCBI gene build 2 version 1 on
Drosophila melanogasterFruitflyBDGP 4 from
Anopheles gambiaeMosquitoAgamP3 from
Aedes aegyptiYellow fever MosquitoAaegL 1 from
Bombyx moriSilkworm
Tribolium castaneumRed Flour BeetleNCBI gene build 1 version 1 on
Caenorhabditis elegansWorm release WS164
Caenorhabditis briggsaeBriggsae Worm release WS164
Lottia giganteaOwl limpet (snail)Early p02S assembly (see for most recent Lottia assembly and annotation)
Nematostella vectensisSea anemone assembly and annotation

Nodes

Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:

Eumetazoan:Gene families representing the most recent common ancestor of bilateria and cnidaria.
Bilaterian: Gene families representing the most recent common ancestor of deuterostomes and protostomes.
Protostome: Gene families representing the most recent common ancestor of lophotrochozoans + ecdysozoans.
Deuterostome: Gene families representing the most recent common ancestor of hemichordates and echinoderms.
Arthropod:Gene families representing the most recent common ancestor of coleoptera and diptera.
Chordate:Gene families representing the most recent common ancestor of cephalochordates + olfactores.
Vertebrate:Gene families representing the most commen recent ancestor of teleosts and tetrapods.
Teleost:Gene families representing the most commen recent ancestor of euteleostei and otocephala.
Tetrapod:Gene families representing the most commen recent ancestor of amniota and amphibia.

Clustering Methodology

Clustering is performed hierarchically, from the crown (extant organism) nodes to the root, at each node performing in-group (paralog) clustering followed by out-group (ortholog) clustering. In-group relationships and out-group relationships are defined as follows: all descendant nodes (internal as well as crown) that connect through the same (opposite) deepest branch to a given node are considered in-group (out-group) nodes with respect to that node. For example, human, mouse, rat, dog, opossum, chicken, frog and the teleosts are all in-group with respect to the chordate node, but with respect to the vertebrate node, teleosts are an out-group for tetrapods, and vice versa. Gene families are created at a given node by first creating paralog clusters amongst the in-group members, with in-group genes with greater similarity (as measured by blastp) to each other than their most-similar outgroup gene being placed in the paralogous cluster. Four-fold-degenerate transversion rates (4DTV, one measure of the divergence time between genes based on the number of neutral substitutions) are used to preclude clustering of "paralogs" with low degrees of similarity which are drawn together simply by the lack of a high-scoring hit to an outgroup gene. After paralog clustering along both branches, clusters are merged across branches to capture orthologous relationships. This merging uses a combination of Mutual-Best-Hit (MBH, based on blastp sequence similarity) and synteny (the hypothesis that disjoint genomic regions that contain a fair number of genes with high quality hits to each other likely share a common origin in a speciation, whole genome duplication, or segmental duplication event, though only the former is directly relevant in orthology identification) metrics. This process is repeated as we move down from the crown nodes (organisms) to the root.

Note that, by construction, for any given node and gene family, the members of that family remain together in any (larger) gene family defined at a more ancient (i.e., closer to the root) node; the clustering is rigorously hierarchical. Also note that every gene from an organism present at a particular node is in one and only one cluster/family at that node (the clusters are "hard", not "fuzzy").

Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.

Clustering Statistics


NodeGene FamiliesSingletonsLargest Cluster Size and Defline
Eumetazoan31902203761496 - Notch-1 related
Bilaterian30367192975489 - Helix-loop-helix DNA-binding domain-containing protein
Protostome1631475876141 - UDP-glucuronosyltransferase-related
Deuterostome21031126047230 - Major Facilitator Superfamily domain-containing protein
Arthropod866332389 49 - Odorant-binding protein AgamOBP3
Chordate1752791781189 - RAB5A
Vertebrate1838287487121 - G protein signaling regulator
Teleost1621333235 26 - Glicacolin-related and C1q domain-containing protein
Tetrapod1565462031 30 - ETS domain-containing predicted transcription factor

Metazome Team


Software:David M. Goodstein, Rusty Howson, Rochak Neupane, Shengqiang Shu
Analysis: Bill Dirks, Uffe Hellsten, Therese Mitros, Dan Rokhsar
©2023 University of California Regents. All rights reserved