Metazome

Overview

Metazome is a joint project of the Department of Energy's and the to facilitate comparative genomic studies amongst metazoans. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of version 2.0.4, Metazome provides access to twenty-four sequenced and annotated metazoan genomes, clustered at nine evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, Ensembl, and JGI are hyper-linked and searchable.

Included Organisms

Genes from the following organisms are clustered in release 2.0.4 of Metazome:

Organism	common name	Source
Homo sapiens	Human	NCBI 36 build from
Mus musculus	Mouse	NCBI m36 build from
Rattus norvegicus	Rat	RGSC 3.4 from
Canis familiaris	Dog	NCBI gene models build 2 version 1 on assembly release by Broad 10 May 2005.
Monodelphis domestica	Opossum	MonDom 4 from
Gallus gallus	Chicken	Assembly WASHUC 1, Mar 2004 from (gene build December 2005)
Xenopus tropicalis	Frog	assembly and annotation
Gasterosteus aculeatus	Stickleback	Broad S1 from
Oryzias latipes	Medaka	MEDAKA 1 from
Takifugu rubripes	Fugu pufferfish	assembly and annotation
Danio rerio	Zebrafish	Zv6 from
Ciona savignyi	Seasquirt - savignyi	CSAV 2.0 from
Ciona intestinalis	Seasquirt - intestinalis	JGI 2 from
Brachiostoma floridae	Amphioxus	from April 11 2006
Strongylocentrotus purpuratus	Sea Urchin	NCBI gene build 2 version 1 on
Drosophila melanogaster	Fruitfly	BDGP 4 from
Anopheles gambiae	Mosquito	AgamP3 from
Aedes aegypti	Yellow fever Mosquito	AaegL 1 from
Bombyx mori	Silkworm
Tribolium castaneum	Red Flour Beetle	NCBI gene build 1 version 1 on
Caenorhabditis elegans	Worm	release WS164
Caenorhabditis briggsae	Briggsae Worm	release WS164
Lottia gigantea	Owl limpet (snail)	Early p02S assembly (see for most recent Lottia assembly and annotation)
Nematostella vectensis	Sea anemone	assembly and annotation

Nodes

Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:

Eumetazoan:	Gene families representing the most recent common ancestor of bilateria and cnidaria.
Bilaterian:	Gene families representing the most recent common ancestor of deuterostomes and protostomes.
Protostome:	Gene families representing the most recent common ancestor of lophotrochozoans + ecdysozoans.
Deuterostome:	Gene families representing the most recent common ancestor of hemichordates and echinoderms.
Arthropod:	Gene families representing the most recent common ancestor of coleoptera and diptera.
Chordate:	Gene families representing the most recent common ancestor of cephalochordates + olfactores.
Vertebrate:	Gene families representing the most commen recent ancestor of teleosts and tetrapods.
Teleost:	Gene families representing the most commen recent ancestor of euteleostei and otocephala.
Tetrapod:	Gene families representing the most commen recent ancestor of amniota and amphibia.

Clustering Methodology

Clustering is performed hierarchically, from the crown (extant organism) nodes to the root, at each node performing in-group (paralog) clustering followed by out-group (ortholog) clustering. In-group relationships and out-group relationships are defined as follows: all descendant nodes (internal as well as crown) that connect through the same (opposite) deepest branch to a given node are considered in-group (out-group) nodes with respect to that node. For example, human, mouse, rat, dog, opossum, chicken, frog and the teleosts are all in-group with respect to the chordate node, but with respect to the vertebrate node, teleosts are an out-group for tetrapods, and vice versa. Gene families are created at a given node by first creating paralog clusters amongst the in-group members, with in-group genes with greater similarity (as measured by blastp) to each other than their most-similar outgroup gene being placed in the paralogous cluster. Four-fold-degenerate transversion rates (4DTV, one measure of the divergence time between genes based on the number of neutral substitutions) are used to preclude clustering of "paralogs" with low degrees of similarity which are drawn together simply by the lack of a high-scoring hit to an outgroup gene. After paralog clustering along both branches, clusters are merged across branches to capture orthologous relationships. This merging uses a combination of Mutual-Best-Hit (MBH, based on blastp sequence similarity) and synteny (the hypothesis that disjoint genomic regions that contain a fair number of genes with high quality hits to each other likely share a common origin in a speciation, whole genome duplication, or segmental duplication event, though only the former is directly relevant in orthology identification) metrics. This process is repeated as we move down from the crown nodes (organisms) to the root.

Note that, by construction, for any given node and gene family, the members of that family remain together in any (larger) gene family defined at a more ancient (i.e., closer to the root) node; the clustering is rigorously hierarchical. Also note that every gene from an organism present at a particular node is in one and only one cluster/family at that node (the clusters are "hard", not "fuzzy").

Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.

Clustering Statistics

Node	Gene Families	Singletons	Largest Cluster Size and Defline
Eumetazoan	31902	203761	496 - Notch-1 related
Bilaterian	30367	192975	489 - Helix-loop-helix DNA-binding domain-containing protein
Protostome	16314	75876	141 - UDP-glucuronosyltransferase-related
Deuterostome	21031	126047	230 - Major Facilitator Superfamily domain-containing protein
Arthropod	8663	32389	49 - Odorant-binding protein AgamOBP3
Chordate	17527	91781	189 - RAB5A
Vertebrate	18382	87487	121 - G protein signaling regulator
Teleost	16213	33235	26 - Glicacolin-related and C1q domain-containing protein
Tetrapod	15654	62031	30 - ETS domain-containing predicted transcription factor

Metazome Team

Software:	David M. Goodstein, Rusty Howson, Rochak Neupane, Shengqiang Shu
Analysis:	Bill Dirks, Uffe Hellsten, Therese Mitros, Dan Rokhsar