Phytozome
Overview
Phytozome is a joint project of the Department of Energy's and the to facilitate comparative genomic studies amongst land plants. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of version 2.0.4, Phytozome provides access to six sequenced and annotated land plant genomes, five of which have been clustered into gene families at six evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, TAIR, JGI are hyper-linked and searchable.
Included Organisms
The proteomes of the following organisms are clustered in release 2.0.4 of Phytozome:
Organism | common name | Source |
Arabidopsis thaliana | Mouse-ear cress | release 6 acquired from |
Oryza Sativa | Rice | |
Populus trichocarpa | Poplar | annotation of the v1.0 assembly |
Sorghum bicolor | Sweet Sorghum | Preliminary Genomescan annotation of assembly sbi0 |
Phycomitrella patens | Moss |
Access is also provided to the sequence and annotation of soybean, though it is not yet included in Phytozome gene families.
Glycine max | Soybean | Preliminary Genomescan/FgenesH/PASA annotation Glyma0.1 of assembly Glyma0 |
Nodes
Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation and genome-wide duplication events:
Land Plants node (~450 Mya): | Genes representing the most recent common ancestor of Tracheophyta (represented by angiosperms) and bryophyta (represented by the moss Physcomitrella). |
Angiosperms node (~120 Mya): | Genes representing the most recent common ancestor of grasses (Sorghum and Rice) and Rosids (Arabidopsis and Poplar). |
Grasses (~68 Mya): | Genes representing the most recent common ancestor of Sorghum and Rice. |
Rosids (~110 Mya): | Genes representing the most recent common ancestor of Arabidopsis and Poplar. |
Poplar duplication (~60 Mya): | Genes representing the ancestral gene set just prior to the post-speciation genome wide duplication in poplar. |
Arabidopsis duplication (~60 Mya): | Genes representing the ancestral gene set just prior to the post-speciation genome wide duplication in Arabidopsis. |
Clustering Methodology
Clustering is performed hierarchically, from the crown nodes to the root. First, the recent whole genome duplications in poplar and Arabidopsis were analyzed. Next, poplar-Arabidopsis (Rosid) clusters were created as described above from poplar-Arabidopsis orthologs, while requiring that all previous intraspecies clusters remain together. The Grass node captures the rice-sorghum common ancestor. The Angiosperm (poplar-Arabidopsis-rice-sorghum) node was created from grass-Rosid orthologs, requiring all previous Rosid and grass clusters remain together. The final land plant node was created using mutual-best hits between the spike moss Physcomitrella patens and previously clustered plants as orthologs, as there is not a detectable syntenic signal left at this scale of evolutionary time.
Note that, by construction, every gene from an organism present at a particular node is in one and only one cluster at that node. Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.
Clustering Statistics
Node | Gene Families | Singletons | Median Family size | Largest |
Land Plants | 10232 | 14117 | 6 | 499 - No apical meristem (NAM) protein |
Angiosperms | 11963 | 13467 | 5 | 410 - F-box domain and leucine-rich repeat |
Grasses | 15362 | 7562 | 2 | 362 - retrotransposon |
Rosids | 13976 | 6785 | 3 | 112 - Plant protein of unknown function |
Poplar Duplication | 9564 | 24679 | 2 | 15 - Leucine-rich repeat |
Arabidopsis Duplication | 2686 | 20095 | 2 | 25 - Tyrosine Kinase |
Phytozome Team
Software: | David M. Goodstein, Rusty Howson, Rochak Neupane,Shengqiang Shu |
Analysis: | Bill Dirks, Uffe Hellsten, Therese Mitros, Dan Rokhsar |