Phytozome is a joint project of the Department of Energy's and the to facilitate comparative genomic studies amongst land plants. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of version 2.0.4, Phytozome provides access to six sequenced and annotated land plant genomes, five of which have been clustered into gene families at six evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, TAIR, JGI are hyper-linked and searchable.

Included Organisms

The proteomes of the following organisms are clustered in release 2.0.4 of Phytozome:

Organismcommon nameSource
Arabidopsis thalianaMouse-ear cress release 6 acquired from
Oryza SativaRice
Populus trichocarpaPoplar annotation of the v1.0 assembly
Sorghum bicolorSweet SorghumPreliminary Genomescan annotation of assembly sbi0
Phycomitrella patensMoss

Access is also provided to the sequence and annotation of soybean, though it is not yet included in Phytozome gene families.

Glycine maxSoybeanPreliminary Genomescan/FgenesH/PASA annotation Glyma0.1 of assembly Glyma0


Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation and genome-wide duplication events:

Land Plants node (~450 Mya):Genes representing the most recent common ancestor of Tracheophyta (represented by angiosperms) and bryophyta (represented by the moss Physcomitrella).
Angiosperms node (~120 Mya): Genes representing the most recent common ancestor of grasses (Sorghum and Rice) and Rosids (Arabidopsis and Poplar).
Grasses (~68 Mya): Genes representing the most recent common ancestor of Sorghum and Rice.
Rosids (~110 Mya): Genes representing the most recent common ancestor of Arabidopsis and Poplar.
Poplar duplication (~60 Mya):Genes representing the ancestral gene set just prior to the post-speciation genome wide duplication in poplar.
Arabidopsis duplication (~60 Mya):Genes representing the ancestral gene set just prior to the post-speciation genome wide duplication in Arabidopsis.

Clustering Methodology

Clustering is performed hierarchically, from the crown nodes to the root. First, the recent whole genome duplications in poplar and Arabidopsis were analyzed. Next, poplar-Arabidopsis (Rosid) clusters were created as described above from poplar-Arabidopsis orthologs, while requiring that all previous intraspecies clusters remain together. The Grass node captures the rice-sorghum common ancestor. The Angiosperm (poplar-Arabidopsis-rice-sorghum) node was created from grass-Rosid orthologs, requiring all previous Rosid and grass clusters remain together. The final land plant node was created using mutual-best hits between the spike moss Physcomitrella patens and previously clustered plants as orthologs, as there is not a detectable syntenic signal left at this scale of evolutionary time.

Note that, by construction, every gene from an organism present at a particular node is in one and only one cluster at that node. Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.

Clustering Statistics

NodeGene FamiliesSingletonsMedian Family sizeLargest
Land Plants10232141176499 - No apical meristem (NAM) protein
Angiosperms11963134675410 - F-box domain and leucine-rich repeat
Grasses1536275622362 - retrotransposon
Rosids1397667853112 - Plant protein of unknown function
Poplar Duplication9564246792 15 - Leucine-rich repeat
Arabidopsis Duplication2686200952 25 - Tyrosine Kinase

Phytozome Team

Software:David M. Goodstein, Rusty Howson, Rochak Neupane,Shengqiang Shu
Analysis: Bill Dirks, Uffe Hellsten, Therese Mitros, Dan Rokhsar
©2023 University of California Regents. All rights reserved