Metazome Help

Nodes, Clusters and Consensus Sequences

Nodes and Clusters
Consensus Sequences

Searching for Clusters

Keyword Search
BLAST Search

Viewing Cluster Details

Cluster Naming and Classification
Genes in this Cluster
Functional Analysis
Multiple Sequence Alignments
Align cluster members
Find related clusters
Get Sequences
Display Options

Analyzing Cluster Sequences

Nodes,Clusters and Consensus Sequences

Nodes and Clusters:

: Please go to the info page for information on nodes and clustering.

Consensus Sequences

: A consensus peptide sequence is constructed for each cluster from the MSA (multiple sequence alignment). The consensus is that sequence which maximizes the sum of the pairwise scores with each cluster member's peptide sequence. For Metazome, BLOSUM62 was used as the scoring matrix, with gap opening and extension costs of 11 and 2, respectively. This relatively simple approach produces a cluster consensus sequence that is comparable to more sophisticated profile construction algorithms, in terms of its ability to post facto correctly assign (via BLAST) cluster members to their correct clusters.

Searching for Clusters

Keyword Search:

Gene Symbol, Identifier, Defline and Ontology Search

Gene clusters can be retrieved by providing keywords that match the symbols, data source identifiers, deflines, and functional annotations (ontologies) that have been assigned either directly to the gene cluster, or to any of its members. There are two types of keyword searches: Symbols/Identifiers/Deflines and Ontologies.

Symbols/Identifiers/Deflines

This search type allows you to find gene clusters with whose members have textual annotations (deflines, symbols, database reference identifiers) matching one or more search terms. For example, if you know the HUGO, EMBL, Unigene, Uniprot, RefSeq, or other identifier for a member of the cluster, you can pull up the entire cluster. Likewise, if you have a defline or description of a gene, you can use all or part of it find the clusters associated with that gene. Please see the "Search Syntax" section below for details on how to enter search terms.

Ontologies

This search type allows you to find gene clusters with one or more members whose classification by , , , , , or matches the search terms. "Cluster KEGG Orthology" is handled slightly differently. If you select "Cluster KEGG Orthology" and your search terms match one or more KO (KEGG Orthology) classifications, the search will only return clusters that have been specifically classified with those KO's at the cluster level (this requires that a majority of cluster members share the same KO classification). The term can either by an ontology Id or the text associated with an ontology Id (e.g., "non-specific rna polymerase ii transcription factor" is the description of GO ID GO:0016252, so a search term of "non-specific rna polymerase" would give the same result as searching with "16252" Be sure to select at least one ontology type for this type of search, and see the "Search Syntax" section below for details on how to enter ontology identifiers.

When searching with specific ontology ids, be sure follow the format below:

Ontology	format
GO	remove the "GO:000" prefix (e.g, 16455, not GO:00016455)
PFAM	PF01000
PANTHER	PTHR19376.SF11 (for subfamilies) or PTHR19376 (for family)
EC	2.7.1.121
KEGG Orthology	K03085
Cluster KEGG Orthology	K04354 (for 4th level) or 00453 () ids.
KOG	KOG1354

A successful ontology search will return a table of ontology ids (for example, this rather broad search for transcription) that match the search terms, followed by the search results. Clicking on one of the ontology ids will repeat the search as if you searched for just that specific id.

Search Syntax

You can enter one or more search terms. By default, these terms will be OR-ed (results will be returned as long as they match at least one of the search terms). To perform an AND search (retrieve results that match ALL of your search terms), preface each term with "+". To make sure search results DO NOT match a particular term, preface it with "-". Wildcards ("*") are allowed anywhere in a term except as the first character. To exactly match a phrase, rather than individual terms, place the phrase in quotes (e.g., "rna polymerase"). By default, a trailing wildcard is added to each term (except in quoted, exact phrase searches. Do not use wildcards in exact phrase searches). To disable this, merely uncheck the "add trailing wildcard" box next to the search box.

Search Results

Search results are presented in pages displaying 100 clusters each, ordered from the largest to the smallest clusters. Information on each cluster consists of its size (number of members), node, and internal identifier, a defline (if available), and lastly the distribution of cluster members by organism. To view detailed information on any particular gene cluster, click on the

icon. To view additional pages of search results, use the "First", "Previous", "Next", and "Last" links at the top and bottom of each page.

Analysis of Search Results and Composite Cluster creation

You can immediately begin analyzing the sequences associated with one or more clusters using the links contained here. Simply select one or more of clusters listed on the Search Results page (by checking the box to the left of each cluster row), and then decide whether you wish to view/align the cluster(s) member proteins, coding sequences, or consensus sequences (note that for consensus sequences, you are working with one sequence per cluster, so viewing and aligning the consensu sequences is typically useful only when you've selected more than one cluster). Go to the section Analyzing Cluster Sequences to learn more.

As of version 2.0.3, you can also create a composite cluster by selecting two or more clusters from the search results and clicking "View selected clusters as a single composite cluster." This will immediately take you to a cluster summary page that contains all the members of the selected clusters, arranged for viewing as a single, composite cluster. Note that this option is only available for searches performed at a single node.

Display Options for Search Results

The columns on the right-hand side of the Search Results page show a species-by-species breakdown of the members of each gene family. By unchecking the boxes for a given species in the Display tab, that species column is removed from the display, and the counts for that species are added to the "Oth" ( = "Other") column. You can make these display options permament by clicking the "Save these Settings" button. If these settings are made permanent, they will also affect the display of information on the cluster summary page.

BLAST Search:

The BLAST search implements NCBI Blast (v2.2.13) to enable sequence similarity searches of both individual organism genomes, as well as cluster consensus sequences defined at each node. Simply paste your sequence (with or without a fasta header) into the Query Sequence text box. If you are mainly interested in analyzing your sequence in the genomic context of a particular organism, use the "Organism Genome" target type. To analyze the evolutionary history and possible orthologs of your sequence, select the "Node Consensus" target type. Note the former is a nucleotide BLAST database, while the latter is a protein database.

BLAST Options:

The available options are mostly standard NCBI options These include:

Allow Gapped Alignment
Comparison Matrix	that determines the cost of each possible residue mismatch between query and target sequence.
Word Length	The minimum number of consecutive resides that must match identically between the query and target sequence in order to seed an alignment
E threshold	The maximum expectation value of retained alignments.
# of alignments to show	How many top-scoring alignments should be displayed in the result set
Filter options	Whether to remove low complexity regions from the query sequence, using DUST for blastn searches, and SEG for all others.

For Node Consensus searches, there is also the option whether to include singletons in the target database. Singletons are clusters of size 1. whose consensus sequence is simply the single member's protein sequence.

BLAST results

BLAST results are organized into a graphical overview of HSPs (High Scoring Pairs) of query and target sequence, and a hyperlinked text report. The HSPs are color-coded to indicate significance (red being the most significant alignments and blue the least). If you mouse over an HSP, the info box above the graphic will display the significance and score for the hit, as well as a descriptive defline for the corresponding target sequence. Clicking on an HSP will take you a detailed view of the alignment (within the textual report). The Textual report include a summary of target sequences producing significant alignments. The E values in this summary are each hyperlinked to a detailed alignment view. You can also click on the target name (or magnifying glass icon) to be taken to a view of the blast hit in its genomic context (for genomic blasts) or a cluster summary (for blasts against the Node Consensus databases).

Analyze BLAST results and Create Composite Clusters (Node Consensus BLAST only)

Using the checkboxes to the left of the "Significant Alignments" list, you can select one or more clusters to analyze within Jalview. Click on "Align Member protein sequences" to load the cluster(s) peptide sequences into Jalview. Click on "Align Member coding sequences" to launch Jalview with the cluster's coding sequences instead. Finally, click on "Align cluster consensus sequences" to load up the consensus sequences for each cluster (only useful if you've selected more than one cluster). When loading more than one cluster into Jalview, each cluster's sequences will be shaded in the same color so they can be readily distinguished from sequences from other clusters. All alignments, sequence, and trees can be downloaded from Jalview in multiple formats. Please see the Analyzing Cluster Sequences section for more information.

As of version 2.0.3, you can also create a composite cluster by selecting two or more clusters from the BLAST results and clicking "View selected clusters as a single composite cluster." This will immediately take you to a cluster summary page that contains all the members of the selected clusters, arranged for viewing as a single, composite cluster.

Viewing Cluster Details

The Cluster Summary page provides a detailed picture of given cluster's constituent genes. This page is accessed by clicking the "magnifying glass" icon ( ) next to the cluster of interest on the Search Results or BLAST Results pages.

Cluster Naming and Classification

A brief summary and high level classification of the ancestral gene represented by this cluster. The summary includes the node name, the number of crown (extant organism) genes in the cluster, an automatically generated cluster name, and, where possible, both KOG letter and classification of the cluster (also referred to as "Cluster KEGG Orthology").

Names for non-singleton clusters are either KOG-based (if more than 50% of a cluster's member genes are annotated with the same KOG, the cluster name is the KOG description) or SwissProt-based (if a cluster is purely orthologous, meaning all organisms at that node have one and only one member in the cluster, then the cluster will be named according to the SwissProt of a "prominent", meaning well-annotated member). If neither of these cases applies, then the SwissProt or Trembl name of the member that is most similar to the cluster consensus sequence will be used. If this still does not yield a name, then the cluster is named simply "Hypothetical Gene." Singleton clusters are named with the definition line of their sole member.

The KOG Class assignment follows the same rule as KOG naming, described above. The KEGG Brite classification is done similarly, with the modification that only 50% of the member genes that could possibly have a KO () are required to agree. This modification is due to the fact that not all organisms have been analyzed via KO, which is a prerequisite for KEGG Brite classification. The two shallowest levels of the KEGG Brite classification are hyperlinked. Clicking on the links will find all clusters at the current node assigned this KEGG Brite annotation.

Note that for composite clusters, naming and classification information are not provided.

Genes in this Cluster

Information on each member gene of this cluster is available in this section. This information includes: the species code, the genomic location (chromosome/scaffold, with the start and end coordinates available via mouse hovering), reference identifiers for this gene in other datasets (e.g, RefSeq, Unigene, Uniprot,Ensembl, JGI), gene symbol(s), defline, a cartoon of any PFAM domains found on this gene's product, and a cartoon of the upstream and downstream neighbors of this gene. Note that each of these columns can be hidden or made visible by selecting it in the Display Options panel (see below). If a

is visible next to a row, the row can be expanded to show more information.

The CHROM entry is always hyperlinked to a Gbrowse-based local genomic view of the gene model. For some organisms, this view will also include supporting EST evidence, homologous proteins aligned to the genome, and genome-to-genome synteny information plots (). Place the mouse cursor over the CHROM entry, and the gene start and end positions will be visible as well.

The first id in the DbXREF column is always from the primary source dataset (the dataset from which this gene was obtained, typically Ensembl or JGI). If available in other datasets, their identifiers are listed as well (you will need to expand the row to see these other identifiers). Where possible, all reference identifiers are hyperlinked to an information page provided by their source's curators. A two-letter code is used to indicate the source database of a given identifier. The codes are

RS	NCBI RefSeq
ST	SwissProt/Trembl
UG	NCBI UniGene
EG	NCBI EntrezGene
HU	HUGO Gene Nomenclature Database
EM	EMBL

The Domains tab provides a cartoon view of any PFAM domains called on this gene's peptide. The same PFAM domain in different peptides will be rendered in the same color. Mouse over a domain to see the PFAM id, description, domain coordinates displayed in a pop-up to the left of the domains. The selected domain will also be highlighted in all rows in which it appears. Click on the domain to see it highlighted in all other rows in which it appears. Note that all peptides in the cluster are scaled to the same length for viewing.

The Synteny tab provides a view of the 5 upstream and 5 downstream neighbors ("syntenic block") of this gene (known as the "anchor gene",which is always rendered in black, except in the case of composite clusters, where the anchor genes are not necessarily from the same cluster). The syntenic blocks are oriented so that anchor genes are always on the same strand (consistent with their implied descent from a common ancestral gene). Mousing over any syntenic gene will produce an info box displaying the gene's primary id, and the name and id of the cluster containing that gene. The box also includes a link to that cluster's summary page. To access the link, click on the syntenic gene (which freezes the info box and highlights all other genes that are members of the same cluster), and move the cursor over to the hyperlink, and click.

Functional Analysis

The functional and domain annotations (e.g., KOG, KEGG, GO, PFAM, PANTHER) that have been assigned to members of this cluster are displayed here. For each annotation type, the identifier and description are provided, as well as this annotation's phylogenetic fingerprint (i.e., how many of the genes in this cluster have been assigned this annotation, broken down by organism).

Multiple Sequence Alignment

A Multiple Sequence Alignment (MSA) has been precalculated for each gene family. You can view the MSA in this panel, as well as download a conservation-colored html file of the alignment (please use the Get Sequences tab if you want the raw clustalw output). Note that any organisms which have been hidden in the Display Options will also be excluded from the MSA, though the MSA will not be recalculated. If you want to recalculate the MSA with certain sequences excluded or modified, you should go to the Align cluster members tab and launch Jalview.

Note that MSA's are not pre-calculated for composite clusters. If a composite cluster has fewer than 75 members, the Multiple Sequence Alignment tab will be visible, and clicking on it will launch a real-time alignment. For composite clusters with greater than 75 members, the MSA tab will not be accessible.

Align cluster members

This tab provides access to the Jalview Multiple Sequence Viewer. Click on "Align Member protein sequences" to load this cluster's peptide sequences into Jalview. Click on "Align Member coding sequences" to launch Jalview with the cluster's coding sequences instead. For all "reasonably" sized clusters, the Clustalw Multiple Sequence Alignment has been pre-computed, and will automatically load when Jalview is launched (protein sequences only). Otherwise, you can apply Clustalw (or , a similar Multiple Sequence Aligner) within Jalview. Once you have an alignment, you can build Neighbor-Joining or Maximum Likelihood phylogenetic trees. All alignments, sequence, and trees can be downloaded from Jalview in multiple formats. Please see the Analyzing Cluster Sequences section for more information.

Find related clusters

There are several methods available for finding clusters related to the current cluster by descent, sequence similarity, or functional annotation.

Clusters related by descent:: Click on the View Ancestor link to be taken to the cluster summary page of the parent (immediate ancestor) of the current cluster. This cluster is guaranteed to contain all the members of the current cluster. If the current cluster is a root (i.e., most ancient) node cluster, of course, no ancestor exists. Click on the View descendants link to find the children (immediate descendants) of the current cluster. The union of these child clusters exactly reconstructs the current cluster. We currently do not include crown nodes (nodes consisting solely of a single extant organism). If you are already at terminal (most modern), you won't see a link for descendant clusters. If you are interested in tracing the ancestry of only a particular subset of the members of the current cluster, select them (by checkbox) and click "Find all clusters with selected gene(s)". This will return only those clusters containing all of the selected genes.
Clusters related by sequence similarity:: Each (non-composite) cluster is represented by a consensus peptide sequence, which is based on a residue-by-residue consensus constructed from the multiple sequence alignment of the cluster members' peptide sequences. One can search for clusters whose consensus sequence is similar to that of the current cluster by clicking the "BLAST for similar clusters" link. This link will load the blast search page with the current node selected. One can use consensus sequences from a different node as the target database by using the node selector on the blast page.
For composite clusters of fewer than 75 members, an MSA and consensus sequence are calculated on-the-fly, and the "BLAST for similar clusters" link functions exactly as for non-composite clusters. For composite clusters with more than 75 members, however, this link is not available.
Clusters related by functional annotation:: Use the checkboxes in the Functional Analysis section of the Cluster summary to select one or more functional annotations that have been assigned to the current cluster. Click on "Find node clusters with selected annotation(s)" to find all clusters at the current node which also have been assigned all the selected annotations.

Get Sequences

Use this tab to download sequence associated with a given gene family/cluster. You can download the peptide or nucleotide (CDS) sequence for each cluster member, the consensus sequence for the cluster, or the raw clustalw Multiple Sequence Alignment. Choose "View" to load the fasta sequence into a browser window, or "Download" to save it to a file. Note that any species hidden via Display Options will not be included in the "Cluster Sequences" download, though they will be included in the "Raw CLUSTALW alignment." Note that for composite clusters with more than 75 members, neither the Raw CLUSTALW alignment nor the consensus sequence are available.

Display options

Click on "columns" to select which columns are displayed in the "Genes in this cluster" secti\ on. The "Graphical Analyses" column refers to the Domain and Synteny views. The synteny color control refers to how many of the (displayed and hidde\ n) syntenic blocks must contain members of a cluster for that cluster's members to be rendered in color (all members of the same cluster will be rendered in \ the same non-white color). By default this number is 2, but can be increased or decreased by clicking "+" or "-" in the column heading.

The Species Visible section of the Display Options allows the user to hide results from particular species. Unchecking a species' checkbox\ will cause information for that organism's genes to be removed from the cluster display. This affects the "Genes in this Cluster", "Functional Analysis", a\ nd "Multiple Sequence Alignment" tabs. as well as the . If you wish to make these filter choices permanent, click on the "Save Species Settting" button.

Click-Info

The Click-Info tab displays additional information about PFAM domains and syntenic genes when they are selected (by mouseclick) in the Domains or Synteny tabs. For a selected PFAM domain, the PFAM identifier and description are displayed. For a selected syntenic gene, the id and name of that gene's cluster is displayed, along with a link to the cluster summary page.

Analyzing Cluster Sequences

: is used for sequence viewing, alignment, and tree-building. When you launch Jalview (having selected one or more clusters), the protein or coding sequences of each cluster member are loaded into an alignment panel. If the set of sequences corresponds to a single cluster, the pre-computed MSA (multiple sequence alignment) is also retrieved and loaded into another alignment panel. Otherwise, you can launch a CLUSTALW or MUSCLE MSA yoursel (under the "Align" menu in Jalview). Sequences are grouped by greatest pairwise similarity after an alignment. Once you have an MSA, you can build a neighbor-joining or maximum likelihood tree from the aligned sequences (under the "Tree" menu).
: You can always remove a sequence from the set by highlighting the sequence name and choosing "Edit->Delete" from the menu. If you'd like to add one or more sequences to the set, choose "Edit->Add Sequence(s)". It's important to re-align the set after you add or delete sequences.
: The Features menu allows you to visualize PFAM domains directly on the sequences in the alignment panels. Simply select "Features->PAC Protein Domains", and a list of PFAM identifiers and descriptions will appear in a panel to the right. Clicking on any one of these entries will highlight (in blue) that particular PFAM domain on all sequences in the pnale.
: If you would like to save a MSA or Tree, choose the "File->Save As" menu item, and specify the desired file format (Fasta, clustal, MSF, etc.). If you'd like an image or HTML page of the alignment, choose the "File->Export" menu item instead.
: More help on Jalview is available .