An Introduction to CATH And Gene3D
A Brief Introduction
At the heart of the system is the CATH classification of protein domains, derived from integrated semi-automatic processing and manual-curation of high-resolution 3D structures in the wwPDB. From these structures protein domains are identified and compared to identify homology relationships and other structural similarities. This hierarchy can be browsed and relationships studied through the website. CATH also provides a set of tools for general structural comparison.
The CATH superfamilies are then extended to the major protein sequence repositories through a process of modelling sequence variation within domain superfamilies, use of the sophisticated Hidden Markov search software HMMER3, and an in-house algorithm called DomainFinder to resolve potential matches into a unified multi-domain architecture (referred to as an ‘MDA’). These predicted sequence domains are presented as the Gene3D resource. Gene3D also merges in many different sources of protein function annotation, ranging from pathway data to active sites, and presents these through a web interface with complex querying abilities.
For more details on the construction of these resources, you are recommended to read the latest NAR papers and documentation around the sites. We are also happy to answer any direct questions about the data (email@example.com, firstname.lastname@example.org).
Welcome to Gene3D
Fusing structural annotation with genomes and functions. In this guide you can learn a few things about the types of data in Gene3D, how you can retrieve sets of interest, and what tools are built into the website. There are several ways of beginning your investigation, depending on whether you are interested in particular proteins, superfamilies or genomes, so feel free to jump to the section that best describes what you wish to do and start there.
Querying a protein or gene name at Gene3D
Gene3D can be queried with most recognised identifiers (e.g. uniprot ID's) along with any gene names provided by these resources. If your query returns more than one sequence, then you will be able to choose the appropriate one from the lists provided. Here we want to find out about VAV1 in human. Enter ‘VAV1’ in the proteins search type in 'human' in the taxon filter box (to restrict to VAV1 proteins in human) and click 'get proteins' to retrieve the proteins Direct link to Results.
Looking through the list you will find two distinct records for the search; this is because Gene3D merges resources at the sequence level, so slightly differing sequences for the same protein are treated distinctly. However, by clicking the 'Get more functional annotation button' we can see only one of the sequences is found in the Ensembl human genome assembly.
The Single Protein View
Clicking on the 'Get protein' link for the VAV1 protein thats in ensembl we get a detailed summary view for this protein Direct link to Results.
Th first tab has a summary page of annotations for the protein. The second ‘Sequence Features’ tab shows the predicted CATH domains, along with sequence annotation from other resources, including other domain databases, UniProt sequence annotation (i.e. active sites) etc.
'Mouse Over for More'
Clicking on domain images will reveal extra functional information and link-outs for a domain.
By looking around the various tabs the funfam assignments you should be able to find annotations from GO and KEGG on the role of VAV1 in the cell and it's molecular function. We can also inspect the functions of its interactors to help establish the roles of this protein in the cell.
The Protein Collection View
In the sequence features tab clicking for VAV1 click on the link 'Click here for Proteins with similar CATH arrangements' and this will retrieve other proteins with a similar domain organisation. Also on this page is a summary of GO annotations and associated evidences for all proteins with this domain organisation. You can then retrieve the sequences from the organism of interest for example for homo sapiens. Direct link to Results. This displays a protein collection page of multiple proteins, further annotation can be obtained from the drop down menu.
The Superfamily summary
We can find a summary of a superfamily by searching from the “Get superfamily summary” tab on the front page. For example searching for 184.108.40.206 we can see information on functions, domain partners, genome distributions etc Direct link to Results. If we click on the Domain organisation tab we can see different domain combinations and the organisms they are found in.
For example clicking on the “number of viruses” we can see this domain is found along with other domains in certain viruses.
The Genome summary
We can find a summary of a genome by searching from the “Get genome summary” tab on the front page. For example searching for taxon id 4932 we can see information on superfamilies, funfams, domain organisations etc. of a genome. Direct link to Results. From each of these pages its possible to retrieve individual protein sets.
The Genome comparison page
We can compare 2 genomes by searching from the “Compare Genomes” tab on the front page. For example lets compare the human pathogen plasmodium vivax and the more lethal species plasmodium falciparum. Direct link to Results. we can click on individual tabs to see superfamilies, funfams and domain organisations compared between the 2 genomes by their counts of proteins between the two species. For example on the funfams tab we can see that the “Rifin -like domain” is found in several sequences in P.falciparum and is absent from p.vivax. The corresponding proteins can be retrieved for either genome on any of the tabs.
Finding Domains in Sequences
Gene3D also provides sequence searching facilities. This service also incorporates disordered region prediction and Eukaryotic Linear Motif prediction (ELM).
An example sequence is provided by clicking on the 'Example' link. However, VAV_1 provides an interesting case in itself.
Enter this sequence in the search box and hit the green 'Scan Sequence' button.
The main track is the top one, displaying the resolved MDA (the coloured blobs) and all the matches from the various HMM profiles (dotted brackets). Matches from the same superfamily are the same colour, and you can find the E-value by mousing over. Hopefully this image demonstrates two things: (1) The complexity involved in precisely defining domain boundaries (2) The robustness of DomainFinder3 - the in-house algorithm for match selection (paper under review).
Feel free to try your own sequence