Data curation : Domain Chopping (DomChop) tutorial (last updated August, 2023)

(Tutorial documented by: Dr. Natalie Dawson and Dr. Vaishali Waman; DomChop webpage created by Dr. Ian Sillitoe)

What is domain chopping?

Domain chopping (DomChopping) is the process of 'chopping' polypeptide chains from the Protein Data Bank (PDB) into one or more protein structure domains.

Logging in

http://update.cathdb.info/cgi-bin/index.pl • Select the 'DomChop' link

• When prompted to login, make sure you select the 'Current CATH database (production)' option from the Database list

Select 'Continue', then select the 'DomChop' link.

General information

Please note that it can take a few seconds for pages to be loaded due to database read/write processes. Please avoid clicking buttons multiple times if a page is still loading otherwise this can cause page errors.

Checking the literature

Please keep in mind througout that it can be very helpful to consult the publication associated with the PDB id when deciding on a chopping. This could be, for example, for confirmation or when the algorithm results do not provide a reasonable solution.

How to identify domain boundaries, AKA choosing a chopping (quick overview)

On the DomChop home page, select 'Get New Chain'. This will load a new chain for you to process.

First, load the image of the chain to get an idea of how the structure looks. To view an illustration of the chain, select the RasMol icon in the top right-hand box that contains the chain's image. If the 3D structure is very unpacked and does not have a compact, globular structure, add the comment (under the 'Comments' tab) “Unpacked chain” and move onto the next chain. If the 3D structure consists of a fragment, for example a single helix (e.g. as in 5lv6A), add the comment “Fragment” and move onto the next chain.

Next, look for a ChopClose (CC) result. If there is no CC result then this means that there is no sequence-similar chopped protein chain in CATH.

Please note that the values and scores provided below are only guidelines. For example, even if the ChopClose result has a bad SSAP score, it could still be the case that it provides an accurate chopping for your chain. Please always view the 3D structure before making a decision on which result to choose.

1. ChopClose (CC) If there is a a CC result available, we would first look at the superposition of our query chain with the matching chopped chain from CATH. Typically we would expect a good superposition if the “NW sequence identity” field is at least 30%, if the SSAP score is >= 70 (preferably >= 80), and if the RMSD is ⇐ 5 Angstroms.

The red boxes in the CC result specify why this chain failed the automatic chopping process and is therefore having to be manually curated. This may provide useful information as to why the CC result may not be very good, for example.

Once you have looked at the superposition, have a look at the RasMol of the chain on its own ('Rasmol' button) to get a clearer view of the proposed chopping.

Superpositions The CC superpositions comprise the new query chain aligned with the best-matching chain that has already been chopped in CATH. The darker colours represent the new query and the lighter colours represent the best-match in CATH.

2. CATHEDRAL This is the next result to check after CC. Any putative domains that match CATH domains with a SSAP over >= 70 (preferably >= 80) indicate a good match.

Superpositions The CATHEDRAL superpositions comprise the new query chain aligned with the best-matching domains that have already been chopped in CATH. The darker colours represent the new query and the lighter colours represent the best-matching domains in CATH.

3. HMM Any putative domain that matches a CATH domain with an E-value below 1×10-05 represents a good match.

4. PUU, Detective, Domak These are ab initio-based algorithms and do not produce scores. These algorithms are very useful in providing results when the query PDB chains do not have any closely-related matches in CATH. If you don't find any chopping you are happy with in the previous steps, have a look at these results. Sometimes, these three algorithms can help to confirm the above-mentioned results.

Submitting a chopping to the curator for review

If you are completely satisfied with a chopping proposed by one of the above algorithms, please select the 'Send for review' button next to the appropriate chopping. This chain will then be sent to the curator for reviewing.

For example, if the ChopClose algorithm provides the chopping you wish to select and ‘send to review’ and enter the appropriate comment.

Please note that it can take a few seconds for pages to be loaded due to database read/write processes. Please avoid clicking buttons multiple times if a page is still loading otherwise this can cause page errors.

Please see the 'Manual adjustment of choppings' text below if you are not completely happy with any of the proposed domain boundaries provided.

Manual adjustment of choppings

It may be necessary at times to manually adjust proposed domain boundaries. For example, if a domain boundary is defined so that it splits secondary structure element (i.e. beta strand, alpha helix) in two. In such cases, choose the chopping that most closely represents your solution and select the 'Inherit Chopping' button (top-right hand corner).

Scroll down to the bottom of the web page and you will be provided with the tools to edit the domain boundaries.

For example, for one of the proposed solutions for the chain 5ja2A (http://update.cathdb.info/cgi-bin/DomChop.pl?chain_id=4uj8B ; please note: you would need your assigned login and password to access this page)

Once you have edited the domain boundaries as necessary, refresh the chopping using the blue button. Then check the new chopping using either RasMol/ Pymol/ or 3D view.

When you are happy with the chopping, it needs to be submitted for review so that the curator can check it. Please add a comment, describing the algorithm that the edited chopping was based upon and why the chopping was edited. Also, select the difficulty of the chopping using the: easy, medium, hard buttons.

When you are ready to submit the chopping to review, select the 'Submit Chopping/Send To Review' button.

Some (25) examples of chopped chains that have undergone manual curation Underlined Text Substituting these chain ids into the following URL will load the relevant web page, which will show you examples of chopped chains that have been reviewed by the CATH curator. Select the 'Chopping' tab and then load the RasMol for for the 'Chopped' result. This page will also inform of the reasons behind the chosen result.

http://update.cathdb.info/cgi-bin/DomChop.pl?chain_id=4uj8B

  1. 4uj8B
  2. 4y25A
  3. 4znoB
  4. 5a57A
  5. 5a8jA
  6. 5aoqA
  7. 5axgA
  8. 5b04I
  9. 5c0xK
  10. 5c14A
  11. 5c1fA
  12. 5c1sA
  13. 5c22C
  14. 5c2wD
  15. 5c4nD
  16. 5c6tA
  17. 5cwwB
  18. 5cylF
  19. 5cyxA
  20. 5cz3A
  21. 5dcpA
  22. 5dcqF
  23. 5dqrA
  24. 5du3A
  25. 5fx0A

Chopping summary acronymns

These give a general summary of a given chopping and ideally should be included at the beginning of a comment in a comma-separated list within square brackets (e.g. [WCD, AMA])

WCD [Whole Chain Domain]

This is a chain consisting of a single domain, with no fragment regions.

AMA [All Methods Agree]

This means that all the automatic chopping methods agree on the chopping ASMA [All Scanning Methods Agree] This means that all the methods that involve scans against known structures agree on the chopping assignment: ChopClose, HMM and CATHEDRAL.

SVT [Structurally Variable Tail]

This means that one or both of the ends consist of random coils which vary. Usually this is due to NMR structures having a number of solutions that differ significantly in these regions (see Chain in PDB Rasmol link).

Other Chopping Acronyms

CC [ChopClose]

This refers to the ChopClose algorithm.

CIP [Chain In PDB]

This is referring to Chain in PDB Rasmol.

COAF [Chopped Off As Fragment]

This refers to regions that might be chopped off as a fragment (rather than included as part of a domain). Example: “This might be COAF, but I don't think it should be”

MCD [Multi-chain Domain]

A domain composed of segments from multiple different chains. CATH cannot handle such domains so the best we can do is to chop into the different parts of the domain in the different chains.

IDS [Incomplete Domain Structure]

This refers to a domain which is a clear structural unit but that is missing many of its residues, as a result of the experimental method. Such domains are typically not integrated into CATH to avoid the build-up of fragmented structures, which may in turn affect future HMM quality, for example.

If in doubt…

If you are still unsure how to chop the protein chain, please add the PDB id to a list of cases to go through with me (Vaishali email v.waman@ucl.ac.uk) and move onto the next one.

Happy domain chopping !!!!