Babbitt Lab > Resources > Supplementary data from "An atlas of the thioredoxin fold class reveals the complexity of function-enabling adaptations"

An atlas of the thioredoxin fold class reveals the complexity of function-enabling adaptations

Atkinson, HJ, and Babbitt, PC. "An atlas of the thioredoxin fold class reveals the complexity of function-enabling adaptations." 2009, in preparation.

Background
The thioredoxin (Trx) fold class is huge and diverse. Assessment of the variation in catalytic machinery of Trx fold proteins is essential in providing a foundation for understanding their functional diversity and predicting the function of its many uncharacterized members.

Methodology/Principal Findings
The proteins of the Trx fold class retain common features — including variations on a dithiol CxxC active site motif — that lead to delivery of function. We use protein similarity networks guide an analysis of how structural and sequence motifs track with catalytic function and taxonomic categories for 4,082 representative sequences spanning the known superfamilies of the Trx fold. Domain structure in the fold class is varied and modular, with 2.8% of sequences containing more than one Trx fold domain. Most member proteins are bacterial. The fold class exhibits many modifications to the CxxC active site motif — only 56.8% of proteins have both cysteines, and no functional groupings have absolute conservation of the expected catalytic motif. Only a small fraction of Trx fold sequences have been functionally characterized.

Conclusions & significance
This work provides a global view of the complex distribution of domains and catalytic machinery throughout the fold class, showing that each superfamily contains remnants of the CxxC active site. The unifying context provided by this work can guide the comparison of members of different Trx fold superfamilies to gain insight about their structure-function relationships, illustrated here with the thioredoxins and peroxiredoxins.

Available files:

  1. Supplementary figures referenced in the text
  2. Supplementary tables referenced in the text
  3. Datafiles generated in the analysis: sequence files, trees, protein similarity networks, etc
    1. Dataset files -- sequences and IDs
    2. Networks and tree
      1. Structure-based networks and tree [Fig 3]
      2. Sequence-based networks [Fig 4+]
    3. Prediction of CxxC active sites
      1. Definition of CxxC motif for each superfamily in terms of PFAM HMMs
  4. Tutorials on sequence similarity networks
    1. How to view SSN files
    2. Basic Cytoscape tutorial
    3. Movie illustrating how network topology changes with threshold

1. Supplementary figures referenced in the text. All files are in Portable Document Format (PDF)
File Description
SF1_strucNets_minority.pdf
Fig. S1.
A structure-based similarity network describes a map of the Trx fold class: colored by minority Thioredoxin-like Clan families

A. Structure-similarity network, containing 159 structures that are a maximum of 60% identical (by sequence) that span the Trx fold class. Similarity is defined by FAST scores better than a score of 4.5; edges at this limiting score represent alignments with a median of 2.75 RMSD across 72 aligned positions. Each node is colored by a PFAM Thioredoxin-like Clan family if the chain sequence is a member of that family. Nodes with thick red borders and bold labels denote chains present in the hierarchical clustering tree in D. Labels like 1ON4_A denote PDB ID 1ON4, chain A. B. Structure similarity network containing the same structures as in A, shown at the more stringent threshold of 7.5. Edges at this limiting score correspond to alignments with a median of 2.45 RMSD across 89 aligned positions. Nodes are colored as in A. C. Structure similarity network containing the 105 structures from the large connected cluster in B, displayed at a FAST score cutoff of 12.0; edges at this limiting score represent alignments with a median of 2.21 RMSD across 102 aligned positions. Nodes are colored as in A. D. Complete linkage hierarchical clustering tree based on pairwise FAST scores for 15 representative structures singled out in the networks in A-C, with PDB IDs in bold, and associated SWISSPROT sequence IDs in plain text.

SF2_TrxClan_AnnotBySP.pdf
Fig. S2.
A sequence similarity network shows how each Trx fold superfamily is distributed (colored by SwissProt classification)
[From UniProtKB/Swiss-Prot family/domain classification: http://ca.expasy.org/cgi-bin/get-similar?all=all]

Sequence similarity network, containing 4,082 representative sequences that are a maximum of 40% identical that span the Trx fold class. Similarity is defined by pairwise BLAST alignments better than an E-value of 1x10-12; edges at this threshold represent alignments with a median 30% identity over 120 residues, while the rest of the edges represent better alignments. Each node is colored by the sequences SWISSPROT family classification, if available; sequences that are not classified in SWISSPROT are colored grey. Large nodes represent sequences that are at least 40% identical to the 159 structures in Fig. 3. The sequences associated with the 15 representative structures in Fig. 3C are labeled using bold text and white arrows. The general locations of other sequences representing different superfamilies are noted using italicized text.

SF3_TrxClan_AnnotByDomOrder.pdf
Fig. S3.
Many Trx domains occur in combination with other Trx domains

A. Sequence similarity network, containing 4,082 representative sequences that are a maximum of 40% identical that span the Trx fold class. Similarity is defined by pairwise BLAST alignments better than an E-value of 1x10-12; edges at this threshold represent alignments with a median 30% identity over 120 residues, while the rest of the edges represent better alignments. Nodes are colored by the number of PFAM Thioredoxin-like Clan family domains occurring within the sequence; with the exception of H. influenzae Prx 5 -- labeled (iii) -- and the monothiol glutaredoxins -- labeled (ii) -- these domains are typically duplications of the same domain, such as the PDI-type enzymes (iv), which can contain two to four thioredoxin domains, or the few DSBA-like enzymes (i) which contain up to three DSBA-like domains. Large nodes represent sequences that are at least 40% identical to the 159 structures in Fig. 3. The sequences associated with the 15 representative structures in Fig 3C are labeled using bold text and white arrows. The occurrence of other sequences representing different superfamilies are noted using italicized text. B. Domain structures for example sequences from the groups labeled (i)-(iv); some domains are shorter than expected and this is denoted by a gradient that fades to white. The sequences are identified by their UNIPROT sequence IDs.

SF4_TrxFold_byPFAM_tally.tsv.R.pdf
Fig. S4.
The relative populations of the Trx fold superfamilies vary

A. 4,082 representative sequences that are a maximum of 40% identical and span the Trx fold class, binned according to their membership in PFAM families within the Thioredoxin-like Clan. B. All 29,206 sequences in the Trx fold class.

SF5_strucNets_withSeqNet.pdf
Fig. S5.
There is good correspondence between the structure and sequence-based Trx fold class networks

The three views of the structure-based network from Fig. 3 are repeated in A-C, and panel D contains a sequence-based network derived from the amino acid sequences in the 159 structure chains. A. Structure similarity network, containing 159 structures that are a maximum of 60% identical (by sequence) that span the Trx fold class. Similarity is defined by FAST scores better than a score of 4.5; edges at this threshold represent alignments with a median of 2.75 RMSD across 72 aligned positions, while the rest of the edges represent better alignments. Each node is colored by a PFAM Thioredoxin-like Clan family if the chain sequence is a member. Nodes with thick white borders and bold labels denote chains present in the hierarchical clustering tree in Fig. 3D. Labels like 1ON4_A denote PDB ID 1ON4, chain A. B. Structure similarity network containing the same structures as in A, shown at the more stringent threshold of 7.5. Edges at this threshold correspond to alignments with a median of 2.45 RMSD across 89 aligned positions. Nodes are colored as in A. C. Structure similarity network containing the 105 structures from the large connected cluster in B, displayed at a FAST score cutoff of 12.0; edges at this threshold represent alignments with a median of 2.21 RMSD across 102 aligned positions. Nodes are colored as in A. D. Sequence similarity network, containing 159 chain sequences from A-C. Similarity is defined by pairwise BLAST alignments better than an E-value of 1x10-5; edges at this threshold represent alignments with a median 27% identity over 84 residues, while the rest of the edges represent better alignments.

SF6_TrxClan_AnnotByTax.pdf
Fig. S6.
Use of some members of the Trx fold class is restricted to taxonomic subsets

Here, the sequence similarity network from Fig. 4, containing 4,082 sequences, is colored by the species kingdom (Metazoa, Fungi, Viridiplantae) or superkingdom (Bacteria, Eukaryota, Archaea). Note that Eukaryota includes all eukyaryotic species without a more specific kingdom, and is primarily associated with protozoan parasites. Large nodes represent sequences that are associated with the structures from Fig. 3. Blue letter labels correspond to sequence groups in Fig. 5.

Back to top of page


2. Supplementary tables referenced in the text. All files are in Portable Document Format (PDF)
File Description
ST_1-3.pdf

Table S1.
Number of unique structures in each Thioredoxin-like Clan family

Table S2.
Number of sequences in each Thioredoxin-like Clan family

Table S3.
Network edges from Fig. 4 due to sequence similarity outside of the domain of interest


3. Datafiles generated in the analysis: sequence files, trees, protein similarity networks, etc

  1. Dataset files -- sequences and IDs
  2. Networks and tree
    1. Structure-based networks and tree [Fig 3]
    2. Sequence-based networks [Fig 4+]
  3. Prediction of CxxC active sites -- motif definitions

1. Dataset files
File Description
tab-separated file:
SIMILARITY_classes.txt
List of the 20 Trx-fold-relevant SwissProt superfamilies, example sequence IDs, and counts for each class that contributed to the total sequence set.
tab-separated file:
SwissProtPfamTrxClan.ids.tsv
All 29,206 Trx fold class sequences:
ColumnDescription
ACUniProt accession
nameUniProt/SwissProt/TrEMBL ID
DEUniProt definition line
sequenceprotein sequence
seq_lengthsequence length
speciesspecies
taxIDNCBI taxonomy ID
struc_ctcount of structures associated with sequence
has_strucyes if seq is associated with structure
dbsourcesequences is in SwissProt (sp) or TrEMBL (tr) database
SPFamilySwissProt classification from SIMILARITY line if exists
strucIDslist of PDB IDs associated with this sequence
sequences:
SwissProtPfamTrxClan40nrVips60gr.fa
4,082 Trx-fold sequences: the above sequences filtered to a max of 40% identity and a minimum length of 60 amino acids
sequences:
SwissProtPfamTrxClanOnly.pdb.fa
563 sequences extracted from Trx-fold structures:
  1. Start with 29,206 Trx fold class sequences
  2. Collect all PDB IDs associated with these sequences that are associated with X-ray or NMR structures
  3. For each PDB ID, a structure is counted once for each unique chain sequence containing a Trx fold
  4. Extract sequences from the PDB SEQRES records for 563 chains
sequences:
SwissProtPfamTrxClanOnly60nr.pdb.fa
159 sequences extracted from Trx-fold structures (chains):
The above sequences, filtered to a maximum of 60% identity

Back to top of page


2.1 Structure-based networks and tree

File Description
network: SwissProtPfamTrxClanOnly60nr.pdb.tsv. ids_4.5.fc.xgmml
[3M]
Fig. 3A: structure similarity network, FAST score cutoff of 4.5

Structure-similarity network, containing 159 structures that are a maximum of 60% identical (by sequence) that span the Trx fold class. Similarity is defined by FAST scores better than a score of 4.5; edges at this threshold represent alignments with a median of 2.75 RMSD across 72 aligned positions, while the rest of the edges represent better alignments.

Load in Cytoscape using File: Import network
Attribute:
  • PDB chain
  • Associated SwissProt sequence info
  • Annotation of PDB chain sequence: PFAM Trx Clan HMMs
  • Description
    PDB chain
    IDPDB ID_Chain ID
    pdb_chainIDPDB ID_Chain ID
    pdbIDPDB ID
    commentPDB structure "title"
    exp_methodNMR or X-ray
    resolutionif X-ray, resolution; 0.0 if NMR
    yeardate of deposition in the PDB
    het_namelist of non-standard residues (often ligands)
    chain_seqamino acid sequence of PDB chain
    chain_seq_lengthlength of 'chain_seq'
    chain_seq_rangeseq indices corresponding to SwissProt sequence
    chainslist of chain IDs found in entire PDB structure
    inClustTreeyes if in clustering tree in Fig. 1D
    Associated SwissProt sequence info
    sp_IDassociated SwissProt sequence ID
    chain_seq_rangeSwissProt sequence indices indicating coverage by chain seq
    ACUniProt accession
    DEUniProt definition line
    speciesspecies
    taxIDNCBI taxonomy ID
    SPFamilySwissProt classification from SIMILARITY line if exists
    SPSequenceFull length SwissProt sequence assoc w/ sp_ID
    dbsourcesequences is in SwissProt (sp) or TrEMBL (tr) database
    seq_lengthSwissProt sequence length
    strucIDscolon-separated list of PDB IDs associated with this SwissProt ID in the PDB database
    struc_ctnumber of PDB IDs associated with this SwissProt ID in the PDB database
    Annotation of PDB chain sequence: PFAM Trx Clan HMMs
    1stModelBest hit to PFAM Trx Clan member, if aligned better than gathering threshold; Fig. 1A-C coloring
    1stLEval-log10(E-value) [1stModel]
    domain_lenlength of 'domain_seq'
    domain_seqsequence region aligning to '1stModel' HMM
    seq_ststart index in ID:sequence for 'domain_seq'
    seq_endend index

    network: SwissProtPfamTrxClanOnly60nr.pdb.tsv. ids_7.5.fc.xgmml
    [1M]
    Fig. 3B: structure similarity network, FAST score cutoff of 7.5

    Structure-similarity network, containing 159 structures that are a maximum of 60% identical (by sequence) that span the Trx fold class. ... same structures as in A, shown at the more stringent threshold of 7.5. Edges at this threshold correspond to alignments with a median of 2.45 RMSD across 89 aligned positions.

    *See description of Fig. 3A network above for xgmml file structure attributes

    network: fast7.5_subset.ids_12.0.fc.xgmml
    [686K]
    Fig. 3C: structure similarity network, FAST score cutoff of 12.0

    Structure-similarity network containing the 105 structures from the large connected cluster in B, displayed at a FAST score cutoff of 12.0; edges at this threshold represent alignments with a median of 2.21 RMSD across 102 aligned positions.

    *See description of Fig. 3A network above for xgmml file structure attributes

    tree: repr_strucs2.ids.fast.m.cLink0-30.tre
    [4K]
    Fig. 3D: hierarchical clustering tree

    Complete linkage hierarchical clustering tree based on pairwise FAST scores for 15 representative structures singled out in the networks in A-C

    Back to top of page

    2.2 Sequence-based networks

    File Description
    zipped network: SwissProtPfamTrxClan40nrVips60gr.fa _1e-12.fc.xgmml.zip
    [4M zipped; 60M unzipped]
    Fig. 4, 5, 6, S2, S3, S6: sequence similarity network, BLAST E-value cutoff of 1x10-12

    Sequence-similarity network, containing 4,082 sequences that are a maximum of 40% identical that span the Trx fold class. Similarity is defined by pairwise BLAST alignments better than an E-value of 1x10-12; edges at this threshold represent alignments with a median 30% identity over 120 residues, while the rest of the edges represent better alignments.

    1. Unzip file using gunzip (or double-click)

    2. Load in xgmml file in Cytoscape using File: Import network

    Note: This file is huge; will take at least minutes to load
    Attribute:
    1. General
    2. Domains and PFAM Trx Clan HMMs
    3. Structures
    4. Sequence motifs
    5. Taxonomic
    Description
    General
    IDUniProt:SwissProt/TrEMBL ID
    nameUniProt:SwissProt/TrEMBL ID
    ACUniProt accession
    DEUniProt definition line
    SPFamilySwissProt classification from SIMILARITY line if exists; Fig S2 coloring
    dbsourcesequences is in SwissProt (sp) or TrEMBL (tr) database
    sequencefull length sequence
    seq_lengthsequence length
    Domains and PFAM Trx Clan HMMs
    1stModelBest hit to PFAM Trx Clan member, if aligned better than gathering threshold; Fig. 4 coloring
    1stLEval-log10(E-value) [1stModel]
    domain_seqsequence region aligning to '1stModel' HMM
    domain_lenlength of 'domain_seq'
    seq_ststart index in ID:sequence for 'domain_seq'
    seq_endend index
    2ndModelSecond-best hit to PFAM Trx Clan member, if aligned better than gathering threshold
    2ndLEval-log10(E-value) [2ndModel]
    3rdModelThird-best hit to PFAM Trx Clan member, if aligned better than gathering threshold
    3rdLEval-log10(E-value) [3rdModel]
    domOrdercolon-separated list of Trx domains present in sequence, ordered N to C
    doE-valsHMM alignment E-vals for each of the above domains
    do_num_domainsnumber of Trx clan domains present in sequence; Fig. S3 coloring
    Structures
    has_strucyes if is associated with a structure in the PDB database
    strucIDscolon-separated list of PDB IDs associated with this sequence in the PDB database
    struc_ctnumber of PDB IDs associated with this sequence in the PDB database
    idsFrom60nrPdbNetList of PDB chains with >= 40% identity to this sequence in the 159-structure net from Fig 1 (ie, chain sequences are also < 60% identity to each other)
    nStrucsFrom60nrPdbNet= count(idsFrom60nrPdbNet)
    60nrPDB_IDPDB ID for representative of this sequence in structure-based network (Fig. 1) (ie, member of the >= 40% identity sequence cluster containing the sequence associated with that PDB structure chain -- nearest by sequence from 'idsFrom60nrPdbNet')
    Sequence motifs
    catTypeCxxC, Cxxc, cxxC, loopC_C3, other; Fig. 6 coloring
    C0Amino acid at Cxxc position
    X1Amino acid at cXxc position
    X2Amino acid at cxXc position
    C3Amino acid at cxxC position
    CXXC'C0'+'X1'+'X2'+'C3'
    loopCAmino acid at Cxxxc position
    cPorRis there a Pro or Arg at the N-term of the third beta strand? Fig. 7 coloring
    Taxonomic
    speciesspecies
     Note that standard hierarchy is:
    superkingdom, kingdom, phylum, class, order, family, genus, species
    kingToSuperkingIf species is associated with a taxonomic kingdom, this is the attribute value. Otherwise, superkingdom value is used; Fig. S6 coloring
    phylumToClassIf species is associated with a taxonomic phylum, this is the attribute value. Otherwise, class value is used.
    orderToFamilyIf species is associated with a taxonomic order, this is the attribute value. Otherwise, family value is used
    taxIDNCBI taxonomy ID

    File Description
    network: 159_chain_seqs.fa_1e-05.fc.xgmml
    [1M]
    Fig. S5D: sequence similarity network, BLAST E-value cutoff of 1x10-5

    Sequence-similarity network, containing 159 chain sequences from Fig 1 A-C. Similarity is defined by pairwise BLAST alignments better than an E-value of 1x10-5; edges at this threshold represent alignments with a median 27% identity over 84 residues, while the rest of the edges represent better alignments.

    Load in Cytoscape using File: Import network

    *See description of Fig. 3A network above for xgmml file structure/sequence attributes


    3. Prediction of CxxC active sites
    File Description
    tab-separated text file: activeSiteMotifsByFamily.tsv
    [961B]
    Text file associating each Thioredoxin-like Clan PFAM model with a CxxC motif, if present.

    Columns:

    • PFAM model
    • model motif that appears in alignments
    • notes (how does motif correspond to CxxC positions?)
    • example sequence and structure

    Contents:

    PFAM model	motif		notes	example seq and structure
    AhpC-TSA	tPvC		=cxxC	PRDX6_HUMAN 1PRX T44,C47
    ArsC		Cstc		=CxxC	ARSC1_ECOLI 1I9D C12,S15
    Calsequestrin	[no motif]	has multiple Trx domains, no CxxC in any	CASQ1_RABIT 1A8Y
    DSBA		CPyC		=CxxC	DSBA_ECOLI 1FVK C30,C33
    DUF1687		[no motif]		YK29_YEAST 1WPI
    DUF836		ChLC		=CxxC	Q8P6W3_XANCP 1TTZ C11,C14
    DUF953		CGpC		=CxxC	Q9BRA2_HUMAN 1WOU C43,C46
    ERp29_N		[no motif]		WBL_DROME 1OVN
    GSHPx		CGlT		=Cxxc	GPX7_HUMAN 2P31 C57,T60
    GST_N		spra		poor fit: cxxc	GSTT1_HUMAN 2C3N S11,C14
    Glutaredoxin	CpfC		=CxxC	GLRX3_ECOLI 3GRX C11,S14
    HyaE		[no motif]		HYAE_ECOLI 2HFD
    OST3_OST6	CqlC		=Cxxc	OST3_YEAST [no structure] C73,C76
    Phosducin	GtdA		poor fit: model based on 2 seq; 't'-pos is Cys	PHOS_RAT 1B9Y_C C148
    Redoxin		cPtC		=cxxC	PRDX5_HUMAN 1OC3 T44,C47
    SCO1-SenC	CPdiC		=CxxxC	SCO1_HUMAN 2GGT C169,C173
    SH3BGR		[no motif]		SH3L1_HUMAN 1U6T
    T4_deiodinase	TCP			IOD2_HUMAN [no structure] U133 (Sec)
    Thioredoxin	CGpC		=CxxC	THIO_BACSU 2GZY C29,C32
    

    Back to top of page