Welcome to the Effective help page! Following sections are available:
- Description of the input format for online submission
- Calculation time
- Interpretation of the result
- Classification modules
- Interpretation of score
- About the algorithm
- Requirements for WebStart or stand-alone version
- Materials and Data
Eukaryotic-like protein domains
Please submit either a single protein sequence or multiple protein sequences in multi Fasta format. When using the file upload, always provide a multi Fasta file.
A correctly formatted Fasta entry looks like this example and allows
white-space and newline characters within the sequence:
>name description MKSVKIMGTMPPSISLAKAHERISQHWQNPVGELNIGGKRYRIIDNQVLRLNPHSGFSLF REGVGKIFSGKMFNFSIARNLTDTLHAAQKTTSQELRSDIPNALSNLFGAKPQTELPLGW KGEPLSGAPDLEGMRVAETDKFAEGESHISIIETKDKQRLVAKIERSIAEGHLFAELEAY KHIYKTAGKHPNLANVHGMAVVPYGNRKEEALLMDEVDGWRCSDTLRTLADSWKQGKINSwhere name and description are separated by a white-space character. A new entry starts with ">" at the beginning of a line. The entries may be separated by any amount of newlines.
Sequences with length less than 25 residues are too short to perform a prediction. Sequences containing residues that do not belong to the 20 proteinogenic amino acids cannot be predicted. Invalid sequences are discarded and marked in the output. You may upload up to 10000 sequences in one request.
The input data can contain even thousands of protein sequences, however please be aware of the calculation time necessary.
Estimates for 1000 protein sequences are:
It takes about 4 minutes per 1000 sequences to run EffectiveT3 prediction, signal peptide detection and identification of eukaryotic-like domains.
Interpretation of the output
The output consists of a table of all proteins which received a positive prediction by any of the three methods. The positive predictions are marked in green color.
is Sec secreted:
This column shows "+" if SignalP detected a signal peptide for Sec-pathway secretion in the particular protein.
is T3 secreted / T3 Score:
EffectiveT3 returns a score between [0 ... 1.0]. The higher the score, the more confident is the prediction. Depending on the chosen settings, the positive predicted TTSS secreted effectors are shown.
Sequences which contain invalid letters are marked with yellow color, sequences that are too short have red color.
For each eukaryotic-like domain detected in the query sequence, the Pfam accession is provided. Underlying links lead to the respective domain report page for further analysis.
Following schemata are available at the moment:
- Standard classification module: Prediction schema trained with all effector sequences as described in the EffectiveT3 publication. It comprises effectors of E. coli, Salmonella, Chlamydia, Yersinia, and Pseudomonas.
- Plant classification module: Prediction schema trained with effector sequences from the plant symbiont Pseudomonas syringae.
- Animal classification module: Prediction schema trained with effector sequences from E. coli, Salmonella, Chlamydia, Yersinia (Animal/Human pathogens).
The cut-off defines the minimum score the prediction has to achieve to be reported as positive (=secreted) prediction. Each module comes with a selective (0.9999, report less putative effectors with high confidence) and a sensitive (0.95, report more effectors but with an higher false positive rate) setting. In addition to the pre-defined cut-offs, the interface allows to adjust this cut-off by a value between [0 ... 1.0].
Interpretation of the EffectiveT3 score
EffectiveT3 returns a score between [0 ... 1.0]. The higher the score, the more confident is the prediction. However, this score cannot be interpreted as a true P-Value since the probability distribution of random predictions is unknown. The score can be interpreted as probability measured empirically by the classification algorithm employed.
About the EffectiveT3 algorithm
Up to date, only few Type III effectors are known. The EffectiveT3 software can detect probable effector candidates from a specific secretion signal in the N-termini of protein sequences. EffectiveT3 is based on an algorithm which is trained to divide effector and non-effector proteins by judging a combination of discriminative sequence properties of the N-termini. These properties (as, for example, an enriched Serine content) describe an N-terminal secretion signal. They have been extracted from a training-set of high confident effector sequences.
For more details, please read the EffectiveT3 publication:
Sequence-based prediction of type III secreted proteins.
Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, Behrens S, Niinikoski A, Mewes HW, Horn M, Rattei T.
PLoS Pathog. 2009 Apr;5(4):e1000376. PMID: 19390696.
Requirements using the web-start or the stand-alone version
To run the stand-alone or the web-start application of EffectiveT3, you need the Java Runtime Environment Version 5.0 or a later version (http://www.java.com/download/). EffectiveT3 has been tested on following operating systems:
- Microsoft Windows XP
- Linux Fedora 11
- MacOS X 10.5
Identification of eukaryotic-like protein domains
Materials and Data
Proteome data and features from a variety of different resources are integrated into the portal and structured in a genome repository: we derived the annotated proteins of all publicly available completely sequenced genomes listed in the RefSeq database, as well as all domain signatures detected by Pfam using the Simap database. For genomes included in the eggNOG Clusters of Orthologous Groups, we have additionally stored reduced proteomes containing only evolutionary conserved sequences being member of a COG or NOG. This approach eliminates ORFans representing possible gene over-predictions and thus improves the determination of domain enrichment scores. All organisms covered by the genome repository are classified into eukaryotes and, according to the Resource on Microbial Genomes, into pathogenic, symbiotic and non-pathogenic bacteria.
In order to eliminate the influence of bacterial contaminations in eukaryotic genomes, the calculation was restricted to protein domains that are detected in pathogenic genomes as well as in at least 3 eukaryotic genomes.
To estimate the background model for each remaining domain, the average and standard deviation of its frequencies in all non-pathogenic genomes is calculated. For genomes included in the eggNOG Clusters of Orthologous Groups, frequencies according to the reduced proteomes containing only evolutionary conserved sequences have been determined additionally.
For each pathogen genome, the domain enrichment score of each domain has been calculated as the number of standard deviations in which the domain frequency in that particular pathogenic genome differs from the background frequency in non-pathogen genomes.
Thereby it directly reflects the enrichment of a particular eukaryotic-like domain in proteins of a particular pathogenic genome.
Example of score calculation: MACPF (PF01823)
MACPF (membrane-attack complex/perforin domain, PF01823) is widely distributed over proteins of bacterial and eukaryotic origin. It is known to mediate verteabrate defense and bacterial attack (Hadders et al., Science 2007).
The domainreport reveals that proteins containing the MACPF domain exist in the proteome of
28 pathogens/symbionts, 3 non-pathogens and 33 eukaryots (all values used in the example calculation are based on the genome repository status of November 2010)
The domain occurence in eukaryotic organisms (33) is above cutoff and the domain is considered for further calculations.
Furthermore, MACPF is found in proteins of 3 non-pathogens:
Chlorobium limicola DSM 245 (1)
Chlorobium phaeobacteroides DSM 266 (1)
Trichodesmium erythraeum IMS101 (1)
Considering all 292 non-pathogenic bacteria listed in the genome repository, the average background frequency in non-pathogens can be calculated as
= sum_over_all_nonpathogens(frequency_in_nonpathogen) / #nonpathogens =
= (1+1+1)/292 = 0.01
The standard deviation in nonpathogens is estimated as
= sqrt(sum_over_allnonpathogens(pow(frequency_in_nonpathogen-avg, 2)) / #nonpathogens) =
The score for a particular pathogen with a proteome that contains 1 protein having the domain (e.g. Chlamydophila pneumoniae CWL029) is
= (frequency_in_pathogen - avg_np) / stdev_np =
= (1 - 0.01)/0.1 = 9.9 ~ 10
In Chlamydophila pneumoniae CWL029 the MACPF-domain achieves a high domain score of 10 and therefore is considered to be enriched.
The domain score enrichment score allows distinguishing between protein domains that are uniformly distributed over different classes of organisms and eukaryotic-like domains that are enriched in the proteomes of pathogenic bacteria.
The domain enrichment score is similar to the common Z-scores used to estimate the statistical significance of normal distributed observations. Although the distribution of domain occurrences across genomes has varying shapes, manual inspection of domain enrichment scores has shown that the domain enrichment scores typically show the characteristics of Z-scores and can be considered significant if higher than 3..5. Domains that do only occur in genomes of pathogens and eukaryotes are listed with a score of 10000. Those domains which occur in just one pathogenic genome have a score of 9000.
Update of genome repository and re-calculation of eukaryotic domains.
EffectiveT3 classification modules release: