Welcome to the Effective help page! Following sections are available:
- Description of the input format for online submission
- Calculation time
- Interpretation of the result
- Classification modules
- Cut-off
- Interpretation of score
- About the algorithm
- Requirements for WebStart or stand-alone version
- Materials and Data
- Calculation
- Interpretation
Interactive prediction
EffectiveT3
Eukaryotic-like protein domains
Interactive prediction
Input format
Please submit either a single protein sequence or multiple protein sequences in multi Fasta format. When using the
file upload, always provide a multi Fasta file.
A correctly formatted Fasta entry looks like this example and allows
white-space and newline characters within the sequence:
>name description MKSVKIMGTMPPSISLAKAHERISQHWQNPVGELNIGGKRYRIIDNQVLRLNPHSGFSLF REGVGKIFSGKMFNFSIARNLTDTLHAAQKTTSQELRSDIPNALSNLFGAKPQTELPLGW KGEPLSGAPDLEGMRVAETDKFAEGESHISIIETKDKQRLVAKIERSIAEGHLFAELEAY KHIYKTAGKHPNLANVHGMAVVPYGNRKEEALLMDEVDGWRCSDTLRTLADSWKQGKINSwhere name and description are separated by a white-space character. A new entry starts with ">" at the beginning of a line. The entries may be separated by any amount of newlines.
Sequences with length less than 25 residues are too short to perform a prediction. Sequences containing residues that do not belong to the 20 proteinogenic amino acids cannot be predicted. Invalid sequences are discarded and marked in the output. You may upload up to 10000 sequences in one request.
Calculation time
The input data can contain even thousands of protein sequences,
however please be aware of the calculation time necessary.
Estimates for 1000 protein sequences are:
| EffectiveT3 | 20 sec |
| SignalP | 40 sec |
| Domains | 180 sec |
It takes about 4 minutes per 1000 sequences to run EffectiveT3 prediction, signal peptide detection and identification of eukaryotic-like domains.
Interpretation of the output
The output consists of a table of all proteins which received a positive
prediction by any of the three methods. The positive predictions
are marked in green color.
is Sec secreted:
This column shows "+" if SignalP detected a signal peptide for Sec-pathway
secretion in the particular protein.
is T3 secreted / T3 Score:
EffectiveT3 returns a score between [0 ... 1.0]. The higher the score, the
more confident is the prediction. Depending on the chosen settings, the positive
predicted TTSS secreted effectors are shown.
Sequences which contain invalid letters are marked with yellow color,
sequences that are too short have red color.
Euk. domains:
For each eukaryotic-like domain detected in the query sequence, the Pfam accession
is provided. Underlying links lead to the respective domain report page for
further analysis.
EffectiveT3
Prediction schemata and
cut-offs
EffectiveT3 provides three differently trained classification modules.
The effectors used for each set are listed [here].
Following schemata are available at the moment:
- Standard classification module: Prediction schema trained with all effector sequences as described in the EffectiveT3 publication. It comprises effectors of E. coli, Salmonella, Chlamydia, Yersinia, and Pseudomonas.
- Plant classification module: Prediction schema trained with effector sequences from the plant symbiont Pseudomonas syringae.
- Animal classification module: Prediction schema trained with effector sequences from E. coli, Salmonella, Chlamydia, Yersinia (Animal/Human pathogens).
Cut-Off
The cut-off defines the minimum score the prediction has to
achieve to be reported as positive (=secreted) prediction. Each module
comes with a selective (0.9999, report less putative effectors with high
confidence) and a sensitive (0.95, report more effectors but with an higher
false positive rate) setting. In addition to the pre-defined cut-offs, the interface allows to
adjust this cut-off by a value between [0 ... 1.0].
Interpretation of the EffectiveT3 score
EffectiveT3 returns a score between [0 ... 1.0]. The higher the score, the
more confident is the prediction. However, this score cannot be
interpreted as a true P-Value since the probability distribution of random
predictions is unknown. The score can be interpreted as probability measured
empirically by the classification algorithm employed.
About the EffectiveT3 algorithm
Up to date, only few Type III effectors are known. The EffectiveT3 software can
detect probable effector candidates from a
specific secretion signal in the N-termini of protein sequences.
EffectiveT3 is based on an algorithm which is trained to
divide effector and non-effector proteins by judging a combination of
discriminative sequence properties of the N-termini. These properties
(as, for example, an enriched Serine content) describe an N-terminal
secretion signal. They have been extracted from a training-set of high
confident effector sequences.
For more details, please read the EffectiveT3 publication:
Sequence-based prediction of type III secreted proteins.
Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, Behrens S,
Niinikoski A, Mewes HW, Horn M, Rattei T.
PLoS Pathog. 2009 Apr;5(4):e1000376.
PMID: 19390696.
Requirements using the web-start or the stand-alone version
To run the stand-alone or the web-start application of EffectiveT3, you need the Java Runtime Environment Version 5.0 or a later version (http://www.java.com/download/).
EffectiveT3 has been tested on following operating systems:
- Microsoft Windows XP
- Linux Fedora 11
- MacOS X 10.5
Identification of eukaryotic-like protein domains
Materials and Data
Proteome data and features from a variety of different resources are
integrated into the portal and structured in a genome repository:
we derived the annotated proteins of all publicly available completely sequenced
genomes listed in the RefSeq database, as well as all domain signatures detected by Pfam
using the Simap database. For genomes included in the eggNOG Clusters of Orthologous Groups,
we have additionally stored reduced proteomes containing only evolutionary conserved sequences
being member of a COG or NOG. This approach eliminates ORFans representing possible gene over-predictions
and thus improves the determination of domain enrichment scores.
All organisms covered by the genome repository are classified into eukaryotes and,
according to the Resource on Microbial Genomes,
into pathogenic, symbiotic and non-pathogenic bacteria.
Calculation
In order to eliminate the influence of bacterial contaminations in eukaryotic genomes,
the calculation was restricted to protein domains that are detected in pathogenic genomes
as well as in at least 3 eukaryotic genomes.
To estimate the background model for each remaining domain,
the average and standard deviation of its frequencies in all non-pathogenic genomes is calculated.
For genomes included in the eggNOG Clusters of Orthologous Groups,
frequencies according to the reduced proteomes containing only evolutionary
conserved sequences have been determined additionally.
For each pathogen genome, the domain enrichment score of each domain has been calculated as the number of
standard deviations in which the domain frequency in that particular pathogenic
genome differs from the background frequency in non-pathogen genomes.
Thereby it directly reflects the enrichment of a particular eukaryotic-like domain
in proteins of a particular pathogenic genome.
Example of score calculation: MACPF (PF01823)
MACPF (membrane-attack complex/perforin domain, PF01823) is widely distributed over proteins of bacterial and eukaryotic origin. It is known to mediate verteabrate defense and bacterial attack (Hadders et al., Science 2007).
The domainreport reveals that proteins containing the MACPF domain exist in the proteome of
28 pathogens/symbionts, 3 non-pathogens and 33 eukaryots (all values used in the example calculation are based on the genome repository status of November 2010)
The domain occurence in eukaryotic organisms (33) is above cutoff and the domain is considered for further calculations.
Furthermore, MACPF is found in proteins of 3 non-pathogens:
Chlorobium limicola DSM 245 (1)
Chlorobium phaeobacteroides DSM 266 (1)
Trichodesmium erythraeum IMS101 (1)
Considering all 292 non-pathogenic bacteria listed in the genome repository,
the average background frequency in non-pathogens can be calculated as
avg_np =
= sum_over_all_nonpathogens(frequency_in_nonpathogen) / #nonpathogens =
= (1+1+1)/292 = 0.01
The standard deviation in nonpathogens is estimated as
stdev_np =
= sqrt(sum_over_allnonpathogens(pow(frequency_in_nonpathogen-avg, 2)) / #nonpathogens) =
= 0.1
The score for a particular pathogen with a proteome that contains 1 protein having the domain
(e.g. Chlamydophila pneumoniae CWL029) is
score(pathogen) =
= (frequency_in_pathogen - avg_np) / stdev_np =
= (1 - 0.01)/0.1 = 9.9 ~ 10
In Chlamydophila pneumoniae CWL029 the MACPF-domain achieves a high domain score of 10 and therefore is considered to be enriched.
Interpretation
The domain score enrichment score allows distinguishing between protein domains
that are uniformly distributed over different classes of organisms and
eukaryotic-like domains that are enriched in the proteomes of pathogenic bacteria.
The domain enrichment score is similar to the common Z-scores used
to estimate the statistical significance of normal distributed observations.
Although the distribution of domain occurrences across genomes has varying shapes,
manual inspection of domain enrichment scores has shown that the domain enrichment
scores typically show the characteristics of Z-scores and can be
considered significant if higher than 3..5.
Domains that do only occur in genomes of pathogens and eukaryotes are listed with
a score of 10000. Those domains which occur in just one pathogenic genome have a
score of 9000.
Update of genome repository and re-calculation of eukaryotic domains.
1.0.1 (2009/08/26)
EffectiveT3 classification modules release:
1.0.1 (2009/08/26)