ASAP: a machine learning framework for local protein properties Open Access

Performance of CleavePred models (simple and advanced) and the known motif (KM) model on the NeuroPred dataset

Metric	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)	Mammal model (%)
AUC	80.42	76.75	74.78	81.38
Accuracy	89.87	88.68	71.55	77.02
Sensitivity	64.98	57.23	81.60	68.69
Precision	79.13	78.69	48.29	55.92
Specificity	95.87	96.26	67.85	80.08
F1-Score	71.36	66.26	60.67	61.65

Metric	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)	Mammal model (%)
AUC	80.42	76.75	74.78	81.38
Accuracy	89.87	88.68	71.55	77.02
Sensitivity	64.98	57.23	81.60	68.69
Precision	79.13	78.69	48.29	55.92
Specificity	95.87	96.26	67.85	80.08
F1-Score	71.36	66.26	60.67	61.65

Performance measured using CV (10-fold) on 4,802 windows/samples. AUC: Area under ROC curve.

Table 1.

Performance of CleavePred models (simple and advanced) and the known motif (KM) model on the NeuroPred dataset

Metric	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)	Mammal model (%)
AUC	80.42	76.75	74.78	81.38
Accuracy	89.87	88.68	71.55	77.02
Sensitivity	64.98	57.23	81.60	68.69
Precision	79.13	78.69	48.29	55.92
Specificity	95.87	96.26	67.85	80.08
F1-Score	71.36	66.26	60.67	61.65

Metric	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)	Mammal model (%)
AUC	80.42	76.75	74.78	81.38
Accuracy	89.87	88.68	71.55	77.02
Sensitivity	64.98	57.23	81.60	68.69
Precision	79.13	78.69	48.29	55.92
Specificity	95.87	96.26	67.85	80.08
F1-Score	71.36	66.26	60.67	61.65

Performance measured using CV (10-fold) on 4,802 windows/samples. AUC: Area under ROC curve.

Table 2 shows our performance compared to two state-of-the-art competing methods, the Mammal model (M) (48) and the KM model (both using the implementation provided by the NeuroPred website), on the hold-out UniProt test set.

Table 2.

UniProt test-set performance

	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)
AUC	88.17	89.08	82.56
Accuracy	93.48	94.40	77.57
Sensitivity	80.28	81.17	89.72
Precision	79.97	84.06	49.26
Specificity	96.07	96.99	74.18
F1-Score	80.13	82.59	63.60

	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)
AUC	88.17	89.08	82.56
Accuracy	93.48	94.40	77.57
Sensitivity	80.28	81.17	89.72
Precision	79.97	84.06	49.26
Specificity	96.07	96.99	74.18
F1-Score	80.13	82.59	63.60

Table 2.

Open in new tab Download slide

UniProt test-set performance

	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)
AUC	88.17	89.08	82.56
Accuracy	93.48	94.40	77.57
Sensitivity	80.28	81.17	89.72
Precision	79.97	84.06	49.26
Specificity	96.07	96.99	74.18
F1-Score	80.13	82.59	63.60

	Simple CleavePred (%)	Advanced CleavePred (%)	KM model (%)
AUC	88.17	89.08	82.56
Accuracy	93.48	94.40	77.57
Sensitivity	80.28	81.17	89.72
Precision	79.97	84.06	49.26
Specificity	96.07	96.99	74.18
F1-Score	80.13	82.59	63.60

Several conclusions can be drawn from the analysis shown in Tables 1 and 2:

Our models are superior in most measures of performance, both on the UniProt test set and the NeuroPred dataset (Table 1, NeuroPred CV).
Massive improvement is seen in precision (from 48–55% to 79–84%).
The performance on the test set (from UniProt/SwissProt) is lower with respect to the NeuroPred set (10-fold cross-validation, CV, compare Tables 1 and 2). Recall that the test set is ‘noisier’ and may suffer from shortage in true positives due to lacking experimental validation. Furthermore, some proteins in the validation and test set appeared in the Mammalian model’s training data, giving it an unrealistic advantage for these cases.

Informative top features

We used Scikit-learn’s recursive feature elimination with cross-validation (RFECV) with a random forest (49, 50) in order to identify top features in each of the four configurations (simple and advanced models over NeuroPred and UniProt datasets). This procedure iteratively fits a classifier on the dataset and eliminates the least-informative features according to this classifier (random forest in this case). We focused on subsets of selected features common to both datasets. We found 44 such informative features for the simple model and 192 for the advanced one, which account for 9% and 20% of the original respective sets of features.

We note that our ‘engineered’ features appear consistently, while classic positional features (e.g. AA at each position) were less effective. Exceptions are the R or K at the position prior to the cleavage site (position 11,12 in CleavPred window, Figure 2).

The features that are well outside the ‘classic’ cleavage motif's location are of special interest. These features probably mark the preference for disorder quite remote from the actual cleavage recognition site.

Various AA scales were effective, notably solvent accessibility (51), Atchley scales at positions 0–4 and 7–12, tripeptide flexibility, Hydrophobicity (hw) and TOP-IDP at positions 6 and 13–16. Global features were also important, including the amount of basic AA prior to the cleavage site, GRAVY, Aromaticity, Aliphaticness, net charge and the presence of a potential known motif (KM).

For a detailed explanation on feature descriptors, see https://github.com/ddofer/asap.

In terms of advanced features, the PSSM and entropy based features proved quite powerful, both positional and in aggregated segments (including the maximal entropy segment). The aggregated sums of exposed, buried or intrinsically disordered to either side of the site were also important.

It should be noted however that many of the features are highly correlated with each other, and therefore the choice of some of them on the expense of the others is somewhat arbitrary. It should also be stressed that this procedure was applied only for reporting the top features in this section, and it was not part of the actual training, validation and testing of the model.

Annotating novel genomes with CleavePred

Many of the peptides activated by PCs are peptide cell modulators. These peptides were studies in mammals and insects and to a lesser extent other taxonomical branches. C. elegans is an important model for cell lineage and development. Therefore, peptides that function in signaling and communication between neurons were sought. Tens of such peptides were identified using MS and comparative genomics (52). Many of these identified peptides were used for training CleavePred.

We tested CleavePred as a cleavage sites predictor on poorly annotated genomes. To this end, we selected the draft genome of Ascaris suum (Pig roundworm) (53). We focused on the secreted proteome (i.e. proteins with a putative Signal peptide). Among the tested sequences, several had high probability cleavage sites.

One of these sequences is U1M532_ASCSU (Figure 3) that shows a repeated pattern of cleavage sites. Active peptides (14 high probability sites, 15 peptides, 14 AA each) were predicted using CleavePred. The confidence for the cleavage probability is high (0.64–0.88). Interestingly, identical cleavage pattern was found in other worms including Toxocara canis (Dog roundworm) and Brugia malayi (nematoda that infect humans). A similar organization of peptides was identified in crustacean Blue Crab (Callinectes sapidus) sinus gland. The repeated pattern (Figure 3) is common and was reported in Arthropods and insects (54). We conclude that CleavePred allows accurate prediction for active peptides is a wide range of poorly annotated genomes. ProP (19), A general convertase predictor identified 13 (of 14) sites. A discrepancy is observed at residue 251 of the sequence (GFGFTKK|AL, Figure 3, marked x). Other predictions of NeuroPred using default parameters are shown (Figure 3, marked +).

Figure 3.

Example predictions using CleavePred’s website interface. Graphical view of CleavePred results for Ascaris suum genome (Pig roundworm, U1M532_ASCSU, 279 AA). While along the sequence there are 40 K/R residues, only 14 of them are predicted as cleavage sites (colored red, probability >0.5). Each residue is associated with its cleavage prediction. The repeated nature of the sequence is evident. The Signal sequence is underlined. X marks a missed cleavage site by ProP and additional cleavage sites according to NeuroPed (marked +).

We further tested the potential of ASAP-CleavePred pipeline to predict active peptides from ‘uncharacterized proteins’. We focused on Pfam's Bombestin-like peptide family that includes sequences from amphibian skin (27%) and mammalian (45%). We collected all 59 ‘uncharacterized’ proteins (Figure 4, Supplementary Data S1). We sought to identify cleavage sites regulating the production of short, potentially active peptides (8–14 AA) from the full proproteins. CleavePred identified paired cleavage sites for 24 of these sequences (at a probability threshold >0.5). For the rest of the sequences (35), only cleavage sites at the C’ terminal of the active peptides were predicted (Figure 4).

Figure 4.

Bombestin putative peptides derived from Pfam PF02044 ‘uncharacterized’ proteins. Graphical view of the conserved region from 59 sequences named as ‘uncharacterized’ from Pfam’s model for Bombestin-like peptides (PF02044, 148 sequences). This set includes 23% of Neopterygii (new fins fish) and the rest are Amniota including representatives from reptiles, rabbit, elephant and more. For the majority of the sequences CleavePred identified the overlooked sites. Cleavage confidence at the N′-terminal sites was lower with respect to the cleavage site probabilities on the C′-terminal of the sequences (0.51–0.67 relative to 0.85–0.91, respectively).

Open in new tab Download slide

When the 59 uncharacterized sequences were analyzed with ProP with a relaxed setting for convertase cleavage sites prediction, only 11 high confidence sites were reported. None of ProP's results predicted two adjacent cleavage sites, thus no active peptides would have been predicted by this predictor in view of the 24 active peptides that were correctly predicted by CleavePred.

Conclusion

In this study, we presented ASAP, a universal, generic, modular platform for extracting features and predicting local protein properties. ASAP is useful as a bioinformatics platform, allowing extensive analysis of new genomes and novel sequences. This generic framework can be applied to any residue-level problem. In our tutorial, (https://github.com/ddofer/asap/-wiki/Getting-Started:-A-Basic-Tutorial), we demonstrate the usability of ASAP in approaching biological problems and obtaining non-trivial results ASAP (i.e. in minutes). In the tutorial, we also demonstrate its use on another biological task of predicting phosphorylated serine. While feature engineering, fine-tuning and parameter optimization are always important, we suggest that ASAP is suited as an entry point for a wide range of prediction tasks.

We combined naive features, feature engineering (e.g. aggregated features), and simple ‘rule based’ patterns (i.e. the canonical ‘known motif’) (32). This combined approach outperformed the state-of-the-art results substantially. Our approach also supports integration of external properties such as structure. This provides superior performance to either individual method.

Analyzing the results from ASAP pipeline on CleavePred feature selection indicates that regions outside of the ‘canonical’ known motif itself affect whether a putative site is actually cleaved or not. We note our unexpected minor and sometimes negative (in terms of sensitivity) effects of adding structural features to the model, though adding just PSSM based features did provide a net benefit (Table 2).

We presented the power of ASAP towards the specific challenge of precursor protein proteolytic cleavage prediction (CleavePred). The number of substrates of processing enzymes in mammals is broader than anticipated. General convertase enzymes (PCs) regulate many pathways including lipid homeostasis, neoplastic and infectious diseases (55), as such PCs are attractive targets for therapeutics (56). For this task, we used a more challenging training and validation set and reported the results on a novel test set (Table 2).

We attribute the superior performance and usability of our results to the feature engineering at the heart of ASAP. CleavePred is extremely fast, and suitable for scanning multiple genomes. Due to the high cost of pursuing false-positives experimentally, the precision of CleavePred allows focus on only high-confidence candidates for further validation. Recall that CleavePred is suitable for any organisms and the performance is superior to models trained only on specialized subsets (e.g. mammal-model; Table 2). CleavePred provides highly confident prediction for a diverse collection of organisms (Figure 4). The generality of CleavePred in view of taxonomical coverage distinguish it from other prediction efforts trained only on selected taxa (e.g. Drosophila, humans).

CleavePred is accessible via a web interface at http://protonet.cs.huji.ac.il/cleavepred.

ASAP and CleavePred are free, open source (https://github.com/ddofer/asap), and come with a simple and well-documented Python API.

Funding

The project is partially supported by ELIXIR Accelerate grant (as part of the ELIXIR-IL). This research was partially funded by H2020, ELIXIR-EXCELERATE grant.

Conflict of interest: None declared.

References

Finn

R.D.

Bateman

Clements

et al. . (

2014

)

Pfam: The protein families database

Nucleic Acids Res

D222

–

D230

Mitchell

Chang

H.Y.

Daugherty

et al. . (

2014

)

The InterPro protein families database: the classification resource after 15 years

Nucleic Acids Res

D213

–

D221

Dinkel

Van Roey

Michael

et al. . (

2014

)

The eukaryotic linear motif resource ELM: 10 years and counting

Nucleic Acids Res

D259

–

D266

Sigrist

C.J.A.

De Castro

Langendijk-Genevaux

P.S.

et al. . (

2005

)

ProRule: a new database containing functional and structural information on PROSITE profiles

Bioinformatics

4060

–

4066

Radivojac

Clark

W.T.

Oron

T.R.

et al. . (

2013

)

A large-scale evaluation of computational protein function prediction

Nat. Methods

221

–

227

Jiang

Oron

T.R.

Clark

W.T.

et al. . (

2016

)

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Genome Biol

184

Arighi

C.N.

Roberts

P.M.

Agarwal

et al. . (

2011

)

BioCreative III interactive task: an overview

BM0043 Bioinformatics

12 Suppl 8

Petersen

T.N.

Brunak

von Heijne

et al. . (

2011

)

SignalP 4.0: discriminating signal peptides from transmembrane regions

Nat. Methods

785

–

786

Julenius

Mølgaard

Gupta

et al. . (

2005

)

Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites

Glycobiology

153

–

164

Biswas

A.K.

Noman

Sikder

A.R.

(

2010

)

Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information

BMC Bioinformatics

273.

M.G.

Huang

K.Y.

C.T.

et al. . (

2014

)

TopPTM: A new module of dbPTM for identifying functional post-translational modifications in transmembrane proteins

Nucleic Acids Res

D537

–

D545

Spencer

Eickholt

Cheng

(

2014

)

A deep learning network approach to ab initio protein secondary structure prediction

IEEE/ACM Trans. Comput. Biol. Bioinformat

103

–

115

Crossref

Lyons

Dehzangi

Heffernan

et al. . (

2014

)

Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network

J. Comput. Chem

2040

–

2046

Jones

D.T.

Cozzetto

(

2015

)

DISOPRED3: precise disordered region predictions with annotated protein-binding activity

Bioinformatics

857

–

863

Cai

C.Z.Z.

Han

L.Y.

Z.L.

et al. . (

2003

)

SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence

Nucleic Acids Res

3692

–

3697

You

Z.H.

Lei

Y.K.

Zhu

et al. . (

2013

)

Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis

BMC Bioinformatics

14 Suppl 8

S10.

Ofer

Linial

(

2015

)

ProFET: feature engineering captures high-level protein functions

Bioinformatics

3429

–

3436

Ofer

Linial

Ofer

et al. . (

2014

)

NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes

Bioinformatics

931

–

940

Duckert

Brunak

Blom

(

2004

)

Prediction of proprotein convertase cleavage sites

Protein Eng. Des. Sel

107

–

112

Kim

Bark

Hook

et al. . (

2011

)

NeuroPedia: neuropeptide database and spectral library

Bioinformatics

2772

–

2773

Hummon

A.B.

Amare

Sweedler

J.V.

(

2006

)

Discovering new invertebrate neuropeptides using mass spectrometry

Mass Spectrom. Rev

–

Wang

Yin

et al. . (

2015

)

NeuroPep: a comprehensive resource of neuropeptides

Database

2015

bav038

Tirosh

Ofer

Eliyahu

et al. . (

2013

)

Short toxin-like proteins attack the defense line of innate immunity

Toxins (Basel)

1314

–

1331

Karsenty

Rappoport

Ofer

et al. . (

2014

)

gku363–, NeuroPID: a classifier of neuropeptide precursors

Nucleic Acids Res

W182

–

W186

Shiryaev

S.A.

Chernov

A.V.

Golubkov

V.S.

et al. . (

2013

)

High-resolution analysis and functional mapping of cleavage sites and substrate proteins of furin in the human proteome

PLoS One

e54290

Cheng

Randall

A.Z.

Sweredoski

M.J.

et al. . (

2005

)

SCRATCH: a protein structure and structural feature prediction server

Nucleic Acids Res

W72

–

W76

Magnan

C.N.

Baldi

(

2014

)

SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity

Bioinformatics

2592

–

2597

Wang

Samudrala

(

2006

)

Incorporating background frequency improves entropy-based residue conservation measures

BMC Bioinformatics

385.

King

J.L.

Jukes

T.H.

(

1969

)

Non-Darwinian evolution

Science

164

788

–

798

Prilusky

Felder

C.E.

Zeev-Ben-Mordehai

et al. . (

2005

)

FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded

Bioinformatics

3435

–

3438

Toporik

Borukhov

Apatoff

et al. . (

2014

)

Computational identification of natural peptides based on analysis of molecular evolution

Bioinformatics

2137

–

2141

Southey

B.R.

Rodriguez-Zas

S.L.

Sweedler

J.V.

(

2006

)

Prediction of neuropeptide prohormone cleavages with application to RFamides

Peptides

1087

–

1098

Veenstra

J.A.

(

2000

)

Mono- and dibasic proteolytic cleavage sites in insect neuroendocrine peptide precursors

Arch. Insect Biochem. Physiol

–

Groitl

Horowitz

Makepeace

K.A.T.

et al. . (

2016

)

Protein unfolding as a switch from self-recognition to high-affinity client binding

Nat. Commun

10357

Gasteiger

Gattiker

Hoogland

et al. . (

2003

)

ExPASy: the proteomics server for in-depth protein knowledge and analysis

Nucleic Acids Res

3784

–

3788

Varshavsky

Fromer

Man

et al. . (

2007

) 12–24, When Less Is More : Improving Classification of Protein Families with a Minimal Set of Global Features. In: Giancarlo,R., Hannenhalli,S. (eds). Algorithms in Bioinformatics. Proceedings of the 7th International Workshop, WABI 2007, Philadelphia, PA, USA, September 8–9, 2007. Springer, Berlin, pp. 12–24.

Campen

Williams

R.M.

Brown

C.J.

et al. . (

2008

)

TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder

Protein Pept. Lett

956

–

963

Atchley

W.R.

Zhao

Fernandes

A.D.

et al. . (

2005

)

Solving the protein sequence metric problem

Proc. Natl. Acad. Sci. USA

102

6395

–

6400

Crossref

Georgiev

A.G.

(

2009

)

Interpretable numerical descriptors of amino acid space

J. Comput. Biol

703

–

723

Southey

B.R.

Sweedler

J.V.

Rodriguez-Zas

S.L.

(

2008

)

Prediction of neuropeptide cleavage sites in insects

Bioinformatics

815

–

825

Southey

B.R.

Amare

Zimmerman

T.A.

et al. . (

2006

)

NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides

Nucleic Acids Res

W267

–

W272

Boutet

Lieberherr

Tognolli

et al. . (

2007

)

UniProtKB/Swiss-Prot: the manually annotated section of the UniProt KnowledgeBase

Methods Mol. Biol

406

–

112

PubMed

OpenURL Placeholder Text

Huang

Niu

Gao

et al. . (

2010

)

CD-HIT Suite: a web server for clustering and comparing biological sequences

Bioinformatics

680

–

682

Edgar

R.C.

(

2010

)

Search and clustering orders of magnitude faster than BLAST

Bioinformatics

2460

–

2461

Kliger

Gofer

Wool

et al. . (

2008

)

Predicting proteolytic sites in extracellular proteins: only halfway there

Bioinformatics

1049

–

1055

Tegge

A.N.

Southey

B.R.

Sweedler

J.V.

et al. . (

2008

)

Comparative analysis of neuropeptide cleavage sites in human, mouse, rat, and cattle

Mamm. Genome

106

–

120

Pedregosa

Varoquaux

Gramfort

et al. . (

2011

)

Scikit-learn: machine learning in python

J. Mach. Learn. Res

2825

–

2830

OpenURL Placeholder Text

Amare

Hummon

A.B.

Southey

B.R.

et al. . (

2006

)

Bridging neuropeptidomics and genomics with bioinformatics: prediction of mammalian neuropeptide prohormone processing

J. Proteome Res

1162

–

1167

Breiman

(

1999

)

Random forest. Mach. Learn

–

Crossref

B.Q.

Cai

Y.D.

Feng

K.Y.

et al. . (

2012

)

Prediction of protein cleavage site with feature selection by random forest

PLoS One

e45854

Artimo

Jonnalagedda

Arnold

et al. . (

2012

)

ExPASy: SIB bioinformatics resource portal

Nucleic Acids Res

W597

–

W603

Clynen

Liu

Husson

S.J.

et al. . (

2010

)

Bioinformatic approaches to the identification of novel neuropeptide precursors

Methods Mol. Biol

615

357

–

374

PubMed

OpenURL Placeholder Text