Abstract

Lysine lipoylation is a special type of posttranslational modification in both prokaryotes’ and eukaryotes’ proteomics researches. Such a modification takes part in several significant biological processions and plays a key role in the cellular level. In order to construct and design an accurate classification algorithm for identifying lipoylation sites in the protein level, the computational approaches should be taken into account in this field. Meanwhile, several factors plays different role in the identification of modification sites. Considering such a situation, the foundational elements of the effective identification of modification sites are the available feature description and the high effective classification. With these two elements, the distinguishing between the lipoylation samples and the nonlipoylation samples can be treated as a typical classification issue in the field of machine learning. In this work, we have proposed a method named LipoFNT, which employed the two featuring sets, including the Position-Specific Scoring Matrix and bi-profile Bayesian, as the classification features. And then, the flexible neural tree algorithm is utilized to deal with the imbalance classification issue in lipoylation modification sample dataset. The proposed method can achieve 81.07% in sn%, 80.29% in sp, 80.68% in Acc, 0.8076 in F1, and 0.6136 in MCC, respectively. Meanwhile, we have demonstrated the relationship between the lengths of peptide and identification of modification sites.

1. Introduction

Lysine lipoylation can be regarded as one of the most significant elements in the field of biology. Such a type of modifications has high conservation. Therefore, the lysine lipoylation is a special type of posttranslational modification in both prokaryotes’ and eukaryotes’ proteomics researches [1]. It was pointed out that lipoylation can be regarded as one special process, which is the covalent attachment of lipoic acid to 2-oxoacid dehydrogenase multienzyme complexes [25]. Such a type of modification is different from other PTM types, which depend on the local amino acid residues, in the level of protein sequence. Considering the high conservation of lipoylation modification, such a type of modification can hardly be influenced by the neighboring amino acid residues in the level of protein sequences [6]. It was known that the lysine lipoylation, which is one of the effective evolutionary processions, appears in various enzymes, including pyruvate dehydrogenase and other related enzymes, in many organisms, including bacteria and mammals [79]. Meanwhile, lipoylation plays the significant role in many key metabolic pathways and protein interactions [10]. With several years’ efforts, some important researches have reported that the modification has some relationships with several human diseases. These diseases, including metabolic disorders, cancer, viral infection, and Alzheimer’s disease [1115], may cause some negative and harmful influences in the human being. Considering the above mentioned reasons, discovering the biological function of such a modification can be helpful and beneficial to understanding the causes of such mentioned serious diseases in some degree. Nevertheless, large numbers of lipoylation sites can hardly be effectively and accurately identified in this field. Without identification of such a modification sites, the molecular functions of lipoylation can hardly be discovered and researched. So, such an issue can be treated as one of the urgent topics in the related fields.

Lipoylation can be regarded as one of the rare but highly conserved lysine PTM types in the area of PTM researches. With the increasing development of lipoylation, some important issues have been reported. One of them is that there are merely four types of multimeric metabolic enzymes among the mammals. In these proteins, the majority of them are the core metabolic landscape. It was pointed that the dysregulation of such mitochondrial proteins may cause some human metabolic disorders in some degrees and even some diseases. Meanwhile, the most striking issue can be regarded as the lipoylation itself. Therefore, with further in-depth study of such high conserved lysine modification type, the addition or removing of such a modification is all evolutionarily conserved among the majority species in the level of protein. In short, such a modification can be treated as one of the most significant essential cofactors in the field of biology. So, we will demonstrate the biological functions and significances of such a modification. From these reasons, the significance of understanding the regulation of such a modification may be one of the necessary elements in the research of human diseases.

According to the function of lipoylation, lipoamide can be regarded as a cofactor central in the level of cellular metabolism [7, 16]. The lipoylation is presented as a conserved lysine PTM on essential multimeric metabolic complexes, and this function group needs some enzymatic activities among these protein complexes [17, 18]. For instance, both pyruvate dehydrogenase (PDH) and alpha-ketoglutarate (KDH) complexes own the ability to regulate distinct carbon entry points into the key metabolic pathway of TCA. For both the above mentioned complexes, lipoylation plays the critical role in proper enzyme functions. Meanwhile, removing such type of lysine modification may cause the inhabitation of their activities in some degree. It was reported that the evolutionary conservation of such type PTM of lipoylated enzymes can range from a variety of species and make some contributions in several core metabolic pathways in the level of organisms [8, 9]. Such theme of conservation can be treated as the lipoylated complexes [19, 20]. With the striking evolutionary conservation of such lysine rare modification, it was noted that these modified enzymes make great contributions to maintenance health and several serious diseases [12, 13, 21].

In order to better discover and know the molecular mechanisms of lipoylation, the main problem of identification of such a modification site can be treated as the classification issue, where positive samples and negative samples own different scales. There are some elements of this issue. Actually, some experimental methods and biological approaches have been proposed in this field. However, both the experimental and the biological ones can hardly meet the needs, which seem to be time-consuming and waste of resources in some degrees. Some PTM sites, including phosphorylation [2224], S-nitrosylation [2528], succinylation sites [29, 30], hydroxylation sites [31, 32], crotonylation [33, 34], sumoylation [35], glycosylation [36], ubiquitination [37], prenylation [38], carbonylation [39], and methylation [4045], have successfully been classified with the methods in silico. From these successful instances, we can easily find out that several key elements of such a classification issue should be pointed out. These key elements include the feature evaluation, the model construction, the classification model selections, and the measurements of classification. On the other hand, the imbalance dataset, whose negative samples are far larger than the positive ones, should be considered.

In order to construct and design an accurate classification algorithm for identifying lipoylation sites in the protein level, as far as the researches covered, the foundational elements of the effective identification of modification sites are the available feature description and the high effective classification. With these two elements, distinguishing between the lipoylation samples and the nonlipoylation samples can be treated as a typical classification issue in the field of machine learning. In this work, we have employed the two featuring sets, including the Position-Specific Scoring Matrix (PSSM) and bi-profile Bayesian, as the classification features. And then, the flexible neural tree (FNT) algorithm is utilized to deal with the imbalance classification issue in lipoylation modification sample dataset. By combining other featuring sets and other machine learning models, we find out that the proposed method has better performances than other art-of-the-state methods in the field of PTM sites identification. What is more, we have demonstrated the relationship between the lengths of peptide and identification of modification sites. The steps can be shown in Figure 1. We will introduce such work in the following section step by step (http://121.250.173.184/).

2. Materials and Methods

2.1. Dataset

All employed protein sequences have been sourced from the UniProt database (http://www.uniprot.org/), which contains 576 lipoylated protein sequences. At the same time, the sequence high-similarity should be taken into account. Therefore, some necessary reduction redundancy should be proposed to deal with this problem. These employed protein sequences, whose similarities are higher than 40%, should be removed with the tool of the CD-HIT program [58, 59]. With this procession, we achieve the nonredundant sample set, including 44 lipoylated proteins covering 52 lipoylation sites and 1035 nonmodification lysine sites. In order to reduce some unuseful protein segments, we utilized the sliding window to cover every lysine residue in the employed protein sequences. It was pointed that the scale of sliding window should be discussed in this work and we want to find the relationship between the scales of sliding windows and the classification performances. At the same time, some blank amino acid position may appear in the sliding windows. In order to deal with such phenomenon, the amino acid stands for the blank amino acid position in the sample peptide segments.

2.2. Feature Construction

The first featuring set is the PSSM information of the identification of protein samples. With the development of the processing biological sequences in the field of bioinformatics, one of the most significant and challenging issues in this field is the method to express the biological sequences with different methods, including the discrete methods and the vector methods. However, these methods may keep some considerable sequence information and key pattern properties. It was pointed that the vector methods merely keep some foundational information and lose several sequence pattern in the level of protein. In order to avoid losing such information, the pseudo amino acid composition [60, 61] or PseAAC [62] was utilized in this work. Such a model has been widely utilized in the field of biological sequences, including protein level, DNA level and RNA level, and procession [6366]. The “Pse-in-One” [67] and its updated version “Pse-in-One2.0” [68] can be treated as the most powerful tool in this area [68, 69].

The second one is the BPB feature set, which is a novel type of encode method [70]. When it comes to the BPB, such a feature depends on Bayesian’s theories. So, a sample was given, which means peptide segments, that contains length amino acid residues among it. The identified sample can be classified into two types, including the positive type and the negative one. Here, we define the positive type as the and the negative type as the . In detail, the Cp means the center lysine residue has the lipoylation modification in the identified peptide segment and the stands for the fact that center lysine residue cannot be modified with the lipoylation in the classified peptide segment. With the rule of Bayesian’s, assume the n amino acid residues are mutually independent; the posterior’s probability of the peptide for the two types can be shown asAnd then, we can redefine the above mentioned inWe assume the prior distribution can follow the uniform distribution. Therefore, the probability of negative samples and the probability of positive ones are equal. The decision function can be demonstrated inAccording to the Shao’s method, (5) can be redefined in

2.3. Flexible Neural Tree

Flexible neural tree, which can be regarded as one type of special alternative tree structural neural network, was proposed by Chen [71, 72]. The model owns the ability to construct the neural network with the tree structure. Such a type of neural network has been widely utilized in some classification issues in the field of machine learning. The main steps of such an algorithm can be demonstrated in the following section.

Initially, the utilizing instruction set for generating the foundational elements in the FNT model can be demonstrated inwhere the instruction set contains two subsets, including the operation set and the variable one. The operation set includes several operation processions and the variable set includes several values. At the same time, we can find out that the operation set mainly can be utilized in the nonleaf nodes and the variable set mainly can be utilized in the leaf nodes in the tree structure neural network. In other words, the variable set can be treated as the input of their neural node and the operation set can be regarded as the neural node in this model. And then, the employed flexible activation function is described inNext, the output of each neural node can be calculated with the method of recursion. For each operation set element +i, the total excitation can be calculated in

where are the input to node +i. The output of the node +i is then calculated in

2.4. Performance Measurements

When it comes to the model performances, some well-known methods should be listed. In this work, some typical measurements, including sensitivity, specificity, accuracy, F1 scores, and Matthew’s Correlation Coefficient (MCC) [73, 74], of the identification of modification sites issue should be listed. At the same time, the AUC [75] should also be employed to test the performance of imbalance classification problem and that is the negative samples size is much bigger than that of the positive ones.

In this classification problem, samples can be defined into two types, including the positive samples and the negative samples. According to the definition of the classified samples, they can cause the four results in the common situation. If the modification sample is classified as the modification one, this result can be named as TP, which stands for true positive. If the modification sample is classified as the nonmodification one, this result can be named as FP, which stands for false positive. With the concept, the nonmodificiation sample with classified modification one is the TN and the nonmodification sample with classified nonmodification is the FN. According to the number of TP, TN, FP, and FN, we can easily obtain these formulations, including sensitivity, specificity, accuracy, F1 scores, and MCC. And the detailed information is shown inwhere means the number of positive samples and means the number of negative samples.

3. Results and Comparisons

3.1. Performance of LipoTree

In this section, we want to find out the available length of the sliding window in each sample. Meanwhile, the employ several lengths, which range from 3 to 29, whose center sites are lysine residues were pointed out. Therefore, the radius of each sample can be selected from 1 to 14. The ROC curves of each length can be demonstrated in Figure 2.

From Figure 2, we find out that the 14 employed lengths play different role in the classification of such medication type. At the same time, such classification issue can be treated as one of the typical imbalance classification issues in the field of machine learning. Considering such a situation, the ROC (receiver operating characteristic) curves can be known as one reasonable measurement to deal with such problem. It was pointed that while the length is equal to 23, the AUC value, which is the area under ROC curve, can reach the highest value. So, we can get the conclusion that such a length can be treated as the most available length among these employed lengths with the method of FNT and the feature of PSSM and BPB combination.

In order to demonstrate the performances of such algorithm, some typical feature descriptions have been employed to be compared with such an algorithm and several art-of-the-state methods have also been compared with such an algorithm in this field.

From Table 1, we can easily find that several typical feature description methods, including binary encoding, amino acid composition, grouping amino acid composition, physicochemical properties, KNN features, secondary tendency structure, Bi-gram [76], and Tri-gram [77], have been employed to be compared with proposed algorithm in this work. From Table 1, we can get the performances where the proposed method can achieve 81.07% in sn, 80.29% in sp, 80.68% in Acc, 0.8076 in F1, and 0.6136 in MCC, respectively. At the same time, we can get the conclusion that these typical and classical features play various roles in this classification issue. However, these features can hardly overcome the distance between the sensitivity and specialty in this classification issue.

From Table 2, we can get the information that several art-of-the-state methods, which include DNABIND, DNAbinder, DBD-Threader, DBPPred, and other approaches in this field, have been compared with the proposed algorithm. From the comparison, we can get the result that BRABSB can get the highest performance in sensitivity and the Phosida can play the most available results in specificity. It was pointed that the proposed algorithm can get the most ideal performances, while the length is equal to 23.

4. Conclusions and Discussions

In this study, a novel predictor named LipoFNT was developed to predict lysine lipoylation sites with the elements of bi-profile bayes feature encoding and flexible neural tree algorithm. As far as we are concerned, this is the first time flexible neural tree has been utilized in the classification of the lipoylation samples and nonlipoylation samples. Experimental results and performances showed that LipoFNT achieved an excellent performance and could be a useful bioinformatics algorithm for accurate identification of lipoylation sites.

From the above research, we can find out that there are 3 candidate lengths among all the employed lengths in this work. The top 3 lengths are 19, 23, and 25. In this section, we will discuss the performances of the top 3 lengths. And some art-of-the-state methods and features can be compared in this work. And the detailed information is shown in Tables S1S6. It is shown that each sample can be calculated by the F-score with the BPB features [78, 79], which can be demonstrated in Table 3. With the candidate lengths, we can find that the most available length is 23. In this length, the proposed method can achieve well performances.

Meanwhile, some significant elements of lipoylated lysine site identification should be taken into account. First of all, the reasonable and effective features should be discovered and described in this classification issue. The features mainly have important influences on the sample valuation. The second step is the speed and accurate classification model. The classification model may have the ability to overcome some shortcomings and limitations on the features. In other words, the construction classification model may reduce some redundant and useless features and more effectively utilize some key features in the classification model. The last but not least step is the available sample length selection. The available length can reduce some useless features and low-useful neighbor amino acid residues influences.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Wenzheng Bao conceived the method. Rong Bao designed the method. Yuehui Chen conducted the experiments, Wenzheng Bao wrote the main manuscript text, and Bin Yang designed the website of this algorithm. All authors reviewed the manuscript.

Acknowledgments

This work was supported by the grants of the National Science Foundation of China, Nos. 61873270 and 61702445, and the grant from the Ph.D. Programs Foundation of Ministry of Education of China (No. 20120072110040).

Supplementary Materials

The supplementary material includes Tables S1 to S6 and Figures S1 to S14. (Supplementary Materials)