Abstract

Cloud computing paradigm is growing rapidly, and it allows users to get services via the Internet as pay-per-use and it is convenient for developing, deploying, and accessing mobile applications. Currently, security is a requisite concern owning to the open and distributed nature of the cloud. Copious amounts of data are responsible for alluring hackers. Thus, developing efficacious IDS is an imperative task. This article analyzed four intrusion detection systems for the detection of attacks. Two standard benchmark datasets, namely, NSL-KDD and UNSW-NB15, were used for the simulations. Additionally, this study highlights the proliferating challenges for the security of sensitive user data and gives useful recommendations to address the identified issues. Finally, the projected results show that the hybridization method with support vector machine classifier outperforms the existing techniques in the case of the datasets investigated.

1. Introduction

Cloud computing is defined as an Internet-based computing platform in which virtually shared servers provide software, platform, infrastructure, policies, and other functions [1]. It is visualized as a demand from its users to reduce overall cost and complexities. It is gaining popularity due to various advantages of on-demand service provision, flexible resource allocation, higher fault tolerance, and higher scalability. Various cloud service providers (CSPs), including Google, Amazon, and Microsoft, use virtualization technologies with self-service capabilities. Virtualization is the first need of cloud computing [2]. A huge increase in IT technologies leads to daily data increases [3]. Attackers have taken benefit of cloud computing as copious amounts of data are produced by it greater than 665 Gb/s [4]. Huge data generated by the cloud have become its biggest problem as it has come on the target of attackers [5]. Hackers are alluring towards the cloud due to its open and distributed nature and the amount of traffic produced [6]. Attackers can interrupt the services of the users, misuse the sensitive information, and misuse the services and resources given by the CSP. An intrusion can be an attack that can misuse the private or sensitive information of the users, or it can consume the resources such as CPU, bandwidth, and storage. Traditional methods for providing security like firewalls are not sufficient. But there is a need for a proper system that can provide security to the users. An intrusion detection system (IDS) can detect or find attacks in the network by analyzing the data of the network. There are mainly two categories of IDS based on the deployment strategies: host-based IDS and network-based IDS [7, 8]. Host-based IDS analyzes attacks by monitoring the host system only, whereas network-based IDS analyzes the whole network. Every node in the cloud has personal IDS and storage in the case of host-based IDS [9].

Host-based IDS is proposed based on statistics and probability theory [10]. SNORT-based detection is performed in Eucalyptus Cloud in Ref. [11]. Network-based IDS proposed in Ref. [12] has intrusion detection system management unit and intrusion detection system sensor. The distributed intrusion detection system is also growing with time as it merges the characteristics of both the abovementioned IDSs [13]. Two more types of IDS are based on the detection mechanism: signature-based IDS and anomaly-based IDS. Signature-based IDS analyzes the attacks in the network by comparing the signatures of attacks stored in the database. Anomaly-based IDS can detect attacks in the network by analyzing the dynamic activities in the network. A profile is created by observing the activities of the users, applications, and users during a particular period in anomaly-based IDS [14, 15].

Numerous researchers have used data mining and machine learning approaches [16]. Zero-day attacks are the biggest concerns for the cloud [17]. Classifiers based on machine learning are usually used to classify attack packets and normal packets [18]. Another emerging technique is the mining rule association technique [19]. Artificial neural networks are mostly used due to their ability to work on the incomplete dataset [20]. Some researchers have found the importance of machine learning algorithms for intrusion detection in the cloud due to the scalability and elasticity features of the cloud computing paradigm [2124]. Different optimization algorithms such as genetic algorithm [25], particle swarm optimization [26], harmony search [27], and artificial bee colony [28] are also used with various classifiers for categorizing attack packets and normal packets of the network.

The main contributions of the article are given as follows:(i)Discerned the methodologies followed by different intrusion detection systems related to the cloud computing environment. Also discerned which attacks they have considered for their research work.(ii)Analogized four existing intrusion detection systems for the detection of attacks.(iii)Analogized various attacks of two different standard benchmark datasets: NSL-KDD dataset and UNSWB-15 dataset.(iv)Epitomized the study of various existing intrusion detection systems of the cloud computing environment. Represented our research work and discerned which methodology outperformed our results and comparative analysis.(v)Exemplified the remaining challenges in cloud security and suggested possible recommendations for addressing the challenges.

The structure of the remaining article is as follows: Section 2 reviews the literature review. Section 3 describes the proposed methodology. Section 4 presents the experiments and comparative analysis. Section 5 represents the future scopes and recommendations for the cloud computing environment. Conclusions are presented in Section 6.

2. Literature Review

The literature review section of the article is reviewing various good journal papers related to the intrusion detection in the cloud computing environment. Literature review is presented in the tabular form. Table 1 is showing the literature review, and also we have suggested the possible future scopes for the reviewed papers.

Additionally, we have compared our survey article with other latest survey papers. Table 2 shows how our survey article differs from other surveys. In table describes the novelty of our survey.

3. Methodology

Our methodology is described in this section of the article. It is implemented in three modules. The modules are preprocessing classification and evaluation. We have used four existing methodologies for the detection of attacks. Out of four methodologies, three methodologies are applied to the cloud computing environment, and the last methodology is applied to general network, which makes our comparison more strong. We have chosen these four methodologies for comparison as they are including the popular classifiers for intrusion detection. We have also chosen one methodology, which is using the optimization concept. So, these four methodologies’ comparison will give a good comparison outcome.

3.1. Dataset

We have used two standard benchmark datasets for the comparative analysis. We have used the NSL-KDD dataset [52] and the UNSW-NB15 dataset [53].

3.1.1. UNSW-NB15 Dataset

It was created to overcome the drawbacks of the NSL-KDD dataset. This dataset contains low footprint attack characteristics and some traffic schemes, and there is no discrepancy between the distributions of datasets. This dataset contains 49 features. The last two features represent the category and label (0 for normal and 1 for attack records). Figure 1 shows the pie chart of the UNSW-NB15 dataset distribution of various classes.

3.1.2. NSL-KDD Dataset

It is a publicly available dataset refining the KDD-CUP 1999 dataset. This dataset does not contain redundant records in the training and testing dataset. There is no requirement for creating subsets of the dataset for experimentation purposes. Figure 2 shows the pie chart of NSL-KDD dataset distribution.

3.2. Preprocessing

Rough or raw datasets can lead to high false alarms [54]. Datasets used for classification include various attributes, which can be numeric or non-numeric. Symbolic or non-numeric should be converted to the numeric form that easily interprets the classifiers. We have preprocessed the raw datasets and converted the dataset into one form, which is numeric. Like in the NSL-KDD dataset, attribute 41 has no use for classifying the dataset. Hence, we have not considered that attribute for the classification of the dataset. Attributes that have no importance for the classification increase the computation time and are excluded from the dataset.

3.3. Classification

Classification of the dataset into normal and attack packets plays an important role in providing security to the cloud computing environment. Classification can be a binary classification or multiclass classification. Binary classification results in two classes. Multiclass classification results in more than two classes. We have performed multiclass classification. For the classification, we have implemented four existing intrusion detection methodologies. The four methodologies are described next.

3.3.1. FCM-ANN

This methodology is implemented in four modules [33]. The flowchart of the methodology is shown in Figure 3.

(1) Preprocessing Module. The raw dataset is preprocessed, and the dataset is converted into a form that is easily analyzed by the classifier.

(2) FCM Module. This module is used for making clusters of the dataset. The membership function used for creating the clusters is represented [33] by the following equation:where N is the number of elements, K is the number of clusters, M is a real number and, , and Uij is the degree of membership functions of xi data in the jth cluster.

The output of this module results in creating homogeneity between the cluster and heterogeneity among various clusters.

(3) ANN Module. This module is used for classifying the clusters generated by the fuzzy c-means algorithm. Backpropagation algorithm is commonly used for training neural network [55]. In this module, the cluster pattern is learned, and the back propagation algorithm is used to train the feed-forward neural network. A feed-forward neural network has an input layer, an output layer, and numerous hidden layers. The input given to k node (belongs to hidden layer) is ln (k), and it is given [33] by where ln (k) is the input given to k node, k node is belonging to the hidden layer, is the bias of the hidden layer, xi is the input given to the i node, i node is belonging to the input layer, and is the weight value between the input layer and hidden layer.

The activation function is the sigmoid function, and it is used for processing the ln (k). It is given [33] by the following equation:

The result of the activation function is f (ln (k)), which is sent to all the neurons of the output layer. It is given [33] by the following equation:where yj is the output sent to all the neurons j, j node is belonging to the output layer, is the bias of the output layer, is the weight value between the hidden layer and output layer, and f (ln (k)) is the activation function.

(4) Aggregation Module. The last module is the aggregation module that combines the results of all artificial neural networks and creates a single module. This module combines the intermediate results and generates the final result.

3.3.2. SVM-ANN Methodology

In this methodology [34], the SVM classifier uses the anomaly detection technique, and the ANN classifier uses the misuse detection technique. The whole methodology is implemented in three modules. The modules are preprocessing module, SVM module, and ANN module. The flowchart of the SVM-ANN methodology is shown in Figure 4.

(1) Preprocessing Module. Preprocessing module is a very important part of the classification methodology, and this module makes the dataset ready for classification. The raw dataset has redundant and useless data, and the preprocessing makes them free from redundant and useless data.

(2) SVM Module. The preprocessed dataset is given as input to the support vector machine classifier, and this classifier performs the binary classification and results into two classes: normal and attack. The normal packet is labelled as normal, whereas the attack packet is labelled as attack. Support vector machine (SVM) classifier usually increases the dimensionality of the data, which makes it easy for separating or classify the data into different categories or classes. A hyperplane can be expressed as [56] H in Rn in the following equation:where x is an element in Rn and b is an element in R.

Some studies state that SVM is implemented successfully in regression and classification [52, 53, 5759].

(3) ANN Module. The attack packets are the input for the artificial neural network classifier. Backpropagation algorithm with feed-forward neural network is implemented. It is a commonly used algorithm by neural networks [55]. This classifier performs multiclass classification. It outputs the attack packets with their types.

3.3.3. FCM-SVM Methodology

In this methodology [44], the hybrid approach combines FCM with the SVM classifier. The methodology comprises three modules. Figure 5 shows the flowchart of FCM-SVM methodology.

(1) Preprocessing Module. The first module is used for converting the dataset in a form easily understood by the classifier. The preprocessed dataset saves time and resources as unwanted data are removed in this module.

(2) FCM Module. This module makes various groups of the dataset, and the groups are made based on membership functions. The equations related to the FCM algorithm are discussed earlier in this study.

(3) SVM Module. This module classifies various clusters using support vector machine classifiers. SVM classifiers are performing the multiclass classification.

(4) Aggregation Module. The outputs of all the SVM classifiers are combined, and the aggregation module generates the final output.

3.3.4. SMO-ANN Methodology

This is based on a fuzzy C-means clustering algorithm optimized with the Spider monkey optimization algorithm (SMO) [45]. Figure 6 shows the flowchart of SMO-ANN methodology. The methodology is divided into three modules. The modules are described next.

(1) Preprocessing Module. Preprocessing is carried out to obtain the preprocessed dataset from the raw dataset. The preprocessed dataset is not containing useless data.

(2) FCM-SMO Module. The whole dataset is divided into various clusters in this module. SMO is applied to the clusters to reduce the dataset further and obtain an optimized dataset.

(3) ANN Module. In this module, an artificial neural network (ANN) is applied to classify the dataset into attack packets and normal packets. Attack packets are further classified into their types.

3.4. Evaluation

Performance metrics are vital for comparing different intrusion detection systems, and they also tell which intrusion detection system is performing better than others.(1)Accuracy: Accuracy describes the percentage of true intrusion detection system predictions. Accuracy is represented by the following equation:(2)Precision: Precision describes the ratio of the attack packets correctly identified as an intrusion by the intrusion detection system to the total number of attack packets. Precision is represented by (3)Detection Rate: The detection rate describes how many packets are identified correctly. It is represented by (4)F-measure: F-measure is defined as the harmonic composition of recall and precision. It is represented by (5)False-Positive Rate: False alarm rate describes the ROC curve. False-positive rate is represented by

These performance metrics are used for comparing various methodologies by using two standard benchmark datasets.

We are using a multiclass dataset for performance assessment. We will calculate performance metrics for every class of both datasets: the UNSW-NB15 and the NSL-KDD datasets. For example, we will calculate the accuracy of every class of the NSL-KDD dataset. For calculating the overall accuracy for the whole dataset, we will find the average of the accuracies of all the classes. In this way, we will calculate the other performance metrics for both datasets. We have compared every attack of both datasets by calculating the performance metrics for every attack. We have also compared the overall performance metrics of both datasets. We have compared the performance of four existing intrusion detection systems.

4. Experiments and Comparative Analysis

To evaluate the performance of the various existing IDSs, we conducted the experimentation on four existing IDSs using two benchmark datasets: the NSL-KDD dataset and UNSW-NB15 dataset. We have compared four existing methodologies and used two standard benchmark datasets: NSL-KDD dataset and the UNSW-NB15 dataset. We present the analysis of the results by comparison concerning five performance metrics: accuracy, detection rate, precision, F-measure, and false-positive rate. Table 3 shows the hardware and software used in the experiments.

In Table 4, the SVM-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99855, highest detection rate of 0.98475, and highest F-measure of 0.98431. In Table 5, the SVM-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99925, highest detection rate of 0.99254, and highest F-measure of 0.99482. In Table 6, the SVM-ANN methodology has the highest precision of 1 and the lowest false-positive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99954 and the highest F-measure of 0.99482. SMO-ANN methodology has the highest detection rate 1. In Table 7, the SVM-ANN methodology has the highest detection rate of 0.98624. FCM-SVM methodology has the highest accuracy of 0.99793 and the highest F-measure of 0.98068. SMO-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. In Table 8, the SVM-ANN methodology has the highest precision of 0.99926 and lowest false-positive rate of 0.00074. FCM-SVM methodology has the highest accuracy of 0.99838, the highest detection rate of 0.99047, and the highest F-measure of 0.98969. In Table 9, FCM-SVM methodology has the highest accuracy of 0.99983, highest detection rate of 0.99984, and highest F-measure of 0.99934. SMO-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. In Table 10, the SVM-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99788 and the highest F-measure of 0.97563. SMO-ANN methodology has the highest detection rate 1. In Table 11, SMO-ANN methodology has the highest accuracy of 1, highest detection rate of 1, precision of 1, F-measure of 1, and lowest false-positive rate of 0. In Table 12, SVM-ANN methodology and SMO-ANN methodology have precision of 1 and lowest false-positive rate of 0. SMO-ANN methodology has the highest accuracy of 1, highest detection rate of 1, and highest f-measure of 1. In Table 13, FCM-ANN methodology has the highest accuracy of 0.99862, highest detection rate of 0.98710, highest precision of 0.98710, highest F-measure of 0.98710, and lowest false-positive rate of 0.000658. In Table 14, SVM-ANN methodology has the highest accuracy of 0.99151, highest detection rate of 0.98408, and highest F-measure of 0.98836. FCM-ANN methodology and FCM-SVM methodology have a precision of 1 and the lowest false-positive rate of 0.

In Table 15, SVM-ANN methodology has the highest accuracy of 0.99365 and highest F-measure of 0.96540. FCM-SVM methodology has the highest detection rate of 1. FCM-ANN methodology and SMO-ANN methodology have the highest precision of 1 and the lowest false-positive rate of 0. In Table 16, SVM-ANN methodology has the highest accuracy of 0.99805, highest detection rate of 0.76555, and highest F-measure of 0.86721. All methodologies have precision of 1 and false-positive rate of 0. In Table 17, SVM-ANN methodology has the highest accuracy of 0.99996, highest detection rate of 1, and highest F-measure 0.95652. All methodologies have precision 1 and false-positive rate of 0. In Table 18, SVM-ANN methodology has the highest accuracy of 0.99362, highest detection rate 0.94270, highest precision of 0.96460, highest F-measure of 0.95270, and lowest false-positive rate of 0.00484.

The different attacks of the UNSW-NB15 and NSL-KDD datasets are analyzed to evaluate various intrusion detection systems of cloud computing environments. The above tables are representing the results of our experimentation.

Tables 4 to 18 show the different performance metrics values of different attacks of the UNSW-NB15 dataset. FCM-SVM methodology performs better in detecting every attack of the UNSW-NB15 dataset than other methodologies. Table 12 shows the performance metrics values of a complete UNSW-NB15 dataset. The overall performance of the FCM-SVM methodology for detecting attacks of the UNSW-NB15 dataset is better than other methodologies. Tables 13 to 16 show the different performance metrics values of different attacks of the NSL-KDD dataset. FCM-SVM and SMO-ANN methodologies perform better in detecting every attack of the NSL-KDD dataset than other methodologies. Table 17 shows the performance metrics values of the complete NSL-KDD dataset. The overall performances of the SMO-ANN methodology for detecting attacks of the NSL-KDD dataset are better than other methodologies. The main advantage of the SVM classifier is that it only depends on support vectors. The complete dataset does not influence the SVM function, which is the case in many artificial neural networks (ANNs). Also, SVM deals efficiently with many features because kernel functions have exploitation features. The rate of convergence of the SMO algorithm is low. The premature convergence of the SMO algorithm also affects the performance. SVM hybridization with other classifiers might give an efficient intrusion detection system.

5. Future Scopes and Recommendations

Intrusion detection systems detect known and unknown attacks. But the copious amounts of data generated and stored on the cloud make the intrusion detection problem more complex. We epitomized the underlying future scopes:(i)The brisk growing zero-day attacks and their vulnerabilities are the demanding future scope in developing the intrusion detection system for cloud computing.(ii)Another future scope is developing an adaptive architecture of intrusion detection systems to handle the dynamic computations.(iii)Researchers can also focus on integrating the intrusion detection system with blockchain technologies.(iv)The possible recommendations for the above future scopes are as follows.(v)An adaptive intrusion detection system must be developed that can adapt to change the requirements such as environment configurations, resources of computation, and various locations where intrusion detection systems are deployed.(vi)It should expand dynamically by adding virtual machines when the cloud network extends.

6. Conclusion

This article reviews various intrusion detection systems related to cloud computing. The article implements various IDSs and compares them. Two standard benchmark datasets were employed and observed that the FCM-SVM methodology outperforms other techniques using the UNSW-NB15 dataset, and the SVM-ANN method outperforms the preliminaries using the NSL-KDD dataset. Hence, SVM is identified as a better classifier than other classifiers. In future work, we will work on zero-day attacks to develop an adaptive intrusion detection system that adapts to changing cloud architecture.

Data Availability

The datasets used in the article are publicly available standard benchmark datasets referred to in Refs. [54, 56, 60].

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest related to this work.

Authors’ Contributions

P.R. and I.B. were responsible for the conceptualization of the topic; article gathering and sorting were carried out by A.M., Y.K., S.K.P., N.G., A.K., S.R., and A.L.I; manuscript writing and original drafting and formal analysis were carried out by P.R., I.B., Y.K., and S.P; and writing of reviews and editing were carried out by N.G., A.K., S.R., and A.L.I. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (Grant no. 2020R1G1A1099559). The work of Agbotiname Lucky Imoize was supported in part by the Nigerian Petroleum Technology Development Fund (PTDF) and in part by the German Academic Exchange Service (DAAD) through the Nigerian-German Postgraduate Program under Grant no. 57473408.