Abstract

Twitter integrates with streaming data technologies and machine learning to add new value to healthcare. This paper presented a real-time system to predict breast cancer based on streaming patient’s health data from Twitter. The proposed system consists of two major components: developing an offline building model and an online prediction pipeline. For the first component, we made a correlation between the features to determine the correlation between features and reduce the number of features from the Breast Cancer Wisconsin Diagnostic dataset. Two feature selection algorithms are recursive feature elimination and univariate feature selection algorithms which are applied to features after correlation to select the essential features. Four decision trees, logistic regression, support vector machine, and random forest classifier have been used on features after correlation and feature selection. Also, hyperparameter tuning and cross-validation have been applied with machine learning to optimize models and enhance accuracy. Apache Spark, Apache Kafka, and Twitter Streaming API are used to develop the second component. The best model with the highest accuracy obtained from the first component predicts breast cancer in real time from tweets’ streaming. The results showed that the best model is the random forest classifier which achieved the best accuracy.

1. Introduction

Cancer, Rodríguez Larumbe [1], appeared as a result of mutations or abnormal changes in the genes responsible for regulating the growth of cells and keeping them grow healthily. The genes are in each cell’s nucleus, representing the “control room” of each cell. Normally, the cells in our bodies replace themselves through an orderly process of cell growth: healthy new cells take place, and the older ones die, but mutations can “turn on” and turn off certain genes in a cell, which gives the cells the ability to keep dividing without producing more cells identical to the original cell which leads to forming a tumor. A tumor, Rodríguez Larumbe [1], can be less dangerous in the beginning. These tumors are not considered cancerous: their cells are close to normal, they grow slowly, and they do not invade close tissues or other parts of the body. However, malignant tumors are cancerous. If they are left unchecked, malignant cells eventually can spread out of the original host to other tissues of the body. Breast cancer is a type of cancer which forms in the tissue of the breasts’ cells, Clinic [23]; Board [3]. The symptoms of breast cancer, Board [3], may include a lump in the breast, a change in breast volume and form, dimpling of the skin, fluid coming from the nipple, a newly inverted nipple, or red color or scaly patch of skin. This type of cancer is an uncontrolled growth of breast cells. Statistically, breast cancer is ranked the second most fatal disease worldwide for women, Group et al. [4]. According to the report of the statistics of the World Health Organization [5], 627,000 women died from breast cancer in 2018. This death number accounts nearly for 15% of all deaths because of cancer among women. In the Western part of the world, previous research has indicated that one of every nine women is likely to develop breast cancer in the course of their lives [6]. For all these reasons together, there is continuous requirement for a robust and accurate system that works as a tool for early diagnosis and detection of breast cancer diseases to lower the number of demises and increase the number of survivors from this disease, through accurate distinguishing between benign and malignant breast tumors.

When it comes to data science applications, the healthcare environment is one of the most appealing data sources due to the tremendous amount of available data and the sustainable nature of data. Each hospital has a dataset that is constantly increasing with time. Improving the healthcare system is a noble goal that almost everybody is always after it. Data mining and machine learning techniques can lead to direct improvement in the healthcare system.

Recently, machine learning algorithms have been playing an essential role in predicting breast cancer. For example, Asri et al. [7] applied different machine learning classification algorithms such as support vector machine, decision tree, naive Bayes, and k-nearest neighbors on the Wisconsin Breast Cancer dataset to predict breast cancer. Moreover, Aloraini [8] compared five classification learning algorithms, including Bayesian network, naïve Bayes, decision trees J4.8, ADTree, and multilayer neural network, to classify benign cancer or malignant cancer from the Wisconsin Breast Cancer dataset. Besides this, the hybrid method is a new technique that is used to reduce the number of features using feature selection methods to enhance the performance of machine learning models. For example, Akay [9] introduced a hybrid technique, which is a combination of the support vector machine integrated with feature selection for breast cancer diagnosis. Zheng et al. [10] extracted features using a hybrid of K-means with support vector machine algorithms to predict breast cancer.

Nowadays, a new source of data has become a challenging task to process and store using traditional database storage and has been playing a pivotal role in many fields such as health, industry, and making decisions, which is streaming data. Streaming data are generated continuously by different sources of data such as social networks, sensors, and mobile devices. For processing streaming data, researchers used big data platforms such as Apache Spark, Apache Hadoop [11], Apache Kafka [12], and Apache Storm [13] to store, analyze, and process streaming data. For example, Zhang et al. [14] proposed a new task-level adaptive MapReduce framework to apply real-time streaming data in healthcare applications. Nair et al. [15] used a machine learning model to predict heart disease from streaming tweets based on Apache Spark. The ultimate goal of an effective healthcare system is saving people’s lives, lessening of hospitalization periods, and providing better application of preventative care. Recently, real-time streaming analytic technologies offer significant improvement toward achieving this goal. Streaming analytic methods along with the Internet of Things (IOT) have enabled healthcare providers to observe trends and patterns faster than ever by analyzing the data on a real-time basis. Building such patterns enhances the decision-making process by the application of predictive analytics. The implementation of these techniques not only results in the reduction of the required workload by the nurses and doctors but also results in a general improvement of the patient care and lowers the needed cost for the healthcare appointments. Top hospitals around the world are employing data analytic methods over data streams for various medical fields such as internal and neurological medicine for adults and neonatal care for kids. Massive amount of medical data is available for processing on a real-time basis without requiring the healthcare provider to visit the patient’s place. Multiple sensors and devices are generating data every second such as clinical alarms and vital signs’ monitoring. Accessing the medical data as it happens in the same moment, then analyzing them, and visualizing their results facilitate the healthcare providers’ task in detecting early signs of illness which leads to reduction of the healthcare cost in general. The main factor to achieve such a goal is the implementation of data analytic techniques on streaming data collected from multiple sources. Nowadays, social networks are used extensively as a health support tool in increasing the health awareness of the community on top of spreading the medical updates and current recommendations when a crisis happens. Social media data can be a useful part if added to the healthcare database; this could improve both diagnosis [16] and clinical decisions [17]. Also, social media adds a new dimension to healthcare by utilizing real-time patients’ data to detect early breast cancer because social media, especially Twitter, is rich in medical information used increasingly for health and medical goals, including sharing information about diabetes [18], identifying the effective adverse drug [19], analyzing breast cancer [20], and other benefits. Also, Twitter Streaming API allows researchers to read streaming data in real time. Therefore, researchers can integrate Twitter with big data streaming tools to develop applications that work in real time such as [15, 21]. In this paper, the problem of predicting breast cancer using a set of streaming data collected from users’ health data from Twitter is addressed. The previous studies of breast cancer prediction have focused only on predicting breast cancer based on historical data and traditional machine learning algorithms to solve this problem. These studies do not predict breast cancer in real time using streaming data that are collected from social networks. The goal of this work is predicting breast cancer in real time from patients’ social posts based on machine learning algorithms that are integrated with Apache Spark and Apache Kafka. The real-time predicting breast cancer system consists of two components: developing an offline model and online prediction pipeline. In the developing an offline model component, distributed machine learning algorithms, namely, decision tree (DT), support vector machine (SVM), random forest (RF), and logistic regression (LR), based on Apache Spark are used to train and test models to a Breast Cancer Wisconsin (Diagnostic) database (BCWD) to select the best model that is used to predict breast cancer in real time. For the online prediction pipeline, patient’s tweets are collected by Apache Kafka from Twitter. Also, Apache Spark is used to preprocess the data in real time. Our contributions could be reviewed in(i)Developing a real-time system to predict breast cancer from streaming tweets(ii)Applying different feature selection algorithms to select essential features from a database(iii)Applying different machine learning algorithms to select features after correlation on the Breast Cancer Wisconsin (Diagnostic) dataset(iv)Applying grid search with cross-validation to optimize machine learning algorithms and enhance accuracy(v)Developing an offline model to find the best model that has the highest accuracy that is used to predict breast cancer in real time from tweets’ streaming

This paper is organized as follows: Section 2 describes the previous studies. Section 3 displays the description of big data tools. Section 4 describes a description of the dataset. Section 5 describes the real-time system of breast cancer prediction. Section 6 discusses the experimental results in detail. The final Section 7 is the conclusion of the paper.

Many researchers have applied data mining and machine learning techniques to develop models and systems that predict or diagnose breast cancer. For example, Ak [22] made a comparative analysis using data visualization and machine learning to detect and diagnosis breast cancer. Different machine learning algorithms including LR, KNN, SVM, NB, RF, and rotation forest were applied to the breast cancer dataset by Dr. William H. Walberg of the University of Wisconsin Hospital. The result shows that LR with all features has achieved the highest accuracy. Delen et al. [23] utilized two data mining algorithms, which are artificial neural networks and DT, with statistical method logistic regression to develop the prediction models using a large dataset. They made a performance comparison between three models using 10-fold cross-validation methods to compute the three prediction models’ unbiased estimates. Agarap [24] applied six machine learning algorithms, which are Gated Recurrent Unit (GRU) with SVM, LR, multilayer perceptron, KNN, softmax regression, and SVM, on the WDBC dataset to predict breast cancer. Multilayer perceptron has achieved the best accuracy. Oyewola et al. [25] used five machine learning algorithms, including LR, linear discriminant analysis, quadratic discriminant analysis, RF, and SVM, to predict breast cancer based on the mammographic diagnostic method. The results show that SVM is the best classifier for prediction. Benbrahim et al. [26] made a comparison between 11 machine learning algorithms, KNN, NB, RF, LR, DT, stochastic gradient descent, linear SVM, Extra Tree, linear discriminant analysis, quadratic discriminant analysis, and neural network, on the WDBC dataset to predict breast cancer. The best accuracy was achieved by the neural network. Asri et al. [7] compared the performance of SVM, DT, naive Bayes (NB), and K-nearest neighbors (KNN) on the BCWD dataset using the WEKA data mining tool to predict breast cancer. The results showed that the SVM is the best classifier. Asri et al. [7] compared the performance of NB, SVM, and KNN in the BCWD dataset. The SVM was the best classifier. Eshlaghy et al. [27] used DT, SVM, and artificial neural network on the dataset of patients who were registered in the Iranian Center for Breast Cancer program from 1997 to 2008. The results show that the SVM model has the highest accuracy than the others. Then, some researchers applied feature selection algorithms with machine learning to improve the accuracy by reducing the number of features. For example, Liu et al. [28] proposed a hybrid system using information gain directed simulated annealing genetic algorithm wrapper for ranking all features. Also, they applied the cost-sensitive support vector machine learning algorithm to predict breast cancer. Luo and Cheng [29] used two feature selection methods, forward selection and backward selection, for improving the accuracy of the prediction of breast cancer on the dataset collected at the Institute of Radiology of the University of Erlangen-Nuremberg between 2003 and 2006. Chen et al. [30] applied a rough set reduction algorithm with the SVM to remove extra features and improve the accuracy of the BCWD dataset. Currently, researchers are using big data techniques to predict breast cancer. For example, Alghunaim and Al-Baity [31] used three machine learning algorithms such as SVM, DT, and RF using Weka and Apache Spark to predict cancer. The results show that the SVM using Apache Spark is the best classifier than the others.

3. Big Data Tools

This section explains the big data tools that are used in the proposed system.

3.1. Apache Kafka

Apache Kafka [12] is a distributed streaming platform for developing a streaming data pipeline in real time. Kafka can receive large volumes of a data stream in real time with low latency, fault tolerance, and reliability. Kafka stores streaming data in Kafka’s topic. Kafka includes two main APIs, which are Producer API and Consumer API. In the Procedure API, applications send a stream of records to Kafka’s topics. In the Consumer API, applications can read data as streaming from Kafka’s topics. In our work, Kafka receives streaming tweets from Twitter, and it stores the data in Kafka’s topic to allow Apache Spark to read data as streaming from Kafka’s topic.

3.2. Apache Spark

Apache Spark [32] is an open-source big data framework. Spark was designed for speed processing of large datasets. Spark is faster than Hadoop because Spark executes processing in memory. A strong point of using Apache Spark is that it includes two main libraries, which are Spark Streaming API and MLib API. Spark MLlib API is Spark’s machine learning (ML) library that provides different types of machine learning algorithms such as classification and regression, and it includes feature transformations: standardization, normalization, hashing, and model evaluation and hyperparameter tuning. We used the MLlib API to implement the building offline model component. It is also used to implement different types of classification algorithms, such as SVM, DT, RF, and LR, with grid search and cross-validation. Spark Streaming API provides scalable and fault-tolerant stream processing of data streams. In our work, Spark Streaming API is used to implement an online prediction pipeline component. Spark Streaming API is used to read tweets as streaming from Kafka topic and preprocessing tweets in real time and then sends the preprocessed tweets into the best developed model that is implemented in the offline model to predict whether tweets include breast cancer in real time.

4. Dataset Description

In this section, we describe the Breast Cancer Wisconsin (Diagnostic) dataset that is used to build the offline model.

4.1. Breast Cancer Wisconsin (Diagnostic) Dataset (BCWD)

We used the BCWD dataset [33] to train and test the models because BCWD is a free and reliable dataset; also, it has been used for the prediction of breast cancer by various researchers such as Agarap [24], Dubey et al. [34], and Sridevi and Murugan [35]. It includes 30 features and one class label. These features are a description of the cell nuclei found in the clip of the image taken from the breast. The class label has two values, which are 0 or 1. 0 indicates benign breast cancer, and 1 indicates malignant breast cancer. In this work, we reduced the number of features using correlation; after that, we applied two types of feature selection algorithms on features after correlation. Reducing the number of features is necessary for machine learning because, sometimes, unnecessary features affect the models’ performance and models’ accuracy. Also, it helps to reduce overfitting and improve accuracy. Correlation studies the relationships between two or more features of a dataset. We used the correlation matrix in Python [45] to study the relationship between features in the database. Also, we deleted one of the features which has the most significant correlation above 90% with other features. After applying the correlation, we selected 20 features from the database. The description of these features is shown in Table 1.

5. The Real-Time System of Breast Cancer Prediction

The architecture of the real-time system of breast cancer prediction consists of two components, namely, developing an offline building model and online prediction pipeline, as shown in Figure 1. The two components will be described in detail in the following sections.

5.1. Developing an Offline Model

The goal of developing an offline model component is finding the optimal machine learning model which has the highest accuracy. Two feature selection algorithms, recursive feature elimination/cross-validated selection and univariate feature selection, are used to select the essential features from the database that has features after correlation. Four machine learning algorithms, decision tree, logistic regression, support vector machine, and random forest classifier, are used to classify breast cancer into benign and malignant. Figure 1 shows the main stages of developing an offline model: feature selection methods, data splitting, classifiers’ optimization and training, and evaluating the models. Each stage of this component is described in detail as follows.

5.1.1. Feature Selection Methods

The process of selecting the important input features to a predictive model is called feature selection. The selection process reduces the total number of input variables which shortens the execution time; and it focuses the model on the important feature which increases the classification accuracy. The objective of applying feature selection methods is to specify the key features in the database which play a crucial role in the prediction process. These key features must be available so that the system can predict cancer disease correctly, besides defining the features which if absent will not affect the ability of the system to predict correctly. In this paper, we used two feature selection algorithms which are recursive feature elimination and cross-validated selection (RFECV) and univariate feature selection.(1)Recursive feature elimination and cross-validated selection (RFECV): RFECV [36] is a type of wrapper method. RFECV is used to set ranking for each feature and select the best number of features with the highest ranking.(2)Univariate feature selection is a type of filter method. We used chi-square [37] with SelectKBest [38], to select the best number of features. The scikit-learn library in Python provides SelectKBest that can be used in different statistical tests to select a specific number of features.

5.1.2. Database Splitting

The dataset is split into an 80% training dataset and a 20% testing dataset (unseen dataset) using a stratified method. The training set is used to optimize and train the ML models, and the unseen test set is used to evaluate the resulting models.

5.1.3. Classifiers’ Optimization and Training

The grid search method with 10-fold CV has been used to find the machine learning algorithms’ optimal hyperparameters and enhance the accuracy. Four machine learning classification algorithms, logistic regression (LR) [39], decision tree (DT) [40], random forest classifier (RF) [41, 42], and support vector machine [43], are used in this work. The accuracy of cross-validation and unseen data is calculated for each model. K-fold cross-validation: k-fold function works on dividing all the datasets into equal k groups of samples which are called folds. K − 1 groups are used for training the classifier, and the rest of the fold is used for testing the classifier. In the 10-fold CV process, 90% of data has been used for the training, and 10% of data has been used for the testing purpose. Furthermore, hyperparameter tuning is used to pass different parameters into the model. Grid search is the widely used technique in applying hyperparameter tuning. In the gird search, the user defines a set of values for each hyperparameter. After that, the model performs tests of all values for each hyperparameter and selects the best value which achieves the best accuracy.

5.1.4. Evaluating the Models

We used accuracy to evaluate models, where TP is true positive, TN is true negative, FP is false positive, and FN is a false negative, see the following equation:

5.2. Online Prediction Pipeline

To implement a streaming processing pipeline component, both Apache Kafka and Apache Spark which are distributed streaming technologies are utilized. Also, Twitter Streaming API App [44] is used to collect real-time data as streaming from Twitter. The main goals of this component are to study the efficiency of the proposed system to work in real time using tweets’ streaming and to measure its ability in predicting benign cancer or malignant cancer, based on the health status information contained in the tweet. Apache Kafka [12] is chosen to exploit its high throughput, low transportation time, and ordering assurance. Kafka is used to read tweets from Twitter and store them in Kafka’s topic. In our case, Apache Spark works as the stream processor, which takes its input streams from Apache Kafka’s topic. For each tweet, the extracted data are represented in the form of a vector that is passed to the best model in the same order in the training dataset to predict if the tweet includes benign or malignant breast cancer. The model that gives the highest prediction accuracy is referred to as the best model.

5.2.1. Streaming Processing Pipeline

Twitter is one of the most used social media platforms to the extent that it is considered one of the major data sources for medical and healthcare-related applications. People use Twitter to share medical conditions, concerns, possible side effects of drugs, etc. Containing such a big amount of data makes Twitter an important resource for data science researchers to conduct their experiments using Twitter data. Also, Twitter Streaming API [44] allows researchers to read streaming data in real time. Therefore, researchers can integrate Twitter with big data streaming tools to develop applications that work in real time. In this step, Twitter API streaming and Apache Kafka are used to capture tweets containing “∗streamingcancer” hashtags. Streaming data that include breast cancer-related information are retrieved synchronously using Twitter Streaming API. Prediction is then performed to determine if any of the two breast cancer types (benign or malignant) are included in the tweet. Tweepy, a Python library, is used for accessing Twitter data. To establish the connection to Twitter Streaming API, both a keep-alive HTTP connection and an OAuth protocol-supported user authorization method were used. Besides, an account has been created on the Twitter app to obtain the consumer key and the secret consumer key, access token, and access token secret for authorized access of tweet streams. Afterwards, we ran the developed script to capture the streaming tweets containing “∗streamingcancer” hashtags. Figure 2 shows an example of the type of tweet that is collected to our streaming dataset. This tweet includes a sequence of attribute values, which are ra_mean, te_mean, sm_mean, com_mean, con_mean, fr_di_mean, ra_se, com_se, con_se, sm_worst, com_worst, con_worst, and sym_worst in the same order of attributes that are used in the training dataset. We split between each attribute using space. Later on, the Twitter streaming data are transferred to a Kafka topic on a real-time basis.

5.2.2. Online Prediction

After listing out the intersteps for the data collection process from Twitter, Kafka topic absorbs the Twitter streaming data. Spark streaming consumes the streaming tweets from the Kafka’s topic and applies many steps. The steps include removing unimportant data and extracting health attributes. Then, the health attributes are transformed into a vector and sent to the best model to predict malignant or benign breast cancer. Specifically, the real-time breast cancer prediction model has two main steps. First, the offline best prediction model is used to classify each tweet related to breast cancer into two different classes such as benign and malignant in real time. For example, using the sample tweet in Figure 3, the proposed system digests the information of the tweet that this specific Twitter user is concerned about the consequences of the malignant breast cancer condition.

6. Experimental Results and Discussion

6.1. Experimental Setup

The proposed system is implemented by Python. Machine learning classifiers are implemented by Spark’s MLlib API using PySpark. Apache Kafka is used to receive streaming tweets from Twitter and store them in Kafka’s topic. Spark streaming API is used to consume data as streaming from Kafka’s topic using PySpark. Feature selection methods are implemented by Python. The proposed system was performed on a Spark cluster, which includes one master node and two worker nodes. Ubuntu virtual machines were used to run Java (VM) to build the cluster, which has 20 GB of RAM, seven cores, and 100 GB disk.

6.2. The Result of Feature Selection Methods

The experimental results depend on the database that has correlated features. RFECV and univariate feature selection algorithms are applied to the dataset that has correlated features. These feature selection techniques are used to select important features from correlated features. The result of the selected features is described in detail in the following.

6.2.1. The Result of Applying RFECV

RFECV algorithm selects important features whose ranking value is one. The ranking of features is shown in Figure 3. According to the figure, the optimal number of features is 12 features. The most important features that have ranking 1 are ra_mean, te_mean, com_mean, con_mean, fr_di_mean, ra_se, com_se, con_se, sm_worst, com_worst, con_worst, and sym_worst. te_se, sy_se registered the wost ranks at 9 and 8, respectively.

6.2.2. The Result of Applying Univariate

The scores of all features that are selected by univariate are shown in Table 2. Ra_mean is the most critical feature for the diagnosis of cancer. Sy_se and fr_di_mean have the smallest score at 0.00008 and 0.00007, respectively.The feature selection process is called univariate when the best features are selected depending on the results of a univariate statistical test. After the test, features with high ranking values are more important to the classifier. Therefore, after sorting the features in a descending order, the 9 high-rated features are selected. Consequently, Figure 4 shows the important 9 features with the highest ratings. We can notice that the highest score is registered by Ra_mean at 266.1. The second important feature is te_mean that has 93.897 scores. Furthermore, con_mean and com_worst have the same score at 19.71 and 19.31, respectively.

6.3. The Results of Machine Learning

The experimental results’ goal is selecting the best model that registered the highest accuracy of cross-validation results and unseen dataset results. We split the dataset into an 80% training dataset and a 20% testing dataset (unseen dataset) using stratified splitting. Moreover, 10-fold cross-validation with hyperparameter tuning is applied to the training dataset. For 10-fold cross-validation, 90% of data is used to train the models and 10% of the data is used to evaluate the models using accuracy. Furthermore, the average accuracy for 10-fold cross-validation is computed for each model. Also, four machine learning algorithms, LR, DT, SVM, and RF, were applied to features after correlation and feature selection. For hyperparameter tuning, some parameters were tuned into machine learning algorithms. For SVM, three parameters were tuned, which are the kernel, regularization parameter (regPram), and the maximum number of iterations (maxIter). For LR, two parameters were optimized, which are regularization parameter (regPram) and the maximum number of iterations (maxIter). For RF, two parameters were tuned, which are the max number of bins for discretizing continuous features (maxBins) and the maximum depth of the tree (maxDepth). For DT, three parameters were tuned, which are information gain (impurity), the maximum depth of the tree (maxDepth), and the number of bins for discretizing continuous features (maxBins).

6.3.1. The Result of Applying ML on Features after Correlation

Table 3 shows the accuracy of 10-fold CV and the accuracy for the unseen dataset, which are registered by the four models: LR, DT, SVM, and RF. For the cross-validation, RF has achieved the best accuracy at 99.5%, while the DT has achieved the lowest accuracy at 98.6%. For unseen data, the best accuracy is registered by LR at 98.8%, while DT has recorded the lowest accuracy at 90.3%, compared to LR and SVM which recorded the accuracy for cross-validation at 99.06% and 99.1%, respectively. For all, the RF has achieved the best accuracy for cross-validation and LR for unseen data. Table 3 displays the best value of the model’s parameters given to classifiers registering their essential role to achieve high accuracy.

6.3.2. Accuracy Using Selected Features by RFECV

Table 4 shows the accuracy of 10-fold CV and the accuracy of the unseen dataset, which are registered by the four models, LR, DT, SVM, and RF, on the selected features by RFECV. For the cross-validation, RF has registered the highest accuracy at 99.1%, while the DT has achieved the lowest accuracy at 98.6%. For unseen data, the best accuracy is registered by RF at 100%, while DT has recorded the lowest accuracy at 91.2%. SVM and LR scored the same accuracy at 98.5% and 98.8%, respectively. For all, RF has achieved the best accuracy for cross-validation and unseen data. Table 4 displays the best value of the model’s parameters given to classifiers registering their essential role to achieve high accuracy.

6.3.3. Results of Models Applied on the Selected Features by Univariate

Table 5 shows the accuracy of 10-fold CV for the training dataset and the accuracy of the unseen dataset, which are registered by the four models, LR, DT, SVM, and RF, on the selected features by univariate feature selection. For the cross-validation, the highest accuracy is registered by RF at 99.1%, and then LR is the second-best classifier with an accuracy of 98.6%. For the unseen data, the best accuracy is registered by LR at 98.4%, while DT has recorded the lowest accuracy at 90.35%. For all, RF has achieved the best accuracy for cross-validation, and LR has achieved the best accuracy for the unseen data. Also, Table 5 displays the best value of the model’s parameters given to classifiers registering their essential role to achieve high accuracy.

6.4. Discussion

In our analysis, two feature selection algorithms, namely, univariate and RFECV, have been used to select the most essential features from the selected features after correlation from the BCWD dataset. Figure 5 shows the best models of cross-validation results. As can be seen, RF has achieved the best accuracy. RF has registered the highest accuracy at 99.5.11% with the feature after correlation, 100% using the selected feature by the RFECV, and 99.1 with the selected features by univariate. Figure 6 shows the best models of the unseen data results. As can be seen RF has achieved the highest accuracy at 99.1% with the selected features by RFECV, while LR has obtained the highest accuracy at 98.7% with the selected feature after correlation and 98.4% with the selected features by univariate. We can notice that RF has achieved the highest accuracy for cross-validation and the unseen data with the selected features by RFECV. Consequently, RF with the selected features by RFECV is used to evaluate the proposed system in real time.

6.5. The Result of Evaluating the Proposed System in Real Time

The best model is RF, with features that were selected by RFECV, which are ra_mean, te_mean, com_mean, con_mean, fr_di_mean, ra_se, com_se, con_se, sm_worst, com_worst, con_worst, and sym_worst. The goal of the real-time experiment is evaluating the ability of the proposed system to work in real time and its ability to predict malignant or benign breast cancer from tweets in real time. The proposed system receives streaming tweets which consist of 12 features that are applied to RF to classify tweets into malignant or benign breast cancer. Table 6 shows a sample of structure tweets and the prediction label. Also, it can be seen that there are two tweets containing malignant breast cancer indications and five tweets containing benign breast cancer indications.

7. Conclusion

In this research, we proposed a system for the prediction of breast cancer disease in real time. The developed proposed system is based on Apache Spark and Apache Kafka. It consists of two components which are developing an offline model and online prediction pipeline. In developing an offline model, we evaluate the performance of four machine learning algorithms, LR, SVM, RF, and DT on features and on the BCWD dataset to predict malignant or benign breast cancer. We applied correlation to select the critical features and applied two feature selection algorithms on features after correlation to choose the most essential features from features after correlation. Machine learning models with k-fold cross-validation and hyperparameter tuning were applied on features after correlation and feature selection to get the best model with the highest accuracy. In the online prediction pipeline, the proposed system is evaluated in real time using tweets’ streaming. Tweets streaming are retrieved from Twitter using the header word “∗streamingcancer” and sent to Kafka topic. Apache Spark reads tweets from the Kafka topic and extracts health attributes and sends them to online prediction. Then, online prediction sends the health attributes in the vector form in the same order of training data to the developed model to predict whether the tweet contains malignant breast cancer or benign breast cancer. The results have proved that RF with the selected features by RFECV has the best accuracy at 99.1%.

Data Availability

The data used to support the findings of this study are available in the Breast Cancer Wisconsin (Diagnostic) dataset (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).

Conflicts of Interest

The authors declare that they have no conflicts of interest.