4.2. Evaluation Indicators
The performance metrics used to evaluate the machine-learning model have been extensively explained in previous studies [
18,
19]. Here is a brief introduction to these metrics:
Accuracy measured the overall performance of the model by calculating the ratio of correctly classified samples to the total number of samples.
Precision evaluated the ability of the model to correctly identify negative samples by calculating the ratio of true-negatives to the sum of true-negatives and false-positives.
Recall, also known as sensitivity or the true-positive rate, measured the model’s ability to correctly identify positive samples. It was calculated as the ratio of true-positives to the sum of true-positives and false-negatives.
The F1-score was a measure of the model’s balance between precision and recall. It was the harmonic mean of precision and recall and provided an overall assessment of the model’s performance.
The formulas for calculating these metrics are shown in Equations (
7)–(
10), where
represents true-positives,
represents false-positives,
represents true-negatives, and
represents false-negatives.
4.3. Dataset Selection
The completeness of the training data directly affects the effectiveness of deep-learning models and their performance in real-world applications. Public datasets in the field of intrusion detection, such as KDD CUP 99 and NSLKDD, have had some limitations. These included outdated data; lack of integrity and diversity in attacks; and data sanitization issues. As a result, most attacks in these datasets have lacked meaningful payload information and thus have not effectively reflected attack trends. After conducting multiple experimental attempts, we found in this study that the UNSW-NB15 and CICIDS2017 datasets contained a significant amount of valid attack payloads. These datasets could accurately reflect the trends of network attacks over a period of time, meeting the requirements of this study. The UNSW-NB15 dataset consisted of 257,673 flow records, with 175,341 records used for the training set and 82,332 records used for the testing set. Each flow record contained 49 features, including 5 flow features, 13 basic features, 8 content features, 9 time features, 12 additional features, and 2 label tags. The CICIDS2017 dataset consisted of 8 CSV files, comprising a total of 2,273,097 network flow samples. Each flow sample included 79 features and 1 label.
4.4. Experimental Setup
The experimental dataset was constructed using the CICIDS2017 dataset, which was a standard dataset for network traffic anomaly detection. The training process of ResADM is shown in Algorithm 1.
The CICIDS2017 dataset needed to be cleaned according to the prescribed model design. Firstly, identify and replace any illegal characters found in the “Label” column in the dataset. Next, delete column 55 (Fwd Header Length) that duplicated column 34 (Fwd Header Length). Furthermore, discard rows where the values in columns 31 (Bwd PSH Flag), 33 (Bwd URG Flags), 56 (Fwd Avg Bytes/Bulk), 57 (Fwd Avg Packets/Bulk), 58 (Fwd Avg Bulk Rate), 59 (Bwd Avg Bytes/Bulk), 60 (Bwd Avg Packets/Bulk), and 61 (Bwd Avg Bulk Rate) were all equal to zero. In column 14 (Flow Bytes/s), there were 1358 instances of “NaN” and 1509 instances of “Inf”. Similarly, column 15 (Flow Packets/s) contained 2867 instances of “Inf”. To rectify these invalid values, calculate the column mean and substitute these values accordingly.
Algorithm 1: ResADM for Attack Detection |
|
Once the dataset was cleaned, a purposive sampling of network attack samples could be performed. To begin, randomly select 10,000 benign flows labeled as “BENIGN” from the file “Monday-WorkingHours”. Extract 5000 instances of brute-force attacks labeled as “FTP-Patator” and another 5000 instances labeled as “SSH-Patator” from the file “Tuesday-WorkingHours”. Merge these samples to form a set of 10,000 attack samples labeled as “Patator”. Similarly, sample 10,000 instances of denial-of-service attacks labeled as “DoS” from the file “Wednesday-workingHours”. Lastly, extract 10,000 instances of distributed denial-of-service attacks labeled as “DDoS” from the file “Friday-WorkingHours-Afternoon-DDos”. These steps result in a dataset containing 40,000 network flow samples, each with 78 features and 1 label. Randomly split the dataset into an 80% training set and a 20% test set. The distribution of each class within the divided dataset can be found in
Table 2.
The training set had to be passed to the attack-feature-selection layer for feature-importance analysis using the LightGBM model. The feature importance was calculated in the multi-classification process of traffic samples, and the features were ordered accordingly. Select the top
k features with the highest importance for retraining the baseline multi-classification model. Gradually increase the value of
k until it reaches 32, which maximizes the accuracy of the retrained baseline model. Remove the remaining features and keep the selected 32 features, as shown in
Figure 4.
To visualize the correlation between these 32 features and their sample labels, apply the K-means clustering algorithm to the network flow samples, as illustrated in
Figure 5. The visualization demonstrated a clear distinction between the benign flow samples and the various categories of network attack flow samples, indicating that these 32-dimensional features effectively characterized normal traffic and attack traffic. Although DoS and DDoS attacks shared some behavioral features, there were still significant differences due to the presence of a large number of dispersed source addresses in the DDoS attacks. Therefore, the selected 32-dimensional features could be used as a feature set for training the backbone network.
4.5. Result Analysis
The training set was reduced to include only the 32 effective features. The data were then transformed and prepared to meet the required dimensions for the ResADM model. The transformed training set was used to train the model. Initially, the dataset was trained on the backbone network built with the ResNet-50 architecture, which had been pre-trained on the UNSW-NB15 dataset. This pre-training enabled the utilization of learned generic features from the pre-trained model for transfer learning in the task of network-attack-behavior detection. During the fine-tuning process, a portion of the convolutional layer weights was frozen, and only the newly added, fully connected layers were trained. This approach reduced the need for parameter fine-tuning, conserved computational resources, and allowed for the efficient adaptation of the features to the categories of the network-attack-behavior dataset.
To ensure the stable optimization of the model parameters during the fine-tuning process, the learning rate was set to 0.001. The fully connected layers and a softmax-classifier layer were added after the backbone network. The fully connected layers retrained the features to align with the categories of the network-attack-behavior dataset and converted the output of the backbone network into a probability distribution, facilitating network-attack-behavior detection. During the training process of ResADM, the model underwent 200 iterations at a batch size of 64. The Adam optimizer was used for optimization. The overall multi-classification loss function employed was SparseCategoricalCrossentropy. The loss was calculated as described in Equation (
11), where
represents the true class label (not one-hot encoded), and
denotes the predicted probability distribution of the model, indicating the probability of belonging to each class.
The trends of accuracy and loss values during the 200 iterations of ResADM training are shown in
Figure 6. In
Figure 6a, it was observed that the model quickly converged and achieved a high level of accuracy. After approximately 100 iterations, the accuracy stabilized at around 0.99, indicating that the model had successfully learned to accurately classify network attack behaviors.
Figure 6b displays the loss values of the model on both the training and testing sets. The low values of the loss function on both sets demonstrated that the model had effectively minimized errors during the training process. This suggested that the model had successfully learned the underlying patterns and features associated with network attack behaviors. The convergence of the accuracy and the low levels of loss indicated the effectiveness of the ResADM model in detecting network attack behaviors. These results demonstrated that the model had generalized well on the given dataset, achieving high accuracy and effectively minimizing errors.
Following the completion of the ResADM training, the parameters of the backbone network were updated, and ResADM was evaluated using the validation set, as presented in
Table 3. The overall accuracy of ResADM was reported as 99.9%, indicating its high performance in accurately detecting network attack behaviors. Furthermore, the recognition performance for each network category exceeded 99%, highlighting the model’s effectiveness for classifying different types of network traffic.
ResADM utilized a purposeful sampling-based data-cleaning method, which contributed to its accurate identification and labeling of attack behaviors within large volumes of CPS traffic. This approach yielded high accuracy across all network traffic categories, approaching or equaling 100%. As a result, it effectively addressed the challenge of data-labeling for CPS attack behaviors. Additionally, ResADM employed a feature-selection method based on feature importance, allowing for data dimension reduction and the elimination of unnecessary features. This enabled the model to identify the relevant features associated with covert or deceptive CPS attack behaviors. As a result, ResADM achieved a high recall rate, demonstrating its ability to effectively identify and capture instances of attack behaviors. The combination of high accuracy, excellent recognition performance across network categories, and the capability to handle challenging data labeling and discover effective features made ResADM a robust model for detecting CPS attack behaviors.
Figure 7 presents the confusion matrix heatmap of the validation results obtained by ResADM. From the figure, it could be observed that due to the similarity in the behavioral features between BENIGN traffic and the DDoS attack behavior, the recall rate and F1-score for BENIGN were relatively low. However, leveraging the concept of transfer learning, ResADM utilized the advantages of pre-training the ResNet-50 backbone network and retained relevant features, which effectively addressed this issue. Consequently, ResADM achieved favorable results across all metrics.
To thoroughly validate the advancements of the proposed ResADM method, comparative experiments were conducted in the same simulated environment with the latest relevant works, including CNN [
14], DNN [
20], and LSTM [
21]. The experiments were carried out using identical dataset distributions and hyper-parameter settings to ensure a controlled experiment. The results of these experiments are presented in
Table 4.
The analysis of the results in
Table 4 demonstrated that the purposeful sampling-based data-cleaning method proposed in this study had effectively partitioned the dataset of CPS attack behaviors and addressed the challenge of labeling such data. As a result, all the comparative models achieved favorable classification accuracy on this dataset. However, it was important to note that CNN had exhibited higher sensitivity towards image data, while LSTM networks were more suited for sequential data. Therefore, when it came to mining CPS attack-behavior features, the performances of these two models were relatively weaker, as compared to the feature-selection method based on feature importance, as proposed in this study. Furthermore, although the effectiveness of DNN improved with increasing model depth, it still fell short of the performance achieved by the transfer-learning model based on ResNet-50. The proposed ResADM model demonstrated superior performance, as compared to existing methods in terms of classification accuracy and feature extraction for attack-behavior detection. It leveraged transfer learning and feature selection based on feature importance to overcome the limitations of other models and offered a more effective solution for identifying CPS attack behaviors.