1. Introduction
With the decade-long development of remote sensing technology, the Gaofen series of satellites has formed a “three-high” observation system with high spatial, temporal, and spectral resolution [
1], which uses sensors to acquire images by obtaining information about the Earth over long distances. In the remote sensing image, the cloud shadow area is an important identification; through the identification of the cloud shadow position in the image, we can obtain the visible light, infrared rays, and other information on the ground, used to monitor the cloud coverage, the type of cloud, and the direction of cloud movement. This provides meteorologists and weather forecasters with critical data to help them predict the weather more accurately. However, merely identifying the location of cloud cover is insufficient. The presence of cloud shadows can obstruct analysis in precision agriculture and other fields, leading to biases in the results. Therefore, applications of cloud shadow detection are increasingly widespread in meteorological forecasting, environmental monitoring, and natural disaster detection. The cloud detection technology [
2] is inadequate; thus, utilizing cloud and cloud shadow detection technology to accurately detect cloud cover from remote sensing images is a crucial preprocessing step for most satellite imagery. In this paper, we propose a segmentation algorithm for separating the three components of clouds, cloud shadows, and background in remote sensing images.
Traditional cloud shadow segmentation methods can be broadly categorized into the following five types: 1. thresholding-based methods; 2. morphology-based methods; 3. statistical-based methods; 4. texture feature-based methods; and 5. machine learning-based methods. The thresholding method uses various physical methods, such as AVHRR and NIR images, to set feature thresholds such as luminance, chromaticity, etc., to detect the cloud shadows in the image. Early on in this research, people used fixed thresholds to distinguish clouds from other parts. For instance, Saunders and Kriebel [
3] processed the NOAA-9 dataset over a week by determining thresholds for a range of physical parameters including cloud-top temperatures, optical depths, and liquid water content. While the fixed threshold method is straightforward and user-friendly, it lacks the adaptability needed to accommodate various meteorological conditions, lighting scenarios, geographical regions, and times of day. Additionally, it often necessitates manual threshold adjustments, which pose numerous shortcomings and limitations. Later, many researchers proposed improvements by using dynamic thresholding for cloud detection [
4,
5,
6,
7]. The dynamic thresholding method adjusts thresholds based on environmental conditions through the construction of diverse physical models, thereby enhancing the accuracy of automatic cloud analysis. However, for complex cloud and feature types, this method can be challenging to apply to the background, and it also incurs significant computational costs. Secondly, the morphological method based on set theory proposes a series of operations, such as expansion, erosion, open and close operations, and hit–hit–miss transformations for images. Danda and Xiang Liu et al. [
8,
9] constructed skeleton features to help analyze the morphology of the cloud and thus separate it from other regions by using a gray-level morphological edge extraction method. Moreover, Tom et al. [
10] established a common method based on morphological data to create an efficient computational paradigm for the combination of simple nonlinear grayscale operations such that the cloud detection filter exhibits spatial high-pass properties, emphasizes cloud shadow regions in the data, and suppresses all other clutter. A series of methods regarding morphology are more effective for the case of blurred cloud edges and complex shapes, but they are difficult to apply directly to multispectral images. Thirdly, statistical methods use various statistical and analytical tools to establish regression equations for differences in reflectance, brightness, or temperature between picture pixels in satellite data to detect clouds. For example, Amato et al. [
11] used PCA and nonparametric density estimation applied to the SEVIRI sensor dataset, and Wylie et al. [
12] combined time-series analyses of more than 20 years of polar-orbiting satellite cloud data to predict future cloud trends. However, since the sample data used in regression models are historical, this type of method is not widely used and is limited to specific times and regions. Fourthly, the texture feature method identifies cloudy and non-cloudy regions by extracting the texture features of images. For example, Abuhussein et al. [
13,
14] conducted segmentation by analyzing the GLCM (Gray-Level Co-occurrence Matrix) to capture spatial relationships and covariance frequencies between pixels of varying gray levels in the image. This process enables the extraction of crucial information regarding the image texture. Reiter and Changhui et al. [
15,
16,
17] completed segmentation by using the wavelet transform to detect texture features and edge information in the image at different spatial scales and to decompose the cloud image into details at different scales to obtain local and global features of the cloud, while Surya et al. [
18] used a clustering algorithm to group texture regions similar to the cloud shadow. This method works better for texture-rich cloud shadow images. To overcome the limitations of the first four traditional methods, machine learning algorithms are proposed to realize cloud shadow segmentation by training classifiers. Support vector machines, random forests, and neural networks are typical classifiers. For instance, Li et al. [
19] proposed a classifier based on support vector machines to detect clouds in images, while Ishida et al. [
20] quantitatively guided the support vector machines with the help of classification effect metrics to improve the feature space used for detecting cloud shadows and to reduce the frequency of erroneous results. Fu et al. [
21] combined the ensemble thresholding method and random forest for the FY-2G image set to improve the meteorological satellite cloud detection technique, and Jin et al. [
22] established a BP neural network backpropagation model for the MODIS dataset, which improved the learning model to a certain extent. Although these methods are indeed more effective, they necessitate manual feature engineering to select suitable labels for training and testing a large volume of data annotations. Furthermore, the quality of the model is directly influenced by the features selected.
To overcome the shortcomings of manual feature engineering, deep convolutional neural networks (CNN) gradually emerged; a variety of convolutional neural networks were proposed for remote sensing image segmentation tasks, and semantic segmentation algorithms based on deep learning began to gradually become mainstream. Long et al. [
23] first proposed a fully convolutional neural network, FCN, for semantic segmentation in 2015, which can directly realize end-to-end pixel-by-pixel classification. Mohajerani et al. [
24] applied the FCN network to the remote sensing image Landsat dataset cloud detection technique in 2018, which dramatically improved the efficiency of the target classification of remote sensing images; however, the results obtained were still not fine enough and not sensitive enough for the detailed parts of the image. Since then, there has been a surge in deep learning networks, with numerous CNN frameworks continuously being proposed. In 2015, Badrinarayanan et al. [
25] introduced SegNet, a segmentation network based on an encoder–decoder structure, utilizing up-sampling with the unpooling operation. Subsequently, in 2019, Lu et al. [
26] adapted the SegNet network model for cloud recognition in remote sensing images. Their approach improved the accuracy of cloud recognition by preserving positional indices during the pooling process, thus retaining image details through a symmetrical parallel structure. Although it demonstrated some ability in cloud–snow differentiation, its training time was found to be excessively long and inefficient. In 2016, Chen et al. [
27] designed an inflated convolutional network called DeepLab, aimed at expanding the sensory field by introducing voids in the convolutional kernel. DeepLab enhances the robustness of image segmentation. However, it imposes specific requirements on the size of the segmented target. It excels in segmenting foreground targets within the general size range. Nonetheless, when faced with extreme size variations in the target, such as very small or very large targets, DeepLab exhibits poor performance and suffers from segmentation instability. In 2015, Ronneberger et al. [
28] proposed the UNet image segmentation network, named because the network framework is shaped like the letter U. The contextual information is fused through feature splicing in the channel dimension during the up-sampling process to achieve a more fine-grained segmentation, which is suitable for highly detailed segmentation tasks. In 2017, Zhao et al. [
29] designed a pyramidal scene parsing network structure, PSPNet, which integrates contextual information from different regions, applies convolutional kernels of different sizes, and employs a multi-scale sensory field to efficiently combine local and global cues. In 2022, Zhang et al. [
30] proposed a dual pyramidal network, DPNet, inspired by PSPNet. This multi-scale feature captures features of the image from different scales, thus enhancing the network’s capability in feature extraction, but it also incurs greater computational cost, making training and prediction slower.
Although existing CNNs perform better in remote sensing image segmentation tasks, there is still a general problem: due to the down-sampling nature of the convolutional operation, the network is prone to lose critical detail information during feature extraction and scale reduction, which leads to many problems, such as inaccuracy and blurred edges in segmentation results. Many studies have demonstrated that combining low-level and high-level semantic information can significantly improve model performance [
31]. However, traditional feature fusion methods are usually too simple and do not pay enough attention to edge information and image features to effectively restore lost information, especially for tasks with complex backgrounds, which may lead to missed detection of fine targets and edge blurring. To address these challenges in semantic segmentation, we propose a new approach for cloud shadow segmentation—an attention mechanism feature fusion network based on the UNet framework. The encoder–decoder architecture of UNet effectively extracts and restores feature information across various scales, making it particularly suitable for smaller-scale datasets. Therefore, we adopt this U-shaped network structure as our baseline and integrate the channel attention mechanism and spatial attention mechanism module into it. This integration allows for adaptive attention to different channels of the image and feature map information, with the goal of enhancing the fine detection of cloud shadows. The addition of the new feature fusion module can effectively fuse the low-level and high-level features, restore the lost information, and segment the fine features more accurately in such a complex context as the cloud shadow segmentation task. The AFMUNet network framework is shown in
Figure 1. After inputting the image, the high-level image features are initially extracted through down-sampling. Subsequently, during the up-sampling process and enhancement of feature map resolution, we progressively enhance the receptive field adaptively and employ different channel operations. In addition, the feature fusion module is utilized in each layer to integrate contextual information more accurately and fuse low-level and high-level information. Furthermore, an innovative loss function is employed during the training process, and classification results are outputted after multiple samplings. Through the combined effect of the above modules, the detection accuracy of our network was substantially improved. The main contributions of this paper’s work are as follows:
An integrated module of channel space attention mechanism, suitable for cloud shadow segmentation tasks within a U-shaped structure, is proposed. This model facilitates dynamic adjustment of feature map weights, enhancing the ability to capture crucial image features and thereby improving segmentation accuracy.
The feature fusion operation of the original network is updated, which helps to better understand the target and background in the image, segment the image using information from different scales, and deal with cloud shadow targets of different sizes and shapes.
An innovative weighted loss function is developed for the dataset, which improves the accuracy of model learning and optimizes the model performance to some extent.
A network that integrates the above three features and combines them with a feature extraction network is proposed to segment high-resolution remote sensing images.
2. Methodology
Since the purpose of the cloud–shadow segmentation task is to match labels on a pixel-by-pixel basis on an image to distinguish between clouds, cloud shadows, and backgrounds, the task can be regarded as a semantic segmentation task for triple categorization. Recently, CNNs have achieved great success in the field of computer vision, especially in image segmentation tasks. As pointed out in
Section 1, due to the diversity of cloud layers, irregular shapes, and variations in lighting conditions and shooting locations, cloud shadow segmentation tasks often require highly accurate models to cope with these complexities. Nevertheless, traditional machine learning algorithms may face challenges in meeting the stringent accuracy demands of cloud shadow segmentation tasks, particularly in scenarios involving snowy mountainous terrain or under low-light conditions [
32]. When dealing with the cloud shadow segmentation task, we need an efficient network structure that can fully capture the detailed features of clouds while preserving the surface information. To fulfill this requirement, we choose the UNet structure as the backbone network framework, which is appropriately modified to incorporate CSAM and FFM improvement modules to further improve the performance of the model in capturing the complex structure and irregular shape of cloud shadows.
2.1. UNet—A Network Based on Encoder–Decoder Architecture (Related Work)
UNet is a classical deep-learning architecture especially suited for image segmentation tasks. It is designed as an encoder–decoder structure with special skip connections to better capture features and details at different scales in segmentation tasks. The following are the main features and working principles of UNet:
1. Encoder Part: The encoder part of UNet consists of multiple convolutional layers that gradually halve the size of the feature map while increasing the number of feature channels. This helps to extract high-level feature representations of the image and capture semantic information at different scales. The encoder part usually includes operations such as convolutional layers, pooling layers, etc.
2. Jump concatenation: UNet introduces jump concatenation to concatenate the feature maps of the encoder with the feature maps of the decoder to include more detailed information in the decoder. This helps to overcome the problem of information loss that may be introduced by pooling operations and improves the performance of the segmentation model.
3. Decoder Part: The decoder part of UNet consists of multiple convolutional and up-sampling layers that gradually recover the spatial resolution of the feature map through operations such as inverse convolution. The decoder part restores the low-resolution feature map to the size of the original input image through the up-sampling operation and, at the same time, performs feature extraction through the convolution operation.
4. Output Layer: The output layer of UNet is usually a convolutional layer whose output is a segmentation mask indicating the class or segmentation result of each pixel in the image. The number of channels in the output layer is usually equal to the number of categories in the task.
The UNet architecture has achieved excellent performance in a variety of fields, such as medical image segmentation, remote sensing image analysis, and automated driving, where it can efficiently capture semantic information and details in an image while maintaining high resolution. In our study, only the basic architecture of UNet is retained, based on which innovations and modifications are made.
2.2. CSAM (Channel Spatial Attention Module)
To better understand the key features and structures in an image and to improve the segmentation of complex scenes, we introduce the attention mechanism. The concept of attention mechanism originated in the field of natural language processing. It serves to emphasize words at different positions within an input sentence, thereby facilitating improved translation into the target language [
33,
34]. For instance, in machine translation, the attention mechanism helps the model focus on relevant parts of the input sentence when generating each word of the translation. This allows for more accurate and contextually appropriate translations, especially in cases where the input sentence is long or complex. Similarly, in text summarization, the attention mechanism aids in identifying important sentences or phrases to include in the summary, resulting in more concise and informative summaries. Now, we apply it to image semantic segmentation tasks to help process image information more efficiently by focusing attention on key regions in the image while suppressing irrelevant information. This is an approach that mimics the human visual and cognitive system, which is similar to how the human cerebral cortex achieves efficient analysis by focusing on specific parts when processing image and video information in complex scenes. In general, the attention mechanism can be categorized into four dimensions—channeling, spatial, temporal, and branching [
35]—which play different roles in different computer vision tasks.
As shown in
Figure 2 below, we add the CSAM module to the basic structure of UNet after the end of each sample in the up-sampling phase, which skillfully combines the channel and spatial attention mechanisms. For a given feature map, the CSAM module is capable of generating feature map information in the channel and spatial dimensions [
36] and multiplying them with the original input feature map to perform adaptive feature adjustment and correction. Eventually, the CSAM module outputs feature maps, adjusted by the attention mechanism, with stronger semantic information and adaptability. This module enhances our ability to focus on the channel information of the image during cloud shadow segmentation tasks, thereby improving cloud perception and segmentation accuracy.
2.2.1. CAB (Channel Attention Block)
CAB is an important component of the CSAM module. It focuses on weighting attention given to the channel dimensions in the feature map [
37,
38]. The goal of the channel attention mechanism is to enhance the attention given to different channels by dynamically adjusting the weights between channels. This is crucial to improve the model’s ability to perceive different features in the image. The CAB module works as follows:
The steps of the CAB module are shown in
Figure 3 below. Step 1: Firstly, the input feature map
is subjected to global average and maximum pooling operations, and the input information is compressed and downgraded to obtain two 1 × 1 average pooled features,
, and maximum pooled features,
. Step 2: Then, they are fed into a weight-sharing two-layer neural network, MLP. Step 3: Finally, the MLP output features are subjected to an element-by-element summation operation, which is applied to the input feature map after activation by the Sigmoid function to generate the final Channel Attention Feature,
. The above computational process is expressed as Equation (1), shown below:
where
is the sigmoid function and
represents the weights of the hidden/output layer. The parameters of
and
are shared in MLP.
Attention weights on the channel dimensions, indicating the contribution of different channels to the final feature representation, were generated by CAB, and these weights were applied to the original input feature map to generate features for the input spatial attention mechanism module. Channel-level feature tuning is achieved by weighting each channel’s features. This means that the model can better focus on the channel features that are important to the task at hand, improving the representation of semantic information.
2.2.2. SAB (Spatial Attention Block)
Unlike CAB, the SAB module focuses on the spatial dimension of the feature map. Its goal is to enhance the focus on different regions in the image by adjusting the weights of different spatial locations to improve the model’s perception of global contextual information. The SAB module works in
Figure 4 as follows:
Step 1: First, the feature map output from the CAB module is used as the input of this module,
, and global maximum pooling and average pooling are done on the channel dimensions; then, these two results are used in a splicing operation. Step 2: Next, a 7 × 7 convolution kernel is chosen to perform a convolution operation on the splicing result, and the channel dimensions are reduced to 1. Step 3: Finally, after the Sigmoid activation function maps the weights between 0 and 1 to represent the order of importance of each position, these spatial attention weights are applied to the inputs to generate the feature map of the spatial channel attention mechanism,
. The above computational process is expressed as Equation (2), shown below.
where
is the kernel of convolution. This size performs better than others.
SAB generates attention weights in the spatial dimension through a series of convolutional operations and activation functions that indicate the contribution of different locations to the final feature representation. This means that the model can better focus on key regions in the image, thus improving the perception of global contextual information. The SAB module helps us to more accurately capture the contours and structure of objects in tasks such as semantic segmentation.
2.3. FFM (Feature Fusion Module)
The introduction of the FFM module [
39,
40,
41] plays a key role in the process of feature fusion of information from different feature maps obtained from deeper and shallower layers when jump connections in the original network structure are involved. The FFM module allows us to efficiently fuse features of different scales and resolutions in order to capture the complex structure and irregular shapes of cloud shadows.
The steps of the FFM module are depicted in
Figure 5. Step 1: Accept two feature maps with different resolutions from the encoder and decoder sections as input. Step 2: Perform a series of operations such as splicing, convolution, and so on, to fuse them into an enhanced hybrid feature map, which strengthens the representation of the hybrid features and makes them more suitable for subsequent processing. Step 3: Perform a global averaging of the hybrid feature map pooling to reduce the spatial dimension to 1 × 1 to obtain global channel statistics. Step 4: Introduce two consecutive 1 × 1 convolution operations via Relu and Sigmoid activation functions in order to enhance the nonlinearity and show the importance of each channel. Step 5: Multiply the channel attention weights with the element-by-element hybrid feature map obtained from Step 2 to perform the mul operation to obtain a weighted feature map. Step 6: Finally, the weighted feature map obtained from Step 5 is subjected to element-by-element add-sum operation with the hybrid feature map obtained from Step 2, to produce the final fused feature map. The above computational process is expressed as Equation (3), shown below.
where
is the fusion of the input from shallow and deep layers and
represents the enhanced nonlinear result as an intermediate variable.
The FFM module is a well-designed feature fusion mechanism that effectively integrates feature maps from shallow and deep layers by means of utilizing channel complementarity, adaptively adjusting the weights of the channel features dynamically to better fuse information from different scales and semantic levels. This innovative fusion module offers an effective tool for our research and improves the performance of the capture and segmentation tasks of feature statistics.
2.4. Loss Function
The loss function is an important component in various segmentation network models based on deep learning [
42]. It is used to measure the difference between the prediction and true values of the network and guide the model to make more accurate predictions. In the segmentation task, the reasonable selection, optimization, and innovation of the loss function can enhance the learning process of the model to achieve better segmentation results [
43] as well as portability and application to other networks; thus, the study of the loss function selection is particularly important. The commonly used loss functions [
44] are as follows:
1. Cross Entropy Loss Function
where
N denotes the number of samples, and
M denotes the number of categories. As the most commonly used loss function in image segmentation, which can be used in a large number of semantic segmentation tasks, the cross-entropy loss can help the network to correct categorization of the pixels after judging how good or bad the model is for the dataset.
2. Weighted Cross-Entropy Loss Function
Despite being similar to the cross-entropy loss function, multiplying all positive samples by a coefficient for weighting allows the model to focus more on a smaller number of samples, thus mitigating the problem of the imbalanced number of categories.
In addition to the imbalance in the number of samples from different categories, the problem of imbalance in the number of easily recognized samples and hard-to-recognize samples is often encountered, and the Focal Loss can help the network to better deal with the imbalance in the distribution of samples.
4. Dice Loss
where
is the intersection between samples
X and
Y,
represents the number of
X samples, and
stands for the number of
Y samples.
Unlike the weighted cross-entropy loss function, the Dice Loss does not require category reweighting; it calculates the loss directly from the Dice coefficients, which can help the network better handle overlaps and boundaries between categories.
5. IOU Loss
where
depicts the union between samples
X and
Y.
The IOU loss measures how similar the predicted segmentation results are to the true segmentation, and it helps to optimize the spatial consistency of the segmentation.
In summary, since the cloud shadows in the image are prone to overlap, and it is desired to distinguish the boundary between the two more accurately,
L and
LD are selected in this experiment for proper weighting to derive an innovative loss function applicable to the task of the dataset in this paper.
From
Table 1, it is evident that the last row, which utilizes different weight proportions in the loss function weighted combination, achieves the best performance. This finding aligns with our initial conjecture. The Dice Loss effectively distinguishes between overlap regions and boundaries, aiding in completing the classification task more effectively. Moreover, continuous training is essential for further enhancing the model’s classification accuracy.
4. Conclusions
In remote sensing images, the accurate segmentation of cloud shadow regions is of great practical significance for practical tasks such as meteorological prediction, environmental monitoring, and natural disaster detection. In this paper, an attention mechanism feature aggregation algorithm is proposed for cloud shadow segmentation, fully leveraging the advantages of convolutional neural networks in deep learning. UNet is selected as the backbone network, an innovative loss function is employed, and two auxiliary modules, CSAM and FFM, are introduced. Our proposed model initiates constant down-sampling to extract high-level features. Adaptive improvement of sensory fields and selection of different channel operations are introduced during each up-sampling process to increase the resolution of feature maps, enabling the acquisition of rich contextual information. This facilitates the accurate fusion of low- and high-level information within each layer’s feature fusion module, ultimately restoring the classification and localization of high-resolution remote sensing images. Compared with previous deep learning and segmentation methods, our approach achieves significant improvement in accuracy in cloud shadow segmentation tasks. Experiments demonstrate the remarkable noise resistance and identification capabilities of this method. It accurately locates cloud shadows and segments fine cloud crevices in complex environments, while also producing smoother edge segmentation. Particularly noteworthy is its performance in the task of identifying thick clouds. However, there are still some shortcomings in cloud shadow segmentation: (1) under the influence of light, some inconspicuous cloud seams may be incorrectly segmented into other features and thus recognized as background; (2) refinement is still needed for the segmentation of thin clouds to capture the fragmented information of cloud shadows; (3) to be better adapted to practical applications, in the future, we also need to appropriately compress and simplify the model while maintaining the accuracy and reduce the segmentation result time to improve the training speed of the network. In the future, augmented learning can be implemented by incorporating a pre-training phase into the model, aiming to enhance segmentation accuracy and reduce training time. Additionally, efforts will be made to explore its application in other domains, including river segmentation and medical tumor segmentation.