Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection

Fan, Zipeng; Wang, Sanqian; Pu, Xueting; Wei, Hongguang; Liu, Yuan; Sui, Xiubao; Chen, Qian

doi:10.3390/electronics12234823

Open AccessArticle

Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection

¹

School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

²

DFH Satellite Co., Ltd., Beijing 100094, China

³

School of Instrument and Electronics, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(23), 4823; https://doi.org/10.3390/electronics12234823

Submission received: 10 November 2023 / Revised: 24 November 2023 / Accepted: 27 November 2023 / Published: 29 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

Change detection (CD) in remote sensing images is a technique for analyzing and characterizing surface changes from remotely sensed data from different time periods. However, due to the diverse nature of targets in complex remote sensing scenarios, the current deep-learning-based methods still sometimes suffer from the problem of the extracted features not being discriminative enough, resulting in false detections and detail loss. To solve these challenges, we propose a method called Fusion-Former for building change detection. Our approach fuses window-based self-attention with depth-wise convolution, which is named Fusion-Block and which combines convolutional neural networks (CNN) and a transformer to integrate information at different scales effectively. Moreover, in order to significantly enhance the performance of the transformer and the effect of Fusion-Block, an innovative attention module called Vision-Module is introduced. On the LEVIR-CD dataset and WHU-CD dataset, our model achieved F1-scores of 89.53% and 86.00%, respectively, showcasing its superior performance over state-of-the-art methods.

Keywords:

attention module; change detection; remote sensing image; transformer

1. Introduction

Change detection (CD) in remote sensing imaging is a technique used to extract and characterize the change in an object or phenomenon of interest over time based on the observation of different phases of remotely sensed images and reference data as well as to quantitatively analyze and determine its change [1,2,3]. It has been widely used in a number of applications like disaster assessment, deforestation, urban expansion, and so on [4,5,6].

Researchers have developed numerous methods to enhance the efficiency of change detection. Over the past decade, deep learning has gained immense popularity across various fields, including computer vision, information retrieval, natural language processing, and image processing [7,8,9]. It is evident that advancements in deep learning are progressively pushing the remote sensing community towards the realm of remote sensing technology [10,11,12].

Deep learning is also known as a deep neural network that is based on layers of artificial neurons. These neurons receive input data and convert them into an output by applying an activation function and learning advanced features step by step. Some of the common neural networks include deep belief networks (DBNs) [13], stacked autoencoder (SAEs) [14], recurrent neural networks (RNNs) [15], convolutional neural networks (CNNs) [16], and transformers, which were invented in 2017 [17]. The most commonly used neural networks for change detection are CNNs and transformers. CNNs can be regarded as a special form of neural network designed for processing data if said data have a known grid-like representation. For example, image data can be considered a two-dimensional (2D) grid of pixels. CNNs are used in almost all kinds of tasks that have something to do with image processing. Change detection is no exception to this rule [18,19,20,21].

Nowadays, with the remarkable discrimination capabilities of deep learning, CNNs have been effectively implemented in RS image analysis and have demonstrated impressive performance in the change detection of buildings [22]. For instance, an improved U-Net++ was elaborated by Peng et al. [23]. They used dense skip connections between the different layers of the architecture to solve the difficulty of learning multi-scale feature maps. Zhang et al. introduced a fully atrous convolutional neural network (FACNN) [24], in which a change map based on pixels is generated using the classification map of current images and an outdated land cover geographical information system (GIS) map. A spectral–spatial joint learning network (SSJLN) was presented by Zhang and Lu [25]. It tackles the problem of the combined features not being extracted well. There have been bold attempts to apply a wide variety of classical networks to change detection. These methods are only suitable for local feature extraction because they use convolutional kernel filtering to extract features. For bitemporal images, CNNs require more layers and parameters and are prone to more severe overfitting.

Transformers have not been around long and are a relatively new type of neural network. They were first widely used in the field of natural language processing (NLP) and have become more popular than CNNs and RNNs. They completely abandon the previous loop and convolution operations and are based entirely on the attention mechanism, which allows for greater parallelism and improved training efficiency. Therefore, they gradually do not stop at just NLP anymore and achieve good results in other areas like image classification.

There is limited historical usage of transformers in the domain of change detection. Concerns have been raised regarding the effectiveness of methods solely relying on channel attention to enhance the accuracy of building change detection. Change detection has witnessed a significant breakthrough with the introduction of the bitemporal image transformer (BIT) [26]. This novel approach combines both convolutional neural networks (CNNs) and transformers, marking a new era in change detection applications. Shao et al. designed an attention-guided edge refinement network (AERNet) [27]. They used a CNN and a transformer to capture channel and location associations between features. ChangeFormer was proposed, which uses a hierarchical transformer encoder for extracting coarse and fine features from diachronic images [28].

Despite the high performance of transformers, their shortcomings cannot be ignored either. Transformers always bring huge model sizes and complex model structures. If the extracted features are complex, the existing methods may not effectively integrate the local and global features of the dual-temporal phase images. This may result in the loss of certain shallow feature information, which will affect the detection effectiveness.

To address these limitations, a combined network called Fusion-Former is proposed in this paper. We fused window-based self-attention with depth-wise convolution to form a backbone network named Fusion-Block, thus combining the CNN and Transformer. This block recognizes contextual patterns within a window and uses depth-wise convolution for local characteristics, expanding the model’s sensory field. Then, we employed a module called Vision-Module to improve the network’s feature extraction in change detection. Briefly, this paper offers the following contributions:

We propose a new network called Fusion-Former, joining Fusion-Block and Vision-Module for change detection.
Fusion-Block was designed as a decoder to extract both coarse and fine features from diachronic images.
We introduced a unique attention module, the Vision-Module, to boost the model’s detection effectiveness.

2. Materials and Methods

2.1. Overall Network Architecture

The overall network architecture is shown in Figure 1. It briefly demonstrates how Fusion-Former performs change detection on two dual-temporal images. Two bi-temporal images are sent to the network as the inputs and then pass Fusion-Block, Vision-Module and Feature-enhancing Module in turn, obtaining the final output. A detailed description will be provided.

2.1.1. Special-Downsampling and Downsampling

At the beginning of the network, initial two remote sensing image sizes are both H × W × C, where H, W, C represent height, width and numbers of channels, respectively. They go through four encoder downsampling and Fusion-Blocks, generating eight different outputs

I_{1}^{i}

and

I_{2}^{i}

, where i takes the values in the range of 1~4.

Downsampling is a 3 × 3 convolutional layer with kernel size K = 3, stride S = 2, and padding P = 1. Special-downsampling is a little different, one 3 × 3 convolutional layers with S = 2, two 3 × 3 convolutional layers with S = 1, one 2 × 2 convolutional layers with S = 2.

The downsampling makes the sizes of

I_{1}^{i}

and

I_{2}^{i}

become

\frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}} \times C_{i}

. It is worth noting that the exact value of

C_{i}

can be designed by yourself, but we had better keep

C_{i + 1}

is larger than

C_{i}

. We let

C_{i + 1}

= 2

C_{i}

.

2.1.2. Fusion-Block

Fusion-Block in Figure 2 is the centerpiece of Fusion-Former. In Fusion-Block, we propose a concept Bidirectional Interaction like Figure 3. It enhances the modeling capabilities of window self-attention and deep convolution in both channel and spatial dimensions. Secondly, the channel and spatial interaction mechanisms were designed, where the channel interaction mechanism focuses on the channel dimension for attention generation, using a global average pooling (GAP) layer and two consecutive 1 × 1 convolutional layers, while the spatial interaction mechanism uses two 1 × 1 convolutional layers with a reduced number of channels. Both interaction mechanisms employ information from the other branch for attention generation as a way to provide complementary cues to their respective branches, thus improving the modeling capability of the model.

Fusion-Block divides the image into several patches; each patch can be considered as a small image block. These patches are all 4 × 4. For each patch using a neural network for feature extraction, mapping it into a fixed dimension vector, this vector represents the features of the patch; this completes the transformation of the input image into a token, which also completes the process of downsampling.

In Figure 3, a 7 × 7 window-based self-attention was designed for more comprehensive extraction of global features. We split the feature maps after the first downsampling into windows, and then used self-attention to extract local features like the transformer encoder in the original work [17]. The formula of the self-attention is as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{h e a d}}})

(1)

where Q, K, V mean Query, Key, Value, respectively, and

d

is the channel dimension of the triple. The Softmax function is shown as follows:

Softmax (z_{j}) = \frac{e^{z_{j}}}{\sum_{n = 1}^{N} e^{z_{j}}}, f o r j = 1, 2, \dots, N

(2)

When it comes to the core of transformer, i.e., Multi-head self-attention (MSA), the MSA performs multiple independent attention heads in parallel, and the outputs are concatenated and then projected to result in the final values. The MSA can be expressed as:

M S A = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, {h e a d}_{3} \dots {h e a d}_{h})

(3)

where head is the self-attention, and

h

means the number of attention heads. However, in our W-attention, the V is not the simple Value, but a new matrix obtained by multiplying V and the feature matrix x obtained after the deep convolution. A wider range of contextual details are captured and the problem of limited sensing field is solved, which is a significant advantage in capturing information about changes in remotely sensed images at different scales. The total MSA is output1 of a Fusion-Block.

After the 3 × 3 Depth-Wise convolution, it will be multiplied with final MSA as another output 2. Last, we concatenated the two outputs and let it pass a Feed-Forward Network (FFN) layer to obtain the final output of the Fusion-Block.

2.1.3. Vision-Module

When

I_{1}^{i}

and

I_{2}^{i}

are sent into the i-th Vision-Module, which is also the decoder section of the network, the new outputs now become

F_{i}

. The Vision-Module greatly improves the performance of the Transformer on visual tasks. In the self-attention mechanism, the output of each location relies on the inputs from that location’s own and all surrounding locations. This will not only cause heavy storage burden, but also ignore the effect of physical distance. Therefore, we propose the Vision-Module so that the output of a location can selectively depend on other locations that are physically closer to it, which can be seen as a localized version of the self-attention mechanism.

Figure 4 depicts the illustration of the Vision-Module. In fact, it can encode fine-level information efficiently. For each spatial location

(x, y)

, we calculate the similarity to all the neighbors within an area of size

K \times K

. It is easy to implement because it accomplishes the whole calculation just through a reshaping operation.

Additionally, we introduce a new attention called Vision Attention (VA). Normally, for a H × W × C feature map, we use a linear layer and an unfold layer to reshape it into (H × W) ×

K^{2}

× C to obtain the value

V

. Synchronously, we use a linear layer to project the feature map into H × W ×

K^{4}

, and soon reshape it into (H × W) ×

K^{2} \times K^{2}

. The most significant part for the Vision-Module is to compute the attention score of VA. It is formulated as:

A t t e n t i o n = S o f t m a x (A) V

(4)

Compared with Equation (1), although the Vision-Module is similar to self-attention in the Transformer, it does not need any Q-K matrix multiplication. This design allows each output element to freely select information from a larger, flexible acceptance domain, playing a nice complementary role on Transformer.

2.1.4. Feature-Enhancing Module

In the Feature-Enhancing Module, we use two 3 × 3 convolutional layers as the feature enhancer to make feature extraction more comprehensive. Finally, it will be processed through an NLP layer to predict the change mask with the resolution of H × W ×

C_{c l a s s}

, where

C_{c l a s s}

represents the number of classes. Since we are only concerned with change detection in the building section, the final black and white map generated also contains only the change part of the building. The white color means change and black means no change.

2.2. Loss Function

The choice of loss function has a great impact on the training results of the model. Considering that change detection belongs to the task of binary or multiclassification segmentation at the pixel level and each pixel needs to be categorized into either positive or negative categories, we use the Binary Cross-Entropy Dice Loss (BCE-Dice Loss) as our loss function. It combines two different loss functions, Binary Cross-Entropy Loss (BCE-Loss) and Dice Loss. For each pixel, BCE Loss calculates the cross-entropy loss between two categories (positive and negative). The mathematical formula for BCE-Loss is as follows:

B C E - L o s s = - [y * l o g (p) + (1 - y) * l o g (1 - p)]

(5)

where y is the ground truth (GT) and p is predictive probability of the model.

The Dice loss is based on the Dice coefficient, which is a metric function used to assess the similarity of two samples, and a larger value means that these two samples are more similar. The mathematical expression for the Dice coefficient is as follows:

D i c e = \frac{2 | X \cap Y |}{|X| + | Y |}

(6)

where X represents the ground truth (GT) and Y represents the prediction, and the numerator represents the number of elements in the intersection between X and Y, and the denominator represents the number of elements in X and Y. The Dice Loss expression can be simply described as

1 - D i c e

, so the Dice Loss expression is given below:

B C E - D i c e L o s s = B C E - L o s s - l o g (D i c e L o s s)

(7)

Minimizing the BCE-Dice Loss when training an image segmentation model will allow the model to classify pixels more accurately in binary classification tasks and achieve better performance in terms of segmentation quality for better change detection results.

2.3. Dataset and Preprocessing

2.3.1. LEVIR-CD Dataset

The LEVIR-CD dataset [29] is a commonly used building CD dataset. LEVIR-CD consists of 637 Google Earth (GE) image patch pairs. They are very high-resolution (VHR, 0.5 m/pixel) images with a size of pixels via Google Earth API. Our dataset has been divided into training set, validation set and test set, including 7120, 1024, and 2048 pairs of images, respectively, in a 7:1:2 ratio. We cropped the original images into 256 × 256 chips with no overlap in order to train the model more efficiently.

2.3.2. WHU-CD Dataset

As with the LEVIR-CD dataset, the WHU-CD dataset [30] is also mainly about the change of building, but it has one thing different from the LEVIR-CD dataset. That is, it contains only one set of bitemporal high-resolution RS image pairs of size 32,507 × 15,354. So we refer to the experience of those who have gone before us, again splitting the two large images into 256 × 256 chips with no overlap. In line with the principles of the control variables approach, we divide the training set, validation set and test set into the same ratio as the LEVIR-CD dataset randomly, almost 6096/762/762 for training/ validation/test, respectively.

2.3.3. Preprocessing of Dataset

In order to promote the robustness of the algorithm and enhance the generalization ability of the model, during the training process, we do a preprocessing operation on the training set. Specifically, we randomly selected fifty percent of the images to be flipped horizontally and vertically, and clipped randomly. Meanwhile, again, we randomly select fifty percent of the images to be rotated according to the probability of one-third each of 90 degrees, 180 degrees and 270 degrees. Afterwards, we add Gaussian noise to all the images and perform a color transformation. In color transformation, we increase the brightness, saturation, contrast and hue of all the images by thirty percent each, but we do not do color transformations on labels.

2.4. Details of the Experiment

The whole process of training, validation and testing is performed on the server, which is equipped with two 24-GB memory NVIDIA GTX 3090 GPUs. In the out experiments, we use SGD as the optimizer. The minimum learning rate is 0.0005, while initial learning rate is 0.002. The total epoch is set to 100, and batch size is set to 8.

To quantitatively characterize the effect of change detection, four widely used evaluation metrics, including Intersection over Union (IOU), Precision (Pre), Recall (Rec), and F1-score (F1), are adopted to quantitatively evaluate the performance of each model for change detection in high-resolution remote sensing images. They all have something to do with parameters TP, FP, TN, and FN. TP is the number of change pixels that are accurately recognized, while FP is the number of unchanged pixels that are wrongly detected as changed pixels. TN denotes the number of correctly detected unchanged pixels, and FN denotes the number of invariant pixels that are detected as change pixels.

The precision rate is for the prediction result and indicates how many of the samples predicted to be positive are positive samples for GT. Recall represents how many of the positive examples in GT are correctly predicted. IOU is one of the commonly used evaluation metrics in image target detection to measure the accuracy of the detection frame. It evaluates the degree of overlap between the detection frame and the real frame by calculating the ratio of intersection and concatenation between the two. F1-score is a metric used in statistics to measure the accuracy of a binary classification model or multiple classification tasks. It takes into account both the accuracy and recall of a classification model. F1-score can be viewed as a weighted average of the model’s accuracy and recall, with a maximum value of 1 and a minimum value of 0. The calculations of the four metrics are shown below:

Precision = \frac{TP}{TP + FP}

(8)

Recall = \frac{TP}{TP + FN}

(9)

IOU = \frac{TP}{TP + FP + FN}

(10)

F 1 - score = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(11)

3. Results

We compare our proposed Fusion-Former with some existing common classical methods. They are FC-EF [31], FC-Siam-conc [31], FC-Siam-diff [31], CDNet [32], DSIFN [33] and BIT. In addition to quantitative analysis, we also need qualitative analysis, which can be more intuitive to find the experimental effects of different models. We will present our experimental results in two main parts. The first part is a comparison of the effect with other models. Our final rendering has four colors white, black, red, green, which represent TP, TN, FP, and FN, respectively. The second part is ablation experiment. We conjecture that the number of channels initialized in the Vision-Module may have an impact on the metrics affecting the experiments, so we change the number of channels and do further comparison experiments.

3.1. Visualization Results of Different Methods

3.1.1. Experiment Results on LEVIR-CD Dataset

Table 1 depicts the quantitative evaluations from our proposed Fusion-Former to other approaches. From this, we can see that Fusion-Former achieves the highest F1, IOU across all evaluation metrics. Compared with the best-performing competing method BIT, they improved by 0.35%, 0.48%, respectively, although Precision decreases by 1.06% compared with BIT and Recall decreases by 3.09% compared with DSIFN.

To ensure the randomness of the sample, we randomly selected five pairs of double time-phase images, where some of the buildings were more tightly distributed and others appeared to be more sparsely distributed. From the qualitative results shown in Figure 5, we more intuitively see the variability in effectiveness between different change detection algorithms. For a better view, we can observe that Fusion-Former still shows the best results of all the algorithms. Generally speaking, all the models have more or less missed or misdirected tests, but visually, both the cases of misdetections and omissions are minimized in our model. Our model has successfully extracted most of the regions of building changes. We briefly analyzed the results of the five sets of experiments and found that FP and FN appeared mainly in the following places. The first one is where the buildings are densely distributed, and the second one is concentrated at the edge of each building, regardless of whether the shape of this building is regular or not. There is also a portion of the image with colors that are extraordinarily similar to the forest, or a small portion of the image that is covered by the forest, which is not easy to detect the change region either.

3.1.2. Experiment Results on WHU-CD Dataset

Our approach achieves equally satisfactory results on this dataset. As displayed in Table 2, Fusion-Former outperforms others in terms of FI, IOU, and Recall, with scores of 86.00%, 75.44%, and 86.72%, respectively, which are 1.7%, 1.41%, 1.12% higher than the second-best. However, CDNet and DSIFN surpass BIT in terms of Recall and IOU, instead of being the second-best. Furthermore, DSIFN achieve the highest precision in WHU-CD dataset as well, 3.5% higher than Fusion-Former. We will further investigate reasons for these findings.

We carefully chose five sets of dual time-phase images with varying seasons and building densities for analysis and comparison. In Figure 6, we present the effect image that showcases the outcome of our method. It is worth mentioning that the colorful visualization in the WHU-CD dataset is similar to that of the LEVIR-CD dataset. Undoubtedly, our method exhibits the most promising results, with lower false positives (FP) and false negatives (FN) compared to other methods, although there may be some minor incorrect classifications and missed detections.

In addition, in some of the post-temporal images of the added buildings, there are some buildings that are more similar in color to the ground. The model may identify the ground as a building, and Fusion-Former is no exception. One more thing we notice is that some construction waste is also being used as a building change, and that is not what we want.

3.1.3. Visualization of the Training Process

Visualization of the training process is crucial in deep learning as it allows for a more intuitive understanding of the data. The learning process involves optimization, where the goal is to find the optimal point as the final result of the training process. By visualizing aspects such as loss function and parameter distribution, we can gain insights into the model’s performance, identify and debug potential issues, and track the progress of the model.

Since our model is proposed on a Unet-based architecture, we chose Unet for the baseline (Base) of our model. We show the trends of training F1, validation F1 for Base and Fusion-Former in the three sets of plots in Figure 7 and Figure 8. From the three indicators above, we can see though that Fusion-Former and Base perform similarly on the stability of the entire training process; Fusion-Former outperforms Base with regard to whether LEVIR-CD or WHU-CD, proving the superiority of our algorithm.

3.2. Ablation Studies

In order to illustrate out method’s effectiveness, we conducted some ablation studies in context modeling. At the beginning, we use the baseline Unet as our first comparison. Next, the Base will be changed to Fusion-Block without Vision-Module. Then, we add Vision-Module of different channel numbers to the model, aiming to select a best performed Fusion-Former. Three different channel numbers 48, 96, 192 are references for this experiment named Fusion48, Fusion96, Fusion192.

We calculate F1, Precision, Recall, IOU of the five sets of models on each of the two datasets. The results are presented in Table 3 and Table 4, respectively. Clearly, in the LEVIR-CD dataset, among them, F1, Precision, Recall and IOU in Fusion192 are the highest, reaching to 85.93%, 90.30%, 88.78%, 81.05%, while Fusion96 (F1 is 86.00%, Precision is 86.40%, Recall is 86.72% and IOU is 75.44%) is slightly better than Fusion192 (F1 is 85.90%, Precision is 85.35%, Recall is 86.45% and IOU is 75.29%) in WHU-CD. Based on the ablation studies conducted, it can be concluded that models with a larger number of channels generally yield better results compared to those with fewer channels. However, it is important to note that simply increasing the number of channels does not guarantee improved performance.

4. Discussion

FC-EF, FC-Siam-diff, and FC-Siam-conc seem to fall short in terms of performance metrics on both datasets. They use a Fully Convolutional Network (FCN) as their backbone. However, FCN tends to be less sensitive to image details and may lack spatial consistency as it does not fully consider pixel-to-pixel relationships. This might result in a decrease in accuracy.

The Recall and Precision of our model are slightly lower compared to other models. It is essential to note that both Precision and Recall are high only in an ideal scenario, suggesting a certain complementary relationship between the two metrics. In the case of our model, it exhibits better Recall but lower Precision than BIT in the LEVIR-CD dataset, better Precision but lower Recall than DSIFN in the LEVIR-CD dataset, and lower Precision but higher Recall than DSIFN in the WHU-CD dataset. Therefore, to thoroughly analyze and improve the performance, we need to consider these factors.

During our analysis, we found that the extracted deep features are utilized in a deeply supervised difference discrimination network (DDN) within the DSIFN network for change detection. To further improve the effectiveness of this approach, a change mapping loss is incorporated directly into the middle layer of the network, allowing for end-to-end training of the entire network. This innovative approach has the potential to provide more reliable indicators of change. That may well explain why DSIFN has the best Precision 89.92% on the WHU-CD dataset.

In the case of BIT, they choose to use Transformer as their decoder, which differs from most papers that use Transformer solely as the encoder. This unique choice may contribute to higher precision in their results. Furthermore, to enhance the robustness of the algorithm, we introduce noise to the images, which may impact the identification of changed or unchanged objects between the bi-temporal images. Maybe this is the reason why BIT obtains the best Precision of 91.36% on the LEVIR-CD dataset.

However, our Fusion-Former combines window-based self-attention and depth-wise convolution, which can extract more comprehensive and detailed features. This advantage is particularly useful for capturing information about changes in remote sensing images at various scales. As a result, our model achieves highest F1-score and IOU on both datasets.

Last but not least, we need to analyze the effect of the number of channels on the experimental results in the Vision-Module. The size of the WHU-CD dataset is smaller than LEVIR-CD dataset, so the most likely reason for the decrease in the effectiveness of the Fusion192 is the occurrence of overfitting. The fact that the larger model is good on the larger sized dataset suggests that the metrics still have room to rise and that there is potential for our model to achieve better experimental results.

5. Conclusions

In our proposed Fusion-Former, Fusion-Block enables the model to extract more effective change information from multi-channel data of remote sensing images, and the Vision-Module significantly improves the performance of the Fusion-Block. The efficiency comparisons show that Fusion-Former achieves optimal performance. However, our algorithm still has some deficiencies in aspects such as edge feature extraction, so there is still a lot of room for model accuracy to rise. In the future, we will continue to explore the combination of CNN and transformer, as well as try to improve the efficiency of feature extraction and achieve model lightweight. This technique is applicable not only for remote sensing image change detection, but also for regular semantic segmentation tasks, making it a valuable area for further study and exploration.

Author Contributions

Conceptualization, Z.F. and S.W.; methodology, Z.F.; software, Y.L.; validation, X.P., Z.F. and S.W.; formal analysis, Z.F.; investigation, X.P. and H.W.; resources, S.W. and X.P.; data curation, H.W.; writing—original draft preparation, Z.F.; writing—review and editing, Z.F. and Y.L.; visualization, Z.F., X.S., and Q.C.; supervision, Z.F. and X.S.; project administration, X.S.; funding acquisition, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant no. 62301253, 62305163, 62105152), Fundamental Research Funds for the Central Universities (Grant no. 30919011401, 30922010204, 30922010718, JSGP202202), Funds of the Key Laboratory of National Defense Science and Technology (Grant no: 6142113210205), Leading Technology of Jiangsu Basic Research Plan (BK20192003), The Excellent Member of Youth Innovation Promotion Association CAS (No. Y2021071, No. Y202058).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Author Sanqian Wang was employed by the company DFH Satellite Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Huang, X.; Zhang, L.; Zhu, T. Building change detection from multitemporal high-resolution remotely sensed images based on a morphological building index. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 7, 105–115. [Google Scholar] [CrossRef]
Gong, M.; Zhao, J.; Liu, J.; Miao, Q.; Jiao, L. Change detection in synthetic aperture radar images based on deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2015, 27, 125–138. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Li, H. DASNet: Dual attentive fully convolutional siamese networks for change detection of high resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Nielsen, A.A.; Conradsen, K.; Simpson, J.J. Multivariate alteration detection (MAD) and MAF postprocessing in multispectral, bitemporal image data: New approaches to change detection studies. Remote Sens. Environ. 1998, 64, 1–19. [Google Scholar] [CrossRef]
Gapper, J.J.; El-Askary, H.M.; Linstead, E.; Piechota, T. Coral Reef Change Detection in Remote Pacific Islands Using Support Vector Machine Classifiers. Remote Sens. 2019, 11, 1525. [Google Scholar] [CrossRef]
Xie, Z.; Li, Y.; Niu, J.; Wang, Z.; Lu, G. Hyperspectral face recognition based on sparse spectral attention deep neural networks. Opt. Express 2020, 28, 36286–36303. [Google Scholar] [CrossRef]
Niu, J.; Xie, Z.; Li, Y. Scale fusion light CNN for hyperspectral face recognition with knowledge distillation and attention mechanism. Appl. Intell. 2022, 52, 6181–6195. [Google Scholar] [CrossRef]
Alshingiti, Z.; Alaqel, R.; Al-Muhtadi, J.; Haq, Q.E.U.; Saleem, K.; Faheem, M.H. A Deep Learning-Based Phishing Detection System Using CNN, LSTM, and LSTM-CNN. Electronics 2023, 12, 232. [Google Scholar] [CrossRef]
Gil-Yepes, J.L.; Ruiz, L.A.; Recio, J.A.; Balaguer-Beser, Á.; Hermosilla, T. Description and validation of a new set of object-based temporal geostatistical features for land-use/land-cover change detection. ISPRS J. Photogramm. Remote Sens. 2016, 121, 77–91. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Pan, B.; Xu, X.; Shi, Z.; Zhang, N.; Luo, H.; Lan, X. DSSNet: A Simple Dilated Semantic Segmentation Network for Hyperspectral Imagery Classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1968–1972. [Google Scholar] [CrossRef]
Chen, H.; Wu, C.; Du, B.; Zhang, L.; Wang, L. Change Detection in Multisource VHR Images via Deep Siamese Convolutional Multiple-Layers Recurrent Neural Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2848–2864. [Google Scholar] [CrossRef]
Mohamed, A.R.; Dahl, G.E.; Hinton, G. Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process 2012, 20, 14–22. [Google Scholar] [CrossRef]
Zabalza, J.; Ren, J.; Zheng, J.; Zhao, H.; Qing, C.; Yang, Z.; Du, P.; Marshall, S. Novel segmented stacked autoencoder for effective dimensionality reduction and feature extraction in hyperspectral imaging. Neurocomputing 2016, 185, 1–10. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Yan, L.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Shi, W.; Zhang, M.; Zhang, R.; Chen, S.; Zhan, Z. Change Detection Based on Artificial Intelligence: State-of-the-Art and Challenges. Remote Sens. 2020, 12, 1688. [Google Scholar] [CrossRef]
Ye, S.; Chen, D.; Yu, J. A Targeted Change-Detection Procedure by Combining Change Vector Analysis and Post-Classification Approach. ISPRS J. Photogramm. Remote Sens. 2016, 114, 115–124. [Google Scholar] [CrossRef]
Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS J. Photogramm. Remote Sens. 2013, 80, 91–106. [Google Scholar] [CrossRef]
Wang, B.; Choi, J.; Choi, S.; Lee, S.; Wu, P.; Gao, Y. Image Fusion-Based Land Cover Change Detection Using Multi-Temporal High-Resolution Satellite Images. Remote Sens. 2017, 9, 804. [Google Scholar] [CrossRef]
Mo, W.; Tan, Y.; Zhou, Y.; Zhi, Y.; Cai, Y.; Ma, W. Multispectral Remote Sensing Image Change Detection Based on Twin Neural Networks. Electronics 2023, 12, 3766. [Google Scholar] [CrossRef]
Peng, D.; Zhang, Y.; Guan, H. End-to-End change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Zhang, C.; Wei, S.; Ji, S.; Lu, M. Detecting large-scale urban land cover changes from very high-resolution remote sensing images using CNN-based classification. ISPRS Int. J. Geo-Inf. 2019, 8, 189. [Google Scholar] [CrossRef]
Zhang, W.; Lu, X. The spectral-spatial joint learning for change detection in multispectral imagery. Remote Sens. 2019, 11, 240. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Zhang, J.; Shao, Z.; Ding, Q.; Huang, X.; Wang, Y.; Zhou, X.; Li, D. AERNet: An Attention-Guided Edge Refinement Network and a Dataset for Remote Sensing Building Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery dataset. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the IEEE International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Jin, W.D.; Xu, J.; Han, Q.; Zhang, Y.; Cheng, M.M. CDNet: Complementary depth network for RGB-D salient object detection. IEEE Trans. Image Process. 2021, 30, 3376–3390. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]

Figure 1. The overall network architecture Fusion-Former.

Figure 2. Fusion-Block.

Figure 3. Bidirectional Interaction.

Figure 4. Vision-Module.

Figure 5. Qualitative results of the different methods on LEVIR-CD dataset: (a) the image of the first date, (b) the image of the second date, (c) the ground-truth GT, (d) Our method, (e) BIT, (f) DSIFN, (g) CDNet, (h) FC-EF, (i) FC-Siam-diff, and (j) FC-Siam-conc. We make the TP areas in white, the FP areas in red, and the FN areas in green. The black color means the TN areas.

Figure 6. Qualitative results of the different methods on WHU-CD dataset: (a) the image of the first date, (b) the image of the second date, (c) the ground truth GT, (d) Our method, (e) BIT, (f) DSIFN, (g) CDNet, (h) FC-EF, (i) FC-Siam-diff, and (j) FC-Siam-conc. We make the TP areas in white, the FP areas in red, and the FN areas in green. The black color means the TN areas.

Figure 7. The train F1-score of the two datasets. (a) LEVIR-CD dataset; (b)WHU-CD dataset.

Figure 8. The validation F1-score of the two datasets. (a) LEVIR-CD dataset; (b) WHU-CD dataset.

Table 1. Quantitative results on LEVIR-CD dataset.

Model	F1-Score (%)	Precision (%)	Recall (%)	IOU (%)
FC-EF	67.32	64.91	69.92	50.74
FC-Siam-diff	85.23	87.02	83.52	74.26
FC-Siam-conc	86.14	87.70	84.64	75.66
CDNet	86.66	88.35	85.05	76.50
DSIFN	89.02	86.35	91.87	80.22
BIT	89.18	91.36	87.21	80.57
Ours	89.53	90.30	88.78	81.05

Table 2. Quantitative results on WHU-CD dataset.

Model	F1-Score (%)	Precision (%)	Recall (%)	IOU (%)
FC-EF	71.44	63.79	66.68	62.56
FC-Siam-diff	71.68	73.20	70.21	55.86
FC-Siam-conc	73.65	76.56	70.96	58.30
CDNet	83.80	82.20	85.60	71.26
DSIFN	85.08	89.92	80.73	74.03
BIT	84.30	84.37	84.22	72.86
Ours	86.00	86.40	86.72	75.44

Table 3. Ablation studies on LEVIR-CD dataset.

Model	F1-Score (%)	Precision (%)	Recall (%)	IOU (%)
Base	85.68	84.94	86.43	74.95
Fusion-Block	86.11	86.54	85.69	75.62
Fusion48	88.81	86.52	86.22	79.88
Fusion96	89.40	90.12	88.64	80.84
Fusion192	89.53	90.30	88.78	81.05

Table 4. Ablation studies on WHU-CD dataset.

Model	F1-Score (%)	Precision (%)	Recall (%)	IOU (%)
Base	75.92	79.68	72.50	61.19
Fusion-Block	79.80	80.02	79.57	66.39
Fusion48	85.58	86.26	84.91	74.80
Fusion96	86.00	86.40	86.72	75.44
Fusion192	85.90	85.35	86.45	75.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, Z.; Wang, S.; Pu, X.; Wei, H.; Liu, Y.; Sui, X.; Chen, Q. Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection. Electronics 2023, 12, 4823. https://doi.org/10.3390/electronics12234823

AMA Style

Fan Z, Wang S, Pu X, Wei H, Liu Y, Sui X, Chen Q. Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection. Electronics. 2023; 12(23):4823. https://doi.org/10.3390/electronics12234823

Chicago/Turabian Style

Fan, Zipeng, Sanqian Wang, Xueting Pu, Hongguang Wei, Yuan Liu, Xiubao Sui, and Qian Chen. 2023. "Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection" Electronics 12, no. 23: 4823. https://doi.org/10.3390/electronics12234823

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Network Architecture

2.1.1. Special-Downsampling and Downsampling

2.1.2. Fusion-Block

2.1.3. Vision-Module

2.1.4. Feature-Enhancing Module

2.2. Loss Function

2.3. Dataset and Preprocessing

2.3.1. LEVIR-CD Dataset

2.3.2. WHU-CD Dataset

2.3.3. Preprocessing of Dataset

2.4. Details of the Experiment

3. Results

3.1. Visualization Results of Different Methods

3.1.1. Experiment Results on LEVIR-CD Dataset

3.1.2. Experiment Results on WHU-CD Dataset

3.1.3. Visualization of the Training Process

3.2. Ablation Studies

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI