SFFNet: Staged Feature Fusion Network of Connecting Convolutional Neural Networks and Graph Convolutional Neural Networks for Hyperspectral Image Classification

Li, Hao; Xiong, Xiaorui; Liu, Chaoxian; Ma, Yong; Zeng, Shan; Li, Yaqin

doi:10.3390/app14062327

Open AccessArticle

SFFNet: Staged Feature Fusion Network of Connecting Convolutional Neural Networks and Graph Convolutional Neural Networks for Hyperspectral Image Classification

¹

School of Mathematics and Computer Science, Wuhan Polytechnic University, Wuhan 430048, China

²

Electronic Information School, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2327; https://doi.org/10.3390/app14062327

Submission received: 3 February 2024 / Revised: 3 March 2024 / Accepted: 8 March 2024 / Published: 10 March 2024

(This article belongs to the Special Issue Deep Learning and Machine Learning in Image Processing and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

The immense representation power of deep learning frameworks has kept them in the spotlight in hyperspectral image (HSI) classification. Graph Convolutional Neural Networks (GCNs) can be used to compensate for the lack of spatial information in Convolutional Neural Networks (CNNs). However, most GCNs construct graph data structures based on pixel points, which requires the construction of neighborhood matrices on all data. Meanwhile, the setting of GCNs to construct similarity relations based on spatial structure is not fully applicable to HSIs. To make the network more compatible with HSIs, we propose a staged feature fusion model called SFFNet, a neural network framework connecting CNN and GCN models. The CNN performs the first stage of feature extraction, assisted by adding neighboring features and overcoming the defects of local convolution; then, the GCN performs the second stage for classification, and the graph data structure is constructed based on spectral similarity, optimizing the original connectivity relationships. In addition, the framework enables the batch training of the GCN by using the extracted spectral features as nodes, which greatly reduces the hardware requirements. The experimental results on three publicly available benchmark hyperspectral datasets show that our proposed framework outperforms other relevant deep learning models, with an overall classification accuracy of over 97%.

Keywords:

hyperspectral image classification; convolutional neural network; graph convolutional neural network

1. Introduction

HSIs have been widely used in many fields due to their advantage of recognizing subtle spectral differences [1]. In the field of classification, most of the problems are focused on the “dimensionality catastrophe”, under-utilization of information, and lack of labeled samples. Researchers have tried to solve these issues by using various methods such as the simplified maximum likelihood classification technique [2], Support Vector Machines (SVMs) [3], and Markov random field (MRF) [4]. The majority of conventional techniques approach problems from a linear standpoint, making it difficult to convey the underlying nonlinear relationship between the original spectrum data and associated materials. With the development of machine learning, deep learning has enabled high-precision classification of HSIs.

Deep learning is extremely widely used on HSIs. Among them, the most common network models include the Recurrent Neural Network (RNN), Generative Adversarial Network (GAN), CNN, Transformer, etc. In 2017, Mou et al. used an RNN for the first time to in field of HSI classification. Hyperspectral pixels were analyzed in the pattern of sequential data [1]. In 2019, Hang et al. proposed a cascaded RNN to explore the redundant and complementary information of HSIs [5]. In 2018, Zhu et al. first discovered the possibility of using a GAN for HSI classification [6]. With the explosion of Transformer, Visual Transformer (ViT) [7] set the trend in the image processing field. He et al. [8] used a bidirectional encoder Transformer to implement HSI classification, with good training results. Hong proposed SpectralFormer, which is a new approach to classifying HSIs from a Transformer’s sequence perspective [9]. Different models have different shortcomings in HSI classification. For example, RNNs are prone to performance bottlenecks in large-scale training tasks, and GANs are not easy to debug. Transformer has a high demand for data. As 3D images, HSIs are generally characterized by a large amount of information and large differences in categories. The existing publicly available hyperspectral datasets are also limited. For practical applications based on hyperspectral images, CNN-based methods have emerged as the most widely adopted approach for HSI classification [10].

CNNs include 1D-CNN, 2D-CNN, and 3D-CNN models. These modes differ in computational dimensions and they all utilize backpropagation algorithms for network updates [1]. In 2015, Hu et al. first applied 1D-CNN to HSI classification based on spectral features [11]. Chen et al. introduced Dropout to address overfitting [12]. However, the occurrence of “same-spectrum different objects” and “different objects with the same spectrum” in HSIs implies that spectral information alone cannot deliver exceptional classification results. To explore the spatial information of HSIs, researchers have experimented with using 2D-CNN. In 2018, Sharma et al. employed a CNN for feature extraction and subsequently utilized AdaBoostSVM for band selection [13]. Makantasis et al. have utilized 2D-CNN for feature extraction and a perceptual machine model for classification [14]. Yue et al. expanded into the spatial domain [15], while Song et al. achieved deep spatial feature extraction and fusion [16]. 2D-CNN has yielded outstanding outcomes in HSI classification. However, its limitation is that the original image must be downscaled before convolution, which may result in the loss of key information. In order to simultaneously utilize the spatial and spectral information of hyperspectral images, an increasing number of researchers are opting for a “spatial-spectral” processing approach. In 2016, Yang et al. proposed the Two-CNN model. This model separates the spatial and spectral channels to extract the information before concatenating the features to complete the classification [17]. In 2017, Li et al. developed a simple three-dimensional convolutional neural network (3D-CNN) framework for HSI classification, which directly captures spatial and spectral information for hyperspectral processing without requiring any preprocessing [18]. In 2022, Zhou et al. added a multi-scale fused spectral attention module to 3D-CNN to improve spectral continuity [19]. In their latest research, Xu et al. proposed a local-spectral feature optimization method using local and global spectral feature extraction blocks to automatically extract raw information [20]. The emergence of CNNs has effectively advanced the field of HSI classification. By extensively exploring the underlying information embedded in images, it becomes possible to discern the correspondence of categories. However, CNNs have the disadvantage of extracting information only in the region of fixed convolutional kernel size. In this case, a CNN cannot capture the global information of a HSI. The information between spatial and spectral aspects is also worthy of further study. Recognizing the limitations of CNNs in HSI processing, researchers have endeavored to introduce a neural network for handling non-Euclidean data, known as the Graph Neural Network (GNN) [21]. Notably, the Graph Convolutional Neural Network (GCN) [22] has assumed an increasingly significant role in this domain [23,24,25]”.

The GCN is able to constantly update the correlation relationship of nodes by aggregating the information of neighboring nodes. By constructing data relationships across the map, the GCN serves as a bridge for global data exchange. In 2018, Qin et al. proposed a spatial graph of convolutional networks for semi-supervised HSI classification (S²GCNs) [23], which utilizes adjacent nodes in the graph to approximate convolution. Wan et al. proposed the MDGCN in 2019 [26]. The model chooses the GCN in order to improve the drawbacks of traditional CNN models that can only perform static computations. In 2020, Wan et al. proposed the CAD-GCN to improve the problem of contextual insensitivity [27]. Based on the traditional GCN, graphs’ variable weights and connection relations are improved. In 2021, Ding et al. proposed a novel semi-supervised network based on graph sample and aggregate-attention (SAGE-A), which utilizes a self-attention mechanism to determine the information of HSIs [28]. The GCN’s ability in spatial construction is well documented. It can effectively model the similarity relationship between data. By combining topological information, the classification of HSIs is optimized. However, most of the models are also accompanied by computational complexity and training difficulty. Researchers have gradually fallen into the bottleneck of single network frameworks. It has been realized that fusion between frameworks can be effective for this purpose. Hong et al. proposed a new minibatch GCN in 2020, which allows for data to be trained in a minibatch manner [29]. Through this method, the GCN makes it possible to process large-scale HSI data. Liu et al. proposed a multilevel super-pixel structured graph U-Net (MSSGU) to learn multiscale features on multilevel graphs [30]. They further improved the original framework by incorporating a graph attention mechanism combining 3D-CNN and 2D-CNN. In 2022, Dong et al. proposed a Weighted Feature Fusion of Convolutional Neural Network and Graph Attention Network (WFCG) for HSI classification by using the characteristics of the super-pixel-based GAT and pixel-based CNN [31]. In 2023 study, Zhou et al. proposed an adaptive model called the AMGCFN [32]. It includes two sub-networks of multi-scale full CNNs and a multi-hop GCN to extract the multilevel information of HSIs. Yu et al. have proposed a two-branch deeper GCN (TBDGCN) [33] by combining GCN and CNN to extract the pixel-level information of HSIs simultaneously. In theie latest research, Yu et al. make full use of the advantages of CNNs and GNNs and propose a graph-polarized fusion (GPF) model. From the long-range and multi-angle levels, this model can acquire pixel-level and super-pixel-level features simultaneously [34]. The CNN refines classification tasks by conducting local convolution, and the GCN acquires long-range similarity relationships through global connections. However, the application of GCNs on HSIs is not a good enough fit. Traditional CNNs and GCNs collectively lack consideration for the structural relationships present within network data. When fusing different models and features, most methods usually use additive (A), elementwise multiplicative (M), and concatenation (C) processes. Regarding network connections, the majority of them are typically interpreted as local connections.

Analysis shows that CNNs and GCNs have their respective strengths from different perspectives. At the same time, their fusion strategies need to be further studied. The following concerns are worthy of our consideration.

(1) HSIs encompasses rich ground features. However, existing methods often overlook the correlation between spatial and spectral aspects. The challenge lies in achieving a balanced utilization of spatial and spectral information, making the method of combining local and global HSI information an area worthy of further exploration.

(2) Most GCN-based methods focus on extracting spatial relationships between image data. These frameworks use adjacent pixels to represent adjacency relationships. However, the advantage of HSIs lies in the spectral dimension, where the expression of spatial relationships is limited and requires a more relevant, new expression of relationships.

(3) GCNs require the adjacency matrix of the original data as input, so GCNs only allow for full-batch network learning. When the hyperspectral data are large, the one-time data input is demanding on the hardware equipment. There may be negative effects, such as slow gradient descent.

To address the above problems, we propose a staged feature fusion network architecture that combines CNNs and GCNs, called SFFNet. Simply put, we first perform random sampling on HSIs while obtaining the central node for classification and the image blocks for auxiliary classification. These are fed into the CNN convolution module to extract features and enhance local information. After extracting the corresponding features, we construct a graph data structure and send it to the GCN module, updating the relationship between node features based on spectral similarity. Classification is completed after outputting the final predicted probability. SFFNet can make full use of the spatial-spectral information of HSIs to strengthen the global information. The main contributions of this paper are as follows.

(1) A HSI classification model connecting a CNN and GCN is proposed, which is called SFFNet. The CNN refines the original information and the GCN obtains the information correlation. By improving the spectral-spatial feature-awareness capability of the model, the model can achieve high-precision HSI classification.

(2) Optimizing the connectivity relationships of GCNs based on the similarity of spectral dimensions. The introduction of the GCN addresses the limitation of traditional CNNs, which can only extract local features using fixed convolution kernels. This approach combines both the local and global information of images.

(3) Allowing the GCN to input data in batches. Following feature extraction by the CNN, the GCN constructs information node similarities. Through multiple iterations, the network weights can be optimized. This approach reduces hardware requirements while ensuring classification accuracy.

The remainder of this paper is organized as follows. Section 2 begins with an introduction to related work. Section 3 describes our image classification framework based on combining the CNN and GCN in detail. Section 4 organizes the background and settings related to the experiments. We will give an analysis of the experimental results in Section 5. Finally, the whole manuscript will be summarized in Section 6.

2. Relate Work

Our model consists of two main parts: a CNN module for feature extraction and a GCN module for classification prediction. In this section, we describe in detail the relevant methods used in the model.

2.1. CNN and VGG 16

A CNN is a widely used artificial neural network with good performance in various fields. It mainly consists of a convolutional layer, pooling layer, and fully connected layer. With the deepening of CNN research, researchers have proposed many improvement methods.

The VGG is a model proposed by the Visual Geometry Group at the ImageNet Challenge, which can be used well for classification and localization tasks, as shown by the competition results and experimental performance [35]. Alexnet [36], which is the pioneering model, has the problems of a too-large convolutional kernel, too-large moving step size, and no padding. These problems are addressed by the VGG. It enhances the extraction of a small number of features and proves that increasing the depth of the network can affect the final performance. The VGG16, as the name suggests, has 16 network layers, including 13 convolutional layers and 3 fully connected layers, separated by max-pooling. All the activation units are processed using the ReLU function. The VGG16 uses 64, 128, 256, and 512 convolutional kernels of 3 × 3 size in the first to the fifth parts of the convolutional layers, respectively, and a pooling layer of 2 × 2 size is moved in steps of 2. The last convolutional layer uses 512 3 × 3 convolutional kernels before feeding the features into the fully connected layer. After that, it performs the softmax prediction.

The VGG choses to use multiple small-sized convolutional kernels to equal large-sized convolutional kernels in the convolutional kernel size setting. Multiple small-sized convolutional kernels produce fewer learning parameters than large convolutional kernels during feature extraction. As a result, the model is less computationally intensive and faster to train. The analysis shows that the VGG improves the performance of the model by increasing the depth of the network, with the network layer settings affecting the accuracy of the results. In 2021, Ye et al. proposed a lightweight model of the VGG-16, which proved that the VGG is competitive in the field of remote sensing image processing [37]. In 2023, Patel et al. discussed the migration learning capability of the VGG16 on HSIs through experimental validation. The experiments showed the performance accuracy of the VGG on HSIs [38].

In terms of structure, the VGG16 mainly uses convolutional and pooling layers, which makes the structure relatively simple and easy to understand. Its 16-layer network helps it learn deeper features. One of the biggest features of the VGG16 is that the overall framework chooses smaller convolutional kernels, i.e., 3 × 3-size convolutional kernels. This setup increases the nonlinear representation of the network and improves the accuracy of the model. The treatment of setting the same size convolution kernel for different convolutional layers enables the network to share parameters and improves the efficiency of the model. However, due to the deep layers in the VGG16, the VGG16 has a large memory requirement. Because of this, the VGG16 is prone to overfitting when there are insufficient data.

To address the deficiencies exhibited by the VGG16, we chose to reduce the number of layers to lower the parameters and computation. To prevent a lack of features due to insufficient depth, we dropped the pooling layer to reduce the loss of valid information. We maintained the use of 3 × 3 size convolutional kernels to preserve crucial features. Additionally, we retained the exponentially increasing characteristics in terms of channel settings to ensure feature retention.

2.2. GCN and Graph SAGE

The GCN is a network that represents data in the form of graph data structures for learning. It can simultaneously extract the topological structure and vertex attributes of data. Then, it will continue to learn the information of target vertices in the global space. The GCN belongs to the field of transductive learning.

The graph is usually represented by

G (V, E)

, where

V

represents the node set and

E

represents the edge set. Under this definition, nodes are usually represented by pixels in HSIs, and adjacent nodes have a connection relationship, which is represented in Equation (1):

A_{i j} = {\begin{matrix} 1 i f {u_{i}, u_{j}} \in E a n d i \neq j \\ 0 o t h e r w i s e \end{matrix}

(1)

when there is an edge relationship between node

u_{i}

and

u_{j}

, the value between the two nodes is 1, otherwise it is 0. A key aspect of graph processing is the appropriate construction of the Laplacian matrix [39]. After determining

A

, we can calculate the Laplace matrix

L

of the graph as Equation (2):

L_{i j} = {\begin{matrix} d (u_{i}) i f i = j \\ - 1 i f {u_{i}, u_{j}} \in E a n d i \neq j \\ 0 o t h e r w i s e \end{matrix}

(2)

where

d (u_{i})

is the number of edges associated with the node [40]. Based on the above definition, we can derive the connection between the data.

Graph SAGE [41], on the other hand, belongs to a variant form of the GCN. It extends the GCN to trainable aggregation functions and uses the attributes of vertices to efficiently generate embedding of unknown vertices. Thus, unlike the GCN, SAGE belongs to inductive learning. SAGE makes up for the disadvantage of the GCN that it cannot be generalized directly. In a recent study, Cui et al. investigated a center-weighted convolution and GraphSAGE (CW-SAGE) for HSI classification, exploiting the ability of SAGE to capture contextual relationships [42]. The SAGE Algorithm 1 is as follows:

Algorithm 1: GraphSAGE embedding generation (i.e., forward propagation) algorithm

Input : Graph G (V, E)

; input features {x_{v}, \forall v \in V};

depth K

; weight matrices W^{k}, \forall k \in {1, \dots, K};

non-linearity σ

; differentiable aggregator functions A G G R E G A T E_{k}

, \forall k \in {1, \dots, K}

; neighborhood function N : v \to 2^{v}

Output : Vector representations z_{v}

for all v \in V

1 h_{v}^{0} \leftarrow x_{v}, \forall v \in V

2 for k = 1 \dots K d o

3 for v \in V d o

4 h_{N (v)}^{k} \leftarrow A G G R E G A T E_{k} ({h_{u}^{k - 1}, \forall u \in N (v)});

5 h_{v}^{k} \leftarrow σ (W^{k} \cdot C O N C A T (h_{v}^{k - 1}, h_{N (v)}^{k}))

6

end

7 h_{v}^{k} \leftarrow h_{v}^{k} / {| | h_{v}^{k} | |}_{2}, \forall v \in V

8 end

9 z_{v} = h_{v}^{k}, \forall v \in V

The graph GCA first randomly samples the neighboring nodes of each node in the graph to reduce the computational complexity. Then, it generates the target node embedding based on the information embedded in the aggregated neighboring nodes. During the propagation process, the information of the node is extended to the k_th order neighbors after k times of aggregation. The embedding, as an input to the fully connected layer, will be used to predict the target label of the node.

GraphSAGE is a spatial domain-based graph convolution whose features include both neighbor sampling and feature aggregation. The neighbor sampling setting enables the GCN to attempt large-scale data processing. Similarly, SAGE has some issues that need to be addressed. The randomness in SAGE sampling leads to feature instability in its inference process. Meanwhile, the sampling process may lead to the loss of important information on the nodes. SAGE also relies on the construction of graph data structures.

To address the problems exhibited by SAGE, we introduce the features of neighboring pixels as a reference during sampling. The input from the original image is shifted to the input of the features. The local convolution in the previous stage effectively compensates for SAGE’s lack of information. In order to ensure the similarity relationship between nodes, we built an edge between the nodes. By iterating, the connection relationship of the model will be optimized and strengthened.

3. Proposed Method

HSIs provide rich information on geological features. We propose a model framework that can improve the utilization of information and make features more targeted. We randomly sampled and collected pixels near the sampling point as auxiliary classification, and fed the image block into the CNN module. In order to utilize the unique features of HSIs to obtain rich feature information, we chose to use the VGG16 to extract image features.

In a narrow sense, GCNs are used in computer vision to construct spatial topological relationships between data. In a broader sense, GCNs can be used to react to similar connections between data. GCNs need a more suitable entry point in the processing of HSIs. Starting from this point, the GCN can reach an expansion in the field of HSIs. Therefore, when representing the connection relationship of graph nodes, we use spectral similarity as the standard, rather than spatial adjacency. We constructed a graph data structure with sampling points and extracted information as feature values. Using SAGE for aggregation from a global perspective can solve the problem of CNNs extracting features only in limited space. Finally, we made a probability prediction of the result. This section first introduces the overall framework of SFFNet, and then describes the two sub parts of the framework, CNN extraction and GCN classification.

3.1. Overall Framework

In SFFNet, the framework mainly consists of two parts: CNN feature extraction and GCN image classification. We used the CNN to extract rich information from the data. These features extracted by the CNN were be used to construct the graph data structure. In the GCN module, we used Graph SAGE as a graph convolutional layer to build the classification model. The previous features were aggregated to obtain the final prediction probability. The overall framework figure is shown below.

As shown in Figure 1, we first randomly sampled the hyperspectral raw image. In addition to the selected sampling point, we extracted an image block of 7 × 7 size centered on that pixel. The image block size was set to the same size as the patch_size. In fact, the other data points on this image block were only used to provide features to assist in the classification of the central pixel point. The image block will serve as the initial information of the sampling points and be fed into the feature extraction module to complete convolution processing together. Its final classification task targets the central sampling point. Let the number of samples be

n

and the number of image blocks be

N

. The

N

image blocks will be fed into the CNN feature extraction module, which consists of three convolutional layers with a kernel size of 3 × 3. The CNN performs convolution in fixed local regions. This characteristic reflects the limitations of the CNN in feature extraction. According to our design in the CNN module, after ending the third layer of convolution we can finally obtain

n \times (1 \times 1 \times 1024)

feature vectors. These are the feature nodes in constructing the graph data structure. Since we added the auxiliary role of image blocks while extracting the pixel points, the topological and semantic information of the spatial-spectral blocks will be more abundant. This makes ours model more referential and connective.

When constructing the graph data structure, we first constructed a connection relationship between each feature node and updated the relationship weights according to the spectral similarity between the nodes. When the weights are reduced, the influence produced by their node vectors on the target nodes is likewise reduced. As shown in the GCN classification module in Figure 1, the solid lines in the graph data structure represent the connection relationships between nodes, and the dashed lines represent the aggregation relationships between graph convolutional layers.

Unlike other networks, the SFFNet construction of the graph data structure occurs after the CNN extracts features rather than when the data are initially processed. The complete network is shown below.

Figure 2 shows the network of SFFNet. Our core SFFNet block consists of two sub-modules: CNN extraction and GCN classification. Between the CNN extraction block and the GCN classification block exists a good time to build the graph data structure. The CNN extraction block is built by a 2D-CNN with three layers, and the number of convolutional channels in each layer grows exponentially. The GCN classification block contains two SAGE layers to convolve the feature data by exponentially reducing the number of convolutional channels. We arranged three fully connected layers to reduce the number of channels to the size before entering the image classification block. The output of the last layer will give the final classification prediction probability. We have used ReLU as the activation function here, although not in as much detail as indicated in Figure 2.

As can be seen from Table 1, the number of convolution channels grew or decreased by multiples of 2, with the idea of looping to retain as much feature information as possible in terms of refinement and stability. After a complete process of convolution and aggregation, high-precision classification results were finally obtained.

3.2. CNN Extraction

The VGG16 doubles the number of convolutional channels after each pooling, which allows the network to retain more features. We continued this idea and doubled the number of convolution channels when building the neural network. The VGG16 initially uses 64 3 × 3 convolution kernels for the first convolution layer. Based on the rich spectral information of hyperspectral images, we set the size of the convolution kernel in the first layer to 3 × 3, and the number of output channels was 256. In order to comply with the VGG settings, we doubled the number of convolutional channels to 512 and 1024, respectively. The schematic diagram of feature extraction is as follows.

In total, we set up three convolutional layers instead of the original 13 convolutional layers of the VGG, such as Figure 3. This was due to the fact that the VGG16 increases the depth of the network along with the difficulty in tuning the parameters. We wanted the model to have an advantage in reducing information loss in the part of feature extraction. Therefore, we reduced the number of network layers. To prevent the issue of inadequate depth affecting performance, we also omitted the pooling layer to reduce information loss.

CNNs are usually classified into 1D, 2D, and 3D. We use the 2D convolutional neural network, which is the most commonly used in the field of computer vision. Its calculation formula is as follows:

v_{i j}^{x y} = σ (b_{i j} + \sum_{m = 0}^{M_{i} - 1} \sum_{p = 0}^{P_{i} - 1} \sum_{q = 0}^{Q_{i} - 1} w_{i j m}^{p q} v_{(i - 1) m}^{(x + p) (y + q)})

(3)

in Equation (3),

P_{i}

and

Q_{i}

denote the size of the convolution kernel of the

i_{th}

convolutional layer.

p

and

q

are the corresponding indexes.

m_{i}

denotes the amount of data in the convolution kernel in the

i_{th}

convolution layer, and also denotes the depth of the feature map output for this convolution layer. Its corresponding index is denoted as

m

. The position index is represented by

x

and

y

.

v_{ij}^{xy}

denotes the eigenvalue of the

j_{t h}

feature map outputted by the

i_{th}

convolutional layer at

(x, y)

.

w_{ijm}^{pq}

denotes the weight of the mth set of convolutional kernels at

(p, q)

, which can be used to compute the

j_{th}

feature map in the

i_{th}

layer.

b_{ij}

denotes the bias term of the

j_{th}

feature map in the

i_{th}

layer. Easily analyzed, σ is the nonlinear activation function for this network.

Set the size of the input to

(C_{in}, H_{in}, W_{in})

, and output to

(C_{out}, H_{out}, W_{out})

. After one layer of conv2d, its output size can be calculated as Equations (4) and (5):

H_{o u t} = [\frac{H_{i n} + 2 \times p a d d i n g [0] - d i l a t i o n [0] \times (k e r n e_s i z e [0] - 1) - 1}{s t r i d e [0]}] + 1

(4)

W_{o u t} = [\frac{W_{i n} + 2 \times p a d d i n g [1] - d i l a t i o n [1] \times (k e r n e l_s i z e [1] - 1) - 1}{s t r i d e [1]}] + 1

(5)

where

C

,

H

, and

W

represent the Num of channels, Height, and Width of the data size, respectively. According to Equations (4) and (5), we can calculate the size of the HSI after extraction by convolutional features, which is convenient for data processing in subsequent image classification.

Note that in our framework, the CNN only serves the purpose of feature extraction and is not used as the final classification, so there is no fully connected layer (FC) and softmax in the model framework.

3.3. GCN Classification

The proposal of the GCN makes it possible to represent and analyze data in graph data structures. Unlike the CNN which only extracts information from image samples in local space, the GCN is able to aggregate from a global perspective, taking full account of the topological structure and semantic information of the image. Existing methods for constructing graph data structures can be mainly classified into two types: constructing graphs with pixels as nodes or constructing graphs with blocks of pixels (e.g., hyper-pixels) as nodes. Some of the work is based on the attention mechanism to characterize the relationships between graph nodes [43]. The main reason for using segmentation algorithms to process data is to prevent too many image nodes that can lead to information redundancy and high computational cost.

It can be seen that the application of the common graph convolutional neural network mainly focuses on extracting the spatial relationship between the pixel data of HSIs. Since the spatial information between hyperspectral data may have no obvious application value due to the different categories, we hope to target the application of GCNs in the field of HSIs in a more relevant way. In creating the graph data structure, we switch from the traditional spatial topological relationship to spectral similarity and update the weight relationship between spectra after several iterations. HSIs are rich in spectral information, which is valuable for information utilization.

The adjacency matrix denoted by A can be used to illustrate the relationship between a node and its neighboring nodes. D is a diagonal matrix and

D_{ii}

denotes the degree of the

i_{th}

node; then, Equation (6):

D_{i i} = \sum_{j = 1}^{n} A_{i, j}

(6)

spectral convolution is defined as follows:

\tilde{A} = A + I

(7)

{\tilde{D}}_{i, i} = \sum_{j} {\tilde{A}}_{i, j}

(8)

H^{(l = 1)} = σ ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} H^{(l)} W^{(l)})

(9)

where

H^{(l)}

and

W^{(l)}

denote the output and trainable parameters of layer l. Let L denote the Laplace matrix of the graph, so

{\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

is a symmetric normalised form of L that prevents the gradient from exploding or vanishing as the depth of the GCN is increased during the training process.

σ (\cdot)

is the expression of an activation function.

In our classification module, we chose to use SAGE, a variant of the GCN, instead of the GCN because SAGE is more in line with the desired goals of rapid convergence and high global information that we wish to achieve.

As shown in Figure 4, after the previous stage of CNN extraction, we obtained some informative pixels. We constructed these feature nodes into a graph data structure. We aggregated the neighboring nodes of the target node to obtain the final feature information. We used a total of two layers of the completed wrapped SAGE module for graph aggregation. Due to the processing of the feature extraction module, the number of input channels we receive in the second stage will be 1024. In the first graph convolution layer, we compressed the number of channels to 512 and normalized them by a one-dimensional BatchNorm (BN). This ensured that the inputs of each network layer remained consistent during the training process. Based on the theory of doubling the number of VGG channels, we adopted a half-reduction mechanism for the number of channels in the graph convolutional layer. The second graph convolutional layer was compressed to 256 channels and used as input for the subsequent fully connected layers. We also chose three fully connected layers and used ReLU as the activation function.

4. Experimental Discussion

In this section, we will conduct specific experiments on SFFNet and compare it with other related machine learning methods to test the performance of the framework. The robustness and adaptability of the framework can be tested by training it under datasets with different characteristics and attributes. We used single-class classification accuracy, overall classification accuracy (OA), average classification accuracy (AA), and the Kappa coefficient as performance evaluation metrics. OA, AA, and kappa are, respectively, represented as Equations (10) and (11):

O A = \frac{\sum_{i = 1}^{C} m_{i i}}{n}

(10)

A A = \frac{\sum_{i = 1}^{C} \frac{m_{i i}}{n_{i}}}{C}

(11)

k a p p a = \frac{N \sum_{i = 1}^{r} x_{i i} - \sum_{i = 1}^{r} x_{i +} \times x_{+ i}}{N^{2} - \sum_{i = 1}^{r} x_{i +} \times x_{+ i}}

(12)

where

C

is the number of sample categories.

m_{ii}

is the number of samples correctly classified in category

i

.

n

is the number of samples.

n_{i}

represents the number of feature samples in category

i

.

r

is the number of rows and columns of the confusion matrix.

N

refers to the total number of observations.

x_{ii}

is the

(i, i)

of the confusion matrix, and

x_{i +}

and

x_{+ i}

refer to the sum of the margins of row

i

and column

i

.

This framework mainly uses two experimental ideas: CNN and GCN. Therefore, in selecting comparative experiments, we focused on two ne.ral network frameworks. At the same time, since we have fused different network frameworks, we needed to include this idea in the selection of comparison experiments. Based on the above principles, we also took into account the aspects of recognition and novelty. Finally, we chose the following four frameworks for comparison experiments: 3D-CNN [18], WFCG [31], MDGCN [26], and AMGCFN [32].

4.1. Introduction to the Datasets

Three publicly available hyperspectral datasets were used in this experiment: the Indian Pines dataset, the Pavia University scene dataset, and the Houston 2013 dataset. We will use different models to classify the three datasets. In order to make the samples uniform, we have provided both normalization and standardization as pre-processing steps of the samples.

Table 2 and Figure 5 present the feature types of the Indian Pines dataset. This dataset was collected by the AVIRIS sensor at the Indian Pines test site in northwestern Indiana. Its image size is 145 × 145 pixels and it contains 224 spectral bands, of which 200 effective spectral bands remain after removing the water absorption band. The scene is composed of two-thirds agriculture and one-third forest or vegetation, with a total of 16 feature classes. Its characteristics are a large number of bands and a narrow spectral range. In terms of class distribution, this dataset is relatively concentrated. As a public dataset, it can be obtained from Hyperspectral Remote Sensing Scenes—Grupo de Inteligencia Computacional (GIC) (ehu.eus).

Table 3 and Figure 6 present the feature types of the Pavia University scene dataset. This dataset was acquired by the ROSIS sensor over Pavia, northern Italy, in 2001, with an image size of 610 × 610 pixels. The dataset contains 103 bands with spectral bands ranging from 0.43 μm to 0.86 μm, totaling 42,776 sample data points. After removing some of the samples that need to be cancelled, its categories can be roughly divided into nine, mainly focusing on urban materials, water, and vegetation. Compared to other datasets, the Pavia University scene dataset has fewer spectral bands and a large number of samples. In terms of class distribution, this dataset includes both scattered and concentrated data. As a public dataset, it can be obtained from Hyperspectral Remote Sensing Scenes—Grupo de Inteligencia Computacional (GIC) (ehu.eus).

Table 4 and Figure 7 present the feature types of the Houston 2013 dataset. This dataset was obtained from the University of Houston campus and neighboring urban areas. The dataset was collected by the ITRES CASI-1500 sensor and is 349 × 1905 pixels in size with 144 spectral bands ranging from 364 to 1046 nm. The spectral resolution is 10 nm. It was used in the 2013 IEEE GRSS data fusion competition. The dataset we used is the HSI version of the cloud-free case, which can be classified into 15 taxonomically challenging categories. The Houston 2013 dataset’s characteristics of high spectral and spatial resolution make it an important data source in the remote sensing field. It has a wide spectral coverage and dispersed categories, posing higher demands on classification models. As a public dataset, it can be obtained from the 2013 IEEE GRSS Data Fusion Contest—Fusion of Hyperspectral and LiDAR Data at the Hyperspectral Image Analysis Lab (uh.edu).

4.2. Accuracy Analysis

All the experiments were implemented on the pytorch framework. The environment setup of SFFNet is the same as that of other comparison experiments. The training set and test set were randomly generated in a ratio of 8:2. The experiments were carried out with a NVIDIA GeForce RTX 3090 24 GB GPU and a Intel(R) Core(TM) i7-13700KF CPU.

In this experiment, the patch size was empirically set to 7, the batch size was set to 64, the learning rate was set to 0.001, and the momentum was set to 0.9. All other parameters related to the comparison experiments were configured according to the optimal settings provided in their original papers. We took the optimal values of the original experiment for comparison. In order to make the results more convincing, the experimental framework was ran 10 times for each dataset to obtain averages and provide accuracy ranges. The stability of the results can be inferred based on the accuracy range. The specific experimental results are shown below, and we have evaluated the classification performance both quantitatively and qualitatively. We have highlighted the best results in bold. The framework and comparison experiments were analyzed by dataset. For comparison, we provide false-color images and ground-truth maps of the datasets.

4.2.1. Results on the Indian Pines Datasets

Compared to Figure 8, SFFNet has almost no obvious classification errors. Specifically, the other frameworks have also achieved good classification results, but are prone to classification errors at the boundary of different categories. The AMGCFN errors are more obvious, and these errors are basically uniformly distributed. This shows that our framework is more advantageous in distinguishing the edge boundaries of different categories.

Table 5 shows the quantitative results of different frameworks on the Indian Pine dataset. It can be seen that our framework achieves top-level and optimal single classification accuracy up to 100.00%. Although the suboptimal structure (WFCG) can achieve an accuracy of 100.00% in some single category classifications, the overall accuracy (OA) gap with the optimal framework (SFFNet) reaches about 4%. Compared with the 3D-CNN framework using only convolutional neural networks, the other three models combining GCNs all have better classification results, which suggests that exploiting the topology of HSIs is necessary.

4.2.2. Results on the Pavia University Scene Dataset

As can be seen from the comparison in Figure 9, the classification map generated by SFFNet is closer to the real ground value. Compared with the other frameworks combining CNN and GCN, SFFNet first uses a CNN to extract features and then constructs a graph data structure for the GCN classification, which can better capture image information and spatial relationships. This method can successfully complete the task of category differentiation with large jumps.

Table 6 shows the classification results of the different frameworks on the PU dataset. As can be seen from the optimal results, the classification performance of SFFNet is significantly better than that of other competitors. SFFNet is able to better extract the data features and construct a more appropriate graph structure, which not only takes into account its local information, but also flexibly extracts the global information. It efficiently perceives the changes in spatial structure in order to meet the classification requirements. Compared with the sub-optimal result (MDGCN), our framework can improve the classification accuracy by about 2.5%.

4.2.3. Results on the Houston 2013 Dataset

Figure 10 provides a comparison of the visual performance for the Houston data. The visual comparisons are similar due to the dispersed categories and the large number of data. However, the observation from the same dense data point section shows that SFFNet has fewer classification errors.

The Houston dataset is larger and more complex compared to the other datasets. It is more fragmented in terms of category distribution. Table 7 provides the classification results of the five frameworks on the Houston dataset. Regardless of the classification accuracy of each class or the overall classification accuracy, our framework is still able to achieve the best results, with up to 100.00%. The sub-optimal framework (WFCG) has about 4% lower classification accuracy than SFFNet, indicating that the construction of the refined graph data structure is effective in improving classification accuracy.

5. Discussion

In this paper, experiments were conducted on the fusion framework of a GCN and CNN. In fact, the training accuracy of the present framework was able to converge to 100.00% on different categories. The comparison experiments were also able to achieve so on some categories. We reasonably attribute this to the fact that there are fewer publicly available datasets. As a result, the designed models are more relevant. In progressive model optimization, these models all achieve high accuracy classification results. The change in this phenomenon can be achieved in two ways: providing new datasets to increase the generality of the models or increasing the evaluation metrics for HSI classification.

In the case of practical applications, the evaluation of the model can be performed in many ways, such as the speed of convergence and stability of the model. We further analyzed the performance of a CNN, GCN, and CNN+GCN (i.e., SFFNet). Table 8 shows the classification accuracies of single-frame and multi-frame fusions with different datasets when the niter is set to 300.

As shown in Table 8, it can be seen that SFFNet basically achieves the optimal accuracy on each dataset. The AA accuracy of the CNN on the Indian Pines dataset is better than the SFFNet model, but as can be seen from Section 4, the final accuracy of SFFNet is higher after increasing the number of training iterations. The accuracy of the model has been demonstrated in order to show the superiority of the fusion framework.

Figure 11 illustrates the change in the loss of the three models during the training iterations. All three models exhibit a decreasing trend on all three datasets, but the decrease rate of SFFNet is significantly faster than the other two single-classification models. During subsequent iterations, the loss of SFFNet converges almost infinitely to 0, reflecting its favorable achievement of high classification accuracy.

Figure 12 shows the overall classification accuracy of the three models in the interval of niter from 0 to 300, with classification data taken every 10 intervals. The local accuracy can provide a clearer indication of the classification trends. It can be observed that the final accuracies of all the models achieve high-precision classification. However, the CNN exhibits stronger fluctuations across the different datasets. In contrast, SFFNet demonstrates relatively greater stability, with faster speed.

We observed a greater variation in accuracy across the different datasets, which may be attributed to the variability of the datasets. For instance, the data in the Indian Pines dataset are more concentrated and block-distributed, whereas the data in the Pavia University and Houston datasets are scattered. Consequently, the classification accuracy and stability of the models under the Pavia University and Houston datasets are better than those under the Indian Pines dataset. Therefore, it is reasonable to speculate that SFFNet is more suitable for classifying large scattered datasets, aligning with our current practical application environment.

6. Conclusions

This paper presents a model called SFFNet that combines a CNN and GCN for HSI classification. To address the challenge of constructing the graph data structure of original HSIs, we first utilized the CNN framework to extract the original features. Then, we constructed the graph data structure. This structure was input into the GCN to complete the final classification. The framework leverages topological information to ensure that local image information complements global information, allowing for flexible feature capture and substantial improvement in classification performance across various datasets.

Compared to the single model, SFFNet optimizes the shortcomings of CNNs by using local convolution and a GCN based on full-batch input. The model further provides deep extraction of spectral–spatial information, which makes the model more suitable for HSIs. Compared with the multi model, SFFNet optimizes the fusion method between models. This fusion method enables the original model to better reflect its characteristics. Based on the experimental results, SFFNet consistently achieved the best performance. In quantitative comparisons, SFFNet exhibited the highest classification accuracy. In qualitative comparisons, SFFNet demonstrated the best classification effect. It is evident that SFFNet excels in category differentiation, particularly in distinguishing edge categories. When handling boundary information, the CNN’s fixed convolutional kernel indiscriminately extracts features. However, the introduction of the GCN makes the model more flexible in feature extraction and preferring features with higher similarity. Additionally, the incorporation of neighboring nodes in SFFNet enables the model to better differentiate the categories of surrounding nodes. For HSIs, spectral information is the most important feature. The spectral similarity structure in SFFNet makes the model more sensitive to differences between spectra. Consequently, SFFNet possesses a strong advantage in feature extraction and differentiation, thereby effectively completing classification tasks to a high standard.

However, SFFNet also has its limitations. Although SFFNet can achieve better results with the same number of iterations, it requires higher time costs compared to other models. In the future work, we will further update the model so that the model continues to maintain its stability and high accuracy on the basis of reducing the time overhead. For example, by choosing a more lightweight CNN structure to reduce the parameters and computation, or by updating the sampling method of SAGE.

Author Contributions

Conceptualization, H.L. and X.X.; methodology, X.X and C.L.; software, Y.M.; validation, H.L., Y.L. and S.Z.; formal analysis, H.L., X.X. and C.L.; resources, H.L. and C.L.; data curation, X.X.; writing—original draft preparation, H.L. and X.X.; writing—review and editing, H.L., C.L. and Y.M.; visualization, X.X. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Hubei’s Key Project of Research and Development Program under Grant 2023BBB046, and National Natural Science Foundation of China under Grant U23B2050, and Excellent young and middle-aged scientific and technological innovation teams in colleges and universities of Hubei Province under Grant T2021009, and NSFC-CAAC under Grant U1833119, and Science and Technology Program of Hubei Provincial Education Department under Grant D20221604, and the Natural Science Foundation of Hubei Province of China under Grant 2023AFB351, and the University-Industry Collaborative Education Program under Grant 230705841293521.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: [https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes] (20 May 2011). [https://hyperspectral.ee.uh.edu/?page_id=459] (16 February 2013).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mou, L.; Ghamisi, P.; Xiao, X.Z. Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3639–3655. [Google Scholar] [CrossRef]
Jia, X.; Richards, J.A. Efficient maximum likelihood classification for imaging spectrometer data set. IEEE Trans. Geosci. Remote Sens. 1994, 32, 274–281. [Google Scholar]
Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
Tarabalka, Y.; Fauvel, M.; Chanussot, J.; Benediktsson, J.A. SVM-and MRF-based method for accurate classification of hyperspectral images. IEEE Geosci. Remote Sens. Lett. 2010, 7, 736–740. [Google Scholar] [CrossRef]
Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef]
Zhu, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J. Generative adversarial networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5046–5063. [Google Scholar] [CrossRef]
He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from Transformers. IEEE Trans. Geosci. Remote Sens. 2019, 58, 165–178. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Deep learning for classification of hyperspectral data: A comparative review. IEEE Geo-Sci. Remote Sens. Mag. 2019, 7, 159–173. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for HSI classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
Sharma, V.; Diba, A.; Tuytelaars, T.; Van Gool, L. Hyperspectral CNN for Image Classification & Band Selection, with Application to Face Recognition; Technical Report KUL/ESAT/PSI/1604; KU Leuven, ESAT: Leuven, Belgium, 2016. [Google Scholar]
Makantasis, K.; Karantzalos, K.; Doulamis, A.; Van Gool, L. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Milan, Italy, 26–31 July 2015; IEEE: Milan, Italy, 2015; pp. 4959–4962. [Google Scholar]
Yue, J.; Zhao, W.Z.; Mao, S.J.; Liu, H. Spectral-spatial classification of hyperspectral images using deep convolutional neural networks. Remote Sens. Lett. 2015, 6, 468–477. [Google Scholar] [CrossRef]
Song, W.W.; Li, S.T.; Fang, L.Y.; Lu, T. Hyperspectral image classification with deep feature fusion network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3173–3184. [Google Scholar] [CrossRef]
Yang, J.; Zhao, Y.; Chan, C.W.; Yi, C. Hyperspectral image classification using two-channel deep convolutional neural network. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016. [Google Scholar]
Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef]
Zhou, J.; Zeng, S.; Xiao, Z.; Zhou, J.; Li, H.; Kang, Z. An Enhanced Spectral Fusion 3D CNN Model for Hyperspectral Image Classification. Remote Sens. 2022, 14, 5334. [Google Scholar] [CrossRef]
Xu, Z.; Su, C.; Wang, S.; Zhang, X. Local and Global Spectral Features for Hyperspectral Image Classification. Remote Sens. 2023, 15, 1803. [Google Scholar] [CrossRef]
Wu, Z.H.; Pan, S.R.; Chen, F.W.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017; ICLR: Toulon, France, 2017; pp. 1–14. [Google Scholar]
Qin, A.Y.; Shang, Z.W.; Tian, J.Y.; Wang, Y.; Zhang, T.; Tang, Y.Y. Spectral-spatial graph convolutional networks for semisupervised hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 16, 241–245. [Google Scholar] [CrossRef]
Mou, L.C.; Lu, X.Q.; Li, X.L.; Zhu, X.X. Nonlocal graph convolutional networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8246–8257. [Google Scholar] [CrossRef]
Zhang, S.; Ting, H.H.; Xu, J.J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 1–23. [Google Scholar] [CrossRef]
Wan, S.; Gong, C.; Zhong, P.; Du, B.; Zhang, L.; Yang, J. Multiscale dynamic graph convolutional network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3162–3177. [Google Scholar] [CrossRef]
Wan, S.; Gong, C.; Zhong, P.; Pan, S.; Li, G.; Yang, J. Hyperspectral image classification with context-aware dynamic graph convolutional network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 597–612. [Google Scholar] [CrossRef]
Ding, Y.; Zhao, X.; Zhang, Z.; Cai, W.; Yang, N. Graph sample and aggregate-attention network for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Yao, J.; Zhang, B.; Plaza, A.; Chanussot, J. Graph convolutional networks for hyperspectral image classification. IEEE Trans. Geo-Sci. Remote Sens. 2020, 59, 5966–5978. [Google Scholar] [CrossRef]
Liu, Q.; Xiao, L.; Yang, J.; Wei, Z. Multilevel superpixel structured graph U-Nets for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Dong, Y.; Liu, Q.; Du, B.; Zhang, L. Weighted feature fusion of convolutional neural network and graph attention network for hy-perspectral image classification. IEEE Trans. Image Process. 2022, 31, 1559–1572. [Google Scholar] [CrossRef]
Zhou, H.; Luo, F.; Zhuang, H.; Weng, Z.; Gong, X.; Lin, Z. Attention Multi-hop Graph and Multi-scale Convolutional Fusion Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar]
Yu, L.; Peng, J.; Chen, N.; Sun, W.; Du, Q. Two-Branch Deeper Graph Convolutional Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Yu, Q.; Wei, W.; Pan, Z.; He, J.; Wang, S.; Hong, D. GPF-Net: Graph-polarized fusion network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5508614. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Ye, M.; Ruiwen, N.; Chang, Z.; He, G.; Tianli, H.; Shijun, L.; Yu, S.; Tong, Z.; Ying, G. A lightweight model of VGG-16 for remote sensing image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6916–6922. [Google Scholar] [CrossRef]
Patel, U.; Pathan, M.; Kathiria, P.; Patel, V. Crop type classification with hyperspectral images using deep learning: A transfer learning approach. Model. Earth Syst. Environ. 2023, 9, 1977–1987. [Google Scholar] [CrossRef]
Belda, J.; Vergara, L.; Salazar, A.; Safont, G. Estimating the Laplacian matrix of Gaussian mixtures for signal processing on graphs. Signal Process. 2018, 148, 241–249. [Google Scholar] [CrossRef]
Zhao, W.; Wu, D.; Liu, Y. Hyperspectral image classification with multi-scale graph convolution network. Int. J. Remote Sens. 2021, 42, 8380–8397. [Google Scholar] [CrossRef]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1025–1035. [Google Scholar]
Cui, Y.; Shao, C.; Luo, L.; Wang, l.; Gao, L.; Chen, L. Center Weighted Convolution and GraphSAGE Cooperative Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Sha, A.; Wang, B.; Wu, X.; Zhang, L. Semisupervised classification for hyperspectral images using graph attention networks. IEEE Geosci. Remote Sens. Lett. 2020, 18, 157–161. [Google Scholar] [CrossRef]

Figure 1. SFFNet Framework. It includes a total of four main parts, which are distinguished by dotted boxes. The core parts of the method are the CNN feature extraction module and the GCN information classification module. The CNN extraction module is used to extract the information of the original pixel blocks, and the GCN classification module performs the similarity connection of the extracted features. The figure uses different colors to represent different nodes.

Figure 2. SFFNet network structure.

Figure 3. Schematic diagram of CNN extraction.

Figure 4. Schematic diagram of GCN Classification. The figure uses different colors to represent different nodes.

Figure 5. False-color image and ground truth of the Indian Pines dataset.

Figure 6. False-color image and ground truth of the Pavia University scene dataset.

Figure 7. False-color image and ground truth of the Houston 2013 dataset.

Figure 8. Classification results on the Indian Pines dataset.

Figure 9. Classification results on the Pavia dataset.

Figure 10. Classification results on the Houston dataset.

Figure 11. Loss of the model with the three datasets.

Figure 12. OA% of the model with the three datasets.

Table 1. Network framework of SFFNet.

Net	CNN	GraphSAGE
Input Dimension	H × W × C	-
Random sampling
Feature Extraction	Conv3-5 $\times$ 5 $\times$ 256	-
	Conv3-3 $\times$ 3 $\times$ 512	-
	Conv3-1 $\times$ 1 $\times$ 1024	-
Build Graph Data Structure
Image Classification	-	SAGEConv-512 BN1d-512
Image Classification	-	SAGEConv-256 BN1d-256
Multilayer Perceptron	FC-1024 ReLU Dropout = 0.5
	FC-1024 ReLU Dropout = 0.5
	FC-Number of categories
Output Dimension	C

Table 2. Feature types and sample sizes of the Indian Pines dataset.

Label	Class	Samples
1	Alfalfa	46
2	Corn-notill	1428
3	Corn-mintill	830
4	Corn	237
5	Grass-pasture	483
6	Grass-trees	730
7	Grass-pasture-mowed	28
8	Hay-windrowed	478
9	Oats	20
10	Soybean-notill	972
11	Soybean-mintill	2455
12	Soybean-clean	593
13	Wheat	205
14	Woods	1265
15	Buildings-Grass-Trees-Drives	386
16	Stone-Steel-Towers	93
total	-	10,249

Table 3. Feature types and sample sizes of the Pavia University scene dataset.

Label	Class	Samples
1	Asphalt	6631
2	Meadows	18,649
3	Gravel	2099
4	Trees	3064
5	Painted metal sheets	1345
6	Bare Soil	5029
7	Bitumen	1330
8	Self-Blocking Bricks	3682
9	Shadows	947
total	-	42,776

Table 4. Feature types and sample sizes of the Houston 2013 dataset.

Label	Class	Sample
1	Healthy grass	1251
2	Stressed grass	1254
3	Synthetic grass	697
4	Trees	1244
5	Soil	1242
6	Water	325
7	Residential	1268
8	Commercial	1244
9	Road	1252
10	Highway	1227
11	Railway	1235
12	Parking Lot 1	1233
13	Parking Lot 2	469
14	Tennis Court	428
15	Running Track	660
total	-	15,029

Table 5. Accuracy (%) on the Indian Pines dataset of the different methods.

CLASSES	CLASSES-NAME	3D-CNN	WFCG	MDGCN	AMGCFN	SFFNet
1	Alfalfa	100.00 ± 0.00	88.89 ± 1.45	77.78 ± 1.32	69.20 ± 0.18	88.89 ± 1.11
2	Corn-notill	95.41 ± 4.89	96.39 ± 4.62	95.41 ± 1.04	90.68 ± 0.03	99.67 ± 0.03
3	Corn-mintill	99.38 ± 3.66	83.75 ± 2.01	67.50 ± 0.21	90.82 ± 0.07	100.00 ± 0.00
4	Corn	100.00 ± 0.00	87.23 ± 2.55	91.49 ± 0.14	89.71 ± 0.09	95.74 ± 2.01
5	Grass-pasture	100.00 ± 0.00	98.75 ± 3.68	100.00 ± 0.00	85.78 ± 0.05	100.00 ± 0.00
6	Grass-trees	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	96.34 ± 0.03	100.00 ± 0.00
7	Grass-pasture-mowed	80.00 ± 9.02	100.00 ± 0.00	80.00 ± 2.35	88.22 ± 0.11	100.00 ± 0.00
8	Hay-windrowed	100.00 ± 0.00	100.00 ± 0.00	95.92 ± 0.03	99.62 ± 0.00	100.00 ± 0.00
9	Oats	100.00 ± 0.00	100.00 ± 0.00	85.71 ± 1.65	99.47 ± 0.02	100.00 ± 0.00
10	Soybean-notill	87.43 ± 1.62	100.00 ± 0.00	100.00 ± 0.00	89.37 ± 0.05	100.00 ± 0.00
11	Soybean-mintill	51.85 ± 3.41	98.91 ± 3.16	99.13 ± 1.32	95.01 ± 0.03	100.00 ± 0.00
12	Soybean-clean	86.55 ± 2.34	68.07 ± 2.34	89.08 ± 2.27	83.16 ± 0.05	94.96 ± 1.72
13	Wheat	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	93.63 ± 0.07	100.00 ± 0.00
14	Woods	81.30 ± 2.54	100.00 ± 0.00	100.00 ± 0.00	98.29 ± 0.01	100.00 ± 0.00
15	Buildings-Grass-Trees-Drives	91.80 ± 4.02	100.00 ± 0.00	86.89 ± 0.03	92.01 ± 0.07	100.00 ± 0.00
16	Stone-Steel-Towers	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	88.74 ± 0.13	100.00 ± 0.00
OA		83.55 ± 1.65	95.55 ± 0.63	94.79 ± 0.42	92.68 ± 0.01	99.49 ± 0.41
AA		92.11 ± 1.98	95.12 ± 0.41	91.81 ± 0.25	90.63 ± 0.02	98.70 ± 0.10
kappa		81.52 ± 1.86	94.91 ± 0.88	94.03 ± 0.54	91.67 ± 0.01	99.42 ± 0.05

Table 6. Accuracy (%) on the Pavia dataset of the different methods.

CLASSES	CLASSES-NAME	3D-CNN	WFCG	MDGCN	AMGCFN	SFFNet
1	Asphalt	100.00 ± 0.00	100.00 ± 0.00	98.74 ± 0.56	95.38 ± 0.02	100.00 ± 0.00
2	Meadows	88.54 ± 2.11	99.21 ± 2.03	100.00 ± 0.00	99.56 ± 0.01	99.32 ± 0.55
3	Gravel	98.99 ± 2.06	100.00 ± 0.00	100.00 ± 0.00	79.16 ± 0.20	100.00 ± 0.00
4	Trees	96.94 ± 2.25	100.00 ± 0.00	99.83 ± 1.69	86.40 ± 0.06	100.00 ± 0.00
5	Painted metal sheets	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	98.35 ± 0.42	100.00 ± 0.00
6	Bare Soil	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	98.36 ± 0.03	100.00 ± 0.00
7	Bitumen	53.08 ± 0.86	62.31 ± 2.68	24.62 ± 1.24	91.27 ± 0,07	100.00 ± 0.00
8	Self-Blocking Bricks	67.72 ± 1.88	61.38 ± 2.89	96.28 ± 1.42	96.02 ± 0.03	98.48 ± 0.14
9	Shadows	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	70.96 ± 0.13	100.00 ± 0.00
OA		89.96 ± 2.56	94.54 ± 1.22	96.55 ± 0.33	95.62 ± 0.01	99.04 ± 0.43
AA		89.47 ± 1.38	91.43 ± 1.23	91.05 ± 0.49	90.66 ± 0.02	99.76 ± 0.20
kappa		87.60 ± 2.87	93.47 ± 1.32	96.12 ± 0.32	94.18 ± 0.02	99.44 ± 0.53

Table 7. Accuracy (%) on the Houston dataset of the different methods.

CLASSES	CLASSES-NAME	3D-CNN	WFCG	MDGCN	AMGCFN	SFFNet
1	Healthy grass	95.12 ± 0.82	99.19 ± 4.03	99.59 ± 2.89	79.20 ± 0.15	100.00 ± 0.00
2	Stressed grass	99.61 ± 0.64	85.49 ± 3.56	100.00 ± 0.00	60.50 ± 0.14	100.00 ± 0.00
3	Synthetic grass	99.30 ± 0.58	100.00 ± 0.00	100.00 ± 0.00	96.75 ± 0.02	100.00 ± 0.00
4	Trees	94.94 ± 0.58	100.00 ± 0.00	99.22 ± 2.47	67.15 ± 0.23	100.00 ± 0.00
5	Soil	96.34 ± 0.83	100.00 ± 0.00	82.11 ± 1.69	94.50 ± 0.07	100.00 ± 0.00
6	Water	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	42.75 ± 0.25	100.00 ± 0.00
7	Residential	96.58 ± 5.03	100.00 ± 0.00	97.01 ± 3.08	56.51 ± 0.12	100.00 ± 0.00
8	Commercial	95.31 ± 3.78	99.22 ± 5.41	96.88 ± 4.57	35.14 ± 0.14	99.61 ± 0.04
9	Road	65.89 ± 1.83	85.05 ± 2.16	93.46 ± 3.44	65.01 ± 0.12	99.53 ± 0.07
10	Highway	78.84 ± 1.68	91.29 ± 5.22	97.51 ± 3.57	61.10 ± 0.18	100.00 ± 0.00
11	Railway	79.51 ± 3.02	98.77 ± 3.01	91.39 ± 2.68	61.07 ± 0.11	99.59 ± 0.01
12	Parking Lot 1	90.73 ± 4.92	96.77 ± 5.18	89.52 ± 2.01	60.04 ± 0.12	99.60 ± 0.02
13	Parking Lot 2	93.02 ± 4.23	96.51 ± 4.06	95.35 ± 3.21	65.25 ± 0.31	98.84 ± 0.02
14	Tennis Court	86.49 ± 0.43	100.00 ± 0.00	98.65 ± 2.51	90.34 ± 0.12	100.00 ± 0.00
15	Running Track	87.02 ± 0.43	98.47 ± 1.46	100.00 ± 0.00	87.20 ± 0.20	100.00 ± 0.00
OA		88.45 ± 1.01	94.31 ± 1.13	93.51 ± 0.84	66.88 ± 0.02	97.80 ± 0.04
AA		90.58 ± 1.21	96.72 ± 0.91	96.05 ± 0.71	68.17 ± 0.03	99.81 ± 0.03
kappa		89.49 ± 1.18	95.96 ± 1.03	95.08 ± 0.88	64.21 ± 0.03	99.82 ± 0.03

Table 8. Accuracy of single models and dual models on different datasets.

	IP			PU			HU
	OA%	AA%	KAPPA%	OA%	AA%	KAPPA%	OA%	AA%	KAPPA%
SFFNet	94.44	93.89	94.41	99.76	99.95	99.97	97.87	99.87	99.89
CNN	89.22	95.28	90.71	99.16	99.71	99.60	96.71	98.65	98.60
GCN	80.03	81.74	77.48	95.83	95.14	94.81	94.61	96.53	96.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Xiong, X.; Liu, C.; Ma, Y.; Zeng, S.; Li, Y. SFFNet: Staged Feature Fusion Network of Connecting Convolutional Neural Networks and Graph Convolutional Neural Networks for Hyperspectral Image Classification. Appl. Sci. 2024, 14, 2327. https://doi.org/10.3390/app14062327

AMA Style

Li H, Xiong X, Liu C, Ma Y, Zeng S, Li Y. SFFNet: Staged Feature Fusion Network of Connecting Convolutional Neural Networks and Graph Convolutional Neural Networks for Hyperspectral Image Classification. Applied Sciences. 2024; 14(6):2327. https://doi.org/10.3390/app14062327

Chicago/Turabian Style

Li, Hao, Xiaorui Xiong, Chaoxian Liu, Yong Ma, Shan Zeng, and Yaqin Li. 2024. "SFFNet: Staged Feature Fusion Network of Connecting Convolutional Neural Networks and Graph Convolutional Neural Networks for Hyperspectral Image Classification" Applied Sciences 14, no. 6: 2327. https://doi.org/10.3390/app14062327

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SFFNet: Staged Feature Fusion Network of Connecting Convolutional Neural Networks and Graph Convolutional Neural Networks for Hyperspectral Image Classification

Abstract

1. Introduction

2. Relate Work

2.1. CNN and VGG 16

2.2. GCN and Graph SAGE

3. Proposed Method

3.1. Overall Framework

3.2. CNN Extraction

3.3. GCN Classification

4. Experimental Discussion

4.1. Introduction to the Datasets

4.2. Accuracy Analysis

4.2.1. Results on the Indian Pines Datasets

4.2.2. Results on the Pavia University Scene Dataset

4.2.3. Results on the Houston 2013 Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI