Next Article in Journal
Spectrum-Based Logistic Regression Modeling for the Sea Bottom Soil Categorization
Next Article in Special Issue
Exploring the ViDiDetect Tool for Automated Defect Detection in Manufacturing with Machine Vision
Previous Article in Journal
Commuted PD Controller for Nonlinear Systems: Glucose–Insulin Regulatory Case
Previous Article in Special Issue
Semantic Segmentation of Packaged and Unpackaged Fresh-Cut Apples Using Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LightSeg: Local Spatial Perception Convolution for Real-Time Semantic Segmentation

1
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
2
Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin 541004, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2023, 13(14), 8130; https://doi.org/10.3390/app13148130
Submission received: 13 June 2023 / Revised: 5 July 2023 / Accepted: 11 July 2023 / Published: 12 July 2023
(This article belongs to the Special Issue Application of Machine Vision and Deep Learning Technology)

Abstract

:
Semantic segmentation is increasingly being applied on mobile devices due to advancements in mobile chipsets, particularly in low-power consumption scenarios. However, the lightweight design of mobile devices poses limitations on the receptive field, which is crucial for dense prediction problems. Existing approaches have attempted to balance lightweight designs and high accuracy by downsampling features in the backbone. However, this downsampling may result in the loss of local details at each network stage. To address this challenge, this paper presents a novel solution in the form of a compact and efficient convolutional neural network (CNN) for real-time applications: our proposed model, local spatial perception convolution (LSPConv). Furthermore, the effectiveness of our architecture is demonstrated on the Cityscapes dataset. The results show that our model achieves an impressive balance between accuracy and inference speed. Specifically, our LightSeg, which does not rely on ImageNet pretraining, achieves an mIoU of 76.1 at a speed of 61 FPS on the Cityscapes validation set, utilizing an RTX 2080Ti GPU with mixed precision. Additionally, it achieves a speed of 115.7 FPS on the Jetson NX with int8 precision.

1. Introduction

Unmanned systems, including autonomous driving and industrial robots, have experienced significant advancements in recent years. These systems heavily rely on sensors and intelligent algorithms to perform tasks such as object recognition and tracking. Semantic segmentation plays a crucial role in obtaining semantic information for such systems, as it assigns a label to each pixel on an image, representing its semantic class, such as “road”, “building”, or “person”. However, performing semantic segmentation on low-power devices poses challenges due to limited computing resources. To achieve real-time performance on these devices, efficient and compact deep-learning models are required to extract semantic information while keeping computational complexity low.
To address this challenge, researchers have proposed various approaches for lightweight neural networks, including lightweight design, model compression, and network architecture searching (NAS).
While depth-wise convolution is commonly used to reduce computational costs, group convolution is employed to address information disfluency, where an output channel is derived from only a small part of the input channel. The MobileNet [1,2,3] and ShuffleNet [4,5] series utilize channel shuffle, respectively, to improve information quality and reduce computational complexity. However, channel shuffle introduces memory access disruptions and additional latency.
In addition to computational efficiency, contextual information is crucial for image segmentation. Recent studies have introduced self-attention mechanisms, such as non-local [6] and DANet [7,8], to capture local, multi-scale, and global contexts. Inspired by ResNeXt [9], RegSeg [10] employs a D block with dilated convolutions to increase the receptive field without sacrificing local information. However, dilated convolutions can introduce drawbacks such as the “Gridding effect” [11,12], which compromises pixel-level dense prediction tasks.
To overcome these challenges, this study proposes a novel block called “local spatial perception convolution” (LSPConv), which utilizes spatial separable convolutions to decompose a large kernel into the outer product of two vectors. A standard convolution is employed to extract local features simultaneously. LSPConv aims to reduce model complexity and parameter count. LSPConv achieves a 26.7% reduction in parameters while achieving a speedup of up to 61 FPS on the RTX 2080Ti GPU (Graphics Processing Unit).
Based on LSPConv, an LSPBlock is introduced to enhance the receptive field while reducing computation. Leveraging these modules, a real-time semantic segmentation model called LightSeg is proposed. LightSeg achieves impressive performance on different devices and settings. It achieves 115.7 FPS on embedded devices when performing int8 inference, which is demonstrating its efficiency in resource-constrained environments. On the RTX 2080Ti, LightSeg achieves a reasoning speed of 61 FPS using mixed precision, even with a larger input resolution of 1024 × 2048. These results highlight the optimized inference capabilities of LightSeg across embedded devices and precision modes.
In summary, this study contributes in three key areas:
  • It proposes the novel local spatial perception convolution (LSPConv) to extract both global and local features.
  • It introduces the LSP Block, based on LSPConv, to enhance the receptive field while reducing computation.
  • It presents LightSeg, a real-time semantic segmentation model, achieving high inference speeds on embedded devices and GPUs, demonstrating its efficiency and versatility.

2. Related Works

2.1. Semantic Segmentation

Semantic segmentation has been a focus of research in computer vision, and several models have contributed significantly to this field.
The fully convolutional network (FCN) [13] was a pioneering network that extended end-to-end convolutional networks to semantic segmentation tasks. It introduced a deconvolutional layer for upsampling, enabling the network to recover spatial information lost during downsampling. Another influential model, U-Net [14], built upon the FCN architecture by incorporating skip connections to combine high-level semantic information with coarse and fine surface details. While FCN adds corresponding pixel values, U-Net performs concatenation on its channels.
The Deeplab series, including Deeplab [15], DeeplabV2 [16], and DeeplabV3 [17], have made significant contributions to semantic segmentation. These models introduced hole convolution, which increases the field of view without sacrificing spatial resolution. Deeplab models also incorporate Atrous spatial pyramid pooling (ASPP) [16] to extract multi-scale features using multiple dilation rates. Additionally, they leverage fully connected conditional random fields [15] (CRFs) to refine segmentation results and improve boundary smoothness.
However, recent advancements in semantic segmentation have explored limitations inherent in CNN methods, which primarily rely on local information processing. To overcome these limitations, researchers have started to incorporate Transformer schemes such as VIT [18], SETR [19], TransUNET [20], and SegFormer [21]. These models tokenize the input image, enabling the exploitation of global information for better semantic segmentation performance.

2.2. Real-Time Semantic Segmentation

Real-time semantic segmentation requires efficient models that can process images quickly without sacrificing accuracy. Researchers have explored different strategies to achieve real-time performance.
To avoid introducing additional branches for obtaining contextual information and expanding the field of perception, some models have adopted skip connections [22] to preserve image information. Additionally, lightweight segmentation networks based on transformer structures have introduced new ideas for real-time semantic segmentation. For example, EdgeNeXt [23] and Mobile-Former [24] propose novel architectures that leverage transformer principles while maintaining computational efficiency.
In contrast, the RegSeg [10] model proposes the D block to enhance contextual information and increase the receptive field. These modules enable real-time semantic segmentation by avoiding additional computational overhead.

3. Methods

3.1. Overall Architecture

The overall architecture of LightSeg is depicted in Figure 1, which follows a standard segmentation network design consisting of an encoder and a decoder. The encoder module is responsible for extracting meaningful semantic features from the input image, capturing essential information about the image’s content and context. These extracted features are then forwarded to the decoder module, which utilizes them to make a pixel-wise classification for the entire image by effectively leveraging the rich semantic information encoded in the extracted features.
Our network architecture starts with a stem module, which consists of a 3 × 3 convolution. This module performs downsampling and reduces the size of the input feature map. Extensive experiments and analysis have shown that this downsampling process has little impact on the overall accuracy of the network [25] because the network effectively captures and retains essential information for accurate predictions. Moreover, the resulting lower spatial resolution improves inference speed. Subsequently, multi-level features are extracted from the original resolution of 1/4, 1/8, and 1/16. These features undergo a fusion process facilitated by the UpHead module, which combines them into a unified representation. Finally, a 3 × 3 convolutional layer is employed to predict the segmentation mask at a 1/4 resolution. It is important to note that the Decoder consists of C segmentation classes.

3.2. Local Spatial Perception Convolution

This paper introduces a novel neural convolution architecture called “Local Spatial Perception Convolution (LSPConv)” that integrates local perception and long-range dependency capturing. Figure 2 illustrates the basic structure of LSPConv, highlighting its distinctive features compared to ordinary convolutions.
The LSPConv module aims to extract more informative features from the input feature map by applying different types of convolutions to different parts of the input channels. Specifically, the input feature map is divided into two channel parts, f a and f b , which are processed separately using different convolutional layers. The mathematical expression is shown in Equation (1). Figure 3 provides a more visual demonstration of this process.
f a o u t = W a f a + b a f b o u t = W b , 1 ( W b , 2 f b ) + b b f o u t = σ ( Concat ( f a o u t , f b o u t ) )
For f a , a standard convolutional layer with weight parameters W a and bias parameters b a is applied. This operation produces the output feature map f a o u t .
For f b , a depthwise separable convolutional layer is used, which consists of a depthwise convolutional layer. The depthwise convolution applies a separate filter to each input channel. In this case, the depthwise convolutional layer applied a filter represented by the weight parameters W b , 1 and W b , 2 to the input feature map f b , The output feature map of this operation is denoted as f b o u t .
Finally, the output feature maps f a o u t and f b o u t are concatenated along the channel axis to form the final output feature map f o u t using the concatenation operator Concat .

3.3. Local Spatial Perception Block

A new network block named LSPBlock was designed based on LSPConv. Figure 4b illustrates the difference between LSPBlock and DBlock [10]. LSPBlock maintains the same number of input and output channels. First, LSPBlock applies a 5 × 5 standard convolution to extract features from the input. Then, the input with w channels is split into two groups, each group including w / 2 channels. A standard convolution is applied to one group while the other group undergoes spatial separable convolution. Next, LSPBlock concatenates the outputs of the two groups and finally applies a pointwise convolution to fuse different channel features. Notably, the LSPBlock module was not used in the early stages of the network, and other modules were adopted instead. Additionally, similar to the convolutions with the parameter g shown in LightSeg, group convolutions were also used to reduce the number of parameters and computational complexity.
Assuming the input feature X has C channels and the configuration of spatially separable convolution size is m, the number of parameters P in the LSP block can be calculated as follows when the stride is 1:
P 5 × 5 = 5 2 × C P S p a r a b l e C o n v = 3 2 C 2 + 2 m C 2 P 1 × 1 = C × C P = P 5 × 5 + P S p a r a b l e C o n v + P 1 × 1
where P 5 × 5 represents the number of parameters in a 5 × 5 convolutional layer, P S p a r a b l e C o n v represents the number of parameters in a spatially separable convolutional layer with kernel size m, P 1 × 1 represents the number of parameters in a 1 × 1 convolutional layer, and P represents the total number of parameters in the LSP block.
A typical LightSeg Encoder is shown in Table 1.
In the initial stage of the network, i.e., stem, stage 4, the standard convolution was used to extract feature information from the input.
Downsampling is performed at the beginning of each stage by applying the s block with stride = 2 ( S = 2 ). The network is structured as Figure 5, and features are downsampled at the first standard 5 × 5 convolution. The above steps are performed to reduce the image resolution as quickly as possible and increase the network speed.
Table 2 demonstrates a comparison between LSPBlock and DBlock [10] in terms of parameter quantity and inference speed. The kernel shape denotes the distinct specifications of the LSPBlock and DBlock kernels. It should be noted that the core purpose of the DBlock is to modify the dilation rate of the 3 × 3 kernel, without altering its size. As a result, there is only one row dedicated to the DBlock in the table. The input channel was set to 64, with the output channel equal to the input channel. The input image size was 116. The results showed that LSPBlock significantly reduced the inference time at the same receptive field.

3.4. Hierarchical Feature Decoding and Fusion

The Up-Head approach was used to decode features from different levels of the encoder. Drawing from the experience of DeeplabV3+ [26], we use 1 × 1 convolutions were used to reduce the dimensionality of low-level features, minimizing the weakening of high-level features. Two sizes of Up-Head were designed to suit different scenarios. Specifically, the final 1/4, 1/8, and 1/16 features from the backbone were selected as inputs, and different convolutional operations and upsampling techniques were used to fuse the features. For example, a 1 × 1 convolution is first applied [1 × 1, 128] on the 1/16 feature for dimensionality reduction, which then undergoes upsampling and is added to the 1/4 feature. Subsequently, [3 × 3, 64] convolution is used for feature fusion, and the 1/4 feature is finally concatenated with the output as input to the SegHead. The SegHead typically consists of a 1 × 1 convolutional layer for final feature mapping. The UpHead structure is illustrated in Figure 6. This simple decoder provides a convenient parameter search paradigm and can be easily inserted into existing networks to search for optimal performance.

4. Experiments

4.1. Datasets

Cityscapes [27] is a large-scale street view dataset. The semantic segmentation dataset contains 2975 images for training, 500 images for validation, and 1525 images for testing. The objects in the images were divided into 19 categories, with an image size of 1024 × 2048 .
CamVid [28,29] is a dataset consisting of street scene images. It contains 367 images for training, 101 for validation, and 233 for testing. We specifically focus on using 11 predefined classes while assigning the ignore label = 255 to all other classes. During our training process, we train the model on the trainval set and evaluate its performance on the test set. The images in the CamVid dataset have a standardized size of 720 × 960 pixels.

4.2. Training Setup and Parameters

Our CNN network uses a momentum of 0.9, an initial learning rate of 0.05, and a weight decay of 0.0001 in the Pytorch 1.11.0 framework and CUDA 11.3 (Compute Unified Device Architecture). The input image resolution was cropped to 768 × 768, and the learning rate was adjusted by the poly strategy when doing the training. Random resize was applied in the range [ 400 , 1600 ] to the Cityscapes dataset and images were randomly cropped to [768] with a 50% probability RandomHorizontalFlip, Furthermore, a set of simplified RandAug [30] operations (auto-contrast, equalization, rotation, color, contrast, luminance, and sharpness) was applied. In addition, a category uniform sampling [31] method was applied for category enhancement sampling, with uniform percent = 0.5, and the model was initialized using Pytorch’s default initialization method. A single NVIDIA RTX 2080Ti GPU was trained with 1000 epochs at a batch size of 8. The NVIDIA Jetson Xavier NX with 21 TOPS was used to test the FPS in a real deployment.
The following loss function Equation (3) was employed to supervise the network:
( x , y ) = L = l 1 , , l N , l n = log exp x n , y n c = 1 C exp x n , c · 1 y n ignore _ index
Given an input data point x and its corresponding true label y, the loss L is computed as a vector of individual losses l n for each class in the output. The individual loss l n for class n is obtained by taking the negative logarithm of the softmax probability of the model output for class n. The softmax probability is calculated by exponentiating the output scores and normalizing them by the sum of exponentiated scores over all classes. The loss for class n is only considered if its corresponding label y n is not marked as an ignore index.
When performing the ablation experiments, only 400 epochs were trained. Since no pre-training was performed on ImageNet, all the experiments were averaged three times.

4.3. LSPBlock Ablation Studies

Experiments on the sizes and numbers of the kernels were conducted for the last 13 blocks with the same experimental settings. In the notation [m], m represents the LSPConv size. The rate indicates the ratio of NormConv to the number of blocks in Stage 4. For example, a rate of 0.3 indicates that NormConv was used for the first four ( 0.3 × 13 ) blocks in Stage 4 and LSPConv for the remaining nine blocks. The experiment results are displayed in Figure 7.
The results in Figure 7 show that the proportion of LSPConv has a significant effect on the performance of the network. For larger kernel sizes, using a higher proportion of LSPConv can lead to improved performance. Conversely, the improvement in performance may not be as significant for smaller kernel sizes. This indicates that the choice of LSPConv proportion is dependent on the kernel size and should be adjusted accordingly for optimal performance.

4.4. LSPBlock Location Ablation Studies

In addition, the effects of location within the network on network performance were investigated for LSPConvs of different sizes. Therefore, the network performance was tested with LSPConvs of different sizes located at different depths of the network. Figure 8 shows the experimental setup for the location ablation study, where B x represents the x-th block (from 1 to 7) and k represents the length of LSPBlock, which was set to 5 in the experiments. Figure 9 shows the results of the location ablation study for LSPConv placement.
The results of the experiments showed that the position of LSPConv in the network affects the effectiveness of different kernel sizes. Specifically, deeper layers in the network may be more effective for larger kernel sizes, while shallower layers may be more effective for smaller kernel sizes.
This finding could be attributed to the receptive field of the convolution operation. As the receptive field of a convolutional layer increases with network depth, it becomes more capable of capturing more spatial information in the input image. Therefore, the deeper layers with a larger receptive field could be more effective for larger kernel sizes. On the other hand, shallower layers may be more effective for smaller kernel sizes since they can capture local features better than deeper layers.

4.5. Block Ablation Studies

The encoder based on spatially separable convolution was compared with Regseg’s encoder. To ensure fairness, RegSeg’s decoder was used for both CNNs in the comparison. The FPS and GFLOPs for each test experiment were tested on an RTX 2080Ti. For 3 × 2048 × 1024 input images, the spatially separable convolution-based model showed a reduction in GFLOPs of about 14.14% and an increase in FPS of 52.36%. mIoU was reduced by about 2.56% compared to Regseg (see Table 3).

4.6. Timing Setup and Parameters

An RTX 2080Ti GPU was used to time LightSeg in mixed-precision mode. PyTorch 1.10.0 and CUDA 11.3 were used. The input size was 1 × 3 × 1024 × 2048 on Cityscapes 10 iterations were processed as a warm-up before officially starting the test, and the inference times were averaged over the next 100 iterations.
Jetson Xavier NX used Pytorch 1.11.0 Jetpack 5.0.2-b231 and accelerated with TensorRT.

4.7. Comparison with State-of-the-Art Methods

In Table 4, our model was compared with other lightweight real-time semantic segmentation models, revealing leading results in both speed and complexity. Here, “IM” indicates that the model was pre-trained on ImageNet. Notably, our model achieved a 16% faster inference speed and a nearly 56% reduction in complexity compared to the previous state-of-the-art model RegSeg, while maintaining a similar MIoU performance. Furthermore, the model was deployed using TensorRT on Jetson NX, achieving a score of 115.7 FPS and outperforming most of the results in Table 4. Notably, the improvements in speed and complexity can be attributed to the design of our LSPBlock and UpHead modules, which effectively reduce computational and memory costs while maintaining segmentation accuracy. Figure 10 displays the visual results of LightSeg and Regseg under different scenarios.
As shown in Table 5, our proposed method, LightSeg, achieves mIOU of 77.1% on the challenging CamVid test set while operating at an impressive speed of 99.4 FPS. This performance improvement is significant when compared to the previous state-of-the-art (SOTA) model, DDRNet-23 [36], as our method achieves a 5% increase in speed. Moreover, in comparison to RegSeg, our approach demonstrates a substantial speed boost of 42%.

4.8. Effective Receptive Field Analysis

The receptive fields of RegSeg and LightSeg were visualized to demonstrate how our network effectively addresses the “Gridding Effect” [12] issue in RegSeg.
The visualization results are shown in Figure 11; the darker colors indicate that the coordinates are focused within the receptive field of the corresponding feature kernel. Based on the visualization results, the following conclusions can be drawn:
  • The D Block in RegSeg effectively increases the receptive field, but also causes sparse gradients due to its dilated convolution structure, resulting in the “Gridding Effect”.
  • Both LightSeg and RegSeg exhibited relatively small ERFs in the early stages of the network.
  • In Stage 3, LightSeg showed a global receptive field while still maintaining a strong focus on local regions.

5. Conclusions

This paper introduces LSPConv, a novel convolutional neural network (CNN) module specifically designed for intelligent recognition and detection on mobile devices. LSPConv effectively addresses the challenge of limited receptive field in lightweight designs by employing spatial separable convolutions with large kernels in the encoder, while also capturing local information using local convolutions. The proposed LSPBlock further enhances the network’s receptive field without the drawbacks of dilated convolutions and reduces computational complexity through depthwise separable convolutions.
By performing spatial separable convolutions channel-wise and fusing features using point-wise convolutions, the proposed approach achieves real-time performance on low-power embedded platforms. This results in a 41% reduction in the number of parameters compared to standard convolution, while increasing inference speed by 30%, thereby improving efficiency for deployment on unmanned systems with limited computational resources.
Experiments on kernel size combinations reveal that incorporating large kernels in the initial stage using LSPConv does not significantly improve network performance. Therefore, the kernel size is gradually increased in the LSPBlock of stage 3.
Evaluation of the Cityscapes dataset demonstrates the effectiveness of our model. Compared to DDRNet-23 [36], our model achieves nearly a 13% speedup while sacrificing only 1.7% accuracy compared to RegSeg. These results highlight the efficiency and accuracy achieved by our LSPConv-based approach in real-time semantic segmentation tasks on resource-constrained mobile devices.

Author Contributions

Conceptualization, X.L. and J.L.; methodology, X.L. and J.L.; software, J.L. and Z.G.; validation and J.L.; formal analysis, J.L.; investigation, X.L. and J.L.; resources, X.L. and Z.J.; data curation, J.L. and Z.G.; writing—original draft preparation, J.L.; writing—review and editing, J.L., X.L. and Z.J.; visualization, J.L. and Z.G.; supervision, Z.J.; project administration, Z.J.; funding acquisition, Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (61876049, 62172118) and Nature Science Key Foundation of Guangxi (2021GXNSFDA196002); in part by the Sichuan Regional Innovation Cooperation Project (2021YFQ0002); in part by the Guangxi Key Laboratory of Image and Graphic Intelligent Processing under Grants (GIIP2004) and Student’s Platform for Innovation and Entrepreneurship Training Program under Grant (S202210595177, 202210595029).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

For the Cityscape dataset, the data supporting reported results can be found at the following link: https://www.cityscapes-dataset.com/ (accessed on 12 May 2023). The dataset is publicly available for research purposes and can be downloaded upon registration on the website. For CamVid dataset, the data supporting reported results can be found at the following link: http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/ (accessed on 20 June 2023). The dataset is publicly available for research purposes and can be downloaded upon registration on the website.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
FCNFully Convolutional Networks
CNNConvolutional Neural Networks
LSPConvLocal Spatial Perception Convolution
LSPBlockLocal Spatial Perception Block
CUDACompute Unified Device Architecture (a parallel computing platform by NVIDIA)
GPUGraphics Processing Unit
FPSFrames Per Second (the number of frames processed or displayed per second)
GFlopsGigaflops (the number of billions of floating-point operations per second)
mIoUMean Intersection over Union (a metric used to evaluate the accuracy of segmentation models)
TensorRTTensor Runtime (an optimization and inference engine for deep learning models)

References

  1. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  2. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. Mobilenetv2: The next generation of on-device computer vision networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  3. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  4. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
  5. Yan, M.; Xiong, R.; Shen, Y.; Jin, C.; Wang, Y. Intelligent generation of Peking opera facial masks with deep learning frameworks. Herit. Sci. 2023, 11, 20. [Google Scholar] [CrossRef]
  6. Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 593–602. [Google Scholar]
  7. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
  8. Yan, M.; Lou, X.; Chan, C.A.; Wang, Y.; Jiang, W. A semantic and emotion-based dual latent variable generation model for a dialogue system. CAAI Trans. Intell. Technol. 2023, 8, 319–330. [Google Scholar] [CrossRef]
  9. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  10. Gao, R. Rethinking Dilated Convolution for Real-Time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, BC, Canada, 18–22 June 2023; pp. 4674–4683. [Google Scholar]
  11. Yu, F.; Koltun, V.; Funkhouser, T. Dilated residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 472–480. [Google Scholar]
  12. Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
  13. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  14. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  15. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  16. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  18. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  19. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
  20. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
  21. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  22. Jia, S. LRD-SLAM: A Lightweight Robust Dynamic SLAM Method by Semantic Segmentation Network. Wirel. Commun. Mob. Comput. 2022, 2022, 7332390. [Google Scholar] [CrossRef]
  23. Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 3–20. [Google Scholar]
  24. Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 July 2022; pp. 5270–5279. [Google Scholar]
  25. Zhao, P.; Haitao, H.; Li, A.; Mansourian, A. Impact of data processing on deriving micro-mobility patterns from vehicle availability data. Transp. Res. Part Transp. Environ. 2021, 97, 102913. [Google Scholar] [CrossRef]
  26. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  27. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  28. Brostow, G.J.; Shotton, J.; Fauqueur, J.; Cipolla, R. Segmentation and Recognition Using Structure from Motion Point Clouds. In Proceedings of the European Conference on Computer Vision (ECCV), Marseille, France, 12–18 October 2008; pp. 44–57. [Google Scholar]
  29. Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic Object Classes in Video: A High-Definition Ground Truth Database. Pattern Recognit. Lett. 2008, 30, 88–97. [Google Scholar] [CrossRef]
  30. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, hlSeattle, WA, USA, 13–19 June 2020; pp. 702–703. [Google Scholar]
  31. Zhu, Y.; Sapra, K.; Reda, F.A.; Shih, K.J.; Newsam, S.; Tao, A.; Catanzaro, B. Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8856–8865. [Google Scholar]
  32. Si, H.; Zhang, Z.; Lv, F.; Yu, G.; Lu, F. Real-time semantic segmentation via multiply spatial fusion network. arXiv 2019, arXiv:1911.07217. [Google Scholar]
  33. Orsic, M.; Kreso, I.; Bevandic, P.; Segvic, S. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12607–12616. [Google Scholar]
  34. Chen, W.; Gong, X.; Liu, X.; Zhang, Q.; Li, Y.; Wang, Z. FasterSeg: Searching for Faster Real-time Semantic Segmentation. arXiv 2019, arXiv:1912.10917. [Google Scholar]
  35. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
  36. Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3448–3460. [Google Scholar] [CrossRef]
  37. Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9716–9725. [Google Scholar]
  38. Lin, P.; Sun, P.; Cheng, G.; Xie, S.; Li, X.; Shi, J. Graph-guided architecture search for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, hlSeattle, WA, USA, 13–19 June 2020; pp. 4203–4212. [Google Scholar]
  39. Zhang, Y.; Qiu, Z.; Liu, J.; Yao, T.; Liu, D.; Mei, T. Customizable architecture search for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11641–11650. [Google Scholar]
  40. Chandra, S.; Couprie, C.; Kokkinos, I. Deep spatio-temporal random fields for efficient video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8915–8924. [Google Scholar]
  41. Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Figure 1. Network structure diagram of LightSeg, illustrating the encoder-decoder architecture. Each stage is stacked with an LSPBlock module for robust feature extraction. The UpHead module serves as the multi-scale feature fusion module.
Figure 1. Network structure diagram of LightSeg, illustrating the encoder-decoder architecture. Each stage is stacked with an LSPBlock module for robust feature extraction. The UpHead module serves as the multi-scale feature fusion module.
Applsci 13 08130 g001
Figure 2. The figure illustrates the distinction between feature extraction methods: standard convolution, grouped convolution, and our proposed LSPConv module. In standard convolution, a small kernel size is applied uniformly across all channels for feature extraction. Grouped convolution divides the input channels into groups, performing feature extraction using standard convolution within each group. In contrast, our LSPConv module partitions the channels into two groups, utilizing standard convolution for one group and spatial separable convolution for the other group.
Figure 2. The figure illustrates the distinction between feature extraction methods: standard convolution, grouped convolution, and our proposed LSPConv module. In standard convolution, a small kernel size is applied uniformly across all channels for feature extraction. Grouped convolution divides the input channels into groups, performing feature extraction using standard convolution within each group. In contrast, our LSPConv module partitions the channels into two groups, utilizing standard convolution for one group and spatial separable convolution for the other group.
Applsci 13 08130 g002
Figure 3. Illustration of the LSPConv module. The input feature map is split into two channel parts, f a and f b , which are processed separately using different convolutional layers. The output feature maps f a o u t and f b o u t are then concatenated and passed through a nonlinear activation function to obtain the final output feature map f o u t .
Figure 3. Illustration of the LSPConv module. The input feature map is split into two channel parts, f a and f b , which are processed separately using different convolutional layers. The output feature maps f a o u t and f b o u t are then concatenated and passed through a nonlinear activation function to obtain the final output feature map f o u t .
Applsci 13 08130 g003
Figure 4. Comparison of LightSeg’s LSP block and ResSeg D block. The symbol g denotes the size of the group convolution, while the symbols d 1 and d 2 represent the dilation rates of the dilated convolutions.
Figure 4. Comparison of LightSeg’s LSP block and ResSeg D block. The symbol g denotes the size of the group convolution, while the symbols d 1 and d 2 represent the dilation rates of the dilated convolutions.
Applsci 13 08130 g004
Figure 5. The symbol s denotes the stride of the convolution. When the stride is set to 2 using the LSPBlock, a 2 × 2 average pooling operation is performed for downsampling, followed by a 1 × 1 convolutional layer for channel transformation.
Figure 5. The symbol s denotes the stride of the convolution. When the stride is set to 2 using the LSPBlock, a 2 × 2 average pooling operation is performed for downsampling, followed by a 1 × 1 convolutional layer for channel transformation.
Applsci 13 08130 g005
Figure 6. The Up-Head approach enables the decoding of features from various levels of the encoder. The diagram depicts the process of fusing features from the final 1/4, 1/8, and 1/16 stages of the backbone.
Figure 6. The Up-Head approach enables the decoding of features from various levels of the encoder. The diagram depicts the process of fusing features from the final 1/4, 1/8, and 1/16 stages of the backbone.
Applsci 13 08130 g006
Figure 7. The effect of the proportion of LSPConv on network performance for different kernel sizes.
Figure 7. The effect of the proportion of LSPConv on network performance for different kernel sizes.
Applsci 13 08130 g007
Figure 8. We conducted experiments on the positions of different sizes of LSPBlocks in Stage 3 to investigate the preference of different kernel sizes for network depth. In the experiments, k represents the length of consecutive LSPBlocks, while x represents the starting block of using LSPBlock.
Figure 8. We conducted experiments on the positions of different sizes of LSPBlocks in Stage 3 to investigate the preference of different kernel sizes for network depth. In the experiments, k represents the length of consecutive LSPBlocks, while x represents the starting block of using LSPBlock.
Applsci 13 08130 g008
Figure 9. We conducted experiments with three different kernel sizes of LSPBlock, namely 5, 7, and 9. The horizontal axis represents the starting position of five consecutive LSPBlocks in Stage 3, while the vertical axis represents the mIoU metric.
Figure 9. We conducted experiments with three different kernel sizes of LSPBlock, namely 5, 7, and 9. The horizontal axis represents the starting position of five consecutive LSPBlocks in Stage 3, while the vertical axis represents the mIoU metric.
Applsci 13 08130 g009
Figure 10. Visual results comparison of LightSeg and Regseg on Cityscapes validation set under various scenarios. The visualization results clearly indicate the superiority of our network over RegSeg in accurately capturing elongated objects. Moreover, the inherent local spatial perception ability of LSPConv significantly enhances the segmentation performance for large-scale objects.
Figure 10. Visual results comparison of LightSeg and Regseg on Cityscapes validation set under various scenarios. The visualization results clearly indicate the superiority of our network over RegSeg in accurately capturing elongated objects. Moreover, the inherent local spatial perception ability of LSPConv significantly enhances the segmentation performance for large-scale objects.
Applsci 13 08130 g010
Figure 11. Effective receptive field (ERF) Analysis on Cityscapes. The depth of color represents the degree of influence of the input image on the target value. A darker color indicates a significant influence, while a lighter color indicates a minimal impact on the target value.
Figure 11. Effective receptive field (ERF) Analysis on Cityscapes. The depth of color represents the degree of influence of the input image on the target value. A darker color indicates a significant influence, while a lighter color indicates a minimal impact on the target value.
Applsci 13 08130 g011
Table 1. The architectures of lightseg-base and lightseg-large, where [ ] × n denotes repeating the corresponding block for n times.
Table 1. The architectures of lightseg-base and lightseg-large, where [ ] × n denotes repeating the corresponding block for n times.
StageOutput SizeLightSeg-BaseLightSeg-Large
1 H 4 × W 4 C 1 48 3 × 3 , 48 3 × 3 , 48 1 × 1 , 48 s t r i d e 2 C 1 = 64 3 × 3 , 64 3 × 3 , 64 1 × 1 , 64 s t r i d e 2
2 H 8 × W 8 C 2 128 3 × 3 , 128 3 × 3 , 128 1 × 1 , 128 s t r i d e 2 C 2 = 128 3 × 3 , 128 3 × 3 , 128 1 × 1 , 128 s t r i d e 2
3 × 3 , 128 3 × 3 , 128 1 × 1 , 128 × 2 3 × 3 , 128 3 × 3 , 128 1 × 1 , 128 × 2
3 H 16 × W 16 C 3 256 3 × 3 , 256 3 × 3 , 256 1 × 1 , 256 s t r i d e 2 C 3 = 256 3 × 3 , 256 3 × 3 , 256 1 × 1 , 256 s t r i d e 2
3 × 3 , 256 3 × 3 , 256 5 × 1 , 128 1 × 5 , 128 1 × 1 , 256 × 2 ( L S P B l o c k ) 3 × 3 , 256 3 × 3 , 128 5 × 1 , 128 1 × 5 , 128 1 × 1 , 256 × 2 ( L S P B l o c k )
3 × 3 , 256 3 × 3 , 128 7 × 1 , 128 1 × 7 , 128 1 × 1 , 256 × 4 ( L S P B l o c k ) 3 × 3 , 256 3 × 3 , 256 7 × 1 , 128 1 × 7 , 128 1 × 1 , 256 × 4 ( L S P B l o c k )
3 × 3 , 256 3 × 3 , 128 11 × 1 , 128 1 × 11 , 128 1 × 1 , 256 × 6 ( L S P B l o c k ) 3 × 3 , 256 3 × 3 , 256 11 × 1 , 128 1 × 11 , 128 1 × 1 , 256 × 6 ( L S P B l o c k )
3 × 3 , 320 3 × 3 , 320 11 × 1 , 160 1 × 11 , 160 1 × 1 , 320 ( L S P B l o c k ) 3 × 3 , 512 3 × 3 , 512 11 × 1 , 256 1 × 11 , 256 1 × 1 , 512 ( L S P B l o c k )
Table 2. Paramete and Benchmark: The table illustrates a comparison between LSPBlock and DBlock in terms of parameter quantity and benchmark metrics. The symbol “↓” indicates that a smaller value in the column is preferred, while the symbol “↑” indicates that a larger value in the column is preferred. The color of the text indicates the ranking, with red indicating the best and blue indicating the second-best.
Table 2. Paramete and Benchmark: The table illustrates a comparison between LSPBlock and DBlock in terms of parameter quantity and benchmark metrics. The symbol “↓” indicates that a smaller value in the column is preferred, while the symbol “↑” indicates that a larger value in the column is preferred. The color of the text indicates the ranking, with red indicating the best and blue indicating the second-best.
Kernel ShapeParameter ↓Delay ↓
LSPBlock 3 × 3 1 × 3 3 × 1  16.05 KB0.7552 ms
3 × 3 1 × 5 5 × 1  18.10 KB0.8159 ms
3 × 3 1 × 7 7 × 1  20.14 KB0.8477 ms
3 × 3 1 × 11 11 × 1  24.24 KB0.8653 ms
DBlock [10] 3 × 3 3 × 3 19.92 KB0.9437 ms
Table 3. Comparison of frames per second (FPS) and Gigaflops (GFLOPs) with 3 × 2048 × 1024 Image Resolution.
Table 3. Comparison of frames per second (FPS) and Gigaflops (GFLOPs) with 3 × 2048 × 1024 Image Resolution.
ModelmIoU ↑FPS ↑GFLOPs (Whole Model) ↓
Our76.861.116.887 G
Regseg78.140.119.670 G
Table 4. Comparison with state-of-the-art methods on Cityscapes. IM means ImageNet. † indicates that the deployment uses TensorRT benchmark.
Table 4. Comparison with state-of-the-art methods on Cityscapes. IM means ImageNet. † indicates that the deployment uses TensorRT benchmark.
Modelval MIoU ↑Speed (FPS) ↑Pre-TrainingGPUResolutionGFLOPs ↓Params ↓
MSFNet [32]-41IMRTX 2080Ti 2048 × 1024 96.8-
SwiftNetRN-18 [33]-39.9IMRTX 2080Ti 2048 × 1024 10411.8 M
FasterSeg [34]73.1163.9NoneRTX 2080Ti 2048 × 1024 28.24.4 M
BiSeNet1 [35]69.040.9IMRTX 2080Ti 2048 × 1024 14.85.8 M
BiSeNet2 [35]74.842.2IMRTX 2080Ti 2048 × 1024 55.349 M
DDRNet-23 [36]79.5 ( 79.1 ± 0.3 )37.1IMRTX 2080Ti 2048 × 1024 143.120.1 M
RegSeg [10] 78.13 ± 0.48 35NoneRTX 2080Ti 2048 × 1024 39.13.34 M
LightSeg76.861.1NoneRTX 2080Ti 2048 × 1024 16.882.44 M
LightSeg †-115.7NoneJetson NX 2048 × 1024 16.882.44 M
Table 5. Comparison with state-of-the-art methods on Camvid, IM: ImageNet, C: Cityscapes.
Table 5. Comparison with state-of-the-art methods on Camvid, IM: ImageNet, C: Cityscapes.
ModelExtra DataFPS ↑mIOU ↑
STDC2-Seg [37]IM152.273.9
GAS [38]-153.172.8
CAS [39]-16971.2
VideoGCRF [40]C-75.2
BiSeNetV2 [41]C12476.7
BiSeNetV2-L [41]C3378.5
DDRNet-23 [36]C9480.1
RegSeg [10]C7080.9
LightSegC99.477.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lei, X.; Liang, J.; Gong, Z.; Jiang, Z. LightSeg: Local Spatial Perception Convolution for Real-Time Semantic Segmentation. Appl. Sci. 2023, 13, 8130. https://doi.org/10.3390/app13148130

AMA Style

Lei X, Liang J, Gong Z, Jiang Z. LightSeg: Local Spatial Perception Convolution for Real-Time Semantic Segmentation. Applied Sciences. 2023; 13(14):8130. https://doi.org/10.3390/app13148130

Chicago/Turabian Style

Lei, Xiaochun, Jiaming Liang, Zhaoting Gong, and Zetao Jiang. 2023. "LightSeg: Local Spatial Perception Convolution for Real-Time Semantic Segmentation" Applied Sciences 13, no. 14: 8130. https://doi.org/10.3390/app13148130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop