GLBRF: Group-Based Lightweight Human Behavior Recognition Framework in Video Camera

Lee, Young-Chan; Lee, So-Yeon; Kim, Byeongchang; Kim, Dae-Young

doi:10.3390/app14062424

Open AccessArticle

GLBRF: Group-Based Lightweight Human Behavior Recognition Framework in Video Camera

¹

Department of Computer Software, Daegu Catholic University, Gyeongsan 38430, Republic of Korea

²

Department of Software Convergence, Soonchunhyang University, Asan 31538, Republic of Korea

³

School of Computer Software, Daegu Catholic University, Gyeongsan 38430, Republic of Korea

⁴

Department of Computer Software Engineering, Soonchunhyang University, Asan 31538, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2424; https://doi.org/10.3390/app14062424

Submission received: 16 February 2024 / Revised: 7 March 2024 / Accepted: 11 March 2024 / Published: 13 March 2024

(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Behavioral recognition is an important technique for recognizing actions by analyzing human behavior. It is used in various fields, such as anomaly detection and health estimation. For this purpose, deep learning models are used to recognize and classify the features and patterns of each behavior. However, video-based behavior recognition models require a lot of computational power as they are trained using large datasets. Therefore, there is a need for a lightweight learning framework that can efficiently recognize various behaviors. In this paper, we propose a group-based lightweight human behavior recognition framework (GLBRF) that achieves both low computational burden and high accuracy in video-based behavior recognition. The GLBRF system utilizes a relatively small dataset to reduce computational cost using a 2D CNN model and improves behavior recognition accuracy by applying location-based grouping to recognize interaction behaviors between people. This enables efficient recognition of multiple behaviors in various services. With grouping, the accuracy was as high as 98%, while without grouping, the accuracy was relatively low at 68%.

Keywords:

behavior recognition; deep learning; video-based; lightweight learning framework; location-based grouping

1. Introduction

With the decreasing cost of digital monitoring equipment such as cameras and microphones, video surveillance systems have been extensively applied in public places like banks, museums, and shopping malls to monitor unusual events. However, the traditional video surveillance system mainly captures anomalies and related evidence through offline video, which requires continuous monitoring and consumes a lot of resources and time. To reduce the workload of security personnel and improve their work efficiency, research on deep learning-based behavioral recognition technology has been active. This research is considered to be one of the important topics in the field of computer vision, which has demonstrated excellent performance and can be utilized in various applications [1,2,3].

Behavior recognition technology aims to analyze information related to human movement and classify or predict behavior. Currently, many researchers are trying to solve behavior recognition problems using deep learning, and various deep learning models, such as convolutional neural network (CNN) and long short-term memory (LSTM), are being used [4]. Among them, 2D CNN and 3D CNN models have received major attention and are widely used in the field of behavior recognition [5]. The difference between their feature extraction methods is shown in Figure 1. 2D CNNs are mainly effective in extracting spatial features of images, but they are limited in learning temporal information directly. 3D CNN, on the other hand, can learn both spatial and temporal features, making them more suitable for recognizing continuous motion in video data. However, they require large frame datasets and a lot of computation [6].

Recently, there has been a growing demand for intelligent CCTV in smart cities and surveillance systems for social security provisions. Research is not only focused on individuals but also extensively on group behaviors. To recognize group behaviors, identified individuals must be grouped together. Grouping involves analyzing patterns such as size, location, direction, and speed of the recognized individuals and clustering them with similar patterns. For collective behaviors like assaults or fights, classifying the extracted individuals as adjacent groups rather than recognizing individual behaviors can enhance accuracy. This is because it utilizes data that influence the behavior of each member within the group [7].

Developing lightweight models is essential for fast and efficient behavior recognition. In addition, in behavior recognition research, detecting not only individual behaviors but also interaction information between people through grouping is an important factor in improving the accuracy of behavior recognition. Therefore, in this paper, we propose a group-based lightweight behavior recognition framework (GLBRF). The proposed system learns by utilizing 2D CNN for a single frame, which enables efficient behavior recognition by minimizing the amount of computation. In addition, by applying location-based grouping to reflect the interaction information between people, we focus on improving the accuracy of behavior recognition with less computational cost. This approach can operate efficiently on devices with limited resources and can reduce computation while maintaining the accuracy of behavior recognition. In other words, the proposed system simultaneously achieves the two main goals of minimizing the amount of computation required for real-time behavior recognition and improving accuracy. The main contributions of our work are as follows:

A lightweight framework: GLBRF is based on 2D CNN, which can accurately recognize human behavior in a video while significantly reducing computational cost. This means that it can be effectively used in resource-constrained environments and presents a new approach in the field of video-based behavior recognition that reduces computational overhead while providing effective recognition results.
Improved recognition accuracy through grouping: We improved the accuracy of behavior recognition by applying grouping techniques to account for interactions between objects. This is especially useful for recognizing complex behavior patterns where multiple objects interact.
Combining object tracking and behavior recognition: GLBRF includes object recognition, center coordinate calculation, object tracking, and grouping functions to efficiently recognize interacting group actions among recognized objects in a video. This allows for final behavior classification based on a specific number of frames, which is more reliable than single-frame behavior recognition.

The rest of the paper is structured as follows: Section 2 describes the principles and differences between 2D CNN and 3D CNN commonly used in behavior recognition. Section 3 describes the principles of the YOLO model and the centroid-tracking algorithm used to recognize and track people. We also describe MobileNetV2, which is used as a learning model for behavior recognition. Section 4 describes the overall system structure and implementation process of the proposed framework, the training dataset, and the model. Section 5 presents the implementation and performance comparison of the framework. Finally, Section 6 discusses the effectiveness and limitations of GLBRF, proposes future research directions, and summarizes the paper.

2. Related Work

As an essential component of social interaction, human behavior is of great significance for understanding and analyzing human behavior in various fields. With the rapid development of computer vision and machine learning technologies, intelligent machines are beginning to replace humans in observing, recognizing, and analyzing images and video data [8,9]. Behavior recognition based on computer vision and machine learning is one of these challenges and has become a particularly hot research topic in various fields, such as intelligent monitoring, smart homes, virtual reality, and medical diagnosis. In this section, we discuss some of the related work by other researchers in this area.

Gul et al. [10] classified abnormal patient activities by monitoring and recognizing patient activities to recognize activities that require emergency medical assistance. They used YOLO as a backbone CNN model and built a large patient video dataset labeled with patient actions and patient locations to train it. The results showed that the accuracy of abnormal behavior recognition was 96.8%, indicating that the proposed framework can be useful for patient monitoring in hospitals and elderly care facilities. Zhang et al. [11] proposed an efficient 2D framework called FENet to significantly reduce the overhead of model storage and computation. To validate the effectiveness of FENet, they conducted several experiments on the UCF101 dataset and showed that FENet achieved optimal accuracy on small datasets and a competitive accuracy of 71.79% with minimal computational cost.

However, these studies were based on the 2D CNN structure, which is limited in capturing motion information encoded in multiple consecutive frames because the convolutional filter mainly learns only spatial features and temporal features are reflected only in the last classification layer, which inherently cannot reflect temporal information [12]. Therefore, 3D CNN-based behavior recognition has been proposed to overcome the structural limitations of 2D CNNs.

Vrskova et al. [13] introduced a 3D CNN architecture for classifying human activities in video data, which demonstrated improved precision over existing models. The proposed 3D CNN architecture achieved high test accuracies of 85.2% and 84.4% using the UCF YouTubeAction and UCF101 datasets, demonstrating the effectiveness of 3D CNN in recognizing different activities and offering potential applications in monitoring non-standard human behavior in medical fields and public places by accurately monitoring and classifying activities. Wang et al. [14] presented an algorithm based on double-branch 3D CNN for human behavior classification in videos. It uses two consecutive convolutional neural networks to recognize human actions in videos, extracting temporal and spatial features of video data in the training network and then performing classification and recognition in the test network. Experimental results showed that the algorithm achieved 95.0% accuracy on the UCF-101 dataset.

Therefore, 3D CNN is a way to overcome the structural limitations of 2D CNN by adding a time axis to the existing 2D CNN layer configuration and using 3D convolutional operations. Because every convolutional filter is a 3D filter, the feature map generated by a single filter is also three-dimensional. Therefore, 3D CNNs can learn spatiotemporal features for continuous frame data. However, 3D CNNs have the disadvantage of requiring a large amount of computation for training due to large frame datasets and 3D convolutional filters [15].

Previous research has focused on individual behavior recognition rather than group behavior recognition. Although there are prior works on crowd behavior, most of them focus on large crowds and contribute to areas such as crowd counting, tracking, and anomaly detection [16,17,18,19]. In other words, research on behavior recognition in individual small groups is relatively limited. Furthermore, they utilize high-dimensional 3D CNN structures for analysis, which is computationally burdensome.

Therefore, this paper proposes a 2D CNN-based GLBRF to overcome the limitations of existing studies. We designed a 2D CNN-based model that requires less computation and applied a grouping technique for relatively accurate behavior recognition. As a result, we were able to perform sufficient analysis with a relatively small amount of data and demonstrated its efficiency in terms of required computation and behavior recognition accuracy.

3. Background

The first step of the GLBRF framework is to perform object recognition on the input video frames. We used the you only look once (YOLO) model as a method to recognize objects and the centroid-tracking algorithm to calculate and track the locations of recognized objects. Finally, we used the MobileNetV2 model to recognize the behavior of the recognized objects.

3.1. You Only Look Once (YOLO)

YOLO is one of the object detection methods known for its rapid real-time performance and high accuracy [20,21,22]. YOLO does not require region proposals to localize and classify objects. Instead, it divides the entire image into an S × S grid and identifies “m” bounding boxes within each grid. Each bounding box predicts a class probability and offset values. Bounding boxes predicting class probabilities below a certain threshold are discarded. As a result, YOLO offers fast detection speeds and high recognition rates and can be applied to small modules, making it suitable for real-time object recognition. The final bounding boxes of detected objects obtained through YOLO can be seen in Figure 2, providing information on the location of the detected objects.

3.2. Centroid-Tracking Algorithm

In this paper, a centroid-tracking algorithm is introduced to determine how individual objects recognized in consecutive frames are connected over time and remain the same object. This object tracking process enables consistent identification and tracking of recognized objects in a video, which is essential for accurate analysis of object movement and behavior patterns, especially in dynamic environments.

First, Equation (1) is applied to each object detected in a single frame through object detection to calculate its center coordinates.

P (x, y) = (x_{1} + ((x_{2} - x_{1}) / 2), y_{1} + ((y_{2} - y_{1}) / 2))

(1)

d_{E} (P_{1}, P_{2}) = \sqrt{{(P_{1} (x) - P_{2} (x))}^{2} + {(P_{1} (y) - P_{2} (y))}^{2}}

(2)

Then, a unique ID is assigned by calculating the Euclidean distance [23] between the center coordinates of the current frame and the center coordinates of the previous frame, as shown in Figure 3. The Euclidean distance can be expressed by Equation (2) for the center coordinates

P_{1}

and

P_{2}

of two objects in a frame. The two closest pairs of coordinates whose Euclidean distance between the center coordinates of the previous frame and the center coordinates of the current frame is shorter than a threshold are determined to be the same object and assigned the existing ID. The threshold is the average width of the bounding box of the recognized people. On the other hand, the centroid coordinates that are longer than the set distance or have not been assigned an ID are judged as new objects and assigned a unique ID [24]. Therefore, the centroid-tracking algorithm is a multi-step process that calculates the Euclidean distance between the centroid coordinates for each object detected in the previous frame and the centroid coordinates for the objects detected in the next frame and tracks the objects if they are determined to be the same object.

3.3. MobileNetV2

In this paper, MobileNetV2 is applied to recognize the behavior of objects once their data (bounding boxes, IDs) are collected through tracking of detected objects. MobileNetV2 is a model that lightens the general CNN model (reduces memory and computation) by applying depthwise-separable convolution and inverted residuals. Depthwise-separable convolution splits the general convolutional layer into two layers: depthwise convolution and pointwise convolution. Depthwise convolution performs a single filter operation on all channels of the input data to output the result. The pointwise convolution consists of a 1 × 1 convolution, and the results output by the depthwise convolution are summed and output. The back-propagation residuals block is an enhancement to ResNet that works by expanding only the channel dimension and collapsing it back down later. In this way, the trade-off between learning and accuracy can be solved by minimizing the reduction in accuracy while significantly reducing the amount of computation [25,26].

4. Proposed System

The typical process of a behavior recognition system is depicted in Figure 4 [27]. The first step involves receiving a video as input and undergoing a preprocessing phase. This phase includes tasks such as noise removal, color normalization, and frame size adjustments. Subsequently, the feature extraction process is carried out on the preprocessed video. Feature extraction is a process that extracts essential feature information and eliminates unnecessary details, ensuring that the core elements of behavior are clearly derived. The extracted features are then passed to the behavior recognition model to predict specific behaviors. Moreover, the results of the preprocessing and feature extraction processes can be used as training datasets for the behavior recognition model. Through such a process, the behavior recognition system can identify and classify various behaviors. However, video data are high-dimensional and possess complex features, necessitating significant computational resources and time for feature extraction and model training. Additionally, because most behavior recognition models focus on recognizing individual behaviors, they face challenges in recognizing behaviors involving interactions between multiple individuals. We designed the GLBRF framework to address these issues.

4.1. Group-Based Lightweight Human Behavior Recognition Framework

The overall system structure of GLBRF is shown in Figure 5. Rectangles represent functions, and circles represent data. The object detection, centroid coordinates calculation, centroid tracking, and grouping functions correspond to the data preprocessing and feature extraction processes for behavior recognition, which can be used to build a behavior recognition training dataset. In this process, objects are recognized and tracked in each frame of the input video, and grouping is applied. These are then fed into the behavior recognition learning model in the behavior recognition function to recognize behaviors. Finally, the final behavior is classified by the classification function based on a specific number of frames. With this GLBRF, we can efficiently recognize interacting group behaviors between recognized objects in a video.

4.2. System Procedure

The flowchart of GLBRF is shown in Figure 6. First, the people detection function receives each frame of the video as input and recognizes objects using the YOLO model. If a person is recognized among the recognized objects, it outputs a bounding box with the coordinates of the person in the frame (

x_{1}

,

y_{1}

,

x_{2}

,

y_{2}

). If the person is not recognized, the centroid coordinates cannot be calculated, so the frame must be repeated until the person is recognized. The centroid coordinates calculation function calculates the center coordinates (

x

,

y

) using Equation (1) and the

x_{1}

,

y_{1}

,

x_{2}

,

y_{2}

coordinates of the bounding box. This center coordinate is used as a point representing the person’s location. The centroid-tracking function uses Equation (2) and the centroid-tracking algorithm to track the person’s center coordinates in consecutive frames. A maximum Euclidean distance is set as a threshold to determine the same person in the current and previous frames. The threshold is the average width of the bounding box of the recognized people. If the distance is greater than the threshold compared to the center coordinates of the previous frame, the person is determined to be a new person and assigned a unique ID. If the distance is not greater than the threshold, the person with the closest center coordinate is determined to be the same person and assigned the ID of that center coordinate. The grouping function groups people who belong to the same group in the current frame. It again sets a maximum Euclidean distance as a threshold to determine the same group. The threshold is based on arm length, considering the distance at which people interact with each other. The arm’s length is based on the “8th Domestic Body Size Survey” by Size Korea, an institution providing standard body size data for Koreans under the Korea Agency for Technology and Standards (KATS) [28]. They stated that the average arm length is about 1/3 of a person’s height. Thus, the distance determining proximity between people in the grouping process is set to 2/3 of the average height of the detected people. If the Euclidean distance between all centroid coordinates in the current frame is below the threshold, they are deemed to be in the same group. The behavior recognition function identifies specific behaviors using the behavior recognition training model. The current system’s behavior recognition training is constructed by labeling each group’s behavior using MobileNetV2 as the behavior recognition model. The behavior of all groups in each frame of the video is recognized based on a specific number of frames. The classification function classifies behaviors for each group by repeating based on a specific number of frames. The behavior recognition results (behavior classes) from the behavior recognition function are stored for each group in each frame. For each group, the behavior class recognized most frequently over a specific number of frames is classified as the behavior for that entire frame span.

4.3. Object Grouping Algorithm

The grouping function in Figure 5 and Figure 6 is executed through the pseudocode in Algorithm 1.

C_{i}^{k}

represents the centroid coordinates of a person present within a frame (line 0), and k denotes the sequence of the frame, where 0 indicates the previous frame and 1 indicates the current frame (line 1). Thus,

C^{k = 0}

signifies the centroid coordinates of the previous frame, and

C^{k = 1}

denotes those of the current frame.

i

and

j

are the unique IDs of the centroid coordinates (line 2). Hence,

C_{i}

and

C_{j}

represent the centroid coordinates of two different individuals. Max_ED is the maximum Euclidean distance set to determine if the same person exists in both the current and previous frames (line 3). Max_GD is the maximum Euclidean distance set to determine membership in the same group (line 4). The function Object_Centroid_Coordinates calculates the centroid coordinates of the detected individuals (lines 5–8). It computes the center

x

of

x_{1}

and

x_{2}

and the center

y

of

y_{1}

and

y_{2}

, returning the coordinates

x

,

y

. Euclidean_Distance calculates the Euclidean distance between the centroid coordinates of detected individuals (lines 9–11). Grouping clusters adjacent centroid coordinates (lines 12–19). Depending on the number of individuals detected in the current frame, the Euclidean distance of the centroid coordinates is calculated. If it is less than Max_GD, they are deemed to be in the same group. Consequently, one bounding box is output for each group. Upon receiving a video input, individuals are detected with each repeated frame (lines 20–21). If individuals are detected, the loop iterates for the number of detected individuals (line 22). If no individuals were detected in the previous frame, the centroid coordinates of the detected individuals are calculated, and unique IDs are assigned (lines 23–25). C_ID is the unique ID of the centroid coordinates. If individuals were detected in the previous frame, the centroid coordinates of the detected individuals in the current frame are calculated (lines 26–27). Next, iterating over the number of centroid coordinates, the Euclidean distance between the centroid coordinates of the previous and current frames is calculated (lines 28–29). If the Euclidean distance exceeds Max_ED, it is determined to be a new individual, and a unique ID is assigned (lines 30–31). Conversely, if the Euclidean distance does not exceed Max_ED, the ID of the closest Euclidean distance from the previous frame is assigned (lines 32–33). The grouping function operates using the centroid coordinates of the detected individuals in the current frame, where G signifies a group (line 36). Once grouping is complete, the centroid coordinates and ID of the current frame are updated to those of the previous frame (lines 37–38).

Algorithm 1 Grouping
0:	$C_{i}^{k}$ : Object Centroid Coordinates
1:	$k$ : Frame Sequence (0: Previous Frame, 1: Current Frame)
2:	$i, j$ : Different Unique IDs
3:	Max_ED: Maximum Euclidean Distance
4:	Max_GD: Maximum Group Distance
5:	FUNCTION Centroid_Coordinates_Calculation ( $x_{1}, y_{1}, x_{2}, y_{2}$ )
6:	$(x, y) \leftarrow ((x_{1} + x_{2}) / 2, (y_{1} + y_{2}) / 2)$
7:	RETURN $(x, y)$
8:	ENDFUNCTION
9:	FUNCTION Euclidean_Distance ( $C^{k = 0}, C^{k = 1}$ )
10:	RETURN $\sqrt{{(C^{k = 0} (x) - C^{k = 1} (x))}^{2} + {(C^{k = 0} (y) - C^{k = 1} (y))}^{2}}$
11:	ENDFUNCTION
12:	FUNCTION Grouping ( $C_{j}^{k = 1}$ )
13:	LOOP $C^{k = 1}$ DO
14:	IF Euclidean_Distance ( $C_{i}^{k = 1}, C_{j}^{k = 1}$ ) $<$ Max_GD THEN
15:	G $\leftarrow C^{k = 1} (M i n x_{1}, M i n y_{1}, M a x x_{2}, M a x y_{2}$ )
16:	ENDIF
17:	ENDLOOP
18:	RETURN G
19:	ENDFUNCTION
20:	LOOP Frame DO
21:	People Detection
22:	LOOP People DO
23:	IF $C^{k = 0} = N I L$ THEN
24:	$C^{k = 0} \leftarrow$ Centroid_Coordinates_Calculation ( $O b j e c t s . x_{1}, O b j e c t s . y_{1}, O b j e c t s . x_{2}, O b j e c t s . y_{2}$ )
25:	$C^{k = 0}$ _ID ← Unique ID
26:	ELSE
27:	$C^{k = 1}$ ← Centroid_Coordinates_Calculation ( $O b j e c t s . x_{1}, O b j e c t s . y_{1}, O b j e c t s . x_{2}, O b j e c t s . y_{2}$ )
28:	LOOP $C^{k = 1}$ DO
29:	ED ← Euclidean_Distance ( $C^{k = 0}, C^{k = 1}$ )
30:	IF ED > Max_ED THEN
31:	$C^{k = 1}$ _ID ← Unique ID
32:	ELSE
33:	$C^{k = 1}$ _ID ← $M i n C^{k = 0}$ _ID
34:	ENDIF
35:	ENDLOOP
36:	G ← Grouping ( $C^{k = 1}$ )
37:	$C^{k = 0} \leftarrow C^{k = 1}$
38:	$C^{k = 0}$ _ID ← $C^{k = 1}$ _ID
39:	ENDIF
40:	ENDLOOP
41:	ENDLOOP

5. System Implementation and Experimental Evaluation

In this section, we present and analyze the implementation and evaluation results of the GLBRF system. Section 5.1 describes the process of building the training dataset for behavior recognition. Then, in Section 5.2, we describe the structure of the learning model used for behavior recognition. Section 5.3 introduces the environment in which we implemented the GLBRF system, and Section 5.4 presents and analyzes the results of the implemented system. Finally, Section 5.5 concludes by comparing the performance of the proposed GLBRF with representative models of 3D CNN and evaluating its efficiency and superiority.

5.1. Behavior Recognition Training Dataset

The process of building a training dataset for behavior recognition on GLBRF is shown in Figure 7. In this study, we chose a violent situation as a case to analyze the distinct interaction behavior between people. We utilized the open dataset titled “Abnormal Behavior (Violence) CCTV Video” from AI Hub, and each frame of the video underwent a data preprocessing step followed by labeling. AI Hub is an integrated AI platform operated by the National Information Society Agency, providing AI infrastructure such as AI data, AI SW API, and computing resources [29]. It operates as part of an initiative to build and distribute AI training data. Initially, when a frame from the video is input, individuals are detected, and their centroid coordinates are calculated. The Euclidean distance of these centroid coordinates is computed for grouping. Bounding boxes for each group are determined, and a behavior class is labeled for each bounding box. Behaviors are classified into “normal” or “violence” classes to construct the behavior recognition training dataset. In this system, we used 30 abnormal behavior (violence) CCTV video datasets, each with an average size of 2.3 GB, capturing either normal behavior or violence.

5.2. Behavior Recognition Learning Model

The structure of the learning model for behavior recognition is depicted in Figure 8. The 3D CNN model has the drawback of requiring 3D convolutional filters and a large dataset, leading to extensive computational demands. Therefore, in this paper, we employed the 2D CNN model, specifically using MobileNetV2, to significantly reduce computational requirements while achieving high accuracy. In addition, we added six dense layers with 32 and 64 nodes alternately to the existing MobileNetV2 model structure and added dropouts between two dense layers to improve performance through a fine-tuned model. Using dense layers [30] allows for more efficient learning because only the reduced-dimension feature maps are used as inputs and connected to the output. Additionally, dropout [31], applied at 50%, is a technique to address the overfitting problem. The activation function for the remaining dense layers was set to ReLU [32], while the last dense layer employed sigmoid [33] to configure the output for binary classification. As an optimizer, we utilized adaptive moment estimation (ADAM), which combines momentum and root mean square propagation (RMSprop). Momentum remembers the direction of past movements by adding a certain value to the calculated gradient, while RMSProp [34,35] uses an exponential moving average to give more weight to the most recent gradients rather than simply accumulating them. The total number of parameters in the entire model was 3,340,674.

5.3. System Implementation Environment

Information about the specifications and libraries of the server used to implement and experiment with the GLBRF system is shown in Table 1 The learning server consisted of an Intel Xeon W-2123 CPU (Octa-Core) (Intel, Santa Clara, CA, USA), NVIDIA GeForce RTX 2080 Ti GPU, 32 GB RAM (NVIDIA, Santa Clara, CA, USA), and Ubuntu 20.04.6 LTS as the operating system. The learning algorithm was implemented using TensorFlow version 2.5.0, an open-source machine learning library, and Keras built into TensorFlow. OpenCV version 4.4.1 was used for image processing of the video frames.

5.4. System Implementation Result

In this section, we present and analyze the implementation results of the GLBRF system. To recognize interactions between people, we first applied grouping to calculate center coordinates and track people. Next, the dataset and system parameters were set for behavior recognition, and training was performed to evaluate the accuracy and loss of the model. Then, we examined how the classification process, which classified the representative behavior of a group for each set frame, improved the reliability of behavior recognition over single frames. We also compared the learning results of behavior recognition with and without grouping. Finally, we analyzed the results of video-based behavior recognition using GLBRF.

5.4.1. Applying Object Grouping

To recognize interactions between individuals, grouping was applied. Initially, as shown in Figure 9, individuals were detected based on the YOLO model. Following this, the centroid coordinates were calculated using the bounding boxes of the detected individuals. The Euclidean distance between the centroids in the current frame and the previous frame was computed for tracking, essentially determining if the individual in the current frame was the same as in the previous frame. In this system, the maximum Euclidean distance for identifying the same individual was denoted as Max_ED in Algorithm 1 and set to the average width of the detected bounding boxes. Centroids with a Euclidean distance greater than Max_ED were all considered new individuals. Among the centroids with a distance less than Max_ED, the one with the smallest distance was deemed the same individual, while the rest were considered new. Subsequently, to perform grouping, the Euclidean distance was computed within the same frame. In this system, the maximum Euclidean distance for determining the same group was denoted as Max_GD in Algorithm 1 and set to two-thirds of the average height of the detected bounding boxes. If the Euclidean distance was less than Max_GD, it was considered the same group; if it was greater, it was considered a different group. Figure 10 displays the results of applying grouping to Figure 9.

5.4.2. Human Behavior Recognition Learning

Behavior recognition training was conducted using the training dataset presented in Section 5.1 and the behavior recognition training model outlined in Section 5.2. Table 2 displays information about the dataset used for behavior recognition training and the values of the key system parameters. These system parameters were variables set in the model to achieve the optimal training model. The image dataset consisted of 1000 images labeled violence and 1000 images labeled normal, totaling 2000 images. Additionally, for model validation, 80% of the entire dataset was used as the training dataset, while the remaining 20% served as the test dataset. An epoch refers to the number of times the entire training dataset was applied and learned by the behavior recognition model, and this was carried out up to 30 times. The batch size, which was set at 32, indicates the number of data points used for updating parameters during model training. Additionally, the learning rate, responsible for determining the extent of updates to the learning parameters at each step, was configured to 0.0001.

In this work, we allocated 80% of the total training dataset for training and the remaining 20% for validation to evaluate the performance of the model. Figure 11 displays a graph illustrating the training evaluation metrics, accuracy, and loss, plotted against the number of epochs. Accuracy is a metric that indicates how closely the model’s predictions align with the actual outcomes, while loss represents the error between the model’s predictions and the true values.

Table 3 presents the results of behavior recognition training. Training accuracy and training loss were used to evaluate the model’s performance based on the data used for training. Validation accuracy and validation loss were used to assess the model’s performance using new data that was not utilized during training. The proposed model for behavior recognition exhibited a training accuracy and validation accuracy of 98.0% and 99.0%, respectively, and a training loss and validation loss of 0.1 and 0.05, respectively. These results indicate high accuracy and low loss. The absence of a significant difference between the training and validation results confirmed that overfitting did not occur. Consequently, it was determined that the training was successfully conducted.

5.4.3. Lightweight Framework for Behavior Recognition

The GLBRF system is designed to combine a lightweight behavior recognition model with location-based grouping. First, it recognizes and tracks people in each frame of a video to perform grouping. Then, it analyzes the behavior of these groups in each frame and stores the information. The key here is that we determine the final behavior across frames based on a certain number of frames of data that we set. For example, Figure 12 shows the results of GLBRF with the number of frames set to 10. Here, the behavior of each group was analyzed in the first 10 frames of the input video. If more than 50% (5 out of 10) of these 10 frames were recognized as violence, the behavior of that group was finally classified as violence. In other words, through the classification process of classifying representative behaviors of a group for each set frame, we can have more confidence in the behavior recognition results than a single frame. The GLBRF system utilizes a lightweight behavior recognition model based on 2D CNN to reduce the computational burden while applying location-based grouping to improve accuracy with fewer resources, thus outperforming existing behavior recognition models.

5.4.4. GLBRF Application: Violence Recognition

GLBRF was developed as a video-based behavior recognition method using multiple frames to overcome the difficulty of extracting accurate motion information, which is a limitation of image-based behavior recognition methods. Figure 13 shows the result of GLBRF being applied. People in each frame of the video were recognized and grouped in the data preprocessing stage. Behavior recognition was performed for each group in each frame, and the behavior class was saved. The behavior recognition result for each group was the behavior class with the highest percentage based on 10 frames. If more than 50% (5 out of 10) of the behavior classes in a group were violence in 10 frames, then all of them were violence for all 10 frames. Conversely, if less than 50% of the frames were violence, then all 10 frames were normal.

5.4.5. Human Behavior Recognition without Object Grouping

To confirm that grouping plays an important role in improving accuracy in behavior recognition, we trained behavior recognition on an ungrouped dataset. The training dataset was constructed by recognizing people from video frames, extracting bounding boxes, and labeling each bounding box with a behavior class to collect a single object (person) behavior dataset without applying grouping. The ungrouped image dataset consisted of 1000 violence images and 1000 normal images, for a total of 2000 images. The training results using the ungrouped dataset can be seen in Figure 14 and Table 4. The training accuracy and validation accuracy were 68.7% and 69.8%, respectively, and the training loss and validation loss were 0.51 and 0.46, respectively. This showed a significant difference in performance compared to Table 3, which shows the results when grouping was applied.

Figure 15 shows behavior recognition without grouping. We omitted the grouping process of GLBRF and used the behavior recognition model trained on the dataset without grouping. The accuracy of the learning model is relatively low compared to when grouping is applied, so it often recognizes people whose behavior is violent as normal or recognizes people whose behavior is normal as violent. Therefore, grouping is an important factor to improve accuracy in behavior recognition, the performance comparison results for applying grouping are shown in Figure 16.

5.5. Comparison of Behavior Recognition Models of GLBRF and 3D CNN

Figure 17 is a graph showing the dataset size and number of parameters of GLBRF’s behavior recognition model, 3D ResNet (a prominent 3D CNN), and I3D. The video dataset for 3D ResNet and I3D was composed of 1000 videos labeled violence and 1000 videos labeled normal, totaling 2000 videos. Each video was structured with 10 frames, and for model validation, 80% of the entire dataset was used for training, while the remaining 20% was allocated for testing. The dataset size for the LCBR algorithm’s behavior recognition model was 26,247,046 bytes (approximately 25 MB), with a parameter count of 3,340,674. For a relatively lightweight comparison, the less deep 3D ResNet18 was employed for 3D ResNet. The input dataset size for 3D ResNet18 was 262,470,460 bytes (about 250 MB) with 34,535,352 parameters. The Two 3D Streams I3D, which was trained on frame RGB values and optical flow data, was used for I3D. With the addition of optical flow, its input dataset size was 524,940,920 bytes (roughly 500 MB), and it had 25,066,066 parameters.

Table 5 and Figure 18 show the analysis results of the proposed GLBRF and two behavior recognition models based on various performance evaluation metrics. We used accuracy, parameters, dataset size, and FLOPS as performance evaluation metrics. The accuracy of 3D ResNet and I3D, both 3D CNNs, appeared relatively low due to the lack of a sufficient dataset for training using 3D filters. In contrast, behavior recognition models based on GLBRF showed higher accuracy. From these results, it is evident that a substantial number of datasets are required to train 3D CNN models. Additionally, the computational demand for 3D convolution is high, leading to elevated FLOPS (floating-point operations per second) values. FLOPS is a unit indicating the number of floating-point operations per second and serves as a metric to measure computing performance. A higher FLOPS value indicates greater model complexity or computational demand. Therefore, it is challenging to create lightweight models for 3D CNNs.

6. Conclusions

Behavior recognition technology is a technology that analyzes and interprets information related to human behavior to recognize what actions a person is taking. Image-based behavior recognition methods using 2D CNNs are limited to capturing motion information in multiple consecutive frames because they only compute features in the spatial dimension. Notably, 3D CNN, a method to overcome this limitation, is more effective when applied to video analytics problems using spatiotemporal information compared to 2D CNN, but it requires a large amount of training data and computation because it must be trained using a large dataset. Behavior recognition techniques require lightweight models that can recognize behaviors efficiently and quickly because the types of behaviors to be recognized vary depending on the service.

Therefore, in this paper, we propose a GLBRF system to extract accurate motion information from video-based behavior recognition and efficiently classify the extracted behaviors. GLBRF is a system that combines classification prediction (reduces computation) and location-based grouping (improves behavior recognition accuracy) in a lightweight behavior recognition model. The system works as follows: After recognizing people in the input video frames, it extracts bounding boxes for the recognized people. Then, it calculates and tracks the center coordinates through the extracted bounding boxes, calculates the Euclidean distance between the center coordinates, and groups adjacent people together. Then, based on the data from a certain number of frames, the final behavior across those frames is determined. In this study, to analyze the distinct interaction behaviors between people, we chose a violent situation as an example to verify the performance of the proposed GLBRF through experiments. The accuracy of GLBRF with grouping was 98%, while the accuracy of GLBRF without grouping was 68.7%. In addition, the accuracy of 3D CNN with an insufficient training dataset was 66.7%. From these results, the proposed GLBRF not only improves the accuracy of action recognition through grouping but also performs well without utilizing a large dataset. In addition, GLBRF uses a 2D CNN-based model, which is lightweight compared to 3D CNN and can perform faster behavior recognition.

The experiments in this paper specifically focused on assault situations and were conducted using a limited dataset, which may not fully reflect the complexity and diversity of real-world settings. However, a key strength of this work is the versatility of the overall framework. This means that despite the focus on behavior recognition in assault situations, the proposed framework has the ability to recognize different types of behaviors that occur in different situations and environments. The system’s approach can also be effectively applied to a variety of human behavioral situations where the benefits of grouping are prominent, such as team sports, collective action, social interaction, and teaching and learning situations. This goes beyond individual behavior recognition and provides deep insights into understanding the diversity and interaction of human behavior in complex social situations. Therefore, this generality is a major advantage beyond the limitations of this work and is an important factor to be further explored in future research. Future research will extend and validate the accuracy and efficiency of this framework by applying it to various behavior recognition scenarios.

Author Contributions

Conceptualization, Y.-C.L. and D.-Y.K.; methodology, Y.-C.L., S.-Y.L., B.K. and D.-Y.K.; software, Y.-C.L. and S.-Y.L.; validation, Y.-C.L., S.-Y.L., B.K. and D.-Y.K.; formal analysis, Y.-C.L., S.-Y.L., B.K. and D.-Y.K.; investigation, Y.-C.L., S.-Y.L., B.K. and D.-Y.K.; resources, Y.-C.L. and S.-Y.L.; data curation, Y.-C.L. and S.-Y.L.; writing—original draft preparation, Y.-C.L. and S.-Y.L.; writing—review and editing, S.-Y.L. and D.-Y.K.; visualization, Y.-C.L. and S.-Y.L.; supervision, B.K. and D.-Y.K.; project administration, Y.-C.L., S.-Y.L. and D.-Y.K.; funding acquisition, D.-Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) (No. 2021R1C1C1013133), and this work was supported by the Soonchunhyang University Research Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=171 (accessed on 22 January 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, J.; Nguyen, M.; Yan, W.Q. Deep Learning Methods for Human Behavior Recognition. In Proceedings of the 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), Wellington, New Zealand, 25–27 November 2020; pp. 1–6. [Google Scholar]
Fort, A.; Peruzzi, G.; Pozzebon, A. Quasi-Real Time Remote Video Surveillance Unit for LoRaWAN-based Image Transmission. In Proceedings of the 2021 IEEE International Workshop on Metrology for Industry 4.0 & IoT (MetroInd4.0&IoT), Rome, Italy, 7–9 June 2021; pp. 588–593. [Google Scholar]
Imane, R.; Bouarifi, W.; Oujaoura, M. A Review of Computer Vision Techniques for Video Violence Detection and intelligent video surveillance systems. Int. J. 2022, 11, 62–70. [Google Scholar]
Hsueh, Y.L.; Lie, W.N.; Guo, G.Y. Human Behavior Recognition from Multiview Videos. Inf. Sci. 2020, 517, 275–296. [Google Scholar] [CrossRef]
Hu, K.; Jin, J.; Zheong, F.; Weng, L.; Ding, Y. Overview of behavior recognition based on deep learning. Artif. Intell. Rev. 2023, 56, 1833–1865. [Google Scholar] [CrossRef]
Jannat, T.; Sayeed, A.; Afrin, S. Supervised Linear Discriminant Analysis for Dimension Reduction and Hyperspectral Image Classification Method Based on 2D-3D CNN. In Proceedings of the 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0 (ACMI), Rajshashi, Bangladesh, 8–9 July 2021; pp. 1–6. [Google Scholar]
Sánchez, F.L.; Hupont, I.; Tabik, S.; Herrera, F. Revisiting crowd behavior analysis through deep learning: Taxonomy, anomaly detection, crowd emotions, datasets, opportunities and prospects. Inf. Fusion 2020, 64, 318–335. [Google Scholar] [CrossRef] [PubMed]
Munteanu, D.; Moina, D.; Zamfir, C.G.; Petrea, Ș.M.; Cristea, D.S.; Munteanu, N. Sea Mine Detection Framework Using YOLO, SSD and EfficientDet Deep Learning Models. Sensors 2022, 23, 9536. [Google Scholar] [CrossRef] [PubMed]
Oroceo, P.P.; Kim, J.-I.; Caliwag, E.M.F.; Kim, S.-H.; Lim, W. Optimizing Face Recognition Inference with a Collaborative Edge–Cloud Network. Sensors 2022, 22, 8371. [Google Scholar] [CrossRef] [PubMed]
Gul, M.A.; Yousaf, M.H.; Nawaz, S.; Ur Rehman, Z.; Kim, H. Patient Monitoring by Abnormal Human Activity Recognition Based on CNN Architecture. Electronics 2020, 9, 1993. [Google Scholar] [CrossRef]
Zhang., Z.; Jin, Y.; Feng, S.; Li, Y.; Wang, T.; Tian, H. FENet: An Efficient Feature Excitation Network for Video-based Human Action Recognition. In Proceedings of the 16th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 21–24 October 2022; Volume 1, pp. 540–544. [Google Scholar]
Kong, Y.; Fu, Y. Human Action Recognition and Prediction: A Survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
Vrskova, R.; Hudec, R.; Kamencay, P.; Sykora, P. Human Activity Classification Using the 3DCNN Architecture. Appl. Sci. 2022, 12, 931. [Google Scholar] [CrossRef]
Wang, Y.; Sun, J. Video Human Action Recognition Algorithm Based on Double Branch 3D-CNN. In Proceedings of the 15th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 5–7 November 2022; pp. 1–6. [Google Scholar]
Bouali, S.N.; Amara, N.E.B. 3D CNN for Human Action Recognition. In Proceedings of the 18th International Multi-Conference on Systems, Signals & Devices (SSD), Monastir, Tunisia, 22–25 March 2021; pp. 276–282. [Google Scholar]
Vahora, S.; Galiya, K.; Sapariya, H.; Varshney, S. Comprehensive Analysis of Crowd Behavior Techniques: A Through Exploration. Int. J. Comput. Digit. Syst. 2021, 11, 991–1007. [Google Scholar] [CrossRef]
Elbishlawi, S.; Abdelpakey, M.H.; Eltantawy, A.; Shehata, M.S.; Mohamed, M.M. Deep Learning-Based Crowd Scene Analysis Survey. J. Imag. 2020, 6, 95. [Google Scholar] [CrossRef] [PubMed]
Lazaridis, L.; Dimou, A.; Daras, P. Abnormal Behavior Detection in Crowded Scenes Using Density Heatmaps and Optical Flow. In Proceedings of the 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 2060–2064. [Google Scholar]
You, Q.; Jiang, H. Action4d: Online action recognition in the crowd and clutter. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11857–11866. [Google Scholar]
Liu, C.; Tao, Y.; Liang, J.; Li, K.; Chen, Y. Object Detection Based on YOLO Network. In Proceedings of the IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 14–16 December 2018; pp. 799–803. [Google Scholar]
Maity, M.; Banerjee, S.; Chaudhur, S.S. Faster R-CNN and YOLO based Vehicle detection: A Survey. In Proceedings of the 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; pp. 1442–1447. [Google Scholar]
Fang, W.; Wang, L.; Ren, P. Tinier-YOLO: A Real-Time Object Detection Method for Constrained Environments. IEEE Access 2020, 8, 1935–1944. [Google Scholar] [CrossRef]
Rahman, Z.; Ami, A.M.; Ullah, M.A. A Real-Time Wrong-Way Vehicle Detection Based on YOLO and Centroid Tracking. In Proceedings of the IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020; pp. 916–920. [Google Scholar]
Sukumar, S.; Libish, T.M. Centroid Based Human Annotation for Object Tracking. In Proceedings of the 2nd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018; pp. 377–380. [Google Scholar]
Nagrath, P.; Jain, R.; Madan, A.; Arora, R.; Kataria, P.; Hemanth, J. SSDMNV2: A real time DNN-based face mask detection system using single shot multibox detectore and MobileNetV2. Sustain. Cities Soc. 2021, 66, 102964. [Google Scholar] [CrossRef] [PubMed]
Saxen, F.; Werner, P.; Handrich, S.; Othman, E.; Dinges, L.; Al-Hamadi. Face Attribute Detection with MobileNetV2 and NasNet-Mobile. In Proceedings of the 11th International Symposium on Image and Signal Processing and Analysis (ISPA), Dubrovnik, Croatia, 23–25 September 2019; pp. 176–180. [Google Scholar]
Singh, R.; Kushwaha, A.K.S.; Chandni; Srivastava, R. Recent trends in human activity recognition–A comparative study. Cogn. Syst. Res. 2023, 77, 30–44. [Google Scholar] [CrossRef]
Size Korea. Available online: https://sizekorea.kr (accessed on 22 January 2024).
AI-Hub. Available online: https://aihub.or.kr (accessed on 22 January 2024).
Dileep, P.; Das, D.; Bora, P.K. Dense Layer Dropout Based CNN Architecture for Automatic Modulation Classification. In Proceedings of the National Conference on Communications (NCC), Kharagpur, India, 21–23 February 2020; pp. 1–5. [Google Scholar]
Garbin, C.; Zhu, X.; Marques, O. Dropout vs. batch normalization: An empirical study of their impact to deep learning. Multimed. Tools Appl. 2020, 79, 12777–12815. [Google Scholar] [CrossRef]
Ide, H.; Kurita, T. Improvement of learning for CNN with ReLU activation by sparse regularization. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2684–2961. [Google Scholar]
Sharma, S. Activation functions in neural networks. Data Sci. 2017, 6, 310–316. [Google Scholar] [CrossRef]
Luo, Z.; Chen, Y.; Li, C.; Xiong, X.; Zhu, L. Minimum BER Criterion and Adaptive Moment Estimation Based Enhanced ICA for Wireless Communications. IEEE Access 2020, 8, 152071–152080. [Google Scholar] [CrossRef]
Kumar, A.; Sarkar, S.; Pradhan, C. Malaria Disease Detection Using CNN Technique with SGD, RMSprop and ADAM Optimizers. Deep Learn. Tech. Biomed. Health Inform. 2019, 68, 211–230. [Google Scholar]

Figure 1. 2D CNN and 3D CNN.

Figure 2. Bounding box coordinates of the detected object.

Figure 3. Centroid tracking.

Figure 4. General process of behavior recognition using deep learning.

Figure 5. Proposed system architecture.

Figure 6. System operation flowchart of GLBRF.

Figure 7. Data collection process for building a training dataset.

Figure 8. Fine-tuned MobileNetV2 model.

Figure 9. Recognition of people and calculation of centroid coordinates.

Figure 10. Result of applying grouping.

Figure 11. Training accuracy and loss graph.

Figure 12. GLBRF results with 10 frames.

Figure 13. GLBRF application results.

Figure 14. Training accuracy and loss graph using dataset without grouping.

Figure 15. Behavior recognition without grouping.

Figure 16. Compare Performance Evaluation for Applying Groupings.

Figure 17. Graph showing the dataset size and number of parameters of behavior recognition models.

Figure 18. Comparison of violence behavior recognition models.

Table 1. System specifications and libraries.

CPU	Intel Xeon W-2123 (Octa-Core)
GPU	NVIDIA GeForce RTX 2080 Ti
RAM	32 GB
OS	Ubuntu 20.04.6 LTS
TensorFlow	2.5.0
Keras	TensorFlow built-in
OpenCV	4.4.1

Table 2. Dataset information and learning parameters.

Dataset (images)	2000 (violence: 1000, normal: 1000)
Train test split	80:20
Epoch	30
Batch Size	32
Learning rate	0.0001

Table 3. Behavior recognition training results with grouped datasets.

Training accuracy	98.0%
Validation accuracy	99.0%
Training loss	0.10
Validation loss	0.05

Table 4. Behavior recognition training results without grouped datasets.

Training accuracy	68.7%
Validation accuracy	69.8%
Training loss	0.51
Validation loss	0.46

Table 5. Comparison of violence behavior recognition models.

	GLBRF	3D ResNet	I3D
Accuracy (%)	98.0	66.7	71.2
Parameters (M)	3.3	33.4	25.0
Dataset (MB)	25	250	500
FLOPS (B)	0.3	5.1	8.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, Y.-C.; Lee, S.-Y.; Kim, B.; Kim, D.-Y. GLBRF: Group-Based Lightweight Human Behavior Recognition Framework in Video Camera. Appl. Sci. 2024, 14, 2424. https://doi.org/10.3390/app14062424

AMA Style

Lee Y-C, Lee S-Y, Kim B, Kim D-Y. GLBRF: Group-Based Lightweight Human Behavior Recognition Framework in Video Camera. Applied Sciences. 2024; 14(6):2424. https://doi.org/10.3390/app14062424

Chicago/Turabian Style

Lee, Young-Chan, So-Yeon Lee, Byeongchang Kim, and Dae-Young Kim. 2024. "GLBRF: Group-Based Lightweight Human Behavior Recognition Framework in Video Camera" Applied Sciences 14, no. 6: 2424. https://doi.org/10.3390/app14062424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GLBRF: Group-Based Lightweight Human Behavior Recognition Framework in Video Camera

Abstract

1. Introduction

2. Related Work

3. Background

3.1. You Only Look Once (YOLO)

3.2. Centroid-Tracking Algorithm

3.3. MobileNetV2

4. Proposed System

4.1. Group-Based Lightweight Human Behavior Recognition Framework

4.2. System Procedure

4.3. Object Grouping Algorithm

5. System Implementation and Experimental Evaluation

5.1. Behavior Recognition Training Dataset

5.2. Behavior Recognition Learning Model

5.3. System Implementation Environment

5.4. System Implementation Result

5.4.1. Applying Object Grouping

5.4.2. Human Behavior Recognition Learning

5.4.3. Lightweight Framework for Behavior Recognition

5.4.4. GLBRF Application: Violence Recognition

5.4.5. Human Behavior Recognition without Object Grouping

5.5. Comparison of Behavior Recognition Models of GLBRF and 3D CNN

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI