A deep model method for recognizing activities of workers on offshore drilling platform by multistage convolutional pose machine

A deep model method for recognizing activities of workers on offshore drilling platform by multistage convolutional pose machine

Journal of Loss Prevention in the Process Industries 64 (2020) 104043 Contents lists available at ScienceDirect Journal of Loss Prevention in the Pr...

3MB Sizes 0 Downloads 6 Views

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Contents lists available at ScienceDirect

Journal of Loss Prevention in the Process Industries journal homepage: http://www.elsevier.com/locate/jlp

A deep model method for recognizing activities of workers on offshore drilling platform by multistage convolutional pose machine Faming Gong a, Yuhui Ma a, Pan Zheng d, Tao Song a, b, c, * a

College of Computer and Communication Engineering, China University of Petroleum, Qingdao, 266580, Shandong, China Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519082, China c Department of Artificial Intelligence, Faculty of Computer Science, Polytechnical University of Madrid, Campus de Montegancedo, Boadilla del Monte, 28660, Madrid, Spain d Department of Accounting and Information Systems, University of Canterbury, Christchurch, 8041, New Zealand b

A R T I C L E I N F O

A B S T R A C T

Keywords: Action recognition Multi-level convolutional pose machine Activity recognition Offshore drilling platform Multi-rule region proposal marker algorithm

The growing diversity of image scenes brings a great challenge to human activity recognition in practice. Traditional activity recognition methods cannot satisfy the demand of precise action recognition in complex scenes. In this work, we build a training set of worker’s activities on offshore drilling platform by collecting data from offshore drilling monitor, and then an improved multi-level convolutional pose machine (MCPM) method is proposed and trained to recognize activities of workers on the platforms. In human object detection, a multi-rule region proposal marker algorithm is developed to separate the seawater area, and the ducts of similar personnel is pre-discriminated by support vector machine. We use the characteristics of the human body key-points not affected by complex background noise to assist the detection of the human target. As results, it shown that our method performs better than Faster-RCNN, MobileNet-SSD and SSD algorithms in detecting human target on the offshore drilling platform, and achieves well accuracy in recognizing many key activities. To our best acknowledge, it is the first attempt of using deep model to recognize worker’s activities on offshore drilling platform.

1. Introduction Human action recognition is a key research direction in the field of computer vision and pattern recognition, due to the complexity and diversity of human activity. In order to quickly locate the action occurrence time from the video and timely understand the behavior of workers on the offshore drilling platform, human action recognition can include three parts: object detection, human pose estimation and action recognition and detection (Toshev and Szegedy, 2014; Ch�eron et al., 2015; Redmon et al., 2016). In recent years, significant progress has been made in research on human action recognition. Early methods were usually based on appearance and optical flow methods (Simonyan and Zisserman, 2014; Peng and Schmid, 2017; Wang et al., 2017), while dynamic human key-point modeling methods received less attention. With further research and development, the human key-point detection method based on deep learning has a significant impact on the detection performance of action recognition, such as hierarchical co-occurrence network (Li et al., 2018) and spatial temporal graph convolutional

networks (Yan et al., 2018). Human action recognition algorithms from (Singh et al., 2016) have had a prominent effect in single specific sce­ narios, but they are affected by background changes in a complex scene such as an offshore drilling platform, and it is difficult to ensure high recognition accuracy. There are the unique challenges in accurately identifying human targets and analyzing them on offshore drilling platforms. The weather often changes and the target detection is greatly affected by environmental factors such as foggy weather and strong illumination. In addition, many cameras are not installed at reasonable positions, which leads to problems such as too small human targets in the distance, and physical obstruction of the person in the bird’s-eye view. As well, the human body may also be obstructed by dense pipe­ lines on offshore drilling platforms. Since the color of the pipeline is similar to the color of the worker’s safety suits, the flow of seawater and the influence of complex scene make human action recognition more difficult. Some attempts (Li et al., 2016; Insafutdinov et al., 2017; Singh et al., 2018) have been used to solve real-time online action recognition problems in complex scenarios, but the performance of existing methods

* Corresponding author. College of Computer and Communication Engineering, China University of Petroleum, Qingdao, 266580, Shandong, [email protected] cn https://doi.org/10.1016/j.jlp.2020.104043 Received 25 March 2019; Received in revised form 2 January 2020; Accepted 2 January 2020 Available online 10 January 2020 0950-4230/© 2020 Elsevier Ltd. All rights reserved.

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Fig. 1. Human action recognition algorithm flowchart. The framework can be divided into four phases: data pre-processing, SVM-based posterior discrimination, human target detection, and human key-point detection.

drilling platform by multistage convolutional pose machine is suitable for practical engineering scenes where the human body is seriously blocked and there are many interferences. In data preparing stage, a multi-rule regional proposal marker (MRPM) method is proposed to separates the ocean area from the land area, and the support vector machine (SVM) is designed to pre-identify the stationary pipelines. After that, we design an improved MCPM to detect the key-points of the human body, through which the positions of workers can be detected. The coordinates of all of the joint points of the human body can be extracted during the key-point detection phase of the human body, thereby improving the accuracy of human action recognition. The motion trajectory, which is formed by the coordinate sequence of the key-points of the human body, is recorded as spatial information and the optical flow trajectory as time information. Combining these two features makes full use of spatial structure infor­ mation and time series structure information to realize the classification and recognition of human body motion based on frame level. The main contributions of this paper are as follows.

remains still far from satisfactory. Human pose estimation focuses on the identification and location of key-points of human targets in images (Simonyan and Zisserman, 2014; Limin et al., 2015; Kaiming et al., 2018), which promotes the use of deep convolutional neural networks to further solve the problem of action recognition. A combined convolutional neural network and cascading method is proposed in (Chen et al., 2018). In the method, the co­ ordinates of the node are initially calculated, and then the corresponding partial image is obtained in the original image according to the co­ ordinates to improve the recognition accuracy. However, the co­ ordinates of each node need to perform a repeated convolution operation, which makes the time complexity increase and cannot meet the requirements of real-time online identification. What’s more, static single-frame based human pose estimation relies only on spatial infor­ mation, which is difficult to solve the problems of human body part occlusion and human continuous action recognition (Wang et al., 2016). For the problem of more occlusion and interference factors, most of the methods are limited to datasets with relatively simple backgrounds. The performance indicators for research regarding pose estimation in com­ plex scenarios are still very low and cannot be simply extended to more practical engineering scenarios. We proposed an improved multi-level convolutional pose machine (MCPM) and applied it to human pose estimation with low time complexity (Chen et al., 2017), which in­ tegrates all of the levels of feature maps and distance relationships be­ tween key-points through progressively refined pipelines. The main purpose of this research is to design a deep model method for recognizing activities of workers in a specific practical application scenario, which can make full use of the characteristics that the keypoints of the human body are not affected by the noise of complex background. This method uses the key-point information of the human body as a high-level feature of action recognition, and establishes the relative positional relationship of key-points by utilizing the length of vector displacement between various parts of the human body. It pro­ vides a new solution to the problem of prediction and estimation of invisible key-points in complex scenes. In addition, the location of the key-points of the human body can also assist the target detection, thereby forming a closed-loop feedback mechanism to improve the ac­ curacy of recognition. It is the first attempt to combine real-time online action recognition tasks on offshore drilling platform with human keypoints information. It is considered in this work recognizing activities of workers on offshore drilling platforms. Since offshore drilling platforms have particular features, such as being far away from land, complex sea conditions and difficulty of escape, oil drilling operations at offshore drilling platforms face various hazards like falling into the sea and equipment collapse. This requires detecting and identifying human targets in a large number of surveillance videos, but this task is difficult to do manually (Carreira et al., 2016; Zhu et al., 2018). Therefore, it is especially important to monitor the status of offshore drilling platforms in real time. In addition, an early warning of the ongoing action is made to understand the entire process of the accident or other event, thereby establishing an early warning system for the safety event. In some public places equipped with monitoring facilities, the application of human action recognition facilitates the management and security of public places. It is important to analyze in real time whether the worker’s behavior is abnormal and to find the time period when the event occurred (Du et al., 2017; Huang et al., 2017; Kalogeiton et al., 2017). The deep model method for recognizing activities of workers on offshore

(1) For the occlusion problem of the part, the improved MCPM approach is proposed, which makes full use of the characteristics that the key-points of the human body are not affected by the noise of complex background, thus improving the accuracy of recognition. (2) We designed a framework for action recognition based on human key-points in complex scenes, which is the first attempt to combine real-time online action recognition tasks in specific practical application scenarios was combined with human keypoints information. (3) We demonstrated the effectiveness and real-time performance of our proposed improved MCPM approach by applying the method to complex practical engineering scenarios to classify and identify human activities from the surveillance video of an offshore dril­ ling platform. 2. Method Since there is no public data set of activities of workers on off-shore drilling platform, we take video recorded from off-shore drilling plat­ form as input and generate single-frame static maps and multi-frame optical flow maps. The object detection model is trained by our labeled image dataset. Bounding boxes and coordinate information provide data sources for key-point extraction. We combine object detection and MCPM algorithm to achieve key-points, whose coordinate information is regarded as spatial flow information. The action trajec­ tory and the optical flow trajectory are then merged and superimposed to realize human action classification and recognition. The flow of our method is shown in Fig. 1. 2.1. Data pre-processing by M-RPM Since there are plenty of marine areas in the surveillance video ac­ quired by the offshore drilling platform monitoring center, the flow of seawater can affect the accuracy of object detection. To solve this problem, we propose a data pre-processing method by M-RPM. According to the different scenes defined for this scenario, an oversegmentation method is proposed to segment the image into small areas, and then number different small areas one after the other, starting 2

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

with the sequence number. The two most probable regional schemes are generated by the selective search and merging rules (Girshick et al., 2014). Such process is repeated until the entire image is merged into two areas in which the boundaries are relatively obvious; that is, the marine area and the non-marine area. In the image-area division process, the following four areas are prioritized to be merged: (1) the texture-similar area, (2) the colorsimilar area; (3) the combined-area-minimum area, and (4) the com­ bined area in which the combined area has the largest proportion of its original image. The first two is measured by gradient histogram and color histogram, respectively. The minimum-area retention rule ensures that the merging operation is uniform in size, and avoids large regions continuously annexing other small regions. The maximum duty-cycle principle en­ sures that the shape of the merged regions is regular. The area features generated by color, texture, area, and location can be directly calculated from the characteristics of the sub-regions, which can speed up the calculation. As well, marking the area is also a pre-criterion rule for subsequent object recognition results. Any input image is equally divided into 9 areas, that is, each area has the same size. The middlemost area is taking as the initial area, which has 8 neighboring areas. Initially, the middlemost area is randomly merged to one of its nearest neighboring areas. And then, the obtained area is randomly merged to one of the areas which is smaller than it. In this way, the areas are merged sequentially to an image. In the merging operation, it performs simultaneously in RGB, HSV, and Lab color spaces. By removing the duplicated area of the results obtained in the three color spaces, boundaries of marine and non-marine area can be obtained, i.e., the labels (marine or non-marine) of each area of an image can be obtained. In non-marine areas, we use optical flow method from (Yuan et al., 2016) to set the optical flow threshold for abstracting effective human motion region in the video, by which the video clip with human can be filtered for single-frame image conversion. In optical flow method, it is assumed that the target moving distance is small enough and the time required for the movement can be negli­ gible. The constrained equation of the optical flow image is Ix � Vx þ Iy � Vy þ Iz � Vz ¼

needs to be fine-tuned using a linear regression to fine-tune its position. Otherwise, it is regarded as invalid bounding box, and the rejection process is performed. The output is a series of discredited target bounding boxes, which are generated on different levels of feature maps and have different aspect ratios. Different default-aspect-ratio bounding boxes at each location are used for different scales of feature maps use. In the confidencediscrimination process, it needs to calculate the error and score of the actual bounding box for each set of default bounding boxes, and then we can predict the category and confidence of all of the targets in the region. The most commonly used threshold for confidence is 0.6 (Liu et al., 2016; Redmon et al., 2016), that is, if loU>0.6, it is considered to be a true detection, otherwise it is considered to be a false detection, and loU is an evaluation parameter for measuring the positioning accuracy (Girshick et al., 2014). We now calculate the loU value of each detection frame obtained by the model, and compare the calculated loU value with the set loU threshold of 0.6 to calculate the correct number of detections for each category in each image. The default bounding box is matched with any actual bounding box that has a value higher than the threshold, and the matching process can be simplified by SVM for posteriori discrimination. Moreover, it allows a plurality of overlapping default bounding boxes instead of choosing the largest degree of overlap for scoring prediction. 2.2.1. The loss function The model loss calculation is performed by the loss function [31] 1 LðeÞ ¼ ðα 2

yÞ2

(1)

where LðeÞ is the loss function, y the expected output, and α the actual output. If there is a variance between the actual output and the expected output, then the greater variance is, the loss is more. In practice, the distribution of y cannot be calculated, but can be estimated by the value of α. And then, the value of α can be used for calculating cross-entropy Lðα; yÞ by X Lðα; yÞ ¼ yi logðαi Þ (2) i

It

where αi is the actual output of the ith default bounding box and yi is the expected output of the ith default bounding box. The average crossentropy LACE ðα; yÞ of n default bounding boxes is !! ! 1X X (3) ð yi;n log αi;n LACE α; y ¼ n n i

where Ix ; Iy ; Iz ; It are the components of Iðx; y; z; tÞ ​ at x; y; z; t, respec­ tively. Vx ; Vy ; Vz are the compositions of x; y; z in the optical flow vectors of Iðx; y; z; tÞ ​ , respectively. The three partial differentials are the dif­ ferences of the corresponding directions of the image at the pixels x;y;z; t. The optical flow graph is obtained by continuously extracting multiple frames at time t, and each pixel in the image is given a velocity vector to form a motion vector field through the pre-processing operation to obtain single-frame still images and multi-frame optical flow images. We design a hash function to generate a random extraction frame from every 24 frames, which can avoid selecting a large number of repeated frames.

We finally choose the average cross-entropy as the loss function, which considers the situation of multiple samples, so there are many possibilities and it is suitable for multi-object classification. 2.2.2. SVM for posteriori discrimination We pre-train SVM by a large number of manually annotated images to the classifier for human and cylindrical pipeline targets. After the initial confidence level is discriminated, local SVM classification and rediscrimination are carried out. The identified column pipeline is regar­ ded as a negative sample removal, and only the confidence of the pos­ itive sample type is estimated to determine whether it is the real target, which decreases the amount of calculation of negative samples. The overall objective loss function, through double discrimination, is sum of the weighted average of the confidence loss and the localized score loss, which can be mathematically calculated by

2.2. Object detection A single-shot multibox Detector (SSD) method (Liu et al., 2016) is used to detect human targets in single frames of static images. We take the features of color, shape, and texture as the main features with which to extract features to form different levels of feature units. For input images of different sizes, a set of default bounding-boxes are generated to extract the features in the default bounding boxes. The set of default bounding boxes is associated with the feature graph cells, and the feature maps are tiled in a convolution such that the position of each default bounding box is fixed. The filter with small convolution kernel is used to predict the objects and calculate the confidence level. Confidence level is the primary screening stage of object detection. SVM discrimination is performed to further detect the target’s prior probability. If it is determined to be a human, the target bounding box

Lðα; c; f Þ ¼

1 ½Lðα; cÞ þ δLðα; f Þ�: N

(4)

The initial weight is set to 1 by cross-validation. When the expected output is evaluated by confidence, the confidence level is c. The loss function of confidence is 3

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Fig. 2. The images with general or enlarged framework on offshore dring platform.

! L α; c ¼

1X X ð yi;N log αi;N N N i

!! ;

The human target bounding box obtained by object detection has a partial error in a certain range. As shown in Fig. 2(a), the target hand and foot parts do not completely appear in the blue bounding box. We enlarge the perceived field by expanding the original bounding box at a ratio of 1.0–1.2 times, and an example of the enlarged image is shown in Fig. 2(b). An improved MCPM algorithm is proposed for human joint node extraction, which can be separated into multistage stages. Stage 1: We predict the confidence value of each part from the color image by a basic convolutional network, thus generating a corre­ sponding confidence map. The human body is divided into P model parts, a then a total of Pþ1 confidence maps can be obtained. The default value of P is 14 (Cao et al., 2017). Assuming that x is a pixel with salient features in the image. The original image is input into the network, and the salient features in the image are extracted by the convolution operation. We use C1 to denote the first-stage classifier, which can roughly predict the location of each part, resulting in the confidence map of each part. The classifier C1 with pixel position xi is calculated by �� � C1 ðxi Þ→ bp1 Yp p2f0;Pg;i2f0;Zg ; (7)

(5)

where N is the number of default bounding boxes that match the actual bounding box, and if N ¼ 0, the confidence loss is set to zero. Let αPij ¼ 1, if the ith default bounding box matches the jth actual

bounding box of category p; otherwise, otherwise, αPij ¼ 0. The localized scoring loss is X Lða; f Þ ¼ max 0; fj

� fai þ Δ ;

(6)

j6¼ai

where fj fai þ Δ is the default bounding box, which, if it matches the actual bounding-box score, the value of Δ is 1. We focus on minimizing the localized scoring loss function find a global minimum in a gradual process to minimize the difference in scoring and makes more accurate predictions. The target bounding box is adjusted to better match the shape of the object. 2.3. Human joint node extraction based on improved MCPM

where Z⊂R2 represents the pixel space of the image, xi is the position of each pixel in the image, p is a specific model part, and bp1 ðYp Þ is the confidence value of the part p in the first stage. Stage 2: The confidence map and image features obtained in Stage 1 are taken as input of Stage 2. Since we enlarge the selected regions, the eigenfunctions should include image data features, location maps for various parts of the stage, and context information for each classifier. The classifier C2 predicts the position of each part, which is the

The MCPM consists of a series of convolutional networks, each consisting of a series of predictors. Each phase of the MCPM generates a two-dimensional (2D) confidence map for each part of the body repeatedly (Fang et al., 2017), which, along with the image features, serves as input to the next stage to provide a more accurate estimate of the position of the knuckles in the body. The images to be input to MCPM are enlarged by discretizing the target bounding-box coordinates. An example of enlarging the input image to MCPM is shown in Fig. 2.

Fig. 3. The topological structure of MCOM withour or with intermediate supervision. 4

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Fig. 4. The structure of MCPM network.

correction of the predicted position of the previous stage. The overall goal function FðtÞ is FðtÞ ¼

T X P X X � jbpt i t¼1

p¼1 i2Z

� bp* i j ​ ​ ​ ðT � 2Þ;

compared to ground truth only at Stage 6. The model error is calculated by the loss function, and the update and optimization of the network weight is guided by the gradient back propagation. Without interme­ diate supervision, the error of the output layer will be greatly reduced by multi-layer back propagation, resulting in a more uniform top layer gradient distribution, while the bottom layer gradients are concentrated near 0, and the network may appear vanishing gradients. Our MCPM algorithm calculates the loss at the output of each stage. As shown in Fig. 3(b), with intermediate supervision, the gradient distribution of each layer is good, and the network can be updated normally with training. The normal update of the underlying parameters solves the problem of vanishing gradients. The MCPM network shown in Fig. 4 takes a color image as input, which is a green ora image. The center map (green) is a pre-generated Gaussian function template that is used to converge the response to the center of the image. Taking the whole body model as an example, the network can be divided into six stages, each of which can output the response map of each part, that is, the blue score, and the network uses the response picture output of the last stage. The first stage is a basic convolutional network, gray convs, which directly predicts the response of each part from a color image. The whole body model has 13 components and a background response with a total of 14 layers of response. The second stage also predicts the response of each part from the color image, but there is one concatenation layer in the middle of the convolutional layer, which is yellow concat. It com­ bines the following three data: the phased convolution result (46*46*32), which is the texture feature, the response of each part of the previous stage (46*46*15), which is the spatial feature, and the center map (46*46*1). The result size after concatenation is unchanged, and the network depth becomes 48 layers (32 þ 15þ1). The third stage does not use the original image as an input, but takes a feature image with a depth of 128 layers from the second stage as input. The above three data are also synthesized using a concatenation layer, namely texture fea­ tures, spatial features, and center constraints. The network structure of the fourth and subsequent phases is exactly the same as the third phase. When designing other network structures, such as the half-length model, it is only necessary to adjust the number of body parts from 14 to 10 and

(8)

where bp* ðiÞ denotes the ideal level of confidence obtained in the t 2 T. The two stages are iterated to make the position prediction more accurate, and ultimately each part of the more precise location is ob­ tained. Our improved MCPM algorithm extracts the features of the input image at each scale to obtain the confidence maps of all parts of the human body. The second stage is to accumulate the confidence maps of all of the scales for each part, and then the total confidence map can be found for finding highest confidence point, that is, the human joint position. Stage n: The MCPM consists of a series of convolutional networks as it is designed in (Wei et al., 2016), each of which is composed of a series of predictors. Any phase of the MCPM generates a 2D confidence map for each part of the body repeatedly, which, along with the image features, serves as input to the next stage to provide a more accurate estimate of the position of key-points of the human body. It uses a sequential con­ volutional architecture to express spatial information and texture in­ formation. The sequential convolution architecture manifests itself in multiple phases of the network, each with supervised training (Newell et al., 2016). The first stage uses the original image as input, and the later stage uses the feature map of the previous stage as input, mainly to fuse spatial information and texture information. Secondly, the large convolution kernel is used to obtain a large receptive field, and then the relative positional relationship between the key-points is obtained, which is very effective for inferring the key-points of occlusion. Finally, in data training phase, MCPM uses relay intermediate supervision to calculate the loss at the output of each stage, which addresses the problem of vanishing gradients and can guarantee the normal update of the underlying parameters, as shown in Fig. 3. It is shown in Fig. 3 the intermediate supervision solving the problem of the vanishing gradients as the depth deepens. In Fig. 3(a), the gradient is directly dropped across the entire network, that is, the score is 5

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Fig. 5. Human action recognition based on key-point sequences. Taking the original video data as input, the key-point sequence is obtained by pose estimation, and the action category is obtained by using time and space information classification, and finally the key-point sequence is synthesized into the original video, such as the recognition result of sitting and walking.

repeat the third-stage structure. Each stage of the MCPM network out­ puts the predicted result of the joint point, repeatedly outputs the belief maps of each joint point position, estimates the position of each joint point in a progressively refined manner, and can infer the position in­ formation of the occluded key-points.

Table 1 Data distribution of different action types.

2.4. Human action classification and recognition In human body movement analysis, the details of the local action should be focused, but the details of the action characteristics in the video surveillance tend not to be obvious. Through the hierarchical processing of the human joint point coordinates, a rough classification can be obtained (Yeung et al., 2016; Xiong et al., 2016). The activities are roughly classified into head action, upper-limb action, trunk action, and lower-limb action by judging the degree of change of the position of the knuckle point of the human body. For different types of activities, the trajectory focus is quite different. For instance, for upper and lower-extremity movements, it focuses on handand leg-joint trajectory changes, and for trunk movements, it often fo­ cuses on body centered joint trajectory changes (Yin et al., 2016). The key-points of each group of rough classification actions can be obtained by our improved MCPM algorithm. For local movement recognition, the motion trajectory is represented by the key-point sequence of the rough classification action, and dense optical flow trajectories are obtained by superimposing multi-frame optical flows. The entire motion sequence from the perspective of space and time is used as recognition flows. The spatial stream maps each track point to the human joint point on a single frame of a static image. The temporal stream identifies the motion from the motion in the form of a dense optical flow. Finally, by comparing the similarity be­ tween the two trajectories, the task of classification and recognition is completed. The above process is shown in Fig. 5.

Action category

Label

Data volume/group

Standing Walking Bending Falling Sitting Making a call Other

0 1 2 3 4 5 6

350 350 75 50 25 75 75

Fig. 6. Training samples histogram by activity type.

3. Experimental results and analysis

The raw data came from the streaming server at the deep-sea oil production platform. The monitoring equipment on each offshore dril­ ling platform remains stationary, and the ocean working platform is used as a monitoring scene. The real-time monitoring video is trans­ mitted and stored on the streaming media server via microwaves. The tag dataset. The key frame image extraction method is used to select the image dataset with the target, and then the image is manually annotated to build the tag dataset for object detection. The dataset stores 20,000 target images with tag type and location information. Each scene is captured by a 204-way camera. The key-point dataset. It is formed by the detection of key-points of the human body stores key-point sequences, including the name of the image, the 14 key-points of the human body, and the coordinate sequence of key-points. The key activities dataset on offshore drilling platform. We pre-

3.1. Experiment preparation In these experiments, an ordinary desktop computer was used as a hardware platform (CPU, E5-2609 v2; clock frequency, 2.5 GHz, 8 processors; memory, 32 GB; graphics card, Nvidia GTX 1080 Ti *2). The software development environment was running the Windows 10 operating system, MatLab R2016a (MathWorks, USA), the Caffe deeplearning framework, and the Python development language. Accord­ ing to the framework flow of this paper, the experiment was divided into three parts: object detection, human key-point detection, and action recognition and classification. Thus, three experimental databases were created. In our approach, we divided the data sets into training sets, validation sets, and test sets in a 7:1:2 ratios. The entire training time of the experimental database was approximately 28 h on 2 1080 Ti GPUs. 6

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

is increased, which is due to the fact the color of the pipelines is similar to the color of the staff’s safety suit and the influence of the shape of the cylindrical pipeline. To avoid this, we proposed a SVM based algorithm to pre-train the object classifier to distinguish personnel and pipeline targets. The classifier was then combined with object detection model to reduce the error rate. The relationship between the recognition rate of the classifier and the number of training sets and the training iteration are shown in Fig. 7. As shown in Fig. 7, when size of the training set increases, the recognition rate of the target does not increase linearly, but decreases after reaching a peak. The reason is that the extracted features in the same scene are too singular, as the training set increases. If the feature distributions of the training and test sets are inconsistent, the overfitting phenomenon is likely to occur. Furthermore, due to the obvious color and shape features of the pipeline, the cylindrical pipeline classifier is easier to converge than the human object classifier. We select here the model at the peak as the object classifier and combine it with the object detection model to obtain the test results, see Fig. 8. In Fig. 8(a), it is shown the original image of the ocean platform taken by the camera. The human target is disturbed by the column pipeline, see Fig. 8(b), in which the column pipeline has been mis­ identified as a human target. Fig. 8(c) shows the detection result of adding the object classifier, which can separately detect the human targets and the column pipelines. In Fig. 8(d), the final result of the object detection is given after removing of negative samples. In addition, we adjust the color, brightness, and style of the image and blend it harmoniously into the original image (Wei et al., 2018). In data enhancement, we use elastic deformation to expand data by using a series of image transformations (Zhong et al., 2018), such as image translation, rotation, scaling, etc. We complete the conversion of images captured by multiple cameras. Drawing on the SF-GAN method pro­ posed by Zhan et al. (2019), we have expanded the training set for bending, falling, jumping, making phone calls, and smoking to improve the performance of our proposed method. We generated 2000 images

Fig. 7. The relationship between object detection and recognition rate and training sets.

suppose six kinds of action types from the perspective of security work: standing, walking, bending, falling, jumping, and making a call. A total of 1000 sets of action sequences are collected as the standard of the human action model library. In addition, falling and calling activities do not occur frequently, so there are few images of falling and calling. The data distribution for each type of activity is shown in Table 1. Fig. 6 is a histogram of the training samples by activity type, which shows the data distribution of the activity type in detail. The frequently occurring actions account for more than 40% of the total data, because there are very few falls on the offshore drilling platform. The amount of training data collected is small, and the impact of training data may be less efficient than regular actions. 3.2. Experimental design and analysis In object detection experiments, the error rate of the personnel target

Fig. 8. The process of classifiers identifying and eliminating cylindrical pipeline targets. 7

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

human body into 14 sites, which are head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, and left ankle. In the multi-person pose estimation, the joint points of the human body are estimated. Then, these joint points are grouped into a graph, and the nodes in the graph are clustered to determine which person each node belongs to. Further, the problem of population counting under congestion condi­ tions can be solved. The confidence is calculated by the predicted po­ sition of each part and the response diagram of each part is generated, see Fig. 9. According to the sequence of key-points of the human body, the distribution of the key-points can be obtained, and then the 2D joint map of the human body can be finally constructed. We compared our method with CPM (Wei et al., 2016), PAF (Cao et al., 2017), CPN (Chen et al., 2018), and other methods on the same dataset. We trained on person-centric annotations and evaluate our method using the Percentage Correct Key-points (PCK) metric (Wei et al., 2016). A toolkit from (Pishchulin et al., 2016) is used to measure the average accuracy (mAP) of all body parts based on the PCKh threshold. The results are shown in Table 3, which is a comparison of mAP performance between our method and other methods on the same subset of test data, and the results of all methods were obtained from actual tests. We calculate the average of the output heatmap and then use the maximum value of each heatmap as the final output of the po­ sition of the human key-points. The experiments show that for the complex special scene of offshore drilling platforms, our method per­ forms better the current up-bottom methods by 10% mAP. Besides these above measures, our method achieved a very good detection rate for the detection of walking behavior in different sce­ narios. Fig. 10 distinguishes the segmentation results in different colors.

Table 2 Accuracy of object detection methods in offshore drilling platform scenario. Number of samples (sheets)

FasterRCNN

MobileNetSSD

SSD512

Our method

5000 10000 15000 20000 22000

63.2% 69.3% 75.8% 80.1% 81.3%

64.8% 68.9% 74.2% 79.3% 80.1%

69.5% 72.1% 78.7% 82.3% 83.4%

73.7% 79.6% 84.5% 87.4% 88.0%

into the test set for performance testing after data enhancement. To verify the validity of this method, we compare our method with Faster-RCNN (Girshick et al., 2014), MobileNet-SSD (Howard et al., 2017) and SSD512 algorithm (Liu et al., 2016) in object detection. The comparing results are shown in Table 2. Table 2 shows the comparison between Faster-RCNN, MobileNetSSD, SSD512. Our approach has a significant improvement in speed and accuracy. Although the SSD algorithm performs well in terms of accu­ racy, we find that it is not suitable for multi-target detection in a single image because it causes target omissions during detection. The MobileNet-SSD is a lightweight network that may be slightly lower in performance than other methods. The experimental results show that our method performs better than the other algorithms on the same datascale verification experiment. With the expansion of the dataset size, the accuracy is much higher than that of the other three algorithms, which can significantly improve the accuracy for object detection. The improved MCPM algorithm detects all of the parts of the human body in order from head to toe (Pishchulin et al., 2016). We divide the

Fig. 9. The detection and distribution results of key-points of the human body in an offshore drilling platform scenario. Table 3 The comparison of mAP for detecting key-points of human body in offshore drilling platform scenario (%). Method

Head

Neck

Sho

Elb

Wri

Hip

Kne

Ank

mAP

Flow convnet CPM Deepcut PAF. CPN Ours

75.8 80.9 83.6 82.6 85.4 93.6

71.3 74.1 74.6 75.8 77.1 83.2

67.3 70.8 73.9 75.9 79.6 90.5

64.3 67.5 71.1 71.3 74.5 83.2

60.4 65.9 66.0 68.5 69.9 73.6

63.7 68.3 70.9 68.9 73.6 77.9

60.1 63.4 64.1 65.9 70.9 76.2

55.7 57.1 59.4 60.9 64.8 71.7

64.8 68.5 70.5 71.2 74.5 81.2

8

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Fig. 10. Action classification and location results.

Fig. 11. Training session’s progress over iterations.

Among them, green indicates the true action segmentation result in ground-truth, blue indicates redundant non-action video segment, and yellow indicates the true action segmentation result in ground-truth, but it may be only part of it, red Indicates that there are extra or detected errors in the action video clip. By quickly locating the object segment in the video and judging the activity state of the segment (Zhao et al., 2017; Song et al., 2018), we can extract the human body posture trajectory efficiently and stably from the video, which can help us better under­ stand the behavior of the person in the video and the interaction be­ tween the person and the surrounding environment. The initial learning rate of the RGB network was set to 0.1, and the initial learning rate of the optical flow network was set to 0.5. As the number of data iterations increases, the model parameters are contin­ ually updated to get a more appropriate value. As the number of data iterations increases, the model parameters are continually updated to get a more appropriate value. One advantage of training deep structures is that we don’t need to traverse all the samples, which is very effective when the amount of data is very large. We define epoch according to the actual problem, and define 1000 iterations as 1 epoch. If the batch-size of each iteration is set to 256, then 1 epoch is equivalent to 256,000

training samples. At this point, the progress of the test accuracy with the number of iterations is shown in Fig. 11. The loss decreases with the number of iterations, and the test accuracy is almost the same as the accuracy of the training. The test loss is slightly lower than the training loss. When the convergence is reached, the loss is already small. Our model has a good accuracy for regular actions. The results of comparison experiments based on template-matching algorithms indicate that our framework is more accurate for recogni­ tion of basic actions. Among the basic actions, the main movement trajectory for the standing and falling actions can be detected and correctly classified as trunk action. For walking, the movements of the lower limbs and the whole body were mainly detected and classified as lower-limb actions. The trajectory of the upper limb was classified as upper-limb action. It is shown in Fig. 12 that the recognition results of standing and falling actions are good. Through the test results, we can find that the expansion of the training set is not too great for the recognition accuracy of the regular action, but the recognition of the action type with scarce data has been significantly improved. Therefore, in combination with the needs of actual engineering, we have reexpanded the training set for bending, falling, jumping, making phone 9

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Fig. 12. Human action recognition results under an offshore drilling platform scenario.

Fig. 13. Visualization result of our method on the offshore drilling platform scenario.

calls and smoking to improve the performance of our proposed method in the complex scene of offshore drilling platforms. In addition, the phone and smoking actions are susceptible to walking, head shaking, and upper limb swings, so the accuracy of our recognition of meticulous actions still needs to be improved. The results of human action recognition on the offshore drilling platform are shown in Fig. 13. The above experimental results show that the human action recognition algorithm based on the complex scene proposed in this paper can accurately both identify human targets and estimate the occluded part, thus obtaining a complete sequence of

human joint points, which can basically meet the accuracy requirements for recognition of human actions. Finally, the datasets for offshore drilling platforms are expanding to build our own datasets to improve the accuracy of our proposed methods. 4. Conclusions In this work, we firstly build a data set of workers’ activities on offshore drilling platform by collecting frames from offshore drilling monitor, and then using it as training set, an improved MCMP method is 10

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

F. Gong et al.

proposed to recognize activities of workers on the platforms. As results, it is obtained by data experiments that our method performs better than Faster-RCNN, MobileNet-SSD and SSD algorithms in human object detection, achieve well accuracy in recognizing key activities. To our best acknowledge, it is the first attempt of using deep model to recognize workers’ activities on off-shore drilling platform. For future works, it deserves to use method (Song et al., 2019; Martinez et al., 2017; Tao et al., 2019; Xu et al., 2019) for temporal action recognition based on 3D human key-points, which can parse more useful information from 3D data. As well, some more activities should be considered such as waving hands, dropping helmet. The training dataset can be enlarged by involving more labeled pictures, which would be helpful in improving the recognition accuracy.

Conference on Computer Vision and Pattern Recognition, pp. 580–587. https://doi. org/10.1109/cvpr.2014.81. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Adam, H., 2017. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Computer Vision and Pattern Recognition. Huang, J., Li, N., Zhang, T., Li, G., 2017. A Self-Adaptive Proposal Model for Temporal Action Detection Based on Reinforcement Learning. Computer Vision and Pattern Recognition. Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B., Schiele, B., 2017. Arttrack: articulated multi-person tracking in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6457–6465. https://doi.org/10.1109/cvpr.2017.142. Kaiming, H., Georgia, G., Piotr, D., Ross, G., 2018. Mask r-cnn. IEEE Trans. Pattern Anal. Mach. Intell. 99 https://doi.org/10.1109/TPAMI.2018.2844175, 1-1. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C., 2017. Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4405–4413. https://doi.org/10.1109/ iccv.2017.472. Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J., 2016. Online human action detection using joint classification-regression recurrent neural networks. In: European Conference on Computer Vision. Springer, Cham, pp. 203–220. https://doi.org/ 10.1007/978-3-319-46478-7_13. Li, C., Zhong, Q., Xie, D., Pu, S., 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: International Joint Conference on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/ 109. Limin, Y., Yue, L.I., Bin, D.U., Hao, P., 2015. Dynamic gesture recognition based on key feature points trajectory. Optoelectron. Technol. 35 (3), 187–190. https://doi.org/ 10.3969/j.issn.1005-488X.2015.03.010. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., 2016. SSD: single shot multibox detector. In: European Conference on Computer Vision. Springer, Cham, pp. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2. Martinez, J., Hossain, R., Romero, J., Little, J.J., 2017. A simple yet effective baseline for 3d human pose estimation. Proc. IEEE Int. Conf. Comput. Vision IEEE Comput. Soc. 1 (2), 2640–2649. https://doi.org/10.1109/ICCV.2017.288. Newell, A., Yang, K., Deng, J., 2016. Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision. Springer, Cham, pp. 483–499. https://doi.org/10.1007/978-3-319-46484-8_29. Peng, X., Schmid, C., 2017. Multi-region two-stream R-CNN for action detection. In: European Conference on Computer Vision. Springer, Cham, pp. 744–759. https:// doi.org/10.1007/978-3-319-46493-0_45. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B., 2016. Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, pp. 4929–4937. https://doi.org/10.1109/cvpr.2016.533. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: unified, realtime object detection. Proc. IEEE Conf. Comput Vision Pattern Recognit. 779–788. https://doi.org/10.1109/cvpr.2016.91. Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 1 (4), 568–576. Singh, A., Patil, D., Omkar, S.N., 2018. Eye in the sky: real-time drone surveillance system (DSS) for violent individuals identification using ScatterNet hybrid deep learning network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1629–1637. https://doi.org/10.1109/ cvprw.2018.00214. Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F., 2016. Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3637–3646. https://doi.org/ 10.1109/iccv.2017.393. Song, T., Alfonso, R., Zheng, P., et al., 2018. Spiking neural P systems with colored spikes. IEEE Trans. Cogn. Dev. Syst. 10 (4), 1106–1115. https://doi.org/10.1109/ TCDS.2017.2785332. Song, T., Pan, L., Wu, T., et al., 2019. Spiking neural P systems with learning functions. IEEE Trans. NanoBioscience 18 (2), 176–180. https://doi.org/10.1109/ TNB.2019.2896981. Tao, S., Pang, Hao, S., Alfonso, R., Pan, Z., 2019. A parallel image skeletonizing method using spiking neural P systems with weights. Neural Process. Lett. https://doi.org/ 10.1007/s11063-018-9947-9. Toshev, A., Szegedy, C., 2014. DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660. https://doi.org/10.1109/cvpr.2014.214. Wang, L., Ge, L., Li, R., Fang, Y., 2017. Three-stream CNNs for action recognition. Pattern Recognit. Lett. 92, 33–40. https://doi.org/10.1016/j.patrec.2017.04.004. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L., 2016. Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision. Springer, Cham, pp. 20–36. https://doi.org/ 10.1007/978-3-319-46484-8_2. Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y., 2016. Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732. https://doi.org/10.1109/cvpr.2016.511. Wei, L., Zhang, S., Gao, W., Tian, Q., 2018. Person transfer gan to bridge domain gap for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 79–88. https://doi.org/10.1109/CVPR.2018.00016. Xiong, Y., Zhao, Y., Wang, L., Lin, D., Tang, X., 2016. A Pursuit of Temporal Accuracy in General Activity Detection. Computer Vision and Pattern Recognition.

Declaration of interest statement All authors must disclose any financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work or state if there are no interests to declare. CRediT authorship contribution statement Faming Gong: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Yuhui Ma: Visualization, Investiga­ tion. Pan Zheng: Writing - review & editing. Tao Song: Supervision. Acknowledgement This work was supported by National Key Research and Develop­ ment Program (No. 2018YFC1406204, 2018YFC1406201), Research and Application of Innovative Methods for Oil and Gas Exploitation in Big Data Environment (No. 2015IM010300), Tai Shan Scholar Foun­ dation (No. tsqn201812029), Natural Science Foundation of China under grant U1811464, National Natural Science Foundation of China (61572522, 61572523, 61672033, 61672248 and 61873280), National Natural Science Foundation of Shandong Province (ZR2019MF012, 2019GGX101067), Fundamental Research Funds of Central Universities (18CX02152A, 19CX05003A-6), Shandong Province Innovation Researching Group (2019KJN014). Appendix A. Supplementary data Supplementary data to this article can be found online at https://doi. org/10.1016/j.jlp.2020.104043. References Cao, Z., Simon, T., Wei, S.E., Sheikh, Y., 2017. Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. https://doi.org/10.1109/CVPR.2017.143. Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J., 2016. Human pose estimation with iterative error feedback. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4733–4742. https://doi.org/10.1109/CVPR.2016.512. Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J., 2017. Adversarial posenet: a structureaware convolutional network for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1212–1221. https://doi.org/ 10.1109/iccv.2017.137. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J., 2018. Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112. https://doi.org/10.1109/ cvpr.2018.00742. Ch�eron, Guilhem, Laptev, I., Schmid, C., 2015. P-CNN: pose-based CNN features for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3218–3226. https://doi.org/10.1109/ICCV.2015.368. Du, W., Wang, Y., Qiao, Y., 2017. Rpan: an end-to-end recurrent pose-attention network for action recognition in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3725–3734. https://doi.org/10.1109/iccv.2017.402. Fang, H.S., Xie, S., Tai, Y.W., Lu, C., 2017. Rmpe: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343. https://doi.org/10.1109/ICCV.2017.256. Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE

11

F. Gong et al.

Journal of Loss Prevention in the Process Industries 64 (2020) 104043

Xu, D., Wang, Z., Song, T., 2019. A Novel Dual Path Gated Recurrent Unit Model for Sea Surface Salinity Prediction. https://doi.org/10.1175/JTECH-D-19-0168.1. Yan, S., Xiong, Y., Lin, D., 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L., 2016. End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687. https://doi.org/10.1109/ CVPR.2016.293. Yin, J., Liu, X., Tian, G., Wei, J., Zhang, L., Xu, T., 2016. Human action recognition based on the sequence of key points. Robot 38 (2), 200–207. Yuan, J., Ni, B., Yang, X., Kassim, A.A., 2016. Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pp. 3093–3102. https://doi.org/10.1109/ CVPR.2016.337. Zhan, F., Zhu, H., Lu, S., 2019. Spatial fusion gan for image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3653–3662. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D., 2017. Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923. https://doi.org/10.1109/ ICCV.2017.317. Zhong, Z., Zheng, L., Zheng, Z., Li, S., Yang, Y., 2018. Camera style adaptation for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5157–5166. https://doi.org/10.1109/CVPR.2018.00541. Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A., 2018. Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision. Springer, Cham, pp. 363–378. https://doi.org/10.1007/978-3-030-20893-6_23.

12