Convolutional long short-term memory model for recognizing construction workers’ postures from wearable inertial measurement units

Convolutional long short-term memory model for recognizing construction workers’ postures from wearable inertial measurement units

Advanced Engineering Informatics 46 (2020) 101177 Contents lists available at ScienceDirect Advanced Engineering Informatics journal homepage: www.e...

7MB Sizes 0 Downloads 12 Views

Advanced Engineering Informatics 46 (2020) 101177

Contents lists available at ScienceDirect

Advanced Engineering Informatics journal homepage:

Convolutional long short-term memory model for recognizing construction workers’ postures from wearable inertial measurement units Junqi Zhao a, *, Esther Obonyo a, b a b

Department of Architectural Engineering, The Pennsylvania State University, State College, PA 16802, USA Engineering Design, Technology, and Professional Programs, The Pennsylvania State University, State College, PA 16802, USA



Keywords: Posture recognition Wearable sensing Deep neural networks Construction worker Injury prevention Ergonomics

This paper proposes using Deep Neural Networks (DNN) models for recognizing construction workers’ postures from motion data captured by wearable Inertial Measurement Units (IMUs) sensors. The recognized awkward postures can be linked to known risks of Musculoskeletal Disorders among workers. Applying conventional Machine Learning (ML)-based models has shown promising results in recognizing workers’ postures. ML models are limited – they reply on heuristic feature engineering when constructing discriminative features for charac­ terizing postures. This makes further improving the model performance regarding recognition accuracy chal­ lenging. In this paper, the authors investigate the feasibility of addressing this problem using a DNN model that, through integrating Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM) layers, au­ tomates feature engineering and sequential pattern detection. The model’s recognition performance was eval­ uated using datasets collected from four workers on construction sites. The DNN model integrating one convolutional and two LSTM layers resulted in the best performance (measured by F1 Score). The proposed model outperformed baseline CNN and LSTM models suggesting that it leveraged the advantages of the two baseline models for effective feature learning. It improved benchmark ML models’ recognition performance by an average of 11% under personalized modelling. The recognition performance was also improved by 3% when the proposed model was applied to 8 types of postures across three subjects. These results support that the proposed DNN model has a high potential in addressing challenges for improving the recognition performance that was observed when using ML models.

1. Introduction The research discussed in this paper is part of a project directed at developing a Data-Driven injury prevention approach for construction workers. Musculoskeletal Disorders (MSDs) account for nearly one-third of the total costs spent on worker’s compensation in the U.S. [1] Em­ ployers spent as much as $53.1 billion dollars annually on direct cost for MSDs treatment in 2012–2014 [2]. MSDs-related compensation in construction is higher than in most of the other sectors [3]. Construction workers executing manually intensive tasks are highly susceptible to

MSDs [3]. This problem can be addressed proactively by identifying awkward postures and linking this information to MSDs risk factors. This strategy requires access to reliable posture data. For this to work in construction, the data capture techniques are warranted, and the pro­ cessing of the information should not require the hiring of additional specialists. The authors have shown the feasibility of using Machine Learning (ML)-based models for workers’ posture recognition from motion data captured by emerging wearable Initial Measurement Units (IMUs) in previous experiments [4,5]. They also observed that the heuristic

Abbreviations: BT, Bending Posture; CLN, Convolutional LSTM model; CNN, Convolutional Neural Networks; CV, Computer Vision; DL, Deep Learning; DNN, Deep Neural Networks; DT, C4.5 Decision Tree; EEG, Electroencephalography; IMUs, Inertial Measurement Units; KN, Kneeling Posture; KNN, K-Nearest Neighbor; LB, Literal bending Posture; LSTM, Long Short-Term Memory; ML, Machine Learning; MSDs, Musculoskeletal Disorders; NB, Naive Bayes; NON, Transitional Posture; OW, Overhead Working Posture; OWAS, Ovako Working Posture Analyzing System; RF, Random Forest; RFE, Recursive Feature Elimination; RNN, Recurrent Neural Networks; SGD, Stochastic Gradient Descend; SQ, Squatting Posture; SRS, Stratified Random Shuffle; ST, Standing Posture; SVM, Support Vector Machine; VS, Visionbased Sensing; WS, Wearable Sensing. * Corresponding author. E-mail addresses: [email protected] (J. Zhao), [email protected] (E. Obonyo). Received 4 December 2019; Received in revised form 8 August 2020; Accepted 16 September 2020 Available online 24 September 2020 1474-0346/© 2020 Elsevier Ltd. All rights reserved.

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

The authors’ proposed solution is based on the integration of a multilayer Convolutional LSTM architecture as a DNN-based recognition model. The proposed model performs feature engineering, sequential pattern learning, and model optimization in an automated manner. A channel-wise normalization converts the multi-channel motion data streams into 2D “Motion Image”. Such an “Image” serves as the input for the proposed DNN models. The integration of “Motion Image” con­ struction with Convolutional LSTM model architecture is expected to result in a high-performance recognition technique for detecting workers’ awkward postures. It also improves the usability of ML-based models as it eliminates the need for manual feature engineering. The remainder of the paper is organized as follows. Section 2 reviews related research. Section 3 provides the theoretical background of DNN models and the proposed design of Convolutional LSTM architecture. The approach used for evaluating the proposed models is summarized in Section 4. The result analysis and discussion are in Section 5, followed by the conclusion in Section 6. Section 7 discusses the limitations in this study and further research works.

feature engineering used in ML models can be biased in constructing discriminative features for different postures [6–8]. In related studies [4,9–13], the recognition performance denotes whether the model can correctly classify workers’ postures using constructed features. The biased feature engineering can impede ML models from achieving the optimal recognition performance. There is an opportunity to enhance the feature learning by using the Deep Neural Networks (DNN)-based models. This paper proposes a DNN-based model that exhibits highperformance posture recognition based on motion data captured from IMUs sensors. The specific research objectives associated with this goal include: (i) developing an integrated Convolutional Long Short-Term Memory (LSTM)-based DNN architecture for automated feature engi­ neering and sequential pattern learning; (ii) investigating the proper Convolutional LSTM architecture for achieving high recognition per­ formance; (iii) evaluating the recognition performance improvement from ML models. The authors hypothesize that the well-configured Convolutional LSTM model can enhance the posture recognition per­ formance of conventional ML-based models. The test was conducted on workers’ posture data collected from real job sites using wearable IMUs sensors. MSDs can develop through exposure to repetitive, awkward working postures. The gradual development of these injuries provides an op­ portunity for monitoring and taking corrective measures to mitigate/ prevent the impact of the exposure. Vision-based Sensing (VS) and Wearable Sensing (WS) can be used in capturing the required data [3]. VS has been successfully applied to posture and activity recognition [14–17]. However, the performance of VS can be adversely affected by poor environmental conditions of construction sites [18], such as lighting conditions [19,20], background noise, and occlusion [21]. WS, particularly the wearable IMUs, has more advantages in continuous and non-intrusive data sensing with less environmental influence [3,22]. IMUs sensors are widely used in commercial wearable products for health monitoring. The recent review [23] shows these products focus mainly on detecting daily living postures related to spine (e.g., neck and back) movements based on IMUs output. The limited posture monitoring cannot cover workers’ full-body awkward postures besides spine movement. In addition, the commercial products typically provide diagnostic information (e.g., warnings and health scores from accom­ panying applications) without raw IMUs output for customized posture analysis. To detect diverse awkward postures from workers, closely-related studies have applied ML models on the raw sensor output from both wearable IMUs [4,9,10] and smart device built-in IMUs [11–13,24,25]. These studies have demonstrated the potential of ML-based models. However, it is important to note that conventional ML-based models use the “sliding-window-based analysis pipeline” [8], which: (i) is prone to heuristic engineering biases; (ii) ignores the sequential patterns; and (iii) isolates feature construction from selection, as well as model optimiza­ tion. These issues result in suboptimal performance in the recognition model. They also limit the effective use of ML-based models on real job sites, where workers’ natural postures come with imbalances among different classes and variations within the same class. DNN techniques that leverage deep hierarchical architectures can be used to address the limitations of the conventional ML-based models. Specifically, the Convolutional Neural Networks (CNN) and the Recur­ rent Neural Networks (RNN) are widely used DNN architectures. CNN extracts rich features from raw data without manual feature engineer­ ing, and the RNN improves the recognition of temporal patterns [8]. The RNN can also be extended to capture long-term patterns by using the Long Short-Term Memory (LSTM) architecture. Deep hybrid models that integrate the CNN and RNN have been successfully applied to improve the models’ performance in detecting workers’ activities from VS output [20] and daily living activities from IMUs output [7,26]. However, there is still a lack of high-performance DNN model architecture that can be easily applied to IMUs-generated data for the recognition of construction workers’ posture.

2. Review of related work 2.1. Posture-based MSDs assessment and prevention Epidemiological studies have identified the contributing factors pose the highest risks for MSDs and other similar injuries [27]. Workers in labor-intensive sectors such as construction routinely adopt awkward working postures; this exposes them to a high risk of developing MSDs. Efforts invested in proactively preventing the injuries include the development of workplace ergonomics assessment strategies. These are based on the use of ergonomics rules to monitor the frequency and duration of repetitive awkward postures. Common ergonomic rules for posture assessment include “Rapid Upper Limb Assessment” [28] and “Rapid Entire Body Assessment” [29]; Ovako Working Posture Analyzing System (OWAS) [30] and its derivatives [31]; and the ISO 11226:2000. The research discussed in this paper focuses on the op­ portunity of MSDs risk monitoring through awkward posture detection with wearable IMUs and DNN techniques. 2.2. Motion capture techniques in construction Construction projects are executed in constantly changing geographical locations exhibiting non-standardized operations. The dynamic nature of the construction sites limits the effectiveness of observation-based MSDs risk assessment strategies, because of com­ plexities of adapting to rapidly changing site conditions. Motion sensing, particularly the VS and WS, have been used to override the limitations of approaches that rely on supervisors [3]. 2.2.1. Vision-based Sensing The VS has high accuracy and is often coupled with powerful analytical tools for biomechanical assessment regarding postures and joint load [32–34]. Sophisticated sensing devices, such as range cameras [35] and stereo cameras [33], cannot be easily deployable on dynami­ cally changing job sites [19,20]. The captured motion data also warrant the post-processing for reconstructing the 3D skeleton model, thus making it more applicable for controlled lab settings. Advancement in Computer Vision (CV) and DNN techniques allows object detection and recognition from regular cameras. There is a growing interest in applying DNN models for safety monitoring of construction workers. The use of CNN [14,19,36] and hybrid DNN models [20] has demonstrated the feasibility of detecting a worker’s postures and activities from site surveillance videos. DNN techniques also facilitate multi-object detection and recognition for safety moni­ toring on construction site. The multi-object detection has been tested for monitoring Personal Protective Equipment misuse among workers [37–39], workers’ activities [15,40], and workers’ interaction with 2

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

environment (e.g., equipment, materials, and building structure) [16,17,41–44]. Additionally, workers’ identity can also be recognized from site surveillance using DNN models. Identification detection can help to track workers committing unsafe acts [45] and assess their competence [44]. The integration of DNN models with VS for construction safety monitoring has been made possible by: (i) the well-developed CNNbased models (e.g., VGG-16 in [37–39] and ResNet in [40,41]) for automated feature learning and R-CNN-based multi-object detection from 2D images [16,17,37–40]; (ii) RNN-deep architecture and its variation (e.g., LSTM used in [42]) for capturing sequential patterns; and (iii) existence of large labeled image datasets (e.g., ImageNet) for training complicated deep models. However, there are still imple­ mentation constraints for VS. Firstly, VS is affected by lighting vari­ ability [19,20], leading to decreased performance due to visual disturbances [46]. Secondly, the cameras need fixed location imple­ mentation, which captures non-targeted information and suffers from occlusions [21]. The available image datasets are not domain-specific. The use of pre-trained deep architecture requires domain-specific datasets, such as images of construction workers, for model finetuning under the Transfer Learning. These challenges can limit the deployment of DNN models with VS into real-life applications [18–20].

and Behzadan [11] applied the SVM model for activity recognition from smartphones mounted on arm and wrist of 2 workers. Their model achieved an accuracy up to 90.2%. The authors’ previous work [4] also investigated the optimal configuration for a personalized model, where the configured SVM model gave an accuracy of 74–83% in experiments. The on-going research shows the potential of applying ML models to recognize workers’ postures from wearable IMUs. However, further improvements are necessary to enable its successful application on real job sites. Most current efforts have evaluated the performance of MLbased models within controlled laboratory experiments. The motion data in these efforts were collected from imitated construction activities by students [4,10,13] or workers [9,11,12], following a prescribed experimental protocol. Performing ML-based posture recognition on a real job site is a challenging task. Human natural postures tend to be highly imbalanced [7], which makes high-performance classification balancing majority and minority classes a complex task. Additionally, the same posture can vary among workers executing similar tasks, even for one worker doing a routine task over time. These bring challenges of intra-class variations and inter-class similarities to the ML-based clas­ sification [8]. There is a need for proper feature engineering and highperformance recognition models that can effectively discriminate workers’ postures on job sites.

2.2.2. Wearable Sensing WS is an alternative motion-sensing approach whose growing popularity is driven by its higher applicability on the construction site [3]. Wearable physiological sensors, measuring electrocardiogram (ECG) and electroencephalography (EEG) signals, are used for moni­ toring both mental [47] and physical strain [48–51]. Smart insole-based sensors have been applied to detect workers’ fall risk [52] and integrated with the 3D pose for biomechanical analysis [53,54]. Particularly, miniature IMUs sensors have been applied in non-intrusive WS systems to capture motion data from both full-body [4,9,10,55–58] and targeted body segments (e.g., ankles [59,60], head, and low back [22]). IMUs sensors built in smartphone and wristband are increasingly being used to track workers’ motion data [11–13,24,25,61,62]. Motion data from wearable IMUs can be transformed into clinically-meaningful joint an­ gles [22,55,57] and load [56,58]. Such information can be further used for ergonomics assessment [63]. There are also studies using IMUs output for assessing gait stability [59,60,64,65] and fatigue analysis [66]. Posture recognition from IMUs-based motion data is usually formu­ lated as a classification problem. The Data-Driven ML models are suit­ able for such classification task. In the area of Human Activity Recognition, ML models are mainly trained and evaluated on open datasets of daily living activities conducted by subjects (ranging from 1 to 29) wearing IMUs sensors in experiments [6]. However, there is a lack of large-scale open dataset of construction workers that can be used to evaluate the feasibility of using ML models for recognizing postures and activities. In this sense, related studies, as reviewed below, typically conduct their in-house data collection experiments, where subjects (ranging from 2 to 25) performed construction tasks wearing IMUs. Akhavian and Behzadan [24]’s study tested recognizing 2 workers’ postures using smartphone built-in IMUs sensors, which achieved an accuracy of over 90% with neural network models. Alwasel, Sabet, Nahangi, Haas and Abdel-Rahman [9] identified the productive ma­ sonry postures of 21 workers with the Support Vector Machine (SVM) model trained from full-body IMUs output. Chen, Qiu and Ahn [10] applied an SVM-based model and achieved an overall accuracy of 60%80% in the experiment with 4 students wearing full-body IMUs sensors. Ryu, Seo, Jebelli and Lee [12] evaluated using an accelerometerembedded wristband for recognizing workers’ activities, where the SVM model showed the best accuracy of around 88% in laboratory test with 10 workers. Yang, Yuan, Zhang, Zhao and Tian [13] also tested the SVM-based model for recognizing construction activities from 25 stu­ dents, using smartphone and wristband built-in IMUs. Nath, Chaspari

2.3. DNN for posture recognition from wearable IMUs The conventional ML-based model typically uses a “sliding-windowbased analysis pipeline” [8]. This segments the continuous data stream into separate windows before constructing and selecting features (such as widely used features from Time-Domain and Frequency-Domain [4,12,13,24]) to discriminate postures. Probabilistic classification models (e.g., SVM tested in related studies [4,9–13]) are then used to assign a posture label to each window. This notwithstanding, such a pipeline suffers some typical problems for challenging classification tasks. Feature engineering is the most important phase of the ML model development process [67], which correlates to the model performance [6]. The conventional ML-based model relies on the manual heuristic feature engineering. Despite its use of expert domain knowledge, heu­ ristic feature engineering is a biased and time-consuming process [8]. Posture intra-class variability and inter-class similarity compound the complexities in developing generic features for high-performance ML models [6]. Besides, human motion is inherently translational and temporal [6,7]. The commonly used ML-based model does not capture the sequential patterns in each segmented window, assuming time series motion data are static. Additionally, the conventional ML-based model separates feature engineering, feature selection, and parameter tuning steps during the model development. These problems impose challenges for further improving ML-based model performance, thus resulting in a sub-optimal model. The authors contend that the Deep Learning (DL) techniques that have automated feature extraction capabilities can be used to address above challenges. DL is an ML technique that uses representation learning to automatically discover feature representation in raw data [68]. Unlike conventional ML models that require a human-engineered feature to perform optimally, DL models comprise deep multi-layer Neural Networks, a.k.a. DNN, that represent features from low to high levels hierarchically. DNN models have achieved high performance with respect to pattern recognition in CV, Speech Recognition, and Natural Language Processing [8]. The exceptional performance of DNN models drives their application in recognizing WS-based daily living activities. Motion data from IMUs sensor channels can be treated as 2D “Image” after being segmented into windows of the same dimension. This trick enables one to apply DNN models. A simple Deep Belief Network model was first introduced into WSbased posture recognition in an effort directed at finding discrimina­ tive features [69]. More recently, featured DNN architectures have been deployed and validated in WS-based activity recognition, such as 3

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Autoencoder [46], CNN [7,70], and RNN [71]. Notably, hybrid DNN models have the potential in leveraging the functionality of diverse DNN layers [6]. The multi-layer CNN models automatically extract rich fea­ tures with increasing complexity from input data, eliminating tedious manual feature engineering. The RNN can be extended as LSTM to learn long-term sequential patterns. Integrating both networks can potentially leverage the learning power for extremely complex features and sequential patterns from motion data [8]. Applying hybrid CNN + LSTM model with WS has achieved state-of-the-art performance in recognizing daily living activities [7,26,72] and detecting sleep condition [73]. The use of DNN to process WS output is a rapidly advancing area. This notwithstanding, there are some key research questions that need to be addressed. Firstly, unlike the well-developed DNN architectures under VS (e.g., VGG-16), there is no universal, pre-trained, and ready-touse model architecture for different WS-based application scenarios [8]. It is, therefore, necessary to investigate the proper DNN architectures for workers’ posture recognition. This need to be done while considering the challenges from balancing model performance with complexity, imbalanced posture class distribution, and model generality for different users. Secondly, the heterogeneous motion data from multi-channel sensors should be pre-processed as a “Motion Image” [6]. The preprocessing task has a significant influence on both computational effi­ ciency and recognition performance. Additionally, despite the existence of publicly available WS-based motion datasets (e.g., Opportunity), most of them are routine daily living activities. Domain-specific motion datasets are needed for developing and validating DNN models that can serve specific goals, such as detecting the injury-related postures among workers. Comparing to vibrant studies applying DL with VS for safety monitoring, only a few recent studies explored using DNN models to process sensor output in construction, such as equipment activity recognition from IMUs [74] and workers’ stress detection from EEG [75]. In previous efforts [76], the authors investigated the feasibility of using the DNN-based model for recognizing workers’ postures. This paper expands the initial study and contributes to the current body of knowledge regarding the following aspects: (i) propose an easy-to-use data pre-processing approach for converting multi-channel IMUs’ output as “Motion Image”; (ii) design and development of a Convolu­ tional LSTM architecture to work with IMUs for workers’ posture recognition; (iii) investigate the proper model architecture for balancing model recognition performance and complexity; (iv) evaluate the per­ formance of feature learning and generality for the proposed DNN ar­ chitecture; (v) validate the DNN-based posture recognition model with workers’ naturalistic working postures on a job site.

operation renders a “Motion Image” with the size of “S by D by 1-layer depth”. 3.2. Convolutional Neural Networks for automated feature learning Convolutional Neural Networks is a DNN architecture with inter­ connected structures [78]. Multiple CNN layers can be constructed with different convolution kernels. Convolutional kernels operating on the input data are optimized during the supervised learning process, in an attempt to maximize the activation level of kernels for data subsets in the same class [7]. The discriminative features learned by CNN renders a feature map. Using the 1D data from one sensor channel as a simple example, Eq. (1) and Fig. 1 show how feature maps are extracted via 1D convolution. l al+1 j (p) = σ [bj +


alj (p)

∑Fl f =1

Kjfl (p)*alf (p)] = σ[blj +

∑Fl ∑pl [ p=1 Kjfl (p)alf (p)]] f =1


denotes the unit p of feature map j in layer l, F is the l

number of feature maps in layer l. Kljf is the kernel convolved over

feature map f in layer l to create the feature map j in layer l + 1. pl is the length of kernels in layer l. blj is the bias vector toward feature map j in the next layer l + 1 from the current layer l. σ is the non-linear activation function. The optimized kernel weights serve as a feature detector. It identifies a specific salient pattern of targets (such as posture corresponding to a motion signal pattern). The stacked convolutional layers learn the hi­ erarchical representation of input data. The deeper layers progressively represent the prior layer’s output in a more abstract way. The stacked convolutional layers are the “de facto” standard approach for automated feature extraction [7]. 3.3. Long short-term memory model for learning sequential patterns The LSTM [79] extends the classical Recurrent Neural Networks (RNN)’s abilities in learning long-term temporary relationships. It intelligently controls historical information with “gate” [80]. Fig. 2 ex­ plains the LSTM working procedure following Olah [81]’s work. Spe­ cifically, the LSTM isolates two memories: “long-term memory”, which is the information the LSTM keeps remembering, denoted as cell state at time t (ct ); and “short-term memory”, which is the information used directly and should be focused on for the current task, denoted as hidden state at time t (ht ). Firstly, LSTM learns which information should be kept or forgotten in the long-term memory ct with a forget gate ft from the new data input. Secondly, the model extracts the candidate new information ct to be added into the long-term memory ct with an input gate it . Next, ct is updated from input it , candidate new information ct and previous cell state ct− 1 , via both forge gateft and input gate it . Finally, the model determines which part of the long-term memory ct should be focused on for the current work ht , which is controlled by output gate ot .

3. Design and development of DNN-based model architectures 3.1. Sensor output as “Motion Image” The motion data captured by the IMUs sensors render a multimodule (accelerometers and gyroscopes) and multi-channel (tri-axial channels) time-series data. These motion data can be converted into 2D “Motion Image” description enables the learning of discriminative fea­ tures [77]. The continuous motion data was segmented into consecutive motion images using the fixed-size window (window size selection is discussed in Section 4.2.2). Each window is treated as an image formed by a matrix of the “pixel”. The value in the motion data matrix is the sensor output from a specific channel (di ) at a certain timestamp (si ), instead of using a number of range in 0–255 representing the darkness level in each “pixel” of an image. The channel-wise normalization was applied to sensor output for each channel within a window. It can be achieved by centering to the mean and scaled to unit variance. The normalization addresses the unit difference across sensor channels. In this case, a motion data window becomes an “S by D” 2D image, where S is the number of timesteps in a window and D is the total number of sensor channels. All channels are combined in the same layer. This

3.4. Convolutional LSTM model development 3.4.1. Components in convolutional LSTM model An integrated Convolutional LSTM model (referred to as the CLN model) was deployed for posture recognition. Fig. 3-a presents the conceptual architecture of the proposed CLN model integrating onelayer CNN and one-layer LSTM. Model Input. Model input was segmented by a sliding window with a fixed size. The raw sensor output was normalized for each channel. All channels were combined as a matrix of “S” by “D” (60 by 30 in Fig. 3), which was treated as 2D “Motion Image”. Batch and Epoch. The entire dataset for developing DNN models was divided into multiple (non-overlapping) groups of equal size. One 4

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Fig. 1. CNN-based Feature Learning from 1D Motion Data as an Example. Two convolutional kernels (depth of 1) in Layer 2 are used to learn two feature maps from Layer 1. One convolutional kernel (depth of 2) in Layer 3 is used to learn one feature map from Layer 2.

Fig. 2. LSTM Model Working Process.

can then feed each group into the model for training. Each group is also referred to as a batch (batch size of 10 in Fig. 3-a), which is used for effective model training. All batches of training data that have passed both forward and backward through the model at least once can be bundled into one epoch. Multiple epochs can be used for fully training the model in cases where data is limited. Convolutional Layer. A convolutional layer computes the output that is connected to the local region of each sample in the input. The stride (1 by 1 in Fig. 3-a) quantifies the movements of the convolutional kernel along with the vertical and horizontal directions. Zero-padding of the input data was done to avoid losing information on the border of 2D input. The reference to “n” convolutional kernels identifies the number of feature maps (20 in Fig. 3-a). This creates an added depth dimension for the convolutional layers. A Flatten operation was used to establish a fully-connected dense layer. Flatten operation converts each sample’s feature maps into a one-dimensional vector representing one sample for classification. For example, if the 20 feature maps are flattened, the data will change to a “1 by 36,000” 1D vector from “60 by 30 by 20” 3D Tensor.

LSTM and Classification Layers. Flattening the entire CNN output ignores temporal dependencies within motion data. LSTM can address this problem. The feature maps were flattened only along the depth dimension. The vertical time step dimension in Fig. 3-a was reserved for applying the LSTM model and capturing sequential patterns. Each slice of motion data over the time step has a dimension of 600 (features) by 10 (batch size). Samples in a batch fully connect with 64 neurons in the LSTM layer. The 50%-dropout operation was used before fully con­ nected layers (such as LSTM and Dense layers) to control model over­ fitting. The dropout operation randomly sets the activation of half of the units in a subsequent dense or recurrent layer as zero. The LSTM neurons were fully connected with the softmax layer. The softmax layer was used to predict the class of each sample in a batch. The LSTM gives a pre­ diction for every time step “t” in sequence. The LSTM memory units tend to become more informed after more time steps pass. Because the acti­ vation information in LSTM neurons at each time step is passed on to the next, the more time steps LSTM neurons have “seen”, the more infor­ mative the model will be [7]. Class probability distribution at the last time step “T” was used as a recognition result. At this point, the full 5

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Fig. 3. (a) CLN Conceptual Architecture and (b) Baseline CNN Conceptual Architecture. In the conceptual architecture, the exemplary parameter values are used for illustration. The actual parameter setup is discussed in Section 3.4.2.

sequence in the window has been processed.

Roggen [7] and Karpathy, Johnson and Fei-Fei [82] was adopted. Each LSTM layer had 128 neurons. The last LSTM layer’s output was used by the softmax layer for prediction. This model can be expressed as C(64) × N − RL(128) × 2 − Sm [83], where the C(nc ) denotes a con­ volutional layer with nc features; RL(nr ) is recurrent LSTM layer with nr units; and Sm is the softmax classification. The hyperbolic tangent function (tanh) was used to activate neurons in each convolutional and LSTM layer. Baseline CNN and LSTM Models. All elements in the vector generated by Flatten Layer in the baseline CNN model (Fig. 3-b) con­ nected with neurons in a subsequent fully-connected layer. The last fully connected layer had an equal number of neurons to the number of class labels. The softmax function gave a class probability distribution of samples in the batch. Each sample was classified by the class label with the highest probability. The CNN model can be expressed as C(64) × N − D(128) × 2 − Sm. D(nd ) denotes a dense layer with nd units. N was the same as the number of convolutional layers in the CLN model. Similarly, the baseline LSTM model (Fig. 4) is expressed as RL(128) × 2 − Sm. It used normalized sensor output from each window as model input. It did not use features learned from convolutional layers. Similar to the CLN model, baseline LSTM model also used class probability distribution predicted at the last time step “T” as the result after the LSTM layers have fully learned the sequential patterns.

3.4.2. CLN and baseline DNN model architectures The proposed CLN model aims at leveraging the advantages of CNNbased feature engineering and LSTM-based sequential pattern learning. In this sense, non-recurrent CNN and non-convolutional LSTM should be used as baseline models for validation. The proposed CLN and baseline models shared the same architecture for comparison purposes. All the constructed deep architectures process the input motion data in a layerwise approach. Each layer provides the representation of the input that will be used as data for the next layer. The difference between proposed CLN and baseline CNN is the topology of fully connected dense layers. LSTM units serve as the dense layers under the CLN model. While the baseline CNN model only uses the non-recurrent dense layers. For the baseline LSTM, the motion data feed into the LSTM units directly, without feature extraction from convolutional layers. The baseline model setup ensures performance differences are due to architectural differences, instead of better optimization, pre-processing, or ad-hoc customization [7]. CLN Model. The depth of convolutional layers in non-recurrent CNN models is a key hyperparameter influencing model performance in the initial study [76]. Investigating the optimal CLN model architecture by varying the convolutional layer depth is part of the research discussed in this paper. The range of varying depth was set as one to five, as sug­ gested in [7]. Because the convolutional layer parameter setup used in the CLN model by Zhao and Obonyo [76] had promising results, it was adopted here. In this setup, each convolutional layer has 64 kernels with a size of “5 by the number of sensor channels”, 1 × 1 stride, and zero´n ˜ ez and padding. The recommend 2-layer LSTM architecture by Ordo 6

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Fig. 4. Baseline LSTM Conceptual Architecture. In the conceptual architecture, the exemplary parameter values are used for illustration. The actual parameter setup is discussed in Section 3.4.2.

4. Evaluation of convolutional LSTM model for posture recognition

collected data record with video reference. One data record represents the sensor output form all channels at a specific time step t. The defi­ nition of postures in OWAS was used as a reference. The distribution of labels and postures represented by each label are explained in Table 1. As the video reference can be blocked during data collection, data without video references were not considered during the analysis. 0.5 s to 2.5 s is a commonly used range for human activity recognition for the sliding window size [85]. The authors’ previous research established that a window size between 1 s and 1.3 s can achieve high posture recognition performance [4]. A window size of 1 s was adopted for segmentation in this paper. The dataset was segmented using a sliding window with 50% overlap to capture the transitional postures between consecutive windows. Each window was labelled using the majority label of data records in the window. The sensor output was converted into 2D “Motion Image” using the channel-wise normalization as described in Section 3.1.

4.1. Job site data collection The authors tested the feasibility of developed DNN models on workers’ motion datasets collected on job sites, following the research approach in related studies (reviewed in Section 2.2.2). Four subjects were recruited from a residential building construction project as an initial test case. The workers’ consent was obtained following Institu­ tional Review Board approved protocols. The average work experience of subjects is 17 years. Five body locations were selected for motion data collection according to the human body segments and landmarks sug­ gested in [84]. Five IMUs sensors (Mbinet Lab Meta Motion C1) were deployed at the forehead (on the front of hardhat), chest center, right upper arm, right thigh, and right crus (Fig. 5). All subjects were righthanded. The sensors were attached by sticking them on the workers’ clothing. The loose sensor attachment aims at avoiding the discomfort. Each subject was asked to perform their routine tasks for 20 to 30 min. Workers’ postures were recorded for video referencing. The collected data are summarized in Table 1.

4.3. Setup for model training and evaluation 4.3.1. Construct the posture datasets The motion data from multiple workers can be used separately as a personalized dataset and combined as a generalized dataset. These two datasets were used to evaluate the CLN model’s performance in recog­ nizing the posture from a specific worker and those from multiple workers (Fig. 6). The generalized dataset was constructed by downsampling S2 and S4 to 20 Hz, then combining S1, S2, S4 as S5. S2 was not used as it did not contain sensor output from the arm.

4.2. Motion dataset preparation 4.2.1. Motion data Pre-processing The low-cost wearable IMUs used can provide unstable signals. When configured at 50 Hz, the IMUs collected some data at lower (down to 20 Hz) or higher (up to 90 Hz) frequency. The motion data collected within one second was treated as a window. The window size selection is discussed in Section 4.2.2. 40 Hz was used as a cut-off to remove lowerfrequency windows. 40 data points were randomly sampled with pre­ served temporal sequences from each channel in a window before being combined as a down-sampled window. The down-sampling approach was used to synthesize a 40 Hz motion dataset for subjects S2, S3, and S4. The motion data from S1 were collected at 25 Hz and down-sampled as 20 Hz.

4.3.2. Dataset split strategy for model training and testing The Stratified Random Shuffle (SRS) was used for dataset slitting, giving the following considerations. The stratification ensures the same postures classes in train and test datasets. Shuffling was applied for effectively training DNN models and reducing the potential influence from data drift on models. Training DNN models typically relies on the “mini-batch based gradient descend” [86,87], where mini-batches are expected to approximate the distribution of entire dataset for effective model training [87]. For data naturally grouping the same classes in sequence (like workers’ naturalistic posture data in this study), it is a recommended approach of shuffling before dividing data into consecu­ tive mini-batches for training DNN models effectively [86,87]. Other­ wise, batches containing only limited classes are non-representative. Biased gradients calculated from the non-representative batches will seriously mislead the DNN model weight updating from the true

4.2.2. Windowing and labelling The authors manually labelled the corresponding postures for each 1 “Wearables for Motion Tracking Wireless Environment Monitoring.” MbientLab, 1 July 2019,


J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Fig. 5. Subjects Working with Sensors. The sensors blocked for Subject 1 to 4 are not circled. The rightmost picture provides an example showing all sensor placements for a worker.

these postures. The unweighted average of F1-score is known as Macro F1-score (Eq. (2)).

gradient, thus resulting in ineffective model training [87]. Despite shuffling was applied for effective DNN training, benchmarking ML models used in this study were not influenced by shuffling, given all training data were fed into ML models at one time. Shuffling is also effective in minimizing influences from motion data drift [88]. Drift describes that data properties charactering the target (e. g., motion data charactering postures) change over time. Data drift can result from the drift of IMUs sensor output [89] and intra-class posture variation at different time span [90] in this research context. Recogni­ tion models evaluated on shuffled dataset can better reflect their capa­ bility in learning postures data patterns with minimized influence from drift. It is also worth noting the SRS used in this study may bring the overestimation for recognition performance of all tested models, which will be discussed in Section 7. However, this study aims at comparing the recognition performance between proposed DNN and benchmark ML models. Given all DNN and ML models were trained and tested on the same datasets under SRS, the comparative performance evaluation between models is not influenced by data shuffling. Fig. 7 describes the SRS implementation. The “train” and “test” datasets were split as 9:1, which preserved more data for training the complex deep models. The “train” dataset was further split into “training” and “validation” datasets using a ratio of 8:2. The data split setup allows one to fully develop the DNN models on the training dataset and fine-tune the model on the validation dataset. The SRS was repeated for five rounds using different random states. The repetition minimizes the impact of biases during the splitting of a dataset. The use of the random parameter initialization and Stochastic Gradient Descend (SGD) algorithms for model optimization introduces some randomness into the DNN training process. Because of the randomness, the same training process may yield DNN models with slightly different performance on test data. Repetitive model training and evaluation on different datasets split can improve the reliability of the model evaluation.

MacroF1 =

1∑ Precisioni × Recalli 2× N i Precisioni + Recalli


The N denotes the number of class labels in the dataset and i repre­ sents a specific class. High Macro F1- score reflects high classification performance. 4.3.4. Checkpoint in DNN model training The authors monitored the DNN model performance (Macro F1 Score) on validation dataset after every epoch in the training process. The models were trained until their performance ceased to improve. This can occur around 50–100 epochs based on the authors’ initial study [76]. The 300 epochs were used for fully training the deep model. The model training checkpoint was set to only “save” the trained model with improved performance in an “overwritten” way, which saves the DNN model with the highest recognition performance after all 300 epochs. 4.3.5. ML-based models as benchmark The ML-based models were developed using the same dataset as the DNN models. They were subjected to the same pre-processing and dataset splitting processes. The performance differences between DNN and ML models can be attributed to feature representations. The heu­ ristic features used in ML-based models were constructed from each sensor channel. These include a total of 390 (13 features by 30 channels) features constructed for a given window (Table 2). The developed ML models were based on the use of five commonly used classification al­ gorithms - the SVM, Naive Bayes (NB), K-Nearest Neighbour (KNN), C4.5 Decision Tree (DT), and Random Forest (RF). Using the same training/validation/test splitting approach for DNN models (Fig. 7), all ML-based models were developed on the training subset and fine-tuned via Grid-Search on validation subset. The tuned parameters and their searching range (in Table 3) were selected based on the recommenda­ tion of related study [91] and Scikit-learn package [92]. The parameter combination giving the highest Macro F1 Score on the validation subset was identified as fine-tuning results for each ML model. Recursive Feature Elimination (RFE) was also performed for feature selection. The fine-tuned ML-based models were finally evaluated on test subsets. The presented process is a commonly used “sliding-window-based analysis pipeline” for ML-based posture recognition models [8]. It is important to notice the innate challenges of ML models impeding posture recognition performance improvement. Firstly, features used are chosen from those widely used for daily living activity recognition. The heuristic features can have discriminative power, whereas they are not domain-specific and come with engineering bias. Secondly, feature construction, feature selection, and classification algorithm tuning are implemented through sequential, but fragmented, steps. Model hyper­ parameters across steps are not optimized together for high-

4.3.3. Model performance evaluation metric The collected workers’ postures were highly unbalanced among classes, as shown in Table 1. This imbalance is expected when dealing with natural human postures. The classification accuracy is insufficient for measuring classification performance - a naive model would achieve high accuracy by classifying every sample as the majority class. The Macro F1-score was used to account for the class label imbalance. The F1-score measures the total of correctly classified instances. F1-score is calculated by the harmonic mean of Precision and Recall. After acquiring the F1-score for each class, the average F1-score was used as an evaluation metric. The method used to calculate the average F1-score depends on the specific class label that is being considered. The collected motion data have shown the awkward posture labels of interest can be either majority (in S2 and S3) or minority (in S1 and S4). It is, therefore, appropriate to give equal weights to both majority and minority labels. The models were trained to achieve high performance for recognizing all 8

• BT-Static bending, minor movement with bending, minor lateral bending and bending for pickup • KN-Kneel on one leg and both legs • LB-Literal bend • NON-Transitional movements between postures • OW-Overhead work with one arm or both arms • SQ- Squatting • ST-Standing, without the knee touching ground

Fig. 6. (a) Personalized and (b) Generalized Modelling.

Fig. 7. Dataset Split under SRS.

The arm sensor came across malfunction during data collection, the six channels from arm sensor was not considered.




Packages (R)


Minimum, Maximum Mean, Variance Average Absolute Deviation Slope

Basic statistics of sensor output in each window

Stats (3.4.1)

Root Mean Square FrequencyDomain

Spectral* Entropy Spectral Centroid Skewness Kurtosis Signal Energy Frequency Range Power

Mean absolute deviations from center Sen’s slope for a series of data Square root of arithmetic mean Frequency spectral after FT Shannon entropy Centroid of a given spectrum Symmetric of distribution Heavy tail of distribution Sum squared signal amplitude [93] Sum of absolute signal amplitude [94]

Trend (1.0.1) Seewave (2.0.5)


Spectral is not a feature, it is the basis on which all the other FrequencyDomain features were constructed. Table 3 ML-based Model Setup for Parameter Tuning. Classification Algorithms

Tuned Parameters and Searching Range

Combinations 392


Gamma: 2x , x ∈ [− 10, − 5, − 1, 0, 1, 5, 10] Cost (C):2x , x ∈ [− 10, − 5, − 1, 0, 1, 5, 10] Kernel: [‘poly’, ‘rbf’] Degree: [2,3,4,5] Added Variance: 100 values spaced evenly in the range of [1e–9,1] K-Number of Neighbours: [1, 2, 3,…100] Maximum Depth: [5, 10, 15, …30] Minimum Samples in Leaf: [5, 10, 15, … 30] Minimum Samples to Split: [5, 10, 15, … 30] Number of Trees: [5, 10, 15, 20, 25,… 400]




18.50 min @[email protected] channels 28.48 min @[email protected] channels Wire Pulling S4 Electrician

18.50 min @[email protected] channels 26.94 min @[email protected] channels Ground Electrical Conduit Installation S3 Electrician

S2 Demolition

Ground Guardrail Installation

30.64 min @[email protected] channels

30.27 min @[email protected] channels *

BT (14.7%) KN (2.0%) LB (12.3%) OW (7.2%) ST (52.4%) WK (3.4%) BT (72.9%) NON (4.3%) ST (12.5%) WK (9.1%) BT (13.6%) KN (46.7%) NON (15.0%) SQ (3.0%) ST (22.0%) BT (12.3%) ST (71.5%) WK (12.3%) S1 Masonry


30.75 min @25hz @30 channels

30.27 min @[email protected] channels

Table 2 Heuristic Features Constructed from Each Channel.

Posture Labels (Proportion) Collected Data Tasks Subjects

Table 1 Description of Collect Motion Dataset.

Advanced Engineering Informatics 46 (2020) 101177

Actual Data Used

Label Explanation

J. Zhao and E. Obonyo


100 216


J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

can be a suitable architecture for personalized modelling. This can be explained by the fact that a greater network depth increases the number of parameters significantly (see Fig. 8-b). In addition to the greater depth with limited training data being overfitting, the gradient vanishing problem can also emerge in an overly deep architecture. The gradient decreases exponentially in the initial layers after propagating through multiple activation layers, which results in inefficient model training. The model operation time increased significantly when the model goes deeper (Fig. 8-b). The computational burden impedes the real-time deployment of DNN-based recognition models [6]. Based on these findings, the CLN architecture of C(64) × 1 − RL(128) × 2 − Sm emerged as ideal for developing personalized recognition model.

performance posture modelling. The commonly used classification al­ gorithms are also unable to capture sequential patterns within timeseries motion data. Additionally, both manual feature engineering, parameter fine-turning, and RFE-based feature selection are highly re­ petitive and time-consuming, limiting the practicality of applying MLbased models. ML-based models were used as the benchmark, vali­ dating the performance improvement from the proposed DNN-based model. 4.3.6. Model implementation The DNN model architecture was developed using Keras 2.2.2 [95] (TensorFlow 1.9.0 GPU version). The ML-based models were developed using Scikit-learn in Python [92]. The feature construction was imple­ mented in R (3.4.1). The models were all implemented on a Windows 10 PC (Intel Core i7-7700 [email protected] 2.8 GHz, 16 GB RAM, NIVIDA GeForce GTX 1060 [email protected] GB RAM). The code is available at [96].

5.2. Evaluation of personalized CLN model 5.2.1. Comparative evaluation between CLN and baseline DNN models Personalized modelling leveraged the CLN model’s capability in capturing individual posture idiosyncrasy [8]. The CLN models consis­ tently outperformed the baseline CNN models until the convolutional layer depth reached five (see Fig. 10). The CLN model with zero con­ volutional layers (C0) was the baseline LSTM model. CLN models also outperformed the baseline LSTM model with less than five convolutional layers. The baseline CNN model became a Multilayer Perceptron (MLP) model (D(128) × 2 − Sm) without the convolutional layers. The baseline CNN model’s performance increased as the depth of convolutional layer increased from zero to five. The increasingly complex features learned by the deep CNN model tend to enhance posture recognition under personalized modelling. However, the baseline CNN model needs an overly deep architecture (five-layer) to achieve comparable perfor­ mance with CLN and baseline LSTM models. The recommended archi­ tecture of C(64) × 1 − RL(128) × 2 − Sm achieved the highest performance among all tested DNN architectures. Specifically, C(64) × 1 − RL(128) × 2 − Sm (with 1,190,981 trainable parameters) improved the best baseline CNN (C(64) × 5, with 12,315,205 trainable parameters) model’s performance by 1.91%, with 90.33% less trainable parameters. These results support the author’s position that the CLN architecture can leverage the advantages of both the CNN and LSTM layers to enhance the posture recognition. The findings are consistent with pre­ vious work, which integrated CNN and RNN models in vision-based [20] and sensor-based [7] activity recognition studies. The recommended

5. Result and discussion The focus for this research was improving the posture recognition performance of ML-based models. This was accomplished through developing a Convolutional LSTM-based DNN model, which automates feature engineering and sequential pattern detection. This section pre­ sents the result of investigating the recommended CLN model archi­ tecture. The performance of the CLN has been compared against the baseline and benchmark models as discussed in sub-sections. 5.1. Investigation of convolutional LSTM model architecture The authors investigated the proper model architecture. Different CLN architectures were constructed through varying the convolutional layer depth from one to five while preserving the two-layer LSTM. Both the recognition performance and model complexity were analysed to find the proper architecture. The CLN model training and testing were conducted for each subject under personalized modelling. Fig. 8 depicts the results of the evaluation (see Fig. 9) Increasing the convolutional layer depth from one to two decreased the CLN model performance. There was a plateau when convolutional layer depth reached three to four layers before the model performance started to decrease with five convolutional layers. The result suggests that a “shallow” CLN model with one-layer CNN and two-layer LSTM

Fig. 8. Comparison of CLN Model Performance. (a) Analysis of CLN model performance with varying convolutional layer depth. The dots for S1-S4 represent the average performance over five-round SRS. (b) Analysis of model complexity with varying convolutional layer depth, using the model test result on S3′ s dataset as an example. 10

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Fig. 9. CLN Model Performance Evaluation (Personalized Modelling)-Comparison with Baseline DNN Models. The dots represent the average performance over four subjects for CLN and baseline models.

CLN architecture also reduced the model complexity – this contributed to the observed higher recognition performance.

evaluated on the generalized dataset. The goal here was validating the CLN model’s capability with respect to learning generic features. The generalized model was trained and tested on dataset S5, using the pro­ cess that was previously discussed in Section 4.3. As shown in Fig. 11, the CLN models consistently outperformed the baseline CNN models across different convolutional layer depths. It achieved an average performance improvement of 8.5%. The proposed CLN model also outperformed the baseline LSTM model with an average of 3.6%, regardless of the convolutional layer depth. The baseline CNN model’s performance is no better than the MLP model when using a single convolutional layer. These results suggest that the convolutional layers can extract generic subject-invariant features effectively when the proper depth is applied. When using the non-recurrent CNN model, the improper feature representation from one “shallow” convolutional layer can have an adverse impact on the model’s generalization performance. The proposed CLN model leverages the advantages of both con­ volutional and LSTM layers when learning generic features. The C(64) × 1 − RL(128) × 2 − Sm emerged as the ideal generalized CLN ar­ chitecture. The baseline LSTM model outperformed all baseline CNN models with varying convolutional layer depth. This suggests that sequential patterns contribute more to recognizing workers’ postures in a generalized model. The recognition performance can increase signif­ icantly when adding one convolutional layer to the baseline LSTM model. However, the increased convolutional layers tend to decrease the generalized CLN model performance. The reduced model complexity may help to regularize the CLN models. This can mitigate the overfitting and improve the model’s generality. The CLN model was also evaluated against benchmark ML-based models in generalized modelling. As shown in Table 5, the recom­ mended generalized CLN model (C(64) × 1 − RL(128) × 2 − Sm) out­ performed the ML-based model performance by 3%. This supports the authors’ position that the CLN model can automatically extract subjectinvariant features across subjects. It outperformed the approach using heuristic feature engineering. Additionally, the CLN model achieved higher performance improvement from the baseline DNN models under generalized modelling (see results in Table 4 and Table 5). This may be explained by the benefit of increased data size when training deep models. (See Table 6) Confusion matrices in Fig. 12 show that the CLN model improved the benchmark ML model’s recognition for common postures across multi­ ple subjects, such as WK (improved by 17%) and BT (improved by 11%). When being applied to posture data derived from only one subject (e.g., NON, and OW), the CLN model also showed a comparable and even higher recognition performance than the ML model. It is worth noting the generalized CLN model improved the recognition for OW by 12% from the personalized CLN model (on S1 in Fig. 10). This might be

5.2.2. Comparative evaluation of DNN-based models and ML-based models The performance of the recommended CLN (C(64) × 1 − RL(128) × 2 − Sm) model was compared against the ML-based models (see Table 4). The recorded performance data are the average of Macro F1 Scores obtained from each subject via five-round SRS tests for all models. The ML-based model performance was based on the optimal combination of selected features and candidate classification algo­ rithms, which gave the highest Macro F1 score for each subject. As shown in Table 4, the C(64) × 1 − RL(128) × 2 − Sm model out­ performed benchmark ML-based models on three out of four subjects and provided a slightly lower performance on S4. The increase in recognition performance across all subjects was on average 11% higher. Confusion matrixes presented in Fig. 10 detail out how CLN models improved the performance of ML-based models. The ML-based models tended to make errors when differentiating static postures (e.g., BT and ST). This was noted when the CLN and ML-based models for S1, S2, and S4 were compared. While the CLN model can enhance the differentiation between BT and ST. Besides, the CLN model also showed potential in improving the recognition for dynamic transitional postures, such as NON (S2,S3). These results suggest that the proposed CLN-based automated feature engineering has a high potential in improving the heuristic feature engineering approach used in conventional ML models. The tests of personalized modelling also suggest that the CLN-learned features can result in higher performance when characterizing an individual’s pos­ tures with inter-class similarity (e.g., differencing static BT and ST) and those with dynamic nature (e.g., NON). It is also worth noting the CLN model did not improve the benchmarking ML-based model on S4. This may be attributed to data imbalance. S4 contains the lowest number of postures classes with imbalanced postures distribution (Table 1). The CLN model with high number of parameters (see Fig. 8-b) may not be effectively trained on minority class with limited data, which in turn decreases the model performance. 5.3. Evaluation of model generalization One of the primary needs in the use of posture recognition models is generalizing their applicability to different subjects [8]. Model gener­ alization requires one to address the fundamental challenge of intrasubject variability and inter-subject similarity for implementing human activity recognition [97]. Posture data can vary when the same subject works at different times. Variations also occur when different subjects perform the same task. The developed DNN-based models were 11

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Fig. 10. Confusion Matrix for CLN and Benchmark ML-based Models. The confusion matrix is based on the model giving the closest performance to average per­ formance among the five-round SRS test for each subject. The numbers in the confusion matrix were normalized for better interpretation.

explained by that the CLN model can learn generalized features for posture ST among multiple subjects, which contributes to reducing the misclassification between OW and ST observed under personalized modelling on S1. The results suggest that the generalized CLN model has a high potential in learning subject-invariant features, thus enhancing the detection of common postures with inter-subject variability as well as postures belonging to a specific subject.

“Motion Image”. Each column in the “Image” represented body segment motion measurement, e.g., acceleration of head movement along the vertical direction. The “Motion Image” was obtained through a channelwise normalization process. This assumed that the normalized outputs across channels had the same nature as pixels within the 2D image. This assumption has been validated in subsequent efforts. The identified optimal generalized CLN model’s performance was evaluated using the subsets of S5 dataset using two different groups of sensor channels, namely accelerometer and gyroscope. The CLN model can be directly applied to the two different groups of sensor channels by modifying the convolutional kernel sizes. The model developed from the

5.4. Evaluation of fusing sensor output into “Motion Image” Multi-module and multi-channel IMUs output was fused into 2D 12

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

indicate that the DNN-based automated feature engineering has the high potential in reducing the bias that was observed in the heuristic feature engineering process. One of the scaling-related challenges with the use of the conventional ML-based model is the construction of the generic features. The pro­ posed CLN model allows one to capture the “subject-invariant” features when detecting postures based on data collected from different subjects. The CLN model can potentially balance the trade-off between model generalization and personalization. The results have also shown the CLN model can learn complex features directly from the raw IMUs output. The accelerometer output contributes more to discriminating workers’ postures. Additionally, the authors recommend tuning the convolutional layer depth as a hyperparameter when using the proposed CLN models, which can potentially improve both the model recognition performance and computational efficiency.

Table 4 Model Performance (Macro F1 Score) Evaluation under Personalized Modelling. Models DNN

Subjects S1




CLN:C(64) × 1 − RL(128) × 2 − Sm





CNN:C(64) × 1 − D(128) × 2 − Sm





LSTM:RL(128) × 2 − Sm





0.816 (SVM) 12% 6%

0.640 (SVM) − 1% 3%

0.799 (RF) 16% 4%

0. 847 (SVM) 21% − 2%

ML-based Model as Benchmark Improved CNN Performance LSTM from Baseline DNN to CLN model Improved Performance from ML-based to CLN model

7. Limitations and further research directions 7%



− 0.6%

The following limitations in this study need to be addressed for further evaluating and applying the DNN-based posture recognition with wearable IMUs. The observed performance improvement from DNN-based model was based on the initial test with four workers. This study can benefit from a larger sample size to fully capture various postures used in construction tasks. Further studies could evaluate the generality of the proposed DNN model on more workers of different trades. Particularly, given one trade may be more susceptible to certain postures (e.g., prevalent back pain among masonry), further work can focus on expanding the sample size from targeted trades instead of all

accelerometer channels achieved superior performance compared to ´n ˜ ez that from the gyroscope (see Table 4). This is consistent with Ordo and Roggen [7]’s findings. The authors observed marginal performance improvement (by 1.2%) when the datasets from two sensor channel groups were fused. This can be attributed to the feature selection operation in CLN. The convolutional layer can set lower weights for certain channels (e.g., gyroscopes) to reduce their influence on recog­ nition results. These findings suggest that fusing motion data via channel-wise normalization is a feasible strategy for deploying DNN models. Accelerometers contribute more to recognition performance compared with gyroscopes. Therefore, using the sensor output from accelerometers alone can be an option when requiring a trade-off be­ tween computational efficiency and model performance.

Table 5 Performance Evaluation of Generalized Model (Macro F1 Score).

6. Conclusions The findings in this study support the authors’ hypothesis that the Convolutional Long Short-Term Memory model can outperform the conventional ML-based model when recognizing workers’ postures from wearable IMUs output. Results from the feasibility test with four con­ struction workers in this study suggest that: (i) it is feasible to fuse multimodule multi-channel wearable IMUs output as “Motion Image” for applying DNN models; (ii) the integrated CLN model can take advantage of both convolutional and LSTM layers’ advantages in feature learning, with reduced model complexity; (iii) the C(64) × 1 − RL(128) × 2 − Sm with shallow convolutional layers is a recommended model architecture under both personalized and generalized modelling. These findings




CLN:C(64) × 1 − RL(128) × 2 − Sm


Baseline CNN:C(64) × 1 − D(128) × 2 − Sm



Baseline LSTM:RL(128) × 2 − Sm



Benchmark ML

0.846 (SVM)


Table 6 Performance Evaluation of Fusing Multi-Sensor Channels (Macro F1 Score). Sensor Output CLN (C(64) × 1 − RL(128) × 2 − Sm)




20 Hz*15channels 0.860

20 Hz*15channels 0.650

20 Hz*30channels 0.870

Fig. 11. CLN Model Performance Evaluation (Generalized Modelling)-Comparison with Baseline DNN Models. The dots represent the performance of CLN and baseline models. 13

J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

Fig. 12. Confusion Matrices for (a) Generalized CLN (C(64) × 1 − RL(128) × 2 − Sm) and (b) Benchmark ML-based Model (SVM). The confusion matrices are con­ structed similarly as Fig. 10.

workers. In this sense, the model can be evaluated for effective awkward postures detection among the targeted group of vulnerable workers. This study used the 50%-overlap windows for segmenting the streaming motion data and shuffled the windows during train/test splitting. Neighbouring windows (sharing 50% motion data) can be assigned to train and test datasets, respectively, which brings the simi­ larity in train and test datasets. Such similarity may lead to the potential performance overestimation for all tested recognition models. In the future work, controlled experiments can be adopted for model evalua­ tion, where workers perform pre-defined postures (for reducing posture variation) in two separate train/test sessions. The IMUs sensors with onboard filters can also be applied to minimize the data drift in exper­ iments. In this sense, the models can be trained and tested on two in­ dependent sessions for evaluating the recognition performance. Additionally, the Macro F1 Score used in this study helps to select a recognition model achieving a balanced recognition performance of all postures. Given the detection of awkward postures is more of interest for MSDs risk monitoring, further work should explore adjusting the weight in F1 Score to improve the models’ performance in recognizing targeted awkward postures. Findings of this research indicate that DNN models, widely used for processing 2D image data, can be adapted to work with multi-channel IMUs sensor output. To facilitate the use of DNN models for recog­ nizing workers’ postures from IMUs, further research can investigate the adaptive DNN model under Transfer Learning, which can enable learning new postures from incoming motion data of new subjects. It is also worth noting the opportunity of deploying developed DNN models via the real-time mobile computing, which will facilitate the applica­ bility of posture recognition from wearable sensors. Awkward posture detection is the first step in proactive MSDs pre­ vention. Further research can also explore: (i) the approaches and feasibility of MSDs injury risk assessment with the detected postures; (ii) the effective strategies for delivering the risk assessment results to construction workers on job sites; (iii) the effectiveness of using behaviour intervention for workers’ injury prevention. Beyond real-time MSDs risk monitoring on job sites, the improved posture recognition model can also facilitate the “Prevention through Design” (PtD) prac­ tices by identifying workers’ ergonomics risks under different workplace design. These endeavours are directed at improving the effectiveness and efficiency of Data-Driven injury prevention in construction. The injury prevention approach can also be adapted for various labourintensive occupations, such as manufacturing, health care, and agriculture.

8. Data availability statement The datasets used in this study are available from the corresponding author upon request. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgement The authors would like to acknowledge their construction industry partners for their help in data collection. The authors would also like to thank Mr. Qin Yin for constructive criticism of the manuscript. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. References [1] United States Bureau of Labor Statistics, Injuries, illnesses, and fatalities. https://, 2016 (accessed 22 April 2019). [2] United States Bone and Joint Initiative, The burden of musculoskeletal diseases in the united states., 2018 (accessed 16 June 2020). [3] D. Wang, F. Dai, X. Ning, Risk assessment of work-related musculoskeletal disorders in construction: State-of-the-art review, J. Constr. Eng. Manage. 141 (2015) 04015008, [4] J. Zhao, E. Obonyo, Towards a data-driven approach to injury prevention in construction, Workshop European Group Intelligent Comput. Eng. (2018) 385–411, [5] J. Zhao, E. Obonyo, E-health of construction works: A proactive injury prevention approach, in: 2018 14th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), 2018, pp. 145-152. 10.1109/WiMOB.2018.8589167. [6] H.F. Nweke, Y.W. Teh, M.A. Al-Garadi, U.R. Alo, Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges, Exp. Syst. Appl. 105 (2018) 233–261, https://doi. org/10.1016/j.eswa.2018.03.056. ˜ ez, D. Roggen, Deep convolutional and lstm recurrent neural networks [7] F.J. Ord´ on for multimodal wearable activity recognition, Sensors 16 (2016) 115, https://doi. org/10.3390/s16010115. [8] T. Pl¨ otz, Y. Guan, Deep learning for human activity recognition in mobile computing, Computer 51 (2018) 50–59, MC.2018.2381112. [9] A. Alwasel, A. Sabet, M. Nahangi, C.T. Haas, E. Abdel-Rahman, Identifying poses of safe and productive masons using machine learning, Autom. Constr. 84 (2017) 345–355, [10] J. Chen, J. Qiu, C. Ahn, Construction worker’s awkward posture recognition through supervised motion tensor decomposition, Autom. Constr. 77 (2017) 67–81,


J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177

[11] N.D. Nath, T. Chaspari, A.H. Behzadan, Automated ergonomic risk monitoring using body-mounted sensors and machine learning, Adv. Eng. Inf. 38 (2018) 514–526, [12] J. Ryu, J. Seo, H. Jebelli, S. Lee, Automated action recognition using an accelerometer-embedded wristband-type activity tracker, J. Constr. Eng. Manage. 145 (2018) 04018114, [13] Z. Yang, Y. Yuan, M. Zhang, X. Zhao, B. Tian, Assessment of construction workers’ labor intensity based on wearable smartphone system, J. Constr. Eng. Manage. 145 (2019),, 04019039. [14] X. Yan, H. Li, C. Wang, J. Seo, H. Zhang, H. Wang, Development of ergonomic posture recognition technique based on 2d ordinary camera for construction hazard prevention through view-invariant features in 2d skeleton motion, Adv. Eng. Inf. 34 (2017) 152–163, [15] X. Luo, H. Li, D. Cao, Y. Yu, X. Yang, T. Huang, Towards efficient and objective work sampling: Recognizing workers’ activities in site surveillance videos with two-stream convolutional networks, Autom. Constr. 94 (2018) 360–370, https:// [16] W. Fang, B. Zhong, N. Zhao, P.E. Love, H. Luo, J. Xue, S. Xu, A deep learning-based approach for mitigating falls from height with computer vision: Convolutional neural network, Adv. Eng. Inf. 39 (2019) 170–177, aei.2018.12.005. [17] W. Fang, L. Ding, B. Zhong, P.E. Love, H. Luo, Automated detection of workers and heavy equipment on construction sites: A convolutional neural network approach, Adv. Eng. Inf. 37 (2018) 139–149, [18] B. Zhong, H. Wu, L. Ding, P.E.D. Love, H. Li, H. Luo, L. Jiao, Mapping computer vision research in construction: Developments, knowledge gaps and implications for research, Autom. Constr. 107 (2019) 102919, autcon.2019.102919. [19] H. Luo, C. Xiong, W. Fang, P.E. Love, B. Zhang, X. Ouyang, Convolutional neural networks: Computer vision-based workforce activity assessment in construction, Autom. Constr. 94 (2018) 282–289, autcon.2018.06.007. [20] L. Ding, W. Fang, H. Luo, P.E. Love, B. Zhong, X. Ouyang, A deep hybrid learning model to detect unsafe behavior: Integrating convolution neural networks and long short-term memory, Autom. Constr. 86 (2018) 118–124, j.autcon.2017.11.002. [21] J. Yang, M.N. Nguyen, P.P. San, X.L. Li, S. Krishnaswamy, Deep convolutional neural networks on multichannel time series for human activity recognition, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. [22] X. Yan, H. Li, A.R. Li, H. Zhang, Wearable imu-based real-time motion warning system for construction workers’ musculoskeletal disorders prevention, Autom. Constr. 74 (2017) 2–11, [23] N.K.M. Yoong, J. Perring, R.J. Mobbs, Commercial postural devices: A review, Sensors 19 (2019) 5128, [24] R. Akhavian, A.H. Behzadan, Smartphone-based construction workers’ activity recognition and classification, Autom. Constr. 71 (2016) 198–209, 10.1016/j.autcon.2016.08.015. [25] M. Zhang, T. Cao, X. Zhao, Using smartphones to detect and identify construction workers’ near-miss falls based on ann, J. Constr. Eng. Manage. 145 (2018) 04018120, [26] X. Li, Y. Zhang, J. Zhang, S. Chen, I. Marsic, R.A. Farneth, R.S. Burd, Concurrent activity recognition with multimodal cnn-lstm structure, arXiv preprint arXiv: 1702.01638, (2017). [27] I.L. Nunes, P.M. Bush, Work-related musculoskeletal disorders assessment and prevention, Ergonomics-a systems approach, InTech2012. 10.5772/37229. [28] L. McAtamney, E.N. Corlett, Rula: A survey method for the investigation of workrelated upper limb disorders, Appl. Ergon. 24 (1993) 91–99, 10.1016/0003-6870(93)90080-S. [29] S. Hignett, L. McAtamney, Rapid entire body assessment (reba), Appl. Ergon. 31 (2000) 201–205, [30] O. Karhu, P. Kansi, I. Kuorinka, Correcting working postures in industry: A practical method for analysis, Appl. Ergon. 8 (1977) 199–201, 10.1016/0003-6870(77)90164-8. [31] P. Kivi, M. Mattila, Analysis and improvement of work postures in the building industry: Application of the computerised owas method, Appl. Ergon. 22 (1991) 43–48, [32] J. Seo, R. Starbuck, S. Han, S. Lee, T.J. Armstrong, Motion data-driven biomechanical analysis during construction tasks on sites, J. Comput. Civil Eng. 29 (2014) B4014005, [33] S. Han, S. Lee, A vision-based motion capture and recognition framework for behavior-based safety management, Autom. Constr. 35 (2013) 131–141, https:// [34] D. Wang, F. Dai, X. Ning, R.G. Dong, J.Z. Wu, Assessing work-related risk factors on low back disorders among roofing workers, J. Constr. Eng. Manage. 143 (2017) 04017026, [35] S.J. Ray, J. Teizer, Real-time construction worker posture analysis for ergonomics training, Adv. Eng. Inf. 26 (2012) 439–455, aei.2012.02.011. [36] H. Zhang, X. Yan, H. Li, Ergonomic posture recognition using 3d view-invariant features from single ordinary camera, Autom. Constr. 94 (2018) 1–10, https://doi. org/10.1016/j.autcon.2018.05.033. [37] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T.M. Rose, W. An, Detecting non-hardhatuse by a deep learning method from far-field surveillance videos, Autom. Constr. 85 (2018) 1–9,

[38] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, C. Li, Computer vision aided inspection on falling prevention measures for steeplejacks in an aerial environment, Autom. Constr. 93 (2018) 148–164, [39] J. Wu, N. Cai, W. Chen, H. Wang, G. Wang, Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset, Autom. Constr. 106 (2019) 102894, autcon.2019.102894. [40] H. Son, H. Choi, H. Seong, C. Kim, Detection of construction workers under varying poses and changing background in image sequences via very deep residual networks, Autom. Constr. 99 (2019) 27–38, autcon.2018.11.033. [41] X. Luo, H. Li, D. Cao, F. Dai, J. Seo, S. Lee, Recognizing diverse construction activities in site images via relevance networks of construction-related objects detected by convolutional neural networks, J. Comput. Civil Eng. 32 (2018), [42] J. Cai, Y. Zhang, H. Cai, Two-step long short-term memory method for identifying construction activities through positional and attentional cues, Autom. Constr. 106 (2019), 102886, [43] X. Luo, H. Li, H. Wang, Z. Wu, F. Dai, D. Cao, Vision-based detection and visualization of dynamic workspaces, Autom. Constr. 104 (2019) 1–13, https:// [44] Q. Fang, H. Li, X. Luo, L. Ding, T.M. Rose, W. An, Y. Yu, A deep learning-based method for detecting non-certified work on construction sites, Adv. Eng. Inf. 35 (2018) 56–68, [45] R. Wei, P.E. Love, W. Fang, H. Luo, S. Xu, Recognizing people’s identity in construction sites with computer vision: A spatial and temporal attention pooling network, Adv. Eng. Inf. 42 (2019) 100981, aei.2019.100981. [46] A. Wang, G. Chen, C. Shang, M. Zhang, L. Liu, Human activity recognition in a smart home environment with stacked denoising autoencoders, Int. Conf. Web-Age Inform. Manage., Springer (2016) 29–40, [47] H. Li, D. Wang, J. Chen, X. Luo, J. Li, X. Xing, Pre-service fatigue screening for construction workers through wearable eeg-based signal spectral analysis, Autom. Constr. 106 (2019) 102851, [48] H. Jebelli, B. Choi, S. Lee, Application of wearable biosensors to construction sites. I: Assessing workers’ stress, J. Constr. Eng. Manage. 145 (2019), 10.1061/(ASCE)CO.1943-7862.0001729, 04019079. [49] W. Lee, G.C. Migliaccio, Temporal effect of construction workforce physical strain on diminishing marginal productivity at the task level, J. Constr. Eng. Manage. 144 (2018) 04018083, [50] W. Lee, K.-Y. Lin, E. Seto, G.C. Migliaccio, Wearable sensors for monitoring on-duty and off-duty worker physiological status and activities in construction, Autom. Constr. 83 (2017) 341–353, [51] H. Jebelli, B. Choi, S. Lee, Application of wearable biosensors to construction sites. II: Assessing workers’ physical demand, J. Constr. Eng. Manage. 145 (2019),, 04019080. [52] M.F. Antwi-Afari, H. Li, Fall risk assessment of construction workers based on biomechanical gait stability parameters using wearable insole pressure system, Adv. Eng. Inf. 38 (2018) 683–694, [53] Y. Yu, H. Li, W. Umer, C. Dong, X. Yang, M. Skitmore, A.Y. Wong, Automatic biomechanical workload estimation for construction workers by computer vision and smart insoles, J. Comput. Civil Eng. 33 (2019) 04019010, 10.3390/s19235128. [54] L. Kong, H. Li, Y. Yu, H. Luo, M. Skitmore, M.F. Antwi-Afari, Quantifying the physical intensity of construction workers, a mechanical energy approach, Adv. Eng. Inf. 38 (2018) 404–419, [55] E. Valero, A. Sivanathan, F. Bosch´e, M. Abdel-Wahab, Musculoskeletal disorders in construction: A review and a novel system for activity tracking with body area network, Appl. Ergon. 54 (2016) 120–130, apergo.2015.11.020. [56] A. Alwasel, E.M. Abdel-Rahman, C.T. Haas, S. Lee, Experience, productivity, and musculoskeletal injury among masonry workers, J. Constr. Eng. Manage. 143 (2017) 05017003, [57] E. Valero, A. Sivanathan, F. Bosch´e, M. Abdel-Wahab, Analysis of construction trade worker body motions using a wearable and wireless motion sensor network, Autom. Constr. 83 (2017) 48–55, [58] J. Ryu, A. Alwasel, C.T. Haas, E. Abdel-Rahman, Analysis of relationships between body load and training, work methods, and work rate: Overcoming the novice mason’s risk hump, J. Constr. Eng. Manage. 146 (2020) 04020097, https://doi. org/10.1061/(ASCE)CO.1943-7862.0001889. [59] H. Jebelli, C.R. Ahn, T.L. Stentz, Comprehensive fall-risk assessment of construction workers using inertial measurement units: Validation of the gaitstability metric to assess the fall risk of iron workers, J. Comput. Civil Eng. 30 (2015) 04015034, [60] H. Jebelli, C.R. Ahn, T.L. Stentz, The validation of gait-stability metrics to assess construction workers’ fall risk, Am. Soc. Civil Eng. (ASCE) (2014), 10.1061/9780784413616.124. [61] T.-K. Lim, S.-M. Park, H.-C. Lee, D.-E. Lee, Artificial neural network–based slip-trip classifier using smart sensor for construction workplace, J. Constr. Eng. Manage. 142 (2015) 04015065, [62] R. Dzeng, Y. Fang, I. Chen, A feasibility study of using smartphone built-in accelerometers to detect fall portents, Autom. Constr. 38 (2014) 74–86, https://


J. Zhao and E. Obonyo

Advanced Engineering Informatics 46 (2020) 101177 [80] F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: Continual prediction with lstm, (1999). 10.1049/cp:19991218. [81] C. Olah, Understanding lstm networks, 2015., 2015 (accessed 28 January 2020). [82] A. Karpathy, J. Johnson, L. Fei-Fei, Visualizing and understanding recurrent networks, arXiv preprint arXiv:1506.02078, (2015). 1506.02078. [83] L. Pigou, A. Van Den Oord, S. Dieleman, M. Van Herreweghe, J. Dambre, Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video, Int. J. Comput. Vision 126 (2018) 430–439, s11263-016-0957-7. [84] S. Plagenhoef, F.G. Evans, T. Abdelnour, Anatomical data for analyzing human motion, Res. Q. Exerc. Sport 54 (1983) 169–178, 02701367.1983.10605290. [85] O. Banos, J.-M. Galvez, M. Damas, H. Pomares, I. Rojas, Window size impact in human activity recognition, Sensors 14 (2014) 6474–6499, 10.3390/s140406474. [86] Y. Bengio, Practical recommendations for gradient-based training of deep architectures, Neural networks: Tricks of the trade, Springer2012, pp. 437-478. 10.1007/978-3-642-35289-8_26. [87] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016. [88] T.S. Sethi, M. Kantardzic, On the reliable detection of concept drift from streaming unlabeled data, Expert Syst. Appl. 82 (2017) 77–99, eswa.2017.04.008. [89] J.M. Lambrecht, R.F. Kirsch, Miniature low-power inertial sensors: Promising technology for implantable motion capture systems, IEEE Trans. Neural Syst. Rehabil. Eng. 22 (2014) 1138–1147, TNSRE.2014.2324825. [90] Z. Wang, M. Jiang, Y. Hu, H. Li, An incremental learning method based on probabilistic neural networks and adjustable fuzzy clustering for human activity recognition by using wearable sensors, IEEE Trans. Inf Technol. Biomed. 16 (2012) 691–699, [91] P. Probst, B. Bischl, A.-L. Boulesteix, Tunability: Importance of hyperparameters of machine learning algorithms, arXiv preprint arXiv:1802.09596, (2018). [92] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, Scikit-learn: Machine learning in python, J. Mach. Learn. Res., 12 (2011) 2825-2830. hal-00650905. [93] N. Ravi, N. Dandekar, P. Mysore, M.L. Littman, Activity recognition from accelerometer data, 2005. 1620092.1620107. [94] E.A. Heinz, K.S. Kunze, M. Gruber, D. Bannach, P. Lukowicz, Using wearable sensors for real-time recognition tasks in games of martial arts-an initial experiment, in: Computational Intelligence and Games, 2006 IEEE Symposium on, IEEE, 2006, pp. 98-102. 10.1109/CIG.2006.311687. [95] F. Chollet, Keras: Deep learning library for theano and tensorflow. , 2015 (accessed July 13 2020). [96] J. Zhao, Convolutional lstm model for wearable imus. JunqiZhao/Convolutional-LSTM-for-Wearable-IMUs, 2019 (accessed 28 July 2020). [97] A. Bulling, U. Blanke, B. Schiele, A tutorial on human activity recognition using body-worn inertial sensors, ACM Comput. Surv. (CSUR) 46 (2014) 33, https://doi. org/10.1145/2499621.

[63] X. Yan, H. Li, H. Zhang, T.M. Rose, Personalized method for self-management of trunk postural ergonomic hazards in construction rebar ironwork, Adv. Eng. Inf. 37 (2018) 31–41, [64] K. Yang, C.R. Ahn, Inferring workplace safety hazards from the spatial patterns of workers’ wearable data, Adv. Eng. Inf. 41 (2019), 100924, 10.1016/j.aei.2019.100924. [65] K. Yang, C.R. Ahn, H. Kim, Validating ambulatory gait assessment technique for hazard sensing in construction environments, Autom. Constr. 98 (2019) 302–309, [66] L. Zhang, M.M. Diraneyya, J. Ryu, C.T. Haas, E.M. Abdel-Rahman, Jerk as an indicator of physical exertion and fatigue, Autom. Constr. 104 (2019) 120–128, [67] P.M. Domingos, A few useful things to know about machine learning, Commun. acm 55 (2012) 78–87, [68] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436, https://doi. org/10.1038/nature14539. [69] T. Pl¨ otz, N.Y. Hammerla, P.L. Olivier, Feature learning for activity recognition in ubiquitous computing, in: Twenty-Second International Joint Conference on Artificial Intelligence, 2011. 10.5591/978-1-57735-516-8/IJCAI11-290. [70] M. Zeng, L.T. Nguyen, B. Yu, O.J. Mengshoel, J. Zhu, P. Wu, J. Zhang, Convolutional neural networks for human activity recognition using mobile sensors, in: 6th International Conference on Mobile Computing, Applications and Services, IEEE, 2014, pp. 197-205. 10.4108/icst.mobicase.2014.257786. [71] Y. Guan, T. Pl¨ otz, Ensembles of deep lstm learners for activity recognition using wearables, in: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1 (2017) 11. [72] F.J.O. Morales, D. Roggen, Deep convolutional feature transfer across mobile activity recognition domains, sensor modalities and locations, in: Proceedings of the 2016 ACM International Symposium on Wearable Computers, 2016, pp. 92–99, [73] A. Sathyanarayana, S. Joty, L. Fernandez-Luque, F. Ofli, J. Srivastava, A. Elmagarmid, T. Arora, S. Taheri, Sleep quality prediction from wearable data using deep learning, JMIR mHealth and uHealth 4 (2016), e125, 10.2196/mhealth.6562. [74] K.M. Rashid, J. Louis, Times-series data augmentation and deep learning for construction equipment activity recognition, Adv. Eng. Inf. 42 (2019), 100944, [75] H. Jebelli, M.M. Khalili, S. Lee, Mobile eeg-based workers’ stress recognition by applying deep neural network, Advances in informatics and computing in civil and construction engineering, Springer, 2019, pp. 173-180. 978-3-030-00220-6_21. [76] J. Zhao, E. Obonyo, Convolutional long short-term memory model for recognizing postures from wearable sensor, CEUR Workshop Proc. (2019). [77] A. Sathyanarayana, S. Joty, L. Fernandez-Luque, F. Ofli, J. Srivastava, A. Elmagarmid, S. Taheri, T. Arora, Impact of physical activity on sleep: A deep learning based exploration, arXiv preprint arXiv:1607.07034, (2016). 10.2196/ mhealth.6562. [78] Y. LeCun, F.J. Huang, L. Bottou, Learning methods for generic object recognition with invariance to pose and lighting, CVPR (2), Citeseer (2004) 97–104, https:// [79] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997) 1735–1780,