A lightweight approach to two-person interaction classification in sparse image sequences

A lightweight neural network-based approach to two-person interaction classification in image sequences, based on human skeletons detected in sparse video frames, is proposed. The idea is to use an ensemble of pose classifiers ("experts"), where every expert is trained on different time-indexed snapshots of an interaction. Thus, the expertise of "weak" classifiers is distributed over the time duration of an interaction. The overall classification result is a weighted combination of all the pose experts. Important element of proposed solution is the refinement of skeleton data, based on a merging-of-joints procedure. This allows the generation of reliable features being passed to the artificial neural network. This is the key to our lightweight solution, as ANN resources, needed for feature space transformation, can be significantly limited. Our network model was trained and tested on the interaction subset of the well-known NTU RGB+D dataset, although only 2D skeleton information is used, typical in video analysis. The test results show comparable performance of our method with some of the best so far reported STM- and CNN-based classifiers for this dataset, when they process sparse frame sequences, like we did. The recently proposed multi-stream Graph CNNs have shown superior results but only when processing dense frame sequences. Considering the dominating processing time and resources needed for skeleton estimation in every frame of the sequence, the key to real-time interaction recognition is to limit the number of processed frames.


I. INTRODUCTION
The aim of our work is the analysis of human interactions in specific time-related image sequences. The data can originate from decomposition of video clips onto frames or directly from snapshots of videos posted as image galleries in the Internet. Their common property is the sparsity of timerelevant information (Figure 1).
The approaches to vision-based human activity recognition can be divided into two main categories: activity recognition directly in video data [1] or skeleton-based methods [2], where the 2D or 3D human skeletons are detected first, even by specialized devices, like the Microsoft Kinect.
In early solutions, hand-designed features like edges, contours, Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG) have usually been used for This work was supported by "Narodowe Centrum Badań i Rozwoju", Warszawa, Poland, grant No. CYBERSECIDENT/455132/III/NCBR/2020the APAKT project detection and localization of human body parts or key points in the image [3], [4].
More recently, Neural Network-based solutions were successfully proposed for solving human pose-and human activity recognition problems, e.g., solutions are based on Deep Neural Networks (DNN) [5], especially on Long-Short Term Memory (LSTM) models and Convolutional Neural Networks (CNN) [6], and more recently on Graph CNNs [7]. CNNs have the capability automatically to learn rich semantic and discriminative features from images and multi-dimensional signals. Furthermore, CNNs can learn both spatial and temporal information from signals and model scale-invariant features as well. Graph CNNs allow efficient implementations of convolution layers when structured data (i.e., graphs) are processed. Some popular solutions to human skeleton estimation (i.e., the detection and localization) in images, based on DNN and CNN models, can be mentioned: OpenPose [8], DeepPose [9] and DeeperCut [10].
Hence, nowadays human action-and interaction recognition in video is most often based on skeleton data extracted from video frames. The state-of-the-art solutions to human action encoding and classification, which process human skeleton data, typically use "heavy" deep neural networks, like 3D CNNs and LSTMs or slightly lightweight Graph CNNs [11], [12].
In this work, we focus on two-person interaction recognition in sparse frame sequences, assuming the existence of skeleton data for key video frames. We took the straightforward idea of extending two-person pose classification of still images to twoperson interaction classification in image sequences, by applying an ensemble of pose classifiers [13]. Typically for a classifier ensemble, individual classifiers are "experts" in different parts of the input data domain and the extra weighting network differentiates between subdomains. In our approach, the poseclassifiers are experts at different time stages, while their input space itself (i.e., the spatial image information) is not affecting the fusion weights. By performing a simple time decomposition, we are going to distinguish four subsequent time periods of an interaction process, e.g. start, before midterm, after midterm and final. The final fusion will take the form of a weighted sum of class likelihoods of all the pose classifiers. There are 4 remaining sections of this work. Section II refers some recent approaches in human pose, -action andinteraction recognition. Our solution is presented in section III. In section IV, experiments are described, to verify the approach. The classifiers are trained and tested on two datasets: an own human pose image dataset, called "humiact5", and the well-known video dataset for action and interaction, NTU RGB+D [14]. Finally, in section V, we summarize our work and contribution to the subject.

II. RELATED WORK
The recognition of human activities in video is a hot research topic in the last 15 years. Typically, human activity recognition in images and video requires first a detection of human body parts or key-points of a human skeleton. The skeleton-based methods compensate some of the drawbacks of vision-based methods, such as assuring the privacy of persons and reducing the scene lightness sensitivity.
The vast majority of research is based on the use of artificial neural networks. However, more classical approaches have also been tried, such as the SVM (e.g. [15], [16]). Yan et al. [17] used multiple features, like a "bag of interest points" and a "histogram of interest point locations", to represent human actions. They proposed a combination of classifiers in which AdaBoost and sparse representation (SR) are used as basic algorithms. In the work of Vemulapalli et al. [18] human actions are modeled as curves in a Lie group of Euclidean distances. The classification process is using a combination of dynamic time warping, Fourier temporal pyramid representation and linear SVM.
Thanks to higher quality results, artificial neural networks are replacing other methods. Thus, the most recently conducted research in the area of human activity classification differs only by the proposed network architecture. Networks based on the LSTM architecture or a modification of this architecture (a ST-LSTM network with trust gates) were proposed by Liu et al. [19] and Shahroudy et al. [14]. They introduced so called "Trust Gates" for controlling the content of an LSTM cell and designed an LSTM network capable of capturing spatial and temporal dependencies at the same time (denoted as ST-LSTM). The task performed by the gates is to assess the reliability of the obtained joint positions based on the temporal and spatial context. This context is based on the position of the examined junction in the previous moment (temporal context) and the position of the previously studied junction in the present moment (spatial context). This behavior is intended to help network memory cells assess which locations should not be remembered and which ones should be kept in memory. The authors also drew attention to the importance of capturing default spatial dependencies already in the skeleton data. They have experimented with different mappings of the a joint's set to a sequence. Among the, they mapped the skeleton data into a tree representation, duplicating joints when necessary to keep spatial neighborhood relation, and performed a tree traversal to get a sequence of joints. Such an enhancement of the input data allowed an increase of the classification accuracy by several percent.
The work [20] introduced the idea of applying convolutional filters to pseudo-images in the context of action classification. A pseudo-image is a map (a 2D matrix) of feature vectors from successive time points, aligned along the time axis. Thanks to these two dimensions, the convolutional filters find local relationships of a combined time-space nature. Liang et al. [21] extended this idea to a multi-stream network with three stages. They use 3 types of features, extracted from the skeleton data: positions of joints, motions of joints and orientations of line segments between joints. Every feature type is processed independently in an own stream but after every stage the results are exchanged between streams.
Graph convolutional networks are currently considered as a natural approach to the action (and interaction) recognition problem. They are able to achieve high quality results with only modest requirements of computational resources. "Spatial Temporal Graph Convolutional Networks" [22] and "Actional-Structural Graph Convolutional Networks" [23] are examples of such an solution.
Another recent development is the pre-processing of the skeleton data in order to extract different type of information (e.g., information on joints and bones, and their relations in space and time). Such data streams are first separately processed by so called multi-stream neural networks and later fused to a final result. Examples of such solutions are the "Two-Stream Adaptive Graph Convolutional Network" (2S-AGCN) and the "Multistream Adaptive Graph Convolutional Network" (AAGCN), proposed by Shi et al. [24], [25].
One of the best performances on the NTU RGB+D interaction dataset is reported in the work of Perez et al. [26]. Its main contribution is a powerful two-stream network with three-stages, called "Interaction Relational Network" (IRN).
The network input are basic relations between joints of two interacting persons tracked over the length of image sequence. An important step is the initial extraction of relations between pairs of joints -both distances between joints and their motion are obtained. The neural network makes further encoding and decoding of these relations and a final classification. The first stream means the processing of within-a-person relations, while the second one -between-person relations. The use of a final LSTM with 256 units is a high-quality version of the IRN network, called IRN-LSTM. It allows to reason over the interactions during the whole video sequence -even all frames of the video clip are expected to be processed. In the basic IRN, a simple densely-connected classifier is used instead of the LSTM and a sparse sequence of frames is processed.
The currently best results are reported by Zhu et al. [27], where two new modules are proposed for a baseline 2S-AGCN network. The first module extends the idea of modelling relational links between two skeletons by a spatio-temporal graph to a "Relational Adjacency Matrix (RAM)". The second novelty is a processing module, called "Dyadic Relational Graph Convolution Block", which combines the RAM with spatial graph convolution and temporal convolution to generate new spatial-temporal features.
From the analysis of the recent most successful solutions, we can draw three main conclusions: 1) using an analytic preprocessing of skeleton-data to extract meaningful information and cancel noisy data, either by employing classic functions or learnable function approximations (e.g. relational networks); 2) preferring light-weight solutions by employing background (problem-specific) knowledge, i.e. using graph CNNs instead of CNN, CNNs with 2-D kernels instead of 3-D CNN; 3) a video clip containing a specific human action or interaction can be processed alternatively as a sparse or dense frame sequence, where sparse sequence is chosen to achieve real-time processing under limited computational resources, while the processing of a dense sequence leads to better performance.

A. Structure
The input data for our interaction classifier is a sequence of sparse video frames. Assuming, a video clip is given the start and end of an interaction should be detected first. Then, the video clip is split into some number M of consecutive time intervals (e.g. M = 16). From each interval one frame is selected for classification. Assume, that M = N · m, where N is a period of time, while m the number of frames in one period. We may distinguish N = 4 periods: start, 1-st intermediate, 2-nd intermediate and final. To the classification of frames from a single period, a separate pose classifier (the "expert") is dedicated. As shown in Figure 2, the proposed solution consists of several processing stages: 1) Skeleton estimation: the OpenPose net [28] is applied to detect human skeletons with their 2D joints in an RGB image (a video frame); 2) Feature engineering: a keypoint enhancement algorithm is proposed in order to get more reliable two sets of skeleton joints from the OpenPose results; next, feature vectors are extracted from the refined joints. 3) Pose classifier training: several lightweight, denselyconnected MLP networks are trained -every one is a "weak" classifier. 4) Model evaluation: alternative network models are evaluated, to find the optimal model configuration and training parameter. A Keras-tuner [29] -the RandomSearch algorithm [30] is applied to find optimal hyper-parameter settings. 5) Ensemble classifier: a dense gain network is also trained to learn the weights for results of individual pose classifiers. Two versions of the final classifier are implemented -one with fixed weights and one with learned weights. 6) Model testing: after accumulating the pose class likelihoods over the frame sequence the final most likely interaction class is selected as the winner. Two datasets -an own humiact5 and the RGB subset of the NTU RGB+D dataset, are used to evaluate the created models.

B. Skeleton estimation
In the paper [8], a multi-person 2D pose estimation architecture was proposed based on part affinity fields (PAFs). The work introduced an explicit nonparametric representation of the keypoint association which encodes both position and orientation of the human limbs. The designed architecture can learn both human keypoint detection and association using heatmaps of human key-points and part affinity fields respectively. It iteratively predicts part affinity fields and part detection confidence maps. The part affinity fields encode partto-part association including part locations and orientations. In the iterative architecture, both PAFs and confidence maps will be iteratively refined over successive stages with intermediate supervision at each stage. Subsequently, a greedy parsing algorithm is employed to effectively parse human poses. The work ended up releasing the OpenPose library, the first realtime system for multi-person 2D pose estimation [28]. In our research, we use the core block of OpenPose, the "body_25 model", to extract 25 human key-points in images. The result is an 25-elementary array, providing 2D image coordinates and confidence score for every keypoint.

C. Feature engineering
From the (eventually more than two) sets of skeleton joints, detected in the image by OpenPose, the two main actors are selected based on size measure. A total variability of skeleton keypoint locations is calculated for every skeleton and the two with the highest variability are chosen for feature engineering.
1) Skeleton enhancement: There are cases where OpenPose wrongly splits one human region into different regions due to occlusion, low resolution, or complex visual context. Therefore, we developed a keypoint (i.e., skeleton joints) merging and replacement algorithm. In the first step, we try to merge sets of joints, where applicable, to produce finer skeleton joints (see Figure 3).
Two calculations are made for each pair of sets including the number of intersection points of the two sets and the distance between them. The intersection indicator value is scaled by the number of points of the smaller set. The distance calculation takes their two mean points and two standard deviation values into account. These calculated values then will be compared with corresponding thresholds to decide whether the two sets are going to be merged or not. In case merging conditions are met, the intersection points of the two sets will be treated in the following way: the data points with higher probability will be kept and the lower ones will be ignored.
For the sake of clarity, Figure 4 illustrates the merging procedure of two specific sets, A and B, based on the assumption that they come from the same person in the image. The bigger set A is missing key-points for the left leg, while the smaller set B includes these key-points. The mean points (center of gravity) of A and B are m A and m B , respectively, the standard deviations of joints locations for A and B are [std A,x , std A,y ] and [std B,x , std B,y ], respectively.
The conditions for a merging action are as follows: where θ 1 , θ 2 are intersection threshold and distance threshold, respectively.
After a merging action has been performed, the remaining joints in the smaller set (call it S) can eventually replace lowconfident, corresponding joints in a subset B s of the big set B. To decide about this, the following values are considered: the normalized Euclidean distance between the smaller set joints and the corresponding candidate joints of the subset B s , the average confidence of all candidate joints in the small set S and the average confidence of the corresponding joints in the bigger set ( Figure 5).
Let N -elementary sets S and B s of corresponding joints are given, considered for possible replacement. Standard deviation coefficients of the smaller set joints locations are [std S,x , std S,y ]. Let the confidence value of a joint j be denoted as P (j). The conditions for a joints replacement are as follows: where θ 3 is the normalized Euclidean distance threshold, θ 4 -the confidence threshold for S and θ 5 -the confidence threshold for B s . The skeletons, which remain after the merging and replacement steps, will be ordered by their bounding box size in descending order. With (w, h) representing width and height of a bounding box, the score = w·h. The two sets with highest score will be kept and used further in the feature extraction step.  2) Feature extraction: Feature extraction means the calculation of normalized distances between pairs of joints from two skeletons, tracked in the frame sequence of a video clip. First, the distance between two middle-of-spine points of two human skeletons is calculated and normalized by the length of the spine I 1 of the first person, giving the distance feature ( Figure 6). Then, every set of joints is independently normalized by: translating the local coordinate system to the middle-of-spine point O 1 or O 2 , rotating the points so that the spine segment (connecting joint 1 with joint 8) is parallel to the Y axis of local system, and finally, scaling the point coordinates by the spine length I i .
Denote by H 1 , H 2 the skeletons of the first and the second human; O 1 , O 2 -the centers of spine segments of the first and second human, respectively; l 1 -the length of the spine segment of the first human; α 1 , α 2 -the rotation angles to make corresponding spine segments parallel to the Y axes of local Cartesian coordinate systems. The distance feature is calculated as the distance between local system origins, O 1 , O 2 , normalized by the length l 1 : The normalization of joints coordinates (translation to local system, rotation, scaling) is performed independently for every set H 1 , H 2 . Let p i = (p i,x , p i,y ) denotes the image coordinates of a joint from skeleton H i , (i = 1, 2). The normalization of this joint is given as follows: 3) Feature vector: Both the OpenPose (applied for our RGB dataset) and the built-in skeleton detector from Kinect v2 (generating the skeleton data in the NTU RGB+D dataset) deliver person skeletons of 25 joints. By analysing a small skeleton data subset, we found that the data for joints numbered from 15 to 24, corresponding to "small" parts, like fingers, are very often missing. Thus, we use only joints numbered from 0 to 14. The feature vector obtained from skeleton data of a single frame has 61 dimensions as there are 15 joints × 2 coordinates × 2 sets and one distance feature. Assuming that we have selected m frames for analysis, we get a map of m × 61 features.

D. Pose classifier training and evaluation
The feature data is fed to several MLP-based pose classifiers. We use fully-connected MLP architecture with variants of several hyper-parameters: the number of hidden layers of the network can vary from 1 to 3, different activation functions (ReLU and/or sigmoid) may be chosen, as well as the number of neurons in hidden layers and the learning rate can vary. The ANN is implemented using Keras [29].
Automated hyper-parameter tuning [31] is a crucial step during ANN model training to increase the model's performance. We perform a hyper-parameter search during training using the Random Search algorithm, offered in Keras [30]. For both datasets the hyper-parameter search space is defined as: where the entries are: activation function, a f un ∈ {relu, sigmoid}, learning rate, l rate ∈ [10 25 , 10 22 ], number of hidden layers, n layer ∈ {1, 2, 3}, number of neurons in hidden layer, n neur ∈ {100, 200, . . . , 1000}.

E. Ensemble classifier
As mentioned earlier, every pose classifier is an "expert" to recognize snapshots taken during different time period of an interaction. In practice, the training of such an assembly is performed at the same time, but 3 out of 4 "expert" networks are always in a dropout mode. The actually updated network depends on the time period the current input frame belongs to.
In the testing process, the interaction class is known after the entire frame sequence -from a single video clip -has been classified and the results of individual pose classifiers were accumulated. The likelihood of every interaction class comes from an aggregation of pose class likelihoods, as a weighted sum of pose likelihoods, for frames indexed from t=0 to t=T.
2) The gain network: In the trained case, the gain network provides gain coefficients w i (t) for the four pose classifiers depending on the frame index (t): IV. RESULTS

A. Datasets
In order to evaluate and test the trained classifiers, two datasets were used. The search after best hyper-parameters of a single pose classifier will be performed by training and validating them on our humiact5 dataset. Its consists of images of 5 two-person poses -snapshots of interactions: boxing, facing, hand holding, hand shaking and hugging/kissing. There are 1695 images in total, in which 1154 images are in the training set and remaining 541 images are in the evaluation set ( Figure 7). In this series of experiments, the OpenPose library has been applied for skeleton detection in RGB images.
The best configuration of the pose experts and the final, time-accumulating network will be trained and tested on the interaction subset of the NTU RGB+D dataset. It includes 11 two-person interactions of 40 actors: A50: punch/slap, A51: kicking, A52: pushing, A53: pat on back, A54: point finger, A55: hugging, A56: giving object, A57: touch pocket, A58: shaking hands, A59: walking towards, A60: walking apart. In our experiments, already the skeleton data of the NTU RGB+D dataset is considered. There are 10420 video clips in total, in which ca. 70% are in the training set and remaining 30% are in the test set. No distinct validation subset is distinguished.
The NTU RGB+D dataset allows to perform a cross-subject (person) (short: CS) or a cross-view (CV) evaluation. In the cross-subject setting, samples used for training show actions performed by half of the actors, while test samples show actions of remaining actors. In the cross-view setting, samples recorded by two cameras are used for training, while samples recorded by the remaining camera -for testing. We apply the cross-subject (CS) evaluation mode, i.e., videos of 20 persons are used for training and videos of remaining 20 persons -for testing. The training set contains video clips of users identified as: 1,2,4,5,8,9,13,14,15,16,17,18,19,25,27,28,31,186 PROCEEDINGS OF THE FEDCSIS. SOFIA, BULGARIA, 2022 Fig. 7. Samples from our humiact5 dataset: RGB images with skeleton data Fig. 8. Samples from the NTU RGB+D interaction dataset: RGB video frames with skeleton data [14] 34, 35 and 38. The number of samples in the training set is 7649, while in the test set -2771. Each skeleton instance consists of 25 joints of 3D skeletons that apparently represent a single person ( Figure 8). As our research objective is to analyse video data and to focus on only reliably detected joints, we use only the 2D information of only first 15 joints.
From a video sample a set of frames is chosen as follows: the video clip is uniformly split into N = 4 time intervals ("periods"), from every interval some number of frames m is selected (we tested m = 2, 4, 8). The number of frames in the training set grows from 61192 to 244768 and the number of frames in the test set grows from 22168 to 88672, accordingly to the value of m from 2 to 8.

B. Pose classifier optimization
The hyper-parameter optimization of a pose classifier is performed on the small humiact5 dataset. In order to run the RandomSearch function of Keras, a NNHyperModel is created, which implements the HyperModel class from the Keras-tuner. The hyper-parameters of the search space are declared in NNHyperModel as class parameters. Using the RandomSearch function, we identified three ANN configurations, each one being optimal for given number of hidden layers (1, 2 or 3).
The performances of the three selected models after 100 epochs of training are shown in Table I. The best test accuracy (i.e., the recall averaged over all classes) of 84% was achieved by the second model, whereas the other two have shown an accuracy of 82%. Consequently, we have chosen an ANN configuration of 2 hidden layers with 700 and 500 neurons in the first and second layer, respectively. The activation functions are ReLU and sigmoid, respectively. The learning rate is 5.89 · 10 25 .

C. Verification on the NTU RGB+D dataset
We train and test our models in the CS (cross-subject) verification mode proposed for the NTU RGB+D dataset, i.e. when actors in the training set are different than in the test set, but data from all the camera views are included in both sets. The frame sampling process for both training and testing will be done three times with different number of frames per time period (i.e., extracted from a single video sample): 2, 4, 8. The training set is split into learning and test subsets -two third for learning and one third for validation/testing. There are run 100 epochs of training and the best validation result will be chosen.

1) Pose classifiers:
In the follwing, we apply the second version of the ANN pose classifiers, with two hidden layers, as reported earlier in Table I (Table II).
An immediate observation is, that all learning and test accuracies increase, when the training data size is increased. Specifically, with 8 frames per time-period (f/p), these accuracies reach to 88% and 76%, respectively. The average per class accuracies (i.e. four class poses representing the same interaction class) of the ANN experts, obtained with a 4 f/p  frame sampling on the test set, is shown on Table III. There are 3 classes (A55, A58, A59) that perform at least at 80%, other 6 classes -from 60% to 80% and two -below 60%. Compared with random choice -there are 11 classes and the random prediction (a guess) would be 1/11 = 9.09%. The largest accuracy is observed for the "A055 -hugging" class. The distance between two persons is here significantly smaller than of the rest and the poses are relatively stable in every time period.
2) Ensemble of pose classifiers: There are two variants of the final ensemble classifiers: E-ANN-1, when the final score of every interaction class is obtained by fixed weights, according to equation (11), or E-ANN-2, where the trainable gain network is used, according to equation (12). The class with highest score is selected as the winner of the interaction classifier. A notable improvement of interaction classification is observed, when accumulating over time sequence the weighted pose likelihoods. The mean accuracy of the best version of pose experts (i.e., for frame sampling of 8 f/p) was 88.2% (training) and 76.1% (testing), while the ensemble classifier has reached 92.4 % and 81.3% (version 1), or 94.5% and 83.3% (version 2), respectively (Table IV).

D. Comparison
Many approaches to two-person interaction classification have been tested on the NTU RGB+D interaction dataset. We list some of the leading works in the Table VI. Our solution needs a low number of weights to be trained and it processes a sparse frame sequence. It shows a good tradeoff between competitive accuracy and low complexity when compared with other recently reported results.
Let us notice how we counted the number of parameters of the E-ANN-2 network. Remember that the pose classifiers have a common part -the feature transforming MLP with 2hidden layers -and there are separate fully-connected output layers for every pose classifier. We can create two versions of the E-ANN network -one network with multiple featuretransforming MLPs that processes in parallel the four frame subsets, and another one that processes all frames in sequence.
As the individual results are finally aggregated over all frames, both configurations deliver the same final result. In the first configuration, there are 1 597 677 weights needed, while in the sequential version -399 444 weights only: 1) The feature transforming ANN: 61 · 700 + 700 + 700 · 500 + 500 = 393 900 The FC classification layer: 500 · 11 + 11 = 5 511 The gain network: (11 + 11) + 11 = 33 2) Four parallel pose classifiers: 4·393 900+4·5 511+33 = 1 597 677 3) Four sequential pose classifiers: 393 900+4·5 511+33 = 399 444 Taking into account, that the dominating processing time for a single frame is spent by the skeleton detector (on our equipment, it takes ca. 67 ms, compared to 1 ms for the pose classifier), the sequential version is preferred. Even when the skeleton detection itself will be performed in parallel, for every phase subset of frames one pose classifier will be allocated, the sequential version will take only (N − 1) ms more time than when using N pose classifiers in parallel.
Typically, the performance of an interaction classifier is significantly improved when dense frame sequences are processed instead of sparse ones. But the overall processing time grows proportionally to the frame number, as the computation is dominated by the skeleton estimation step. Thus, processing a dense sequence of 100 frames (typical for the best performing solutions with accuracy > 90%) takes roughly three times longer than the time needed for a sparse sequence of 32 frames (where a typical accuracy is < 90%). The recently proposed multi-stream Graph CNNs have shown superior results but only when processing dense frame sequences. Considering the dominating processing time and resources needed for skeleton estimation in every frame of the sequence, the key to realtime interaction recognition is to limit the number of processed frames.

V. CONCLUSION
A light-weight approach to two-person interaction classification was proposed, that can be applied both in videoand single image-analysis. This is a skeleton-based approach, what means, that an external module for human detection and estimation in images is needed. We adopted the state-of-the art OpenPose library for this purpose. This is a powerful deep network solution for human skeleton estimation in images. Our main contribution are algorithms for skeleton data correction and normalization and the design of an ANN classifier that has the form of an ensemble of several ANN-based pose experts. Aggregating four or more "weak" pose classifiers leads to an efficient and robust solution to human interaction classification. We also found that a comparison of classification approaches should not only consider the accuracy measure but also the amount of information received (i.e., whether a sparse or dense frame sequence is analyzed). Our future research should focus on the extraction of motion information for the skeleton joints and testing the model network on longer frame sequences.