Who danced better? Ranked tiktok dance video dataset and pairwise action quality assessment method

Video-based action quality assessment (AQA) is a non-trivial task due to the subtle visual differences between data produced by experts and non-experts. Current methods are extended from the action recognition domain where most are based on temporal pattern matching. AQA has additional requirements where order and tempo matter for rating the quality of an action. We present a novel dataset of ranked TikTok dance videos, and a pairwise AQA method for predicting which video of a same-label pair was sourced from the better dancer. Exhaustive pairings of same-label videos were randomly assigned to 100 human annotators, ultimately producing a ranked list per label category. Our method relies on a successful detection of the subject’s 2D pose inside successive query frames where the order and tempo of actions are encoded inside a produced String sequence. The detected 2D pose returns a top-matching Visual word from a Codebook to represent the current frame. Given a same-label pair, we generate a String value of concatenated Visual words for each video. By computing the edit distance score between each String value and the Gold Standard’s (i.e., the top-ranked video(s) for that label category), we declare the video with the lower score as the winner. The pairwise AQA method is implemented using two schemes, i.e., with and without text compression. Although the average precision for both schemes over 12 label categories is low, at 0.45 with text compression and 0.48 without, precision values for several label categories are comparable to past methods’ (median: 0.47, max: 0.66).


Introduction
Present social media platforms and video-sharing sites contain copious amounts of tutorial-type videos of a person (or a group) performing a skilled or semi-skilled task in front of a stationary or moving camera. Audiences seek visual demonstration of a task or a skill to emulate. We can see ample evidence of this trend gaining popularity in recent years. These tutorial videos are popular mainly due to their accessibility and perceived usefulness. From a survey [1] involving 141 university students, all had consumed YouTube tutorial videos before, with 57.4% claiming easy access (to the videos) and saving time spent on figuring out how to perform a task as their main motivation. In 2020, Statista Research published a report on TikTok views categorized by hashtags [2]. The second highest number of hashtag views was dance-related content, at 181b views. Also, 5 out of 10 most popular hashtags were tutorialtype content, covering wide-ranging topics, i.e., fitness/sports (57b views), home renovation/DIY (39b views), beauty/skincare (33b views), and recipes/cooking (18b views). In these videos, individuals of varying skill levels performed tasks such as cooking, assembly, and repairs, as well as performative arts, Video-based action quality assessment (AQA) is a non-trivial task due to the subtle visual differences between data produced by experts and nonexperts. Current methods are extended from the action recognition domain where most are based on temporal pattern matching. AQA has additional requirements where order and tempo matter for rating the quality of an action. We present a novel dataset of ranked TikTok dance videos, and a pairwise AQA method for predicting which video of a same-label pair was sourced from the better dancer. Exhaustive pairings of same-label videos were randomly assigned to 100 human annotators, ultimately producing a ranked list per label category. Our method relies on a successful detection of the subject's 2D pose inside successive query frames where the order and tempo of actions are encoded inside a produced String sequence. The detected 2D pose returns a top-matching Visual word from a Codebook to represent the current frame. Given a same-label pair, we generate a String value of concatenated Visual words for each video. By computing the edit distance score between each String value and the Gold Standard's (i.e., the top-ranked video(s) for that label category), we declare the video with the lower score as the winner. The pairwise AQA method is implemented using two schemes, i.e., with and without text compression. Although the average precision for both schemes over 12 label categories is low, at 0.45 with text compression and 0.48 without, precision values for several label categories are comparable to past methods' (median: 0.47, max: 0.66). for example, martial arts and dancing. The ability to gauge the subject's skill level would be valuable for applications that require indexing and retrieval of a video database according to the subject's task expertise. For example, a video-sharing site could place videos performed by higher-skilled subjects at the top of the list when returning a search result. The current workaround is to crowdsource the ranking data via the upvote and downvote buttons. For this solution to work, it is assumed that the videos performed by higher-skilled subjects were upvoted by the majority, and downvoted if otherwise. Nevertheless, this method is hugely unreliable as the quality of annotations is influenced by the annotator's reliability and the task's difficulty level [3]. Worse yet, some participants may be adversarial by deliberately providing false labels [4].
Automated methods for producing ranking data from videos are classified under Action Quality Assessment (AQA) domain. Existing AQA methods either use regression for estimating the action quality score from a single video, or train models to predict the relative ranking between an input pair of videos. Our proposed method uses the subject's 2D poses throughout a dance performance to codify a String value. We then predict the video sourced from the better dancer by comparing edit distance scores of the following String value pairs. i.e., Subject A vs. Gold Standard and Subject B vs. Gold Standard. We define Gold Standard as the top-ranked video(s) for each label category. We estimate the subject's 2D pose from a single query frame. We will only process query frames containing complete pose data, i.e., all the required keypoints are detected, including false positives. Query frames containing incomplete pose data are rejected outright. The 2D pose is codified based on the position of four joints with respect to two auxiliary axes. A detailed explanation of the pose coding module can be found in Section 3.2. The matching Visual word is then appended into a String value that represents the current video. This process is repeated for the rest of the query frames, ultimately producing a String value of length.
The two principal contributions of this paper revolve around automatically assessing the quality of a subject's actions, as captured by a stationary camera. The first contribution is the pairwise AQA method for determining the better dancer, given a pair of same-label videos. As for the second contribution, we released a new and open-access dataset for action quality assessment in the hopes of facilitating future research. The rest of this paper is organized as follows. Section 2 describes the related work. In Section 3, we present the novel dataset along with the pairwise AQA method, implemented under two different schemes. In Section 4, we discussed the results obtained from the two schemes. Finally, we conclude this paper in Section 5.

Method
Our proposed method has three modules, i.e., Region coding, Pose coding, and String builder module, see Fig. 1. The Region coding module encodes the detected 2D pose and feeds the encoded value into the Pose coding module. The Pose coding module acts as a look-up function by searching for the closest match inside the Codebook. The returned Visual word is then fed into the String builder module. This process is repeated for every query frame that contains complete pose data. Ultimately, the String builder module produces a String value of length for a video with complete query frames. The produced String value is later used during the pairwise skill annotation task between two videos of the same dance label. The rest of the section describes the proposed method by its components.

Video Classification
Action Quality Assessment (AQA) is a subset of the video classification problem. Therefore, it is beneficial to review existing work on video classification. Earlier methods use appearance and/or motionbased features extracted from consecutive frames. These features are built into a Spatio-temporal volume or a bag of features. One of the earliest appearance-based methods is Dollar et al. [5]. Their method samples local interest points over time, producing a cuboid of intensity, gradient, and motion information. The cuboid is further divided into stacked regions, described as a histogram each. As for motion-based methods, Wang et al. [6] capture local motion(s) by extracting dense trajectories described using motion boundary histograms. Instead of utilizing the entire frame, Hipiny et al. [7] generate a Visual word from a gaze-directed image region. The visual words from a frame sequence are tabulated into a histogram or appended into a String value for matching purposes.
Recent comparison studies, [8], [9] have shown that deep neural networks are superior to handcrafted approaches in the image classification task. As such, it is natural to extend their usage to video classification. Recently, researchers are looking at incorporating temporal data into the deep classification framework to improve accuracy. The success of approaches [10], [11] that utilized twostream fusion architecture [12] suggested the value of doing so. Karpathy et al. [10] introduced three strategies, i.e., early, late, and slow fusion, to fuse information from a spatial network and a temporal network, with both running in parallel. Feichtenhofer et al. [11] proposed a two-pathway model, i.e., the SlowFast network. The first pathway captures the semantic information from frames captured at a lower rate, whilst the second operates at a higher rate. The difference in temporal resolution ensures that both slow and fast-changing motions are adequately captured. Gong et al. [13] recently developed Auto-TSNet, a two-stream model optimized for a giant multivariate search space. The model utilizes a progressive procedure that performs a search over individual streams, fusion, and attention blocks.

Video-based Dance Genre Classification
At the early stage, the same methods introduced for video classification were used for video-based dance genre classification. Nevertheless, dance is a highly dynamic and specialized class of human action. Therefore, an extended sequence of frames containing temporal information is often required for accurate classification. Previous work had incorporated temporal information in several ways. Castro et al. [14] used temporal 3D CNNs utilizing raster images, optical flow, and visualized multi-person 2D pose. The three separate stacks run in parallel and are fused at the end to produce the predicted class label. The videos were processed in a 16-frame chunk as a computational cost-saving measure. In Tsuchida et al. [15], the dancer's body motions are learned per frame as a 126-dimensional feature vector containing pose, velocity, and acceleration information. The feature vectors are aggregated within a temporal interval set by beat positions or fixed by a uniform window length. The classifiers were trained using LSTM and SVM models, achieving a 91.4% accuracy on a custom dataset. In Wysoczańska and Trzciński [16], the dance videos were augmented with audio track information. They extended the work [14] by adding a fourth stream, i.e., an audio-specific stream employing Liu et al.'s Bottom-up Broadcast Neural Network [17]. The final class prediction is averaged from the softmax output of each stack. More recently, Hu and Ahuja [18] proposed an LSTM-based hierarchical framework for classifying dance genres from a video. They trained an LSTM model to recognize 3D movements associated with specific body parts. The movements are identified from 3D poses extrapolated from the estimated 2D poses. The 2D to 3D mapping was learned by training temporal convolutional networks on videos with manually segmented movements.

Video-based AQA Methods
Existing work on automated action quality assessment of videos uses local features or deep neural networks. An example of the former is Pirshiavash et al. [19]. They trained a regression model to map Spatiotemporal features to an action quality score over a training set with manual scoring data. The feature set consists of low-level image features and a high-level feature, i.e., discrete cosine transform (DCT) encoded body pose, extracted from a single video. Similar to Pirshiavash et al.'s, the following deep learning methods also work on a single video and generate scores via regression. Instead of using manual scoring data as labels during supervised training, Parmar and Morris [20] proposed a multitask learning approach. 3D CNNs were used to learn Spatiotemporal motions and appearance-based features inside a video. The 3D CNNs were jointly optimized end-to-end to enable fine-grained action description and AQA scoring. In Xu et al. [21], two complementary networks, i.e., Self-Attentive LSTM (S-LSTM) and Multi-Scale Convolutional Skip LSTM (M-LSTM) were used to predict the AQA score. S-LSTM selectively learns the spatiotemporal features based on the frame's weight. Frames containing complicated and technical movements are weighted heavier than the rest. The second network, i.e., M-LSTM, learns to model the action at multiple scales using varying kernel sizes. This dual-network setup allows the action's local and global information to be adequately captured. Pan et al. [22] predict the AQA score based on the interactive motion pattern of neighboring joints. The patterns are modeled using spatial and temporal graphs. Regression-based methods work on the assumption that manual scoring by human judges is consistent and bias-free. To address this inherent ambiguity, Tang et al. [23] developed an uncertainty-aware score distribution learning where the score is inferred from Gaussiandistributed scores. Yu et al. [24] trained a binary tree classifier on feature vectors learned from query and exemplar pairs to predict the score difference. Each feature vector consists of Spatio-temporal features Markov Model. Based on the relative ranking between input pairs, the algorithm learns a model extracted from both videos and the reference score. This coarse-to-fine approach enables more accurate skill scoring since the final score is averaged from scores of multiple exemplar videos with similar attributes (i.e., stored in neighboring leaves belonging to the same and non-overlapping group). Instead of regressing a score from a video, Doughty et al. [25] trained temporal attention modules to focus only on frames containing skill-relevant parts. Their method uses a novel rank-aware loss function to perform pairwise ranking of egocentric videos. John et al. [26] measure the percentage of overlapping area between aligned foreground images from frames belonging to an unlabeled video and a video with an annotated score. The resulting value indicates how similar the aerobic dance moves are between the pair of videos.

Dance-related Datasets
Castro et al.'s Let's Dance dataset [14] contains 1,000 videos belonging to 10 different dance categories. These dance categories share overlapping micro-actions, making it impossible to make a class prediction from a single frame. The dances were performed by the same person/duo/group at the same venue; hence the videos share many appearance-based features. A much larger dataset, i.e., AIST Dance Video Database, was introduced by Tsuchida et al. [15]. The dataset focuses on street/urban dancing styles, containing almost 14k videos in 10 different genres. Professional dancers performed the dances, and the recording was done inside a well-lit studio using high-definition cameras. A more recent dataset, i.e., the UID dataset, was introduced by Hu and Ahuja [18]. The dataset contains 1,143 videos belonging to 9 classic dance genres. Most videos were curated from YouTube hence exhibiting greater variations in quality. Dance datasets in other modalities are also available, e.g., motion-capture in Dewan et al. [27] and Li et al. [28]; as well as depth data in [29]- [31].
Like Hu and Ahuja [18], our videos were captured under unregulated background and illumination settings. Unlike the formal dances covered in Castro et al.'s dataset [14], the TikTok dance challenges involve much slower and less intricate dance movements. Nevertheless, the movements still require some skill to execute correctly.

Dataset and Ground Truth Preparation
We first describe our novel dataset and the ground truth preparation steps. To prepare the dataset, we asked 20 participants (P1-P20) to perform 12 TikTok dance challenges each, resulting in a total of 240 videos, see Fig. 2. The 8-23 seconds videos were captured with various background scenes and lighting conditions. Participants used their own camera devices to capture these videos; hence different levels of image quality are expected. We used an interval of one-fifth of a second for extracting the still frames. We found the interval rate sufficient for capturing our participants' dance motions. Thus, the total pairings across 12 categories equal 2,280 pairs. The pairs were divided randomly and equally between 100 paid human annotators. The annotators were tasked to mark the winning video for each pair. Before the annotator can annotate a winner, he or she must watch the included reference video in full. The short reference videos were sourced from TikTok and selected based on the highest number of likes. Annotators were reminded to select the winner based exclusively on the dance motions' quality whilst ignoring the dancer's appearance and video production quality.
After every pair were annotated, we traversed the entire list and updated each video's accumulated win count, ω, Ultimately, we produced a ranked list per label category as our ground truths. For each label category, the videos are ranked based on their win count. Videos with identical win counts are placed in the same ranking position. The top-ranked video(s) are thus declared as the Gold Standard(s). The complete dataset and the ranking data (for each label category) can be downloaded from kaggle.com/datasets/irwandihipiny/cdrg-unimas-tiktok.

Building the Codebook
Since a region label has four possible values, four region labels produce 4! possible combinations. However, we only consider the 49 most common poses, as observed in our dataset. We added the 50th class, i.e., OTHERS, to represent the rest of the (low frequency) codes. Fig. 3 shows all 49+1 possible region codes inside our Codebook. The 49+1 codes are each represented by an ASCII character (65 -114), i.e., a Visual word, inside our Codebook.

Region Coding Module
We implemented Papandreou et al.'s PoseNet [32] in our Region coding module to estimate the subject's 2D pose inside a query frame. PoseNet estimates 2D poses in real-time by returning an array containing fifteen 2D coordinates with a confidence score each. Our method requires only 8 keypoints, regardless of whether the detection is a true or a false positive. See Fig. 4 for query frame samples containing complete pose data. We require the following four key points for pose definition, i.e., rightElbow, rightWrist, leftElbow, and leftWrist, and the following four keypoints for region coding, i.e., rightShoulder, leftShoulder, neck, and chest. We define a query frame with complete pose data as having all eight keypoints being detected. Query frames without complete pose data are rejected outright. We purposely set a slightly weaker threshold value than the default used in Papandreou et al. [32] to increase the number of query frames per video. We found the default threshold value to be too strict thus producing a low number of query frames. The lower threshold risks a false keypoint detection, causing an erroneous Visual word to be appended to the String value. Nevertheless, we believe this event to be a rare occurrence due to the reliability of Papandreou et al.'s [32] method. Based on our observation, the matched keypoints are often correct, only to be rejected due to a slightly lower confidence score than the set threshold.
The Region coding module encodes a code value given a query frame with complete pose data. The code consists of 4 region labels. Possible values are Above-Left (AL), Above-Right (AR), Below-Left (BL), and Below-Right (BR). The value is determined by the position of the current keypoint relative to the two intersecting lines formed by the last four keypoints, i.e., rightShoulder, leftShoulder, neck, and chest. The first line/axis connects the leftShoulder and rightShoulder keypoint, and the second line/axis connects the neck and chest keypoint.

Pose Coding Module
The encoded value then becomes an input for the Pose coding module. This module acts as a lookup function, i.e., searching for the top match (Visual word) inside the Codebook for the given encoded value. The same module is used during the Codebook creation to generate query frames (with pose labels) from training videos.

String Builder Module
The String Builder module generates a String value of length where is equal to the number of complete query frames. It appends the matching Visual word, α_j, into the current video's String value, . This step is repeated for all complete query frames, Φ , We implemented a secondary scheme with a text compression feature. In this particular scheme, we only append into if ≠ −1 . This scheme compensates for instances of the subject executing a specific dance motion part faster or slower than the ideal duration. Only the appearance order of 2D poses is retained, while the magnitude (i.e., the number of frames per pose) is capped at a fixed value of 1. A combined flowchart explaining the process flow between the three modules is shown in Fig. 5.

Pairwise Skill Annotation
Pairwise skill annotation on a same-label pair, ( , ), is performed by first computing the following two edit distance scores. The first score, 1 , is between 's String value and the Gold Standard's. The second score, 2 , is between 's String value and the Gold Standard's. We determine the winning video using, where ∞ indicates a stalemate. In the event of a label category containing two or more Gold Standards, we compute for all cases and select the lowest edit distance score as the representative value for the current video. Table 1 shows the precision values obtained using the two schemes for each label category, with the last column reporting the average value. We report results with the default Codebook size of 49+1 Visual words. The primary scheme obtains an average precision of 0.48, while the second scheme with text compression manages a slightly lower value of 0.45 over 12 dance labels. Encoding the temporal duration of choreographed dance motions is beneficial since it influences our perception of motion quality (i.e., too slow or too fast). Nevertheless, the secondary scheme outperforms the primary scheme in 4 out of 12 label categories, i.e., D-dm, L(SB), StTL, and TMM. We argue that the best scheme depends on the nature of the choreographed dance motions. Some TikTok dance challenges involve rapidly changing motions; hence temporal duration might not be a deterministic factor in action quality assessment. The best precision value (0.66) is achieved using the secondary scheme on the D-dM set. We note that the difference in precision values achieved using this set's primary and secondary scheme is only 0.07. The most significant difference is in the TDS set, with the primary scheme outperforming the secondary scheme by 0.15.

Results and Discussion
Our results are not directly comparable to previous work since the dataset(s) and metrics used were different. Admittedly, the average precision over 12 label categories is rather low, at 0.45 with text compression, and 0.48 without. Nevertheless, precision values for several label categories are comparable to past methods' (median: 0.47, max: 0.66). Our method's performance is comparable to the state-ofthe-art, for example, Jain et al. [33] reported an average precision of 0.58 for the task of returning video clips containing poorly-executed actions from a same-label set.
Codebook Sizes We repeat the same experiment using the primary scheme with smaller Codebook sizes, see Fig. 6. We reduce the Codebook size to =38, =29, =19, and =10, by removing Visual words with lowest frequencies. For example, at =10, Visual words with a frequency value of 23 or less were removed, i.e., merged into OTHERS. We observe the primary scheme to be resilient against the reduction in Codebook's size. The removal of certain Visual words from the Codebook affects the label categories differently since they may or may not be present inside the videos' original String value (at =50). At =10, the frequency values of the remaining Visual words are between 29-157 (excluding OTHERS). The high-frequency values are expected since the 12 dance challenges tend to share visually similar and repetitive dance motions.

Separation in Ranking
We are interested in the method's performance when selecting the winner from pairs with high separation vs. close pairs inside the ranked list. Fig. 7 compares the results obtained using the two schemes. The primary scheme shows an increase in precision value if we limit to high separation pairs. As for the second scheme's precision value is consistent across separation values. The primary scheme works best if the two subjects have a markedly clear difference in skill level. At the max separation value of 8, the primary scheme achieves a high precision value of 0.69.

Evaluation Metrics
To evaluate our method, we check the predicted winner, ( , ), of each same-category pair against the category-specific ranked list. We define the following two scenarios as a event, i.e., i) the predicted winner is indeed ranked higher, and ii) a stalemate is predicted for a pair of videos with identical win counts. We report the precision values in the following subsection.

Conclusion
We presented a pairwise AQA method with two schemes to determine a winner from a pair of samelabel TikTok dance videos. In both schemes, the subject's 2D poses are codified into a String value based on the position of selected joints relative to two auxiliary axes. Using both schemes, we compute the edit distance score between the resulting String value and the Gold Standard's. The secondary scheme implements an additional step with text compression before calculating the edit distance score. We have tested these two schemes on a newly created dataset and achieved an average precision of 0.48 and 0.45, respectively. We also experimented with different Codebook sizes and compared precision values for pairs with high and low separation regarding ranking position. We also show the current limitation of our method, i.e., dependency on the 2D pose estimation result. An incorrect pose estimation produces an erroneous String value that affects the precision of the pairwise action quality assessment task. We see our work as a step toward an automated ranking of general and task-independent videos based on the subject's skill level. Measuring the similarity of task-related motions to an expert's is a valid approach for estimating task mastery. The novel TikTok dance video dataset is made publicly available on the authors' website to motivate more work in skill determination from video. Further work involves exploring higher-dimensional ways to represent the subject's 2D/3D poses and testing on additional AQA datasets.