How to classify human actions?

Question

I'm quite new to machine learning (I followed the Coursera course of Andrew Ng and now starting deeplearning.ai courses).

I want to classify human actions real-time like:

Left-arm bended
Arm above shoulder
...

I first did some research for pre-trained models, but I didn't find any. Because I'm still quite new, I want to have advice about how I should solve this.

I thought maybe I need to create for every action enough pictures and from there on I can do image classification.
Or I use PoseNet from TensorFlow so that I have the pose estimation points. And from there on I create videos of a couple of seconds with every pose I want to track and I save the estimation points. From there on, I use a classification algorithm (neural network) to classify those points.

What is the most efficient option or are they both bad and is there a better way to do this?

score 0 · Answer 1 · answered Dec 27 '19 at 11:30

My suggestion is to go with 1st option. reason is you will get to know much about data and initial stage will find some challenges in developing the model, over a period of time you will get to better results after hypertunning. Please go through article , ignore you have already read this article

score 0 · Answer 2 · answered Sep 06 '24 at 12:43

In response to your question, I recommend using YOLOv8 for pose detection. It's a highly efficient and low-cost model for pose estimation, and it performs well in real-time applications. You can use it to extract key points from the video frames. you can use any other pose detection model it is a suggestion for you. like yolov8 is easy to use and highly accurate also it gives around 17 points of human pose

However, remember that many human actions depend on movements made in the past, so a simple classification of individual frames might not be enough for accurate action recognition. Instead, I suggest training a CNN-LSTM-Attention model.

=> CNN will help capture spatial features.

=> LSTM (Long Short-Term Memory) will account for temporal dependencies (previous movements).

=> The attention mechanism can further enhance the model’s ability to focus on the most important frames/key points for classification.

This architecture would allow you to classify actions based on sequences of key points extracted from videos, giving you much better accuracy than classifying single frames or individual poses.

score 0 · Answer 3 · answered Jun 05 '25 at 14:55

I suggest you use a pose estimation model like you mentioned. I then suggest you link the different keypoints and create a "body graph" of some sort.

Then at that point, you could use trigonometry to calculate the angle between lines crossing the same keypoints (see here). For example, to measure if an arm has been raised, calculate the dot product of the vectors going from the neck to the shoulder and going from the elbow to the shoulder, and divide by the product of their lengths. Then apply arcos to the obtained value and you get the angle.
By doing so with every joint, you can have the angles of every joint and depending on these angles, either you can build a decision tree model, (or any other model) or you can code a rule based system yourself, to determine if an arm has been raised or if a leg has been extended, etc.

The main drawback with this approach is that it depends very much on the position of the person on the image. Some keypoints can be hidden by other parts of the body and mess with the joint angles.
If you have images that are always showing the same part of the body (front, back, etc), this could be a good approach. If not, consider using an model that has time series related layers like LSTM, or attention, like in the answer Keval wrote.

How to classify human actions?

3 Answers3