Deep learning based human pose estimation

Pose estimation is calculated by using computer vision to detect the position and orientation of an object. This usually means detecting key point locations that describe the object.

Pose estimation is calculated by using computer vision to detect the position and orientation of an object. This usually means detecting key point locations that describe the object. For example, in the example of face pose estimation (a.k.a facial landmark detection), we detect landmarks on a human face. A related example is head pose estimation where we use the facial landmarks to obtain the 3D orientation of a human head with respect to the camera.

In this article, we will focus on human pose estimation, where it is required to detect and localize the major parts/joints of the body ( e.g. shoulders, ankle, knee, wrist etc. ). Remember the scene where Tony stark wears the Iron Man suit using gestures? If such a suit is ever built, it would require human pose estimation! For the purpose of this article, though, we will tone down our ambition a tiny bit and solve a simpler problem of detecting keypoints on the body. A typical output of a pose detector looks as shown below :

OpenPose Skeleton
Figure 1 : Sample Skeleton output of pose estimation. Image Credit: Oliver Sjöström, Instagram: @ollivves, Website:

Keypoint detection datasets

Until recently, advancement in pose estimation has been challenged because of the lack of high-quality datasets. Such is the enthusiasm in AI these days that problems that would not have been addressed are now within reach. Exciting new datasets have been released in the last few years which have made it easier for researchers to attack wider opportunities with all their intellectual might.

Some of the datasets are :

  1. COCO Keypoints challenge
  2. MPII Human Pose Dataset
  3. VGG Pose Dataset

In short the more images a system sees the better and more intelligent it gets.

2. Multi-person pose estimation model

The model used in this tutorial is based on a paper titled Multi-Person Pose Estimation by the Perceptual Computing Lab at Carnegie Mellon University. The authors of the paper train very deep neural networks for this task. Let’s briefly go over the architecture before we explain how to use the pre-trained model.

The model takes as input a color image of size w × h and produces, as output, the 2D locations of keypoints for each person in the image. The detection takes place in three stages :

Stage 1: The first 10 layers of the VGGNet are used to create feature maps for the input image.

Stage 2: A 2-branch multi-stage CNN is used where the first branch predicts a set of 2D confidence maps (S) of body part locations ( e.g. elbow, knee etc.). Given below are confidence maps and Affinity maps for the keypoint Left Shoulder.

confidence map left shoulder
Figure 3 : Showing confidence maps for Left Shoulder for the given image

The second branch predicts a set of 2D vector fields (L) of part affinities, which encode the degree of association between parts. In the figure below part affinity between the Neck and Left shoulder is shown.

Pose Estimation Affinity Map Left Shoulder
Figure 4 : Showing Part Affinity maps for Neck – Left Shoulder pair for the given image
Stage 3: The confidence and affinity maps are parsed by greedy inference to produce the 2D key points for all people in the image.This architecture won the COCO key points challenge in 2016.

Pre-trained models for human pose estimation

The authors of the paper have shared two models – one is trained on the Multi-Person Dataset ( MPII ) and the other is trained on the COCO dataset. The COCO model produces 18 points, while the MPII model outputs 15 points. The outputs plotted on a person is shown in the image below.

keypoints difference of coco and mpi

COCO output format Nose – 0, Neck – 1, Right Shoulder – 2, Right Elbow – 3, Right Wrist – 4, Left Shoulder – 5, Left Elbow – 6, Left Wrist – 7, Right Hip – 8, Right Knee – 9, Right Ankle – 10, Left Hip – 11, Left Knee – 12, Left Ankle – 13, Right Eye – 14, Left Eye – 15, Right Ear – 16, Left Ear – 17, Background – 18 MPII Output Format Head – 0, Neck – 1, Right Shoulder – 2, Right Elbow – 3, Right Wrist – 4, Left Shoulder – 5, Left Elbow – 6, Left Wrist – 7, Right Hip – 8, Right Knee – 9, Right Ankle – 10, Left Hip – 11, Left Knee – 12, Left Ankle – 13, Chest – 14, Background – 15

As we saw in the previous section that the output consists of confidence maps and affinity maps. These outputs can be used to find the pose for every person in a frame if multiple people are present.

The output is a 4D matrix :

  1. The first dimension being the image ID ( in case you pass more than one image to the network ).
  2. The second dimension indicates the index of a keypoint. The model produces Confidence Maps and Part Affinity maps which are all concatenated. For COCO model it consists of 57 parts – 18 keypoint confidence Maps + 1 background + 19*2 Part Affinity Maps. Similarly, for MPI, it produces 44 points. We will be using only the first few points which correspond to Keypoints.
  3. The third dimension is the height of the output map.
  4. The fourth dimension is the width of the output map.

We check whether each keypoint is present in the image or not. We get the location of the keypoint by finding the maxima of the confidence map of that keypoint. We also use a threshold to reduce false detections.

Once the keypoints are detected, we just plot them on the image. Since we know the indices of the points before-hand, we can draw the skeleton when we have the keypoints by just joining the pairs.

This is how we are able to detect what is taking place in the video whether it is an individual or multiple person feed.


Share on facebook
Share on twitter
Share on pinterest
Share on linkedin

Leave a Reply

Related Posts



Drowning is the 3rd leading cause of unintentional injury death worldwide, accounting for 7% of all injury-related deaths.


Deep learning based human pose estimation

Pose estimation is calculated by using computer vision to detect the position and orientation of an object. This usually means detecting key point locations that describe the object.

deep learning

Deep learning

Deep learning is a type of that trains a computer to perform human-like tasks, such as recognizing speech, identifying images or making predictions. Instead of organizing data to run through predefined equations

Artificial Intelligence

Artificial intelligence history

The term artificial intelligence was coined in 1956, but AI has become more popular today thanks to increased data volumes, advanced algorithms, and improvements in computing power and storage.

Stay in Touch

Subscribe or updates and get them direct in your email

The artificial intelligence uses frame by frame comparison to detect an object that was not previously there or the disappearance of an object from the field of view.  It learns what ‘normal’ looks like and spots differences.  In ‘Supervised’ mode, alerts to objects such as pool toys or outdoor furniture being moved will be suppressed to avoid false alarms.  When you leave the pool area a new ‘normal’ is established.

By using a combination of two cameras, one to identify individuals as they enter the designated area and the other to monitor the whole area, the artificial intelligence can keep track of an identified individual for as long as they remain within the field of view.  Both cameras are connected to the same processor so the first can pass the identity to the second, allowing the second to continue showing the identity of the individual even when their face is not visible to the camera.

In case you are concerned about privacy, be assured that nobody sees the feed from your camera unless an emergency is detected and not acknowledged locally.  Instead, the artificial intelligence identifies key points on the human body such as shoulders, elbows, wrists, hips, knees and ankles.  It uses the relative position of these key points to determine the pose of the body and has been trained to recognise poses that indicate danger. 

‘Supervised’ mode is designed for use when swimming is planned, with a responsible adult present.  It won’t bother you with constant alerts as people enter the area but the system will still raise the alarm if someone disappears underwater for longer than you have deemed acceptable. 

We strongly encourage use of a Pool Angel lanyard during such sessions so that there is no doubt over who has assumed responsibility for keeping watch over children in the pool.  Child drownings can happen even with multiple adults present if they all assume that someone else is paying attention.  Pool Angel offers you an added layer of protection; by comparing the number of people detected frame by frame, the artificial intelligence can spot when someone is missing and raise the alarm.    

Because the artificial intelligence can learn from experience it can learn to tell the difference between your pet and local wildlife that might encroach on your pool area.  This means that you can keep your pets safe without being disturbed by false alarms during the night when animals may encroach on your pool area; although you might be intrigued to view clips of your nocturnal visitors in the morning.  A short video clip is stored each time something is detected.

If an adult is detected in the pool area the system will alert you and prompt you to switch to ‘Supervised’ mode if you haven’t already done so.  This mode is designed for planned use of the pool and will suppress alerts to entry and exit from the pool area.  When the last adult leaves the area the system detects that too and prompts you to switch back to keeping watch over the empty pool.  An emergency alarm is raised if the departure of that adult leaves an unsupervised child in the pool area.

Although we refer to the boundary around a swimming pool, the camera can be used to keep watch over any boundary you designate. It can keep watch over a trampoline, climbing frame, the tool shed, any area that could present a danger to unsupervised children. By comparing what was present in a previous frame with what is currently in frame, the artificial intelligence can detect the arrival of something or someone new in the designated area.