Pose estimation is calculated by using computer vision to detect the position and orientation of an object. This usually means detecting key point locations that describe the object. For example, in the example of face pose estimation (a.k.a facial landmark detection), we detect landmarks on a human face. A related example is head pose estimation where we use the facial landmarks to obtain the 3D orientation of a human head with respect to the camera.
In this article, we will focus on human pose estimation, where it is required to detect and localize the major parts/joints of the body ( e.g. shoulders, ankle, knee, wrist etc. ). Remember the scene where Tony stark wears the Iron Man suit using gestures? If such a suit is ever built, it would require human pose estimation! For the purpose of this article, though, we will tone down our ambition a tiny bit and solve a simpler problem of detecting keypoints on the body. A typical output of a pose detector looks as shown below :
Keypoint detection datasets
Until recently, advancement in pose estimation has been challenged because of the lack of high-quality datasets. Such is the enthusiasm in AI these days that problems that would not have been addressed are now within reach. Exciting new datasets have been released in the last few years which have made it easier for researchers to attack wider opportunities with all their intellectual might.
Some of the datasets are :
- COCO Keypoints challenge
- MPII Human Pose Dataset
- VGG Pose Dataset
In short the more images a system sees the better and more intelligent it gets.
2. Multi-person pose estimation model
The model used in this tutorial is based on a paper titled Multi-Person Pose Estimation by the Perceptual Computing Lab at Carnegie Mellon University. The authors of the paper train very deep neural networks for this task. Let’s briefly go over the architecture before we explain how to use the pre-trained model.
The model takes as input a color image of size w × h and produces, as output, the 2D locations of keypoints for each person in the image. The detection takes place in three stages :
Stage 1: The first 10 layers of the VGGNet are used to create feature maps for the input image.
Stage 2: A 2-branch multi-stage CNN is used where the first branch predicts a set of 2D confidence maps (S) of body part locations ( e.g. elbow, knee etc.). Given below are confidence maps and Affinity maps for the keypoint Left Shoulder.
The second branch predicts a set of 2D vector fields (L) of part affinities, which encode the degree of association between parts. In the figure below part affinity between the Neck and Left shoulder is shown.