Computer Vision and Image Understanding
Vol. 73, No. 3, March, pp. 428440, 1999
Article ID cviu.1998.0744, available online at http://www.idealibrary.com on
Human Motion Analysis: A Review
J. K. Aggarwal and Q. Cai
Computer and Vision Research Center, Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712
Received October 22, 1997; accepted September 25, 1998
human body, which is a nonrigid form. Our discussion covers
Human motion analysis is receiving increasing attention from three areas: (1) motion analysis of the human body structure,
computer vision researchers. This interest is motivated by a wide (2) tracking human motion without using the body parts from
spectrum of applications, such as athletic performance analysis, a single view or multiple perspectives, and (3) recognizing hu-
surveillance, manmachine interfaces, content-based image storage man activities from image sequences. The relationship among
and retrieval, and video conferencing. This paper gives an overview these three areas is depicted in Fig. 1. Motion analysis of the hu-
of the various tasks involved in motion analysis of the human body. man body usually involves low-level processing, such as body
We focus on three major areas related to interpreting human mo-
part segmentation, joint detection and identification, and the re-
tion: (1) motion analysis involving human body parts, (2) tracking a
covery of 3D structure from the 2D projections in an image
moving human from a single view or multiple camera perspectives,
and (3) recognizing human activities from image sequences. Motion sequence. Tracking moving individuals from a single view or
analysis of human body parts involves the low-level segmentation multiple perspectives involves applying visual features to detect
of the human body into segments connected by joints and recov- the presence of humans directly, i.e., without considering the ge-
ers the 3D structure of the human body using its 2D projections ometric structure of the body parts. Motion information such as
over a sequence of images. Tracking human motion from a single position and velocity, along with intensity values, is employed
view or multiple perspectives focuses on higher-level processing, in to establish matching between consecutive frames. Once feature
which moving humans are tracked without identifying their body correspondence between successive frames is solved, the next
parts. After successfully matching the moving human image from step is to understand the behavior of these features throughout
one frame to another in an image sequence, understanding the hu- the image sequence. Therefore, our discussion turns to recogni-
man movements or activities comes naturally, which leads to our
tion of human movements and activities.
discussion of recognizing human activities. c 1999 Academic Press
Two typical approaches to motion analysis of human body
parts are reviewed, depending on whether a priori shape models
1. INTRODUCTION are used. Figure 2 lists a number of publications in this area
over the past years. In both model-based and nonmodel-based
Human motion analysis is receiving increasing attention from approaches, the representation of the human body evolves from
computer vision researchers. This interest is motivated by ap- stick figures to 2D contours to 3D volumes as the complexity
plications over a wide spectrum of topics. For example, seg- of the model increases. The stick figure representation is based
menting the parts of the human body in an image, tracking the on the observation that human motion is essentially the move-
movement of joints over an image sequence, and recovering the ment of the supporting bones. The use of 2D contours is directly
underlying 3D body structure are particularly useful for analy- associated with the projection of the human figure in images.
sis of athletic performance, as well as medical diagnostics. The Volumetric models, such as generalized cones, elliptical cylin-
capability to automatically monitor human activities using com- ders, and spheres, attempt to describe the details of a human body
puters in security-sensitive areas such as airports, borders, and in 3D and, therefore, require more parameters for computation.
building lobbies is of great interest to the police and military. Various levels of representations could be used for graphical
With the development of digital libraries, the ability to automat- animation with different resolutions .
ically interpret video sequences will save tremendous human With regard to the tracking of human motion without involv-
effort in sorting and retrieving images or video sequences us- ing body parts, we differentiate the work based on whether the
ing content-based queries. Other applications include building subject is imaged at one time instant by a single camera or from
manmachine user interfaces, video conferencing, etc. This pa- multiple perspectives using different cameras. In both configu-
per gives an overview of recent development in human motion rations, the features to be tracked vary from points to 2D blobs
analysis from image sequences using a hierarchical approach. and 3D volumes. There is always a trade-off between feature
In contrast to our previous review of motion estimation of a complexity and tracking efficiency. Lower-level features, such
rigid body , this survey concentrates on motion analysis of the as points, are easier to extract but relatively more difficult to
Copyright c 1999 by Academic Press
All rights of reproduction in any form reserved.
HUMAN MOTION ANALYSIS 429
FIG. 1. Relationship among the three areas of human motion analysis addressed in the paper.
track than higher-level features such as blobs and 3D volumes. naturally extended to recognition of a whole body movement,
Most of the work in this area is listed in Fig. 3. we also include them in our discussion.
To recognize human activities from an image sequence, two This paper is an extension of the review in . The organi-
types of approaches were addressed: approaches based on a zation of the paper is as follows: Section 2 reviews work on
state-space model and those using the template matching tech- motion analysis of the human body structure. Section 3 covers
nique. In the first case, the features used for recognition could be research on the higher-level tasks of tracking a human body as a
points, lines, and 2D blobs. Methods using template matching whole. Section 4 extends the discussion to recognition of human
usually apply meshes of a subject image to identify a particular activity in image sequences based upon successfully tracking
movement. Figure 4 gives an overview of the research in this the features between consecutive frames. Finally, section 5 con-
area. In some of the publications, recognition is conducted us- cludes the paper and delineates possible directions for future re-
ing only parts of the human figure. Since these methods can be search.
FIG. 2. Past research on motion analysis of human body parts.
430 AGGARWAL AND CAI
FIG. 3. Past research on tracking of human motion without using body parts.
2. MOTION ANALYSIS OF HUMAN BODY PARTS ence between the two methodologies is in establishing feature
correspondence between consecutive frames. Methods which
This section focuses on motion analysis of human body parts, assume a priori shape models match the real images to a prede-
i.e., approaches which involve 2D or 3D analysis of the hu- fined model. Feature correspondence is automatically achieved
man body structure throughout image sequences. Convention- once matching between the real images and the model is estab-
ally, human bodies are represented as stick figures, 2D contours, lished. When no a priori shape models are available, however,
or volumetric models ; therefore, body segments can be ap- correspondence between successive frames is based upon pre-
proximated as lines, 2D ribbons, and 3D volumes, accordingly. diction or estimation of features related to position, velocity,
Figures 5, 6, and and 7 show examples of the stick figure, 2D con- shape, texture, and color. These two methodologies can also be
tour, and elliptical cylinder representations of the human body, combined at various levels to verify the matching between con-
respectively. Human body motion is typically addressed by the secutive frames and, finally, to accomplish more complex tasks.
movement of the limbs and hands , such as the velocities Since we have surveyed these methods previously in , we will
of the hand or limb segments, or the angular velocity of various restrict out discussion to very recent work.
Two general strategies are used, depending upon whether in-
2.1. Motion Analysis without a priori Shape Models
formation about the object shape is employed in the motion
analysis, namely, model-based approaches and methods which Most approaches to 2D or 3D interpretation of human body
do not rely on a priori shape models. Both methodologies fol- structure focus on motion estimation of the joints of body seg-
low the general framework of: (1) feature extraction, (2) feature ments. When no a priori shape models are assumed, heuristic
correspondence, and (3) high-level processing. The major differ- assumptions are usually used to establish the correspondence
FIG. 4. Past work on human activity recognition.
HUMAN MOTION ANALYSIS 431
FIG. 7. A volumetric human model (derived from Hogg's work ).
projected positions and velocities. Later, Webb and Aggarwal
[11, 12] extended to 3D structure recovery of Johansson-type fig-
ures in motion. They imposed the fixed axis assumption, which
FIG. 5. A stick-figure human model (based on Chen and Lee's work ). assumes that the motion of each rigid object (or part of an artic-
ulated object) is constrained so that its axis of rotation remains
fixed in direction. Therefore, the depth of the joints can be esti-
of joints between successive frames. These assumptions im- mated from their 2D projections. Detailed review of  can
pose constraints on feature correspondence, decrease the search be found in . All of these approaches inevitably demand a high
space, and eventually, result in a unique match. degree of accuracy in extracting body segments and joints. The
The simplest representation of a human body is the stick fig- segmentation problem is avoided by directly using MLDs that
ure, which consists of line segments linked by joints. The motion implies their restrictions to human images with natural clothing.
of joints provides the key to motion estimation and recognition Another way to describe the human body is using 2D contours.
of the whole figure. This concept was initially considered by In such descriptions, the human body segments are analogous
Johansson , who marked joints as moving light displays to 2D ribbons or blobs. For example, Shio and Sklansky 
(MLD). Along this vein, Rashid  attempted to recover a con- focused their work on 2D translational motion of human blobs.
nected human structure with projected MLD by assuming that The blobs were grouped based on the magnitude and direction of
points belonging to the same object have higher correlations in the pixel velocity which were obtained using techniques similar
to the optical flow method . The velocity of each part was
considered to converge to a global average value over several
frames. This average velocity corresponds to the motion of the
whole human body and leads to identification of the whole body
via region grouping of blobs with a similar smoothed velocity.
Kurakake and Nevatia  attempted to locate the joint locations
in images of walking humans by establishing correspondence
between extracted ribbons. They assumed small motion between
two consecutive frames, and feature correspondence was con-
ducted using various geometric constraints. Joints were finally
identified as the center of the area, where two ribbons overlap.
Recent work by Kakadiaris et al. [18, 19] focused on body part
decomposition and joint location from image sequences of the
moving subject using a physics-based framework. Unlike ,
where joints are located from shapes, here the joint is revealed
only when the body segments connected to it involve motion.
In the beginning, the subject image is assumed to be one de-
FIG. 6. A 2D contour human model (similar to Leung and Yang's model ). formable model. As the subject moves and new postures occur,
432 AGGARWAL AND CAI
multiple new models are produced to replace the old ones, with between the real images and the model. The drawback to this
each of them representing an emerging subpart. Joints are de- model is that it is view-based and sensitive to changes of the per-
termined based on the relative motion and shape of two moving spective angle at which the images are captured. Huber's human
subparts. These methods usually require small image motion model  is a refined version of the stick figure representa-
between successive frames which will not be a major concern as tion. Joints are connected by line segments with a certain degree
the video sampling rate increases. Our main concern is still seg- of relaxation as "virtual springs." Thus, this articulated kine-
mentation under normal circumstances. Among the addressed matic model behaves analogously to a mass-spring-damper sys-
methods, Shio and Sklansky  relied on motion as the cue tem. Motion and stereo measurements of joints are confined to a
for segmentation, while the latter two approaches are based on three-dimensional space called proximity space (PS). The human
intensity or texture. To our best knowledge, robust segmentation head serves as the starting point for tracking all PS locations.
technologies are yet to be developed. Since this problem is the In the end, particular gestures were recognized based on the PS
major obstacle to most of the work involving low-level process- states of the joints associated with the head, torso, and arms.
ing, we will not mention it repeatedly in later discussions. The key to solving this problem requires the 3D positions of the
Finally, we want to address the recent work by Rowley and joints in an image sequence. A recent publication by Iwasawa
Rehg  which focuses on the segmentation of optical flow  focused on real-time extraction of stick figures from monoc-
fields of articulated objects. It is an extension to motion anal- ular thermal images. The height of the human image and the
ysis of rigid objects using the expectation-maximization (EM) distance between the subject and the camera were precalibrated.
algorithm . Compared to , the major contribution of  Then the orientation of the upper half of the body was calculated
is to add kinematic motion constraints to each pixel data. The as the principle axis of inertia of the human silhouette. Signifi-
strength of this work lies in its combination of motion segmen- cant points, such as the top of the head and the tips of the hands
tation and estimation in EM computation, i.e., segmentation is and feet, were heuristically located as the extreme points farthest
accomplished in the E-step, and motion analysis in the M-step. from the center of the silhouette. Finally, major joints such as the
These two steps are computed iteratively in a forwardbackward elbows and knees were estimated, based on the positions of the
manner to minimize the overall energy function of the whole im- detected points through genetic learning. The drawback to this
age. The motion addressed in the paper is restricted to 2D affine method is, again, that it is view-oriented and gesture-restricted.
transforms. We would expect to see its extension to 3D cases For example, if the human arms are placed in front of the body,
under perspective projections. there is no way to extract the finger tips using this method and,
therefore, it fails to locate the elbow joints.
2.2. Model-Based Approaches
Niyogi and Adelson  pursued another route to estimate
In the above subsection, we examined several approaches to the joint motion of human body segments. They first exam-
motion analysis that do not require a priori shape models. In this ined the spatial-temporal (XYT) braided pattern produced by
type of approach, which is necessary when no a priori shape the lower limb trajectories of a walking human and conducted
models are available, it is typically more difficult to establish gait analysis for coarse human recognition. Then the projection
feature correspondence between consecutive frames. Therefore, of head movements in the spatial-temporal domain was located,
most methods for the motion analysis of human body parts apply followed by the identification of other joint trajectories. These
predefined models for feature correspondence and body struc- joint trajectories were then utilized to outline the contour of a
ture recovery. Our discussion will be based on the representa- walking human, based on the observation that the human body
tions of various models. is spatially contiguous. Finally, a more accurate gait analysis
Chen and Lee  recovered the 3D configuration of a moving was performed using the outlined 2D contour, which led to a
subject according to its projected 2D image. Their model used 17 fine-level recognition of specific humans. The major concern
line segments and 14 joints to represent the features of the head, we have with this work is how to obtain these XYT trajectories
torso, hip, arms, and legs (shown in Fig. 5). Various constraints in real image sequences without attaching specific sensors to the
were imposed for the basic analysis of the gait. The method head and feet during image acquisition.
was computationally expensive, as it searched through all pos- In Akita's work , both stick figures and cone approxima-
sible combinations of 3D configurations, given the known 2D tions were integrated and processed in a coarse-to-fine manner.
projection, and required accurate extraction of 2D stick figures. A key frame sequence of stick figures indicates the approximate
Bharatkumar et al.  also used stick figures to model the lower order of the motion and spatial relationships between the body
limbs of the moving human body. They aimed at constructing parts. A cone model is included to provide knowledge of the
a general kinematic model for gait analysis in human walking. rough shape of the body parts, whose 2D segments correspond
Medial-axis transformations were applied to extract 2D stick to the counterparts of the stick figure model. The preliminary
figures of the lower limbs. The body segment angle and joint condition to this approach is to obtain these key frames prior
displacement were measured and smoothed from real image se- to body part segmentation and motion estimation. Perales and
quences, and then a common kinematic pattern was detected Torres also made use of both stick figure and volumetric rep-
for each walking cycle. A high correlation (>0.95) was found resentations in their work . They introduced a predefined
Math Boxes Objectives To introduce My Reference Book; and to introduce the t Math Boxes routine. www.everydaymathonline.com ePresentations eToolkit Algorithms EM Facts Family Assessment Common Curriculum Interactive Practice Workshop Letters Management Core State Focal Points Teacher's GameTM Standards Lesson Guide Teaching the Lesson Ongoing Learning …
US-China Education Review B 4 (2011) 579-585 Earlier title: US-China Education Review, ISSN 1548-6613 A Study of the Relationship Between Students' Anxiety and Test Performance on State-Mandated Assessments Rosalinda Hernandez, Velma Menchaca, Jeffery Huerta University of Texas Pan American, Edinburg, USA This study examined whether …
HIGH-EFFICIENCY UPFLOW FURNACE INSTALLER'S INFORMATION MANUAL D ES IG N CE R TI F I ED ATTENTION, INSTALLER! After installing the ATTENTION, USER! Your furnace installer should furnace, show the user how to turn off gas and electricity to give you the documents listed on …
Raven/Johnson Biology 8e Chapter 12 1. A true-breeding plant is one that-- a. produces offspring that are different from the parent b. forms hybrid offspring through cross-pollination c. produces offspring that are always the same as the parent d. can only reproduce with itself The …
Math Skills for Business- Full Chapters 1 U1-Full Chapter- Algebra Chapter3 Introduction to Algebra 3.1 What is Algebra? Algebra is generalized arithmetic operations that use letters of the alphabet to represent known or unknown quantities. We can use y to represent a company's profit or …