## Cristobal Curio, Heinrich H. Bulthoff, and Martin A. Giese

Print publication date: 2010

Print ISBN-13: 9780262014533

Published to MIT Press Scholarship Online: August 2013

DOI: 10.7551/mitpress/9780262014533.001.0001

Show Summary Details
Page of

PRINTED FROM MIT PRESS SCHOLARSHIP ONLINE (www.mitpress.universitypressscholarship.com). (c) Copyright The MIT Press, 2018. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in MITSO for personal use. Subscriber: null; date: 24 October 2019

# Markerless Tracking of Dynamic 3D Scans of Faces

Chapter:
(p.255) 16 Markerless Tracking of Dynamic 3D Scans of Faces
Source:
Dynamic Faces
Publisher:
The MIT Press
DOI:10.7551/mitpress/9780262014533.003.0017

# Abstract and Keywords

This chapter presents a novel computer graphics approach suitable for the 3D tracking of facial movements. This kernel-based approach for constructing 3D facial animations also provides information needed for realistic and controllable facial animation and dynamic analyses of face space. This approach, which provides a fully automatic, markerless tracking system, creates a platform where an intersection takes place between psychology and other research areas such as computer graphics and machine learning. The chapter highlights the need to develop a system in the future that can improve the mesh regularization terms in a face-specific manner by using feedback and findings from previous tracking systems.

Human perception of facial appearance and motion is a very sensitive and complex, yet mostly unconscious process. In order to be able to systematically investigate this phenomenon, highly controllable stimuli are a useful tool. This chapter describes a new approach to capturing and processing real-world data on human performance in order to build graphics models of faces, an endeavor where psychology intersects with other research areas such as computer graphics and machine learning, as well as the artistic fields of movie or game production.

Creating animated 3D models of deformable objects such as faces is an important and difficult task in computer graphics because the human perception of face appearance and motion is particularly sensitive. By drawing on a lifetime of experience with real faces, people are able to detect even the slightest peculiarities in an artificially animated face model, potentially eliciting unintended emotional or other communicative responses. This has made the animator’s job rather difficult and time-consuming when done manually, even for talented artists, and has led to the use of data-driven animation techniques, which aim to capture and re-use the dynamic performance of a live subject.

Data-driven face animation has recently enjoyed considerable success in the movie industry, in which marker-based methods are the most widely used. Although steady progress has been made in marker-based tracking, there are certain limitations involved in placing physical markers on a subject’s face. Summarizing the face by a sparse set of point locations may lose some information and necessitates involved retargeting of geometric motion to map the marker motion onto that of a model suitable for animation. Markers also partially occlude the subject’s face, which itself contains dynamic information such as that caused by changes in blood flow and expression wrinkles. On a practical level, time and effort are required to correctly place the markers, especially when short recordings of a large number of actors are needed—a scenario likely to arise in the computer game industry for example, but also common in psychology research.

(p.256)

Figure 16.1 Results of an automatically tracked sequence. The deforming mesh is rendered with a wire frame to show the correspondence between time steps. Since no markers are used, we also obtain an animated texture capturing subtle dynamic elements like expression wrinkles.

Tracking the changes of a deforming surface over time without markers is more difficult (see figure 16.1). To date, many markerless tracking methods have made extensive use of optical flow calculations between adjacent time steps of the sequence. Since local flow calculations are noisy and inconsistent, it is necessary to introduce spatial coherency constraints. Although significant progress has been made in this direction, for example, by Zhang, Snavely, Curless, and Seitz (2004), the sequential use of between-frame flow vectors can lead to continual accumulation of errors, which may eventually necessitate labor-intensive manual corrections as reported by Borshukov and Lewis (2003). It is also noteworthy that facial cosmetics designed to remove skin blemishes strike directly at the key assumptions of optical flow-based methods.

For face-tracking purposes, there is significant redundancy between the geometry and color information. Our goal is to exploit this multitude of information sources to obtain high-quality tracking results in spite of possible ambiguities in any of the individual sources. In contrast to classical motion capture, we aim to capture the surface densely rather than at a sparse set of locations.

We present a novel surface-tracking algorithm that addresses these issues. The input to the algorithm consists of an unorganized set of four-dimensional (3D plus time) surface points, along with a corresponding set of surface normals and surface colors. From these data, we construct a 4D implicit surface model, as well as a regressed function that models the color at any given point in space and time. Since we require only an unorganized point cloud, the algorithm is not restricted to scanners that produce a sequence of 3D frames and can handle samples at arbitrary points in time and space as produced by a laser scanner, for example. The algorithm also requires a mesh as a template for tracking, which we register with the first frame of the scanned sequence using an interactive tool. The output of our algorithm is a high-quality tracking result in the form of a sequence of deformations that move the template mesh in correspondence with the scanned subject, along with an animated (p.257) texture. A central design feature of the algorithm is that in contrast to sequential frame-by-frame approaches, it solves for the optimal motion with respect to the entire sequence while incorporating geometry and color information on an equal footing. As a result, the tracking is robust against slippage that is due to the accumulation of errors.

# Background on Facial Animation

Since the human face is such an important communication surface, facial animation has been of interest from early on in the animation and later the computer graphics community. In the following paragraphs we want to give an overview on related work, focusing on automated, algorithmic approaches rather than artistic results such as traditional Cel animation.

Generally, facial animation can be produced either by a purely 2D image-based approach or by using some kind of 3D representation of the facial geometry. By using the full appearance information from images or video, 2D approaches naturally produce realistic-looking results but the possibilities for manipulations are limited. Orthogonally, 3D approaches allow a great amount of control over every aspect of the facial appearance but often have difficulty in achieving the same level of realism as image-based methods. For 3D facial animation, two main ingredients are required: a deformable face model that can be animated, and animation data that describe when and where the face changes. Deformable 3D face models can be roughly split into three basic techniques:

• Example-based 3D morphing, also known as blend shape animation.

• Physical simulation of tissue interaction (e.g., muscle simulations, finite-element systems); see, for example, the work of Sifakis, Neverov, and Fedkiw (2005).

• General nonrigid deformation of distinct surface regions (e.g. radial basis function approaches, cluster or bone animation) (Guenter, Grimm, Wood, Malvar, & Pighin, 1998).

Animation data can be generated by interpolating manually created keyframes, by procedural animation systems (e.g., as produced by text-to-speech systems), or by capturing the motion of a live actor using a specialized recording system. Once the animation data are available, different techniques exist for transferring the motion onto the facial model. If the model is not identical to the actor’s face, this process is called retargeting and the capturing process is often called facial performance capture.

A fourth category of facial animation techniques exists that combines the model and the animation into one dataset by densely recording shape, appearance, and timing aspects of a facial performance. By capturing both 3D and 2D data from moving (p.258) faces, the work described in this chapter uses a combination of both and thus allows for high realism while still being able to control individual details of the synthetic face. An example of general nonrigid deformations based on motion capture data is used in this book by Knappmeyer, whereas Boker and Cohn follow an examplebased 2D approach, and Curio et al. describe a motion-retargeting approach for 3D space.

A wide variety of modalities have been used to capture facial motion and transfer it onto animatable face models. Sifakis et al. (2005) used marker-based optical motion capture to drive a muscle simulation system; Blanz, Basso, Poggio, and Vetter (2003) used a statistical 3D face model to analyze facial motion from monocular video; Guenter et al. (1998) used dense marker motion data to directly deform an otherwise static 3D face geometry.

Recently, several approaches for dense 3D capture of facial motion were presented: Kalberer and Gool (2002) used structured light projection and painted markers to build a face model, then employed Independent Component Analysis to compute the basis elements of facial motion. For the feature film The Matrix, Borshukov and Lewis (2003) captured realistic faces in motion using a calibrated rig of high-resolution cameras and a 3D fusion of optical flow from each camera, but heavily relied on skilled manual intervention to correct for the accumulation of errors. The Mova multiple-camera system (http://www.mova.com) offers a capture service with a multicamera setup, establishing correspondence across cameras and time using speckle patterns caused by phosphorescent makeup. Zhang et al. (2004) employ color optical flow as constraints for tracking. Wand et al. (2007) present a framework for reconstructing moving surfaces from sequences based solely on unordered scanned 3D point clouds. They propose an iterative model assembly that produces the most appropriate mesh topology with correspondence through time.

Many algorithms in computer graphics make use of the concept of implicit surfaces, which are one way of defining a surface mathematically. The surface is represented implicitly by way of an embedding function that is zero on the surface of the object and has a different sign (positive or negative) on the interior and exterior of the object. The study of implicit surfaces in computer graphics began with Blinn (1982) and Nishimura et al. (1985). Variational methods have since been employed (Turk & O’Brien, 1999), and although the exact computation of the variational solution is computationally prohibitive, effective approximations have since been introduced (Carr et al., 2001) and proven useful in representing dynamic 3D scanner data by constructing 4D implicits (Walder, Schölkopf, & Chapelle, 2006). At the expense of useful regularization properties, the partition-of-unity methods exemplified by Ohtake, Belyaev, Alexa, Turk, and Seidel (2003) allow even greater efficiency.

The method we present in this chapter is a new type of partition of unity implicit based on nearest-neighbor calculations. As a partition-of-unity method, it represents (p.259) the surface using many different local approximations that are then stitched together. The stitching is done by weighting each local approximation by one of a set of basis functions which together have the property that they sum to one, or in other words, they partition unity. The new approach extends trivially to arbitrary dimensionality, is simple to implement efficiently, and convincingly models moving surfaces such as faces. Thus we provide a fully automatic markerless tracking system that avoids the manual placement of individual mesh deformation constraints by a straightforward combination of texture and geometry through time, thereby providing both meshes in correspondence and complete texture images. Our approach is able to robustly track moving surfaces that are undergoing general motion without requiring 2D local flow calculations or any specific shape knowledge.

# Hardware, Data Acquisition, and Computation

Here we provide only a short description of the scanning system we employed for this work since it is not our main focus. The dynamic 3D scanner we use is a commercial prototype developed by ABW GmbH (http://www.abw-3d.de) that uses a modified coded light approach with phase unwrapping (Wolf, 2003). The hardware (see figure 16.2) consists of a minirot H1 projector synchronized with two photon focus MV-D752-160 gray-scale cameras that run with 640 by 480 pixels at 200 frames per second, and one Basler A601fc color camera, running at 40 frames per second and a resolution of 656 by 490 pixels. Our results would likely benefit from higher resolutions, but it is in fact a testament to the efficacy of the algorithm that the system succeeds even with the current hardware setup. The distance from the projector to the recorded subject is approximately 1 m, and the baseline between the two gray-scale cameras is approximately 65 cm, with the color camera and projector in the middle. Before recording, the system is calibrated in order to determine the relative transformations between the cameras and projector.

During recording, the subject wears a black ski mask to cover irrelevant parts of the head and neck. This is not necessary for our surface tracking algorithm but is

Figure 16.2 Setup of the dynamic 3D scanner. Two high-speed gray-scale cameras (a, d) compute forty depth images per second from coded light produced by the synchronized projector (c). Two strobes (far left and right) are triggered by the color camera (b), capturing color images with controlled lighting at a rate of forty frames per second.

(p.260)

Figure 16.3 A single unprocessed mesh produced by the 3D scanning system, with and without the color image (shown in the center) projected onto it.

useful in practice because it reduces the volume of scan data produced and thereby lightens the load on our computational pipeline. To minimize occlusion of the coded light, the subject faces the projector directly and avoids strong rigid head motion during the performances, which range between 5 and 10 seconds. After recording, relative phase information is computed from the stripe patterns cast by the projector. The relative phase is then unwrapped into absolute phase, allowing 3D coordinates to be computed for each valid pixel in each camera image, producing two overlapping triangle meshes per time step, which are then exported for processing by the tracking algorithm. A typical frame results in around 40k points with texture coordinates that index into the color image (see figure 16.3).

# Problem Setting and Notation

### Input

The data produced by our scanner consist of a sequence of 3D meshes with texture images, sampled at a constant rate. As a first step we transform each mesh into a set of points and normals, where the points are the mesh vertices and the corresponding normals are computed by a weighted average of the adjacent face normals using the method described by Desbrun, Meyer, Schröder, and Barr (2002). Furthermore, we append to each 3D point the time at which it was sampled, yielding (p.261)

Figure 16.4 Result of the interactive, nonrigid alignment of a common template mesh to scans of three different people.

a 4D spatiotemporal point cloud. To simplify the subsequent notation, we also append to each 3D surface normal a fourth temporal component of zero value. To represent the color information, we assign to each surface point a 3D color vector representing the red green blue (RGB) color, which we obtain by projecting the mesh produced by the scanner into the texture image. Hence we summarize the data from the scanner as the following set of m (point, normal, color) triplets:
$Display mathematics$
(16.1)

### Template Mesh

In addition to the data described here, we also require a template mesh in correspondence with the first frame produced by the scanner, which we denote by M1 = (V1; G), where V1 ∈ ℝn are the n vertices and GJ × J the edges where J = {1, 2, …, n}. The construction of the template mesh could be automated; for example, we could (1) take the first frame itself [or some adaptive refinement of it, for example as produced by a marching-cubes type of algorithm such as that proposed by Kazhdan, Bolitho, & Hoppe (2006)], or (2) automatically register a custom mesh as was done in a similar context by Zhang et al. (2004). Instead, we opt for an interactive approach using the CySlice software package. This semiautomated step requires approximately 15 minutes of user interaction and is guaranteed to lead to a high-quality initial registration (see figure 16.4). Note that although some of our figures depict the template as a quadrangular mesh, this is only for visual clarity, and we assume throughout that the mesh is triangulated. However, this is not (p.262) an algorithmic restriction and the performance has been also demonstrated for a higher-resolution mesh.

### Output

The aim is to compute the vertex locations of the template mesh for each frame i = 2, … s, so that it moves in correspondence with the observed surface. We denote the vertex locations of the i-th frame by Vi ∈ ℝn. We shall refer to the j-th vertex of Vi as vi, j. We also use i, j ∈ ℝ4 to represent vi, j concatenated with the relative time of the i-th frame. That is, $v˜i,j=(vi,j⊤,Δi)⊤$ where Δ is the interval between frames.

### The Algorithm

We take the widespread approach of minimizing an energy functional, Eobj., which in our case is defined in terms of the entire sequence of vertex locations, V1, V2, …, Vs. Rather than using the (point, normal, color) triplets of equation 16.1 directly, we instead use summarized versions of the geometry and color, as represented by the implicit surface embedding function fimp. and color function fcol., respectively. The construction of these functions is explained in detail in the appendix at the end of this chapter. For now, it is sufficient to know that the functions can be set up and evaluated rather efficiently, are differentiable almost everywhere, and

1. 1. fimp. : ℝ4 → ℝ takes as input a spatiotemporal location [say, x = (x, y, z, t)] and returns an estimate of the signed distance to the scanned surface. The signed distance to a surface S evaluated at x has an absolute value |dist(S, x)|, and a sign that differs on different sides of S. At any fixed t, the 4D implicit surface can be thought of as a 3D implicit surface in (x, y, z) (see figure 16.5, left).

2. 2. fcol. : ℝ4 → ℝ3 takes a similar input, but returns a 3-vector representing an estimate of the RGB color value at any given point. Evaluated away from the surface, the function returns an estimate of the color of the surface nearest to the evaluation point (see figure 16.5, right).

3. 3. Both functions are differentiable almost everywhere, vary smoothly through both space and time, and (under some mild assumptions on the density of the samples) can be set up and evaluated efficiently. See the appendix at the end of this chapter.

Modeling the geometry and color in this way has the practical advantage that as we construct fimp. and fcol., we may separately adjust parameters that pertain to the noise level in the raw data, and then visually verify the result. Having done so, we may solve the tracking problem under the assumption that fimp. and fcol. contain little noise, while summarizing the relevant information in the raw data.

The energy we minimize depends on the vertex locations through time and the connectivity (edge list) of the template mesh, the implicit surface model, and the color (p.263)

Figure 16.5 The nearest-neighbor implicit surface (left, intensity plot of fimp., darker is more positive) and color (right, gray-level plot of fcol.) models. Although fcol. regresses on RGB values, for this plot we have mapped to gray scale for printing. Here we fix time and one space dimension, plotting over the two remaining space dimensions. The data used are that of a human face; we depict here a vertical slice that cuts the nose, revealing the profile contour with the nose pointing to the right. For reference, a line along the zero level set of the implicit appears in both images.

model, i.e., V1, … Vs, G, fimp., and fcol.. With a slight abuse of notation, the functional can be written
$Display mathematics$
(16.2)

where the αl are parameters that we fix as described in the section on parameter selection, and the El are the individual terms of the energy function, which we now introduce. Note that it is possible to interpret the minimizer of the above energy functional as the maximum a posteriori estimate of a posterior likelihood in which the individual terms αlEl are interpreted as negative log probabilities, but we do not elaborate on this point.

### Distance to the Surface

The first term is straightforward; in order to keep the mesh close to the surface, we approximate the integral over the template mesh of the squared distance to the scanned surface. As an approximation to this squared (p.264) distance we take the squared value of the implicit surface embedding function fimp.. We approximate the integral by taking an area-weighted sum over the vertices. The quantity we minimize is given by

$Display mathematics$
(16.3)

Here, as throughout, aj refers to the Voronoi area (Desbrun et al., 2002) of the j-th vertex of M1, the template mesh at its starting position.

### Color

We assume that each vertex should remain on a region of stable color, and accordingly we minimize the sum over the vertices of the sample variance of the color components observed at the sampling times of the dynamic 3D scanner. We discuss the validity of this assumption in our presentation of the results. The sample variance of a vector of observations y = (y1, y2, …, ys)Τ is

$Display mathematics$

To ensure a scaling that is compatible with that of Eimp., we neglect the term 1/s in the above expression. Summing these variances over RGB channels, and taking the same approximate integral as before, we obtain the following quantity to be minimized:

$Display mathematics$
(16.4)

### Acceleration

To guarantee smooth motion and temporal coherence, we also minimize a similar approximation to the surface integral of the squared acceleration of the mesh. For a physical analogy, this is similar to minimizing a discretization in time and space of the integral of the squared accelerating forces acting on the mesh, assuming that it is perfectly flexible and has constant mass per area. The corresponding term is given by

$Display mathematics$
(16.5)

### Mesh Regularization

In addition to the previous terms, it is also necessary to regularize deformations of the template mesh in order to prevent unwanted distortions during (p.265) the tracking phase. Typically such regularization is done by minimizing measures of the amount of bending and stretching of the mesh. In our case however, since we are constraining the mesh to lie on the surface defined by fimp., which itself bends only as much as the scanned surface, we need only control the stretching of the template mesh.

We now provide a brief motivation for our regularizer. It is possible to use variational measures of mesh deformations, but we found these energies inappropriate for the following reason. In our experiments with them, it was difficult to choose the correct amount by which to penalize the terms; we invariably encountered one of two undesirable scenarios: (1) the penalization was insufficient to prevent undesirable stretching of the mesh in regions of low deformation, or (2) the penalization was too great to allow the correct deformation in regions of high deformation.

It is more effective to penalize an adaptive measure of stretch, which measures the amount of local distortion of the mesh, while retaining invariance to the absolute amount of stretch. To this end, we compute the ratio of the area of adjacent triangles and penalize the deviation of this ratio from that of the initial template mesh M1. The precise expression for Ereg. is

$Display mathematics$
(16.6)

Here, face1(e) and face2(e) are the two triangles containing edge e, area(·) is the area of the triangle, and a(e) = area[face1(e1)] + area[face2(e1)]. Note that the ordering of face1 and face2 affects the above term. In practice we restore invariance with respect to this ordering by augmenting the above energy with an identical term with reversed order.

## Implementation

Below we provide details on the parametrization and optimization of our tracking algorithm.

### Deformation-Based Reparameterization

So far we have cast the surface tracking problem as an optimization with respect to the 3(s – 1)n variables corresponding to the n 3D vertex locations of frames 2, 3, …, s. This has the following shortcomings:

1. 1. It necessitates further regularization terms to prevent folding and clustering of the mesh, for example.

2. 2. The number of variables is rather large.

3. 3. Compounding the previous shortcoming, convergence will be slow, as this direct parameterization is guaranteed to be ill-conditioned. This is because, for example, (p.266)

Figure 16.6 Deformation using control vertices. In this example the template mesh (left) is deformed via three deformation control vertices (black dots) with deformation displacement constraints (black arrows), leading to the deformed mesh (right). In this example 117 control vertices were used (white dots).

the regularization term Ereg. of equation 16.6 acts in a sparse manner between individual vertices. Hence, loosely speaking, gradients in the objective function that are due to local information (for example, the color term Ecol. of equation 16.4) will be propagated by the regularization term in a slow domino-like manner from one vertex to the next only after each subsequent step in the optimization.

A simple way of overcoming these shortcomings is to optimize with respect to a lower-dimensional parameterization of plausible meshes. To do this, we manually select a set of control vertices that are displaced in order to deform the template mesh (see figure 16.6). Note that the precise placement of these control vertices is not critical provided they afford sufficiently many degrees of freedom. To this end, we take advantage of some ideas from interactive mesh deformation (Botsch & Kobbelt, 2004). This leads to a linear parameterization of the vertex locations V2, V3, … Vs, namely

$Display mathematics$
(16.7)

where Pi ∈ ℝp represent the free parameters and B ∈ ℝp×n represent the basis vectors derived from the deformation scheme (Botsch & Sorkine, 2008). We have written $V¯i$ instead of Vi because we apply another parameterized transformation, namely, the rigid-body transformation. This is necessary since the surfaces we wish to track are not only deformed versions of the template, but also undergo rigid body motion. Hence our vertex parameterization takes the form (p.267)

$Display mathematics$
(16.8)

where ri ∈ ℝ3 allows an arbitrary translation, θi = (αi, βi, γi) is a vector of angles, and R(θ) is

$Display mathematics$

### Remarks on the Reparameterization

The way in which we have proposed to reparameterize the mesh does not amount to tracking only the control vertices. Rather, the objective function contains terms from all vertices, and the positions of the control vertices are optimized to minimize this global error. Alternatively, one could optimize all vertex positions in an unconstrained manner. The main drawback of doing so, however, is not the greatly increased computation times, but the fact that allowing each vertex to move freely necessitates numerous additional regularization terms in order to prevent undesirable mesh behaviors such as triangle flipping. While such regularization terms may succeed in solving this problem, the reparameterization described here is a more elegant solution because we found the problem of choosing various additional regularization parameters to be more difficult in practice than the problem of choosing a set of control vertices that is sufficient to capture the motion of interest. Hence the computational advantages of our scheme are a fortunate side effect of the regularizer induced by the reparameterization.

### Optimizer

We use the popular LBFGS-B optimization algorithm of Byrd, Lu, Nocedal, and Zhu (1995), a quasi-Newton method that requires as input (1) a function that should return the value and gradient of the objective function at an arbitrary point and (2) a starting point. We set the number of optimization line searches to twenty-five for all of our experiments. In our case, the optimization of the Vi is done with respect to the parameters {(Pi, θi, ri)} described earlier. Hence the function passed to the optimizer first uses equation 16.8 to compute the Vi based on the parameters, then computes the objective function equation 16.2 and its gradient with respect to the Vi, and finally uses these gradients to compute the gradients with respect to the parameters by application of the chain rule of calculus.

### Incremental Optimization

It turns out that even in this lower dimensional space of parameters, optimizing the entire sequence at once in this manner is computationally infeasible. First, the number of variables is still rather large: 3(s – 1)(p + 2), corresponding to the parameters {(Pi, θi, ri)}i=2…s. Second, the objective function is (p.268) rather expensive to compute, as we discuss in the next paragraph. However, optimizing the entire sequence would be problematic even if it were computationally feasible, owing to the difficulty of finding a good starting point for the optimization. Since the objective function is nonconvex, it is essential to be able to find a starting point that is near a good local minimum, but it is unclear how to initialize all frames 2, 3, … s given only the first frame and the raw scanner data. Fortunately, both the computational issue and that of the starting point are easily dealt with by incrementally optimizing within a moving temporal window. In particular, we first optimize frame 2, then frames 2–3, frames 2–4, frames 3–5, frames 4–6, etc. With the exception of the first two steps, we always optimize a window of three frames, with all previous frames held fixed. It is now reasonable to simply initialize the parameters of each newly included frame with those of the previous frame at the end of the previous optimization step.

Note that although we optimize on a temporal window with the other frames fixed, we include in the objective function all frames from the first to the current, eventually encompassing the entire sequence. Hence, loosely speaking, the color variance term Ecol. of equation 16.4 forces each vertex inside the optimization window to stay within regions that have a color similar to that “seen” previously by the given vertex at previous time steps. One could also treat the final output of the incremental optimization as a starting point for optimizing the entire sequence with all parameters unfixed, but we found this leads to little change in practice. This is not surprising as, given the moving window of three frames, the optimizer essentially has three chances to get each frame right, with a forward and backward lookahead of up to two frames.

### Parameter Selection

Choosing parameters values is straightforward since the terms in equation 16.2 are, loosely speaking, close to orthogonal. For example, tracking color and staying near the implicit surface are goals that typically compete very little; either can be satisfied without compromising the other. Hence the results are insensitive to the ratio of these parameters, namely αimp./αcol.. Furthermore, the parameters relating to the nearest-neighbor-based implicit surface and color models, and the deformation-based reparameterization, can both be verified independently of the tracking step and were fixed for all experiments (see the appendix for details).

In order to determine suitable parameter setttings for αimp., αcol., αacc., and αreg. of equation 16.2, we employed the following strategy. First, we removed a degree of freedom by fixing without loss of generality αimp. = 1. Next we assumed that the implicit surface was sufficiently reliable and treated the distance-to-surface term almost like the hard constraint Eimp. = 0 by setting the next parameter αcol. to be 1/100. We then took a sample dataset and ran the system over a 2D grid of values (p.269) of Eacc. and Ereg., inspected the results visually, and fixed these two remaining parameters accordingly for all subsequent experiments.

# Results

Tracking results are best visualized with animation, hence the majority of our results are presented in the accompanying video. (The results are available at http://www.kyb.mpg.de/~dynfaces). Here we discuss the practical performance of the system and show still images from the animated sequences produced by the surface-tracking algorithm (see figure 16.7).

Figure 16.7 Challenging examples visualized by projecting the tracked mesh into the color camera image.

(p.270)

## (p.271) Performance

We now provide timings of the tracking algorithm, which ran on a 64-bit, 2.4-GHz AMD Opteron 850 processor with 4 GB of random-access memory (RAM), using a mixture of Matlab and C++ code. We focus on timings for face data and only report averages since the timings vary little with respect to identity and performance.

The recording length is currently limited to 400 frames by the amount of RAM on the computer scanner, which is limited owing to the 32-bit architecture. Drivers of 64 bits are not available for our frame grabbers and the data rate is too great to store directly to hard disk. Note that this limitation is not due to our tracking algorithm, which has constant memory and linear time requirements in the length of the sequence.

After recording a sequence, the scanning software computes the geometry with texture coordinates in the form depicted in the top panel of figure 16.8. Before starting the tracking algorithm, a fraction of a second per frame is required to compute (point, normal, color) triplets from the output of the scanner. The computation time for B of equation 16.7 was negligible for the face template mesh we used, because it consists of only 2,100 vertices. Almost all the computation time of the tracking algorithm was spent evaluating the objective function and its gradient during the optimization phase, and of this, about 80% was spent doing nearest-neighbor searches into the scanner data using the algorithm of Merkwirth, Parlitz, and Lauterborn (2000) in order to evaluate the implicit surface and color models. Including the 1–2 seconds required to build the data structure of the nearest-neighbor search algorithm for each temporal window, the optimization phase of the tracking algorithm required about 20 seconds per frame. Note that the algorithm is trivially parallelizable, and that only a small fraction of the recorded data needs to be stored in RAM at any given time. Note also that the computation times seem to scale roughly linearly with template mesh density.

## Markerless Tracking

Figure 16.8 shows various stills from the recording of a single subject, depicting the data input to the tracking system, as well as various visualizations of the tracking results. The tracking result is convincing and exhibits very little accumulation of error, as can be seen by the consistent alignment of the template mesh with the

Figure 16.8 Snapshots from a tracking sequence. The top panel shows the input to the tracking system: texture images from the color camera (top) and the geometry data (bottom). The bottom panel visualizes the output of the system: the tracked mesh with checkerboard pattern to show the correspondence (top), the tracked mesh with animated texture taken directly from the color camera (middle), and for reference the tracked mesh as a wire frame projected into the original color camera image (bottom). We show the first and final frames of the sequence at the far left and right, with three intermediate frames between them.

(p.272) neutral expression in the first and last frames. Since no markers were used, the original color camera images can be projected onto the deformed template mesh, yielding photorealistic expression wrinkles.

Some more challenging examples are shown in figure 16.7. The expressions performed by the male subject in the top row involve complex deformations around the mouth area that the algorithm captures convincingly. To test the reliance on color, we also applied face paint to the female subject shown in the video of our results at http://www.kyb.mpg.de/~dynfaces. The deterioration in performance is graceful in spite of both the high specularity of the paint and the sparseness of the color information. To demonstrate that the system is not specific to faces, we also include a final example showing a colored cloth being tracked in the same manner as all of the other examples, only with a different template mesh topology. The cloth tracking exhibits only minor inaccuracies around the border of the mesh because there is less information here to resolve the problems caused by plain-colored and strongly shadowed regions. A further example included in the accompanying video shows a uniformly colored, deforming, and rotating piece of foam being tracked reliably using shape cues alone.

# Discussion and Future Work

By design, our algorithm does not use optical flow calculations as the basis for surface tracking. Rather, we combine shape and color information on a coarser scale, under the assumption that the color does not change excessively on any part of the surface. This assumption did not cause major problems in the case of expression wrinkles because such wrinkles tend to appear and disappear on a part of the face with little relative motion with respect to the skin. Hence, in terms of the color penalty in the objective function, wrinkles do not induce a strong force in any specific direction.

Although there are other lighting effects that are more systematic, such as specularities, and self-shadowing, we believe these do not represent a serious practical concern for the following reasons. First, we found that in practice the changes caused by shadows and highlights were largely accounted for by the redundancy in color and shape over time. Second, it would be easy to reduce the severity of these lighting effects using light polarizers, more strobes, and lighting normalization based on a model of the fixed-scene lighting. With the recent interest in markerless surface-capturing methods, we hope that in the future the performance of new approaches such as those presented in this chapter can be systematically compared with others. The tracking system we have presented is automated; however, it is straightforward to modify the energy functional we minimize in order to allow the user to edit (p.273) the result by adding vertex constraints, for example. It would also be interesting to develop a system that can improve the mesh regularization terms in a face-specific manner by learning from previous tracking results. Another interesting direction is intelligent occlusion handling, which could overcome some of the limitations of structured light methods and also allow the tracking of more complex self-occluding objects.

# Acknowledgments

This work was supported by the European Union project BACS FP6-IST-027140 and the Deutsche Forschungs Gemeinschaft (DFG) Perceptual Graphics project PAK 38.

# Appendix: KNN Implicit Surface and Color Models

In this appendix we motivate and define our nearest-neighbor-based implicit surface and color models. Our approach falls into the category of partition-of-unity methods, in which locally approximating functions are mixed together to form a global one. Let Ω. be our domain of interest and assume that we have a set of non-negative (and typically compactly supported) functions {φi} that satisfy

$Display mathematics$
(16.9)

Now let {fi} be a set of locally approximating functions for each sup(φi). The partition-of-unity approximating function on Ω is f(x) = Σiϕi(x)fi(x). The ϕi are typically defined implicitly by way of a set of compactly supported auxiliary functions {wi}. Provided the wi are non-negative and satisfy sup(wi) = sup(ϕi), the following choice is guaranteed to satisfy equation 16.9

$Display mathematics$

At present we take the extreme approach of associating a local approximating function fi with each data point from the set x1, x2, … xm ∈ ℝ4 produced by our scanner. In particular, for the implicit surface embedding function fimp. : ℝ4 → ℝ, we associate with xi the linear locally approximating function fi(x) = (xxi)Τ ni, where ni is the surface normal at xi. For the color model fcol. : ℝ4 → ℝ3, the local approximating functions are simply the constant vector-valued functions fi(x) = ci, where ci ∈ ℝ3 represents the RGB color at xi. Note that the description here constitutes a slight abuse of notation, owing to our having redefined fi twice.

(p.274)

Figure 16.9 An ℝ1 example of our nearest-neighbor-based mixing functions {ϕi}, with k = 5. The horizontal axis represents the one-dimensional real line on which the {xi} are represented as crosses. The correspondingly colored curves represent the value of the mixing functions {ϕi}.

To define the ϕi, we first assume without loss of generality that d1d2 ≤ … dkdi, ∀ik, where x is our evaluation point and di = ∥xxi∥. In practice, we obtain such an ordering by way of a k nearest-neighbor search using the TSTOOL software library (Merkwirth et al., 2000). By now letting ridi/dk and choosing wi = (1 – ri)+, it is easy to see that the corresponding ϕi of equation 16.9 are continuous, differentiable almost everywhere, and that we only need to examine the k nearest neighbors of x in order to compute them (see figure 16.9). Note that the nearest-neighbor search costs are easily amortized between the evaluation of fimp. and fcol.. Larger values of k average over more local estimates and hence lead to smoother functions; for our experiments, we fixed k = 50. Note also that the nearest-neighbor search requires Euclidean distances in 4D, so we must decide, say, what spatial distance is equivalent to the temporal distance between frames. If the spatial distance is too small, each frame will be treated separately, whereas if it is too large, the frames will be smeared together temporally. The heuristic we used was to adjust the time scale so that on average approximately half of the k nearest neighbors of each data point come from the same time (that is, the same 3D frame from the scanner) as that data point, and the other half comes from the surrounding frames. In this way we obtain functions that vary smoothly through space and time.

It is easy to visually verify the effect of this choice by rendering the implicit surface and color models, as demonstrated in the accompanying video. This method is particularly efficient when we optimize on a moving window as discussed in the implementation section. Provided the data are of a roughly constant spatial density near the surface, as is the case with our dynamic 3D scanner, one may easily bound the (p.275) temporal interval between any given point in the optimization window and its k-th nearest neighbor. Hence it is possible to perform the nearest-neighbor searches on a temporal slice of the full dataset. In this case, for a constant temporal window size, the implicit surface and color models enjoy setup and evaluation costs of O[q log(q)] and O[k log(q)], respectively, where q is the number of vertices in a single 3D scan from the scanner. These costs are those of building and traversing the data structure used by the nearest-neighbor searcher (Merkwirth et al., 2000).

References

Bibliography references:

Blanz, V., Basso, C, Poggio, T., & Vetter, T. (2003). Reanimating faces in images and video. Comput Graph Forum, 22, 641–650.

Blinn, J. F. (1982). A generalization of algebraic surface drawing. SIGGRAPH Comput Graph, 16, 273.

Borshukov, G., Piponi, D., Larsen, O., Lewis, J. P., & Tempelaar-Lietz, C. (2003). Universal capture-image-based facial animation for “The Matrix Reloaded.” In SIGGRAPH 2003 Sketches. New York: ACM Press.

Botsch, M., & Kobbelt, L. (2004). An intuitive framework for real-time freeform modeling. In SIGGRAPH ’04: ACM SIGGRAPH2004 Papers (pp. 630–634). New York: ACM Press.

Botsch, M., & Sorkine, O. (2008). On linear variational surface deformation methods. IEEE Trans Vis Comput Graph, 14, 213–230.

Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput, 16, 1190–1208.

Carr, J. C, Beatson, R. K., Cherrie, J. B., Mitchell, T. J., Fright, W. R., McCallum, B. C., & Evans, T. R. (2001). Reconstruction and representation of 3D objects with radial basis functions. In ACM SIGGRAPH 2001 (pp. 67–76). New York: ACM Press.

Desbrun, M., Meyer, M., Schröder, P., & Barr, A. H. (2002). Discrete differential-geometry operators for triangulated 2-manifolds. Vis Math, 2, 35–57.

Guenter, B., Grimm, C, Wood, D., Malvar, H., & Pighin, F. (1998). Making faces. In SIGGRAPH ’98: Proceedings of the 25th annual conference on computer graphics and interactive techniques (pp. 55–66). New York: ACM Press.

Kalberer, G. A., & Gool, L. V. (2002). Realistic face animation for speech. J Visual Comput Anima, 13, 97–106.

Kazhdan, M., Bolitho, M., & Hoppe, H. (2006). Poisson surface reconstruction. In SGP ’06: Proceedings of the fourth Eurographics symposium on geometry processing (pp. 61–70). Aire-la-Ville, Switzerland: Eurographics Association.

Merkwirth, C., Parlitz, U., & Lauterborn, W. (2000). Fast nearest-neighbor searching for nonlinear signal processing. Phys Rev E, 62, 2089–2097.

Nishimura, H., Hirai, M., Kawai, T., Kawata, T., Shirkaw, I., & Omura, K. (1985). Object modeling by distribution function and a method of image generation. Trans Inst Electron Commun Eng Japan, 68, 718–725.

Ohtake, Y., Belyaev, A., Alexa, M., Turk, G., & Seidel, H.-P. (2003). Multi-level partition of unity implicits. ACM Trans Graph, 22, 463–470.

Sifakis, E., Neverov, I., & Fedkiw, R. (2005). Automatic determination of facial muscle activations from sparse motion capture marker data. ACM Trans Graph, 24, 417–425.

Turk, G., & O’Brien, J. F. (1999). Shape transformation using variational implicit functions. In Proceedings of ACM SIGGRAPH 1999 (pp. 335–342). Los Angeles, CA. New York: ACM Press.

Walder, C., Schölkopf, B., & Chapelle, O. (2006). Implicit surface modelling with a globally regularised basis of compact support. Proc Eurographics, 25, 635–644.

(p.276) Wand, M., Jenke, P., Huang, Q., Bokeloh, M., Guibas, L., & Schilling, A. (2007). Reconstruction of deforming geometry from time-varying point clouds. In SGP ’07: Proceedings of the fifth Eurographics symposium on geometry processing (pp. 49–58). Aire-la-Ville, Switzerland: Eurographics Association.

Wolf, K. (2003). 3D measurement of dynamic objects with phase-shifting techniques. In T. Ertl (ed.), Proceedings of the vision, modeling, and visualization conference 2003 (pp. 537–544). Aka GmbH.

Zhang, L., Snavely, N., Curless, B., & Seitz, S. M. (2004). Spacetime faces: High-resolution capture for modeling and animation. ACM SIGGRAPH (pp. 548–558). Los Angeles, CA. New York: ACM Press.