If eyes are the window to the soul, then the face surely is the front door. For many years, CG artists and animators have tried to reproduce human faces digitally and to produce creatures with human emotions that play across their faces. Those with a keen eye for detail have painstakingly copied expressions frame by frame from video onto digital characters. And for many years, CG scientists have helped make that task easier and faster by creating camera- and sensor-based systems that capture, with various degrees of accuracy and artistry, human faces in motion – thus, giving the artists and animators a digital starting point based in reality.
We’ve seen the results in film – from the first major attempt to reproduce actors’ faces in Final Fantasy (2001), through Robert Zemeckis’ body of work from The Polar Express (2004) through Beowulf (2007) created at Sony Pictures Imageworks, to A Christmas Carol (2009) from ImageMovers Digital.
Also in 2009, Digital Domain’s performance of the digital, aged version of Brad Pitt created for The Curious Case of Benjamin Button (2008) helped that film receive an Oscar nomination for best picture and brought home an Oscar for best visual effects.
Similarly, when done well, digital characters animated using facial expressions captured from humans bring home Oscars and Oscar nominations for best visual effects and drive box-office success: Davy Jones created at Industrial Light & Magic for Pirates of the Caribbean: Dead Man’s Chest (2006) and the Na’vi created at Weta Digital in Avatar (2009) gave those studios Oscars for best visual effects. Weta Digital’s Caesar in Rise of the Planet of the Apes (2011) and the Hulk created at ILM for The Avengers (2012) resulted in Oscar nominations for best visual effects.
So, what’s new? Three things: At the high end, facial motion capture and retargeting is becoming more realistic as studios and companies perfect their tools and artists become more practiced. Game engines can play more realistic characters. And at a nearly consumer level, 3D sensors, such as Microsoft’s Kinect and those from PhaseSpace, combined with capture and retargeting software, are “democratizing” facial capture.
The opportunity for actors to play themselves at any age, size, or shape is becoming more appealing and available as digital dopplegangers and characters become ever more realistic and costs decline. In games, sports figures look more like themselves, and warriors, aliens, and avatars can carry believable emotions – especially in cut-scenes and cinematics, but, increasingly, in gameplay. Soon, consumers will be able to create avatars that can wear their facial expressions and see expressive versions of themselves in kiosks.
It’s an exciting time for facial-capture products and techniques. As Brian Rausch, CEO of the motion-capture facility House of Moves, puts it: “We’ve all gotten really good at body capture. It’s been 20 years now. Most body rigs tend to look roughly the same. The variations in how people apply the data are small. But faces … that’s still the Wild West. Everyone has their own ideas. It’s like a bunch of mad scientists blowing things up while we’re out in the middle of the street. It’s weird, but cool. Very cool.”
The House of Moves has performers wearing markers on their faces and bodies, and captures both simultaneously in motion-capture volumes with hundreds of cameras; other times, the performers wear head-mounted cameras. “It costs a lot to surround an actor with cameras,” Rausch says, “and we can get solid information from the head-mounted cameras. But, sometimes the helmet
bars get in the way. Every technical solution has a place and use.”
It would be impossible for us to target everything happening with facial capture right now, but to begin coloring in the picture, we talked with several vendors and service providers in this space, with studios that continue to push the state of the art, and with a game developer that moved to a unique, low-cost solution for quickly animating dozens of faces.
Vicon’s Cara 3D stereo marker-based system
Drawing on Vicon’s long history of providing marker-based optical systems for capturing full-body performances with its proprietary cameras, the company now offers a head rig with four cameras for stereo capture of markers on faces. The cameras sit on a brace that extends out several inches from the face, two on either side. Typically, the actor wearing the head rig would have markers painted on his or her face.
“Cara was born originally out of work we did with ImageMovers Digital for A Christmas Carol,” says Phil Elderfield, entertainment product manager at Vicon. “That predecessor had lower resolution and lower frame rates – it was cruder in every respect, really. But it captured the infinite bendiness of Jim Carrey. It became clear we could evolve that into a product that fit a broader need.” Five years later, the company introduced Cara, which Elderfield calls the most complex product the company has produced.“There have been a lot of single-camera systems around for a long time,” Elderfield says. “We went for the highest possible quality: an open, modular system from which we can extract 3D data. With one camera, you have to process and analyze the data to infer depth. With Cara’s four cameras, we get good coverage and can extract 3D points based on markers placed on the face, straight off the bat. We get the 3D representation and the motion.”
Although the markers provide accuracy through Cara’s system, Vicon doesn’t demand allegiance to its proprietary system. “The system is open to not only our background software, but to any other processing pipeline. You could capture images without markers and process through another system. There is so much going on in this area, so many approaches, that we don’t force people down any road. The market isn’t embryonic, but it’s evolving. There are lots of ways to skin this cat.”Vicon built the rig-mounted 720p, 60 fps cameras from scratch. “You can capture faces in a body system,” Elderfield says. “That’s well understood and commonly used. But, these cameras are locked to the face and close up.” Thus, rather than using the large number of very-high-resolution cameras necessary to capture small face markers within a volume designed for body capture, Cara provides a less-expensive yet accurate solution for the facial-capture part of the performance capture. “When you capture the body and face at the same time without using the head rig, it can take a lot of cleanup and editing in post,” Elderfield says. “We try to extract the same kind of data in comparable quality to what you would get in an optical volume – in a fraction of the time.”
CaraLive software gives an operator images from all four cameras and control over the acquisition system. CaraPost software takes the images from the four cameras and creates a 3D point-cloud representation of the marker positions on the face. Vicon has priced a complete system at approximately $45,000. “It comes in a case,” Elderfield says. “You could take the lid off and, potentially within an hour, capture 3D points on a face. We even include a marker pen.”
Glenn Derry – Technoprops
High-resolution, high-fidelity, head-mounted video-capture rigGlenn Derry’s name became famous when the world learned that he was the virtual production supervisor for Jim Cameron’s Avatar. In fact, Avatar is the touchstone for anyone doing on-set facial motion capture. Although Derry is currently working on Clint Eastwood’s upcoming film Jersey Boys, these days the company finds itself capturing actors for game developers as much or more than for film studios. “Eighty percent of our work right now is for video game cinematics,” Derry says, “especially for the new console games. About a year and a half ago, companies really started buying into the fact that they could get a more cohesive performance if they do body and face capture simultaneously. So, we’re applying a lot of the same techniques we used for Avatar.”
Derry’s group provides a head rig equipped with cameras that capture 720p, 60 fps, 1980x1080-resolution, time-code embedded, uncompressed, color video. “Our guys are on the ground fitting the helmet, making sure we’re capturing good data,” he says. On set, the company typically works cheek to cheek with Giant Studios. “They do the body capture and actual facial retargeting and solving, as well,” Derry says. “And in some cases, we do the reference work. We bring our systems and plug in to theirs. We can do the same with other companies, but the bulk of our work is with Giant.” As part of the on-set facial capture, Derry’s team sometimes helps a studio’s makeup artists apply markers. Prior to the shoot, the team scans the actor’s face, creates a vacuform shell from the scan, and drills holes to show the makeup artists where to paint the markers. But, markers aren’t always used – or necessary. Because the video is high resolution and uncompressed, it’s helpful for markerless pattern tracking. “We are marker agnostic,” Derry says. “We don’t care if the client does marker, markerless, or surface capture. Most of our clients use a technique similar to what we did on Avatar with paint-based markers. But some are doing a surface scan, and some are doing ingenious pattern-tracking that doesn’t require markers. We just provide the video. We don’t do the processing. There are other outfits good at the processing side.”
Typically, a motion-capture or facial-capture supervisor for the project will decide what works best for their production. “Every company approaches the problem differently, and every company has a different marker pattern,” Derry says. “There is no set-in-stone way to do it. With body capture, the technique is pretty much the same whatever system you use. Facial is still up in the air. If you’re going to put an actor in the show, a surface-scan technique makes sense. If you have an actor playing an ogre or another character, you might go through the process of applying markers, tracking, solving the source, and then retargeting. Some groups send the video to a service bureau; others do a lot of work in-house. It’s still a little Wild West.”
Derry believes that with the help of performance capture, we’ll see characters performing with photoreal faces in games in the future. “That isn’t as far out as people might think,” he says. “The animation side has come leaps and bounds in the past few years. The big trick has been lighting, not animation.”
High-resolution faces with video performance capture Studios have turned to USC’s Institute for Creative Technologies (ICT) for years to capture static expression scans in the LightStage system, which surrounds an actor’s face with light. “For the static scans, we ask an actor to hold an expression for two seconds while we turn on different patterns of light and take 16 photos of the face,” explains Paul Debevec, associate director for graphics research at USC ICT and research professor at USC. “We have flat, even illumination, and then gradients with the bright light at the top, at the bottom, moving from left to right, and right to left. We also have two polarization states from vertical to horizontal.” The goal is to get detailed texture maps so the rendered face can have the same lighting as the static images. “So we analyze the specular reflections, the shine, to get high-resolution skin detail,” Debevec says. “We can separate specular from subsurface scattering, which are the two main components in how faces reflect light – the subsurface diffuses the light and the specular light plays off the surface.”
The high-resolution scans produce geometry that is accurate to one-tenth of a millimeter. “We get the resolution of the face down to the level of skin pores and fine creases,” Debevec says.
Recently, the group began shooting high-definition video from several points of view using Canon EOS-1 DX cameras and created a markerless facial motion-capture technique.“What we do is similar to what Mova was doing by using florescent makeup to track facial textures,” Debevec says. “In 2006, the resolution of good, digital still cameras wasn’t enough to see facial texture. But now, we can see the bumps and divots, the skin pores, freckles, and other imperfections well enough to track from frame to frame and see how the skin is moving.” At SIGGRAPH, the group demonstrated its work with a real-time performance of “Digital Ira.” Graham Fyffe, an ICT computer scientist, explains: “We used LightStage to capture the aesthetic expressions using digital cameras, then we captured a video of Ira’s performance. We track movement of his face based on the high-definition video. The look comes from the
expression scans in LightStage. The LightStage capture gives us the range of expressions the actor can make with his face, and the scans inform the tracking by providing almost a library of what the person’s face can do.”
The tracking and retargeting software is proprietary. “The main thing we’re trying to do is improve the way digital characters move, especially their faces,” Fyffe says. “We want them to look real through a wide range of expressions. This is not a solved problem. It took a great deal of artistic effort to achieve the kind of performance we saw in Benjamin Button. We’re trying to build something that captures an actor’s performance more directly with less effort on the part of the artists. We’re trying to push the quality forward, and we think we have done that by combining multiple scans with video in a way people haven’t done before.”
Faceware 2D auto-tracking from video
With roots that trace back to ImageMetrics, which provided proprietary facial capture and retargeting services, Faceware now offers products for automatic tracking, analysis, and retargeting of the 2D motion onto a 3D rig. “We can take any video and automatically track a performance without any user input,” says Peter Busch, vice president of business development. “That sets us apart. We’ve taken the data from 12 years of performance capture and built it into a global tracking model that works in any lighting condition. Our technology looks for key features in a face, like the eyes and mouth. Once it recognizes those, it knows where the rest of the face is and tracks the motion. So, what used to take an hour or two now takes two minutes. No one enjoys tracking. The fact that we’ve automated the process with one button click is massive.” Once tracked and analyzed using Faceware’s Analyzer software, the company’s Retargeter software moves the facial tracking data created in Analyzer onto a facial rig created in Autodesk’s Maya, 3ds Max, MotionBuilder, or Softimage. Animators can then perfect the performance using standard keyframe techniques. Because tracking is quick and easy, the company suggests that studios push the resulting data into retargeting and animation without fiddling with it first. If animators decide the data doesn’t fit the digital character well enough – that is, hasn’t retargeted sufficiently well – they can quickly try tracking again. “Animators can check a performance on a few poses, hit ‘retarget,’ and then toggle back and forth to see how the retargeting influences the performance,” Busch says. “That gives the animator an intuitive workflow they embrace.” Pricing for Analyzer with auto-tracking starts at $795, with professional versions that include features designed for higher volume starting at $6,995. Retargeter Lite costs $395, and Retargeter Pro carries a $1,995 price tag. In addition, Faceware now offers a head-mounted camera rig and software called Faceware Live for real-time facial motion capture. The rig is a helmet with one camera attached to a “boom” that extends in front of the face. “We built a mount for the GoPro camera,” Busch says. “We have six mounting bars that slide in and out of the helmet, three sizes of helmets in different thicknesses, and two lenses for the cameras. We’ve had to drop in on motion-capture stages for two or three days, so we created a kit that is completely portable and flexible.” An entry-level version of the system designed for schools and independent production sells for $400. “Our Faceware Live does the work of Analyzer and Retargeter in real time,” Busch says. “We have a three-second calibration. Faceware Live takes in the video, automatically tracks the performances, and tells the user the quality of the extracted motion. We aimed the product at virtual production and rapid prototyping, at previewing the quality of the animation in real time.”
Data streamed from the performance into MotionBuilder drives a character rig built from a pre-designed expression set. “What you capture on set with Faceware Live indicates what you’ll get in post,” Busch says. To demonstrate a virtual production process, Faceware collaborated with The Third Floor previs studio and KnightVision motion-capture studio, both in Los Angeles, for a demo at the Director ’s Guild. Directors at the demo could move a virtual camera in a motion-capture volume and frame performances from actors on stage applied to digital characters.“They could see the face and body capture altogether and get a feel for the digital characters,” Busch says. “They could frame, play the scene back, and frame again until the camera was where they wanted it. We streamed the shots into Final Cut, so the directors would go to an editorial station, pick the camera angles, and in a matter of five minutes they had a QuickTime file that was the equivalent of an animatic. And that’s when the lightbulb went on. That’s why we developed Faceware Live. Even just seeing the eyelines helps directors make good decisions for camera position.” Meanwhile, according to Busch, game developers like Rockstar and Sony’s Visual Arts Facility are already using virtual production tools. “They see the value of virtual production,” he says. “They get all the animatics locked early. That’s how they make their massive games.”
Cubic Motion Tracking, analyzing, retargeting from video
“We’re not an alternative capture solution,” says Gareth Edwards, director of Cubic Motion. “Our specialty is machine vision. We’re good at analyzing video data from whatever source and extracting geometric measurements at a fine level of detail. We can track exactly where that information is and help people map the data onto digital characters. We deliver animation curves on a customer’s rig.” The company’s primary customers are video game developers. Its ideal customer is one who asks what type of capture technique to use. “We have strong views,” Edwards says. “Facial animation is often done back to front. Someone will capture 50 views of a head. It might be useful in certain circumstances, but might not be optimal for what they’re trying to get on- screen. We ask whether they’re trying to reproduce an actor or create a character driven by an actor. Only then do we start saying what type of capture they need. We can look at a rig, see how a character behaves, and then determine the set of measurements the customer needs.” Edwards gives an extreme example of a character that looks like a crocodile and has a jaw with one degree of freedom. “The character might be photoreal,” he says, “but the set of measurements you need is small. Knowing what you’re trying to produce is important in the video game industry. In the interactive world, they have hours of animation. They aren’t making a movie. And the video game character may be rendered in the game engine.” Although the company’s animators and engineers often track markerless footage, they also work with data captured using markers. “For most game characters, you wouldn’t need to track more than a few hundred markers,” Edwards says. “Even though your target mesh may have tens of thousands of vertices, you don’t need to track thousands of points because in most cases on the game side, the character is not the actor.” Founded in 2009, Cubic Motion now has a core engineering team of seven and approximately 20 technical animators. “Sometimes we might take part in a capture session, but we usually try to stay away from that,” Edwards says. “And that’s the way we feel the industry ought to work. We don’t want to get involved in capture, and we think maybe capture studios should not spend time analyzing if it’s not their specialty.”
Dimensional Imaging 3D markerless performance capture
Founded in 2003, Dimensional Imaging was a pioneer in passive stereo photogrammetry; that is, in creating 3D models from stereo pairs of images. Its customers included facial surgery researchers. Recently, the company evolved from static 3D capture to performance capture, and moved into entertainment as well as medical research. “Our breakthrough project in entertainment was the trailer for the Dead Island game from Axis Animation, which went viral,” says Colin Urquhart, CEO. Now the company’s systems are helping map an actor’s performance onto a character for a feature film. “Instead of still cameras, we’re using 60 fps video and acquiring stereo pairs of images,” Urquhart says. “We can track through a sequence of frames to get a fixed topology mesh that deforms. That’s what makes our system unique and useful for entertainment. You could think of it as Mova without the makeup and with a smaller number of cameras.” A typical system utilizes three pairs of cameras: two monochrome sets for shape and tracking and one set for texture and color. “The fundamental difference between our system and others that capture 3D data at video rates is that our tracking works at the pixel level rather than tracking features on the face. We use the natural skin texture and the pores.” The process begins with a template mesh that has the topology the customer wants. “We fit that to one frame in the sequence to get the initial mapping onto the individual actor, and then we track every vertex on the mesh,” Urquhart explains, “every single one of the 2,000 vertices. Other systems track a sparse set of data that drives a rig that deforms a character. We deform the face mesh directly. Our approach is full-3D.” The data lands on a facial animation rig eventually so animators can refine the performance, and Dimensional Imaging helps its customers build appropriate systems. “When you have a 2,000-vertice mesh, you need a detailed, accurate, and well-defined rig,” Urquhart says. “We have our customers go through 100 or more expressions that we track using the detailed mesh to create a rig.” Recently, the company has been working with a hardware company on a head-mounted capture system to make it possible for customers to capture the body and face simultaneously. “Our technology is amenable to a two-camera system,” Urquhart says. “The problem has been that the video from existing cameras is highly compressed. The compression method looks for similarities between two frames and deletes that, but it’s exactly that similarity that our tracking software requires. The Kinect system is interesting because it produces a video-rate depth map, which is similar to our system, but the fidelity of machine-vision cameras is an order of magnitude higher. The holy grail is a head-mounted system with high fidelity that can track thousands of virtual [not painted] marker points on a face.” Currently, Dimensional Imaging sells software systems, licenses software, and does service work. “The take-home message from me is that traditional methods based on sparse sets of points don’t provide enough detail,” Urquhart says, so people are looking for an alternative, and hopefully that’s where we come in.”
Dynamixyz Markerless tracking with 3D sensor
Gaspard Breton, CEO of Dynamixyz, has been working on facial capture and performance since he was a PhD student 15 years ago. He now heads a team of former PhD students and a company that sells a markerless, model-based facial-capture system and provides facial motion-capture services. “Like most of these facial systems, we have a [CG] model with the actor’s expressions,” Breton says. “When the camera sends data from the actor, the system combines expressions on the model and tries to match what the camera sees. That way we produce the same expressions on the model as the ones the actor performs. When that is successful, every pixel of the face is a marker. We also construct the color so you can see a blush, and if the actor has wrinkles, we rebuild the wrinkles. We really reconstruct the image.” The company’s first systems worked with 2D cameras, but at SIGGRAPH, Dynamixyz released a “4D” system – 3D plus time using a head-mounted rig. “You can capture 3D from a multi-view camera system, like Vicon’s Cara, or with a 3D sensor, like Kinect, SoftKinetic’s cameras, or ones from Leap Motion,” Breton says. “We use ones
from SoftKinetic. The quality is not quite as good as the Kinect, but when it is on a helmet, it’s less than one meter from the eyes, and we think the Kinect shouldn’t be used that close. You need to have the camera connected to a helmet because global head movement introduces a lot of noise in tracking; the worst thing for a tracking algorithm is head rotation. Most of the time, if you turn your head more than 10 or 15 degrees from the camera, the tracking fails. So, a head-mounted system is more accurate.” It is not as accurate, however, as a marker-based system. “Marker-based systems tend to be accurate, but you need a lot of markers to capture facial movement, and some regions, like eyes and inner lips, can barely be captured,” Breton says. “Markerless systems tend to be more successful in capturing whole expressions.” So, Breton imagines that some customers will use Cara’s marker-based system in combination with the Dynamixyz system. “They can send us the video feed from Cara, and we can work on that, ” he says. “We feel strongly this is the winning solution. They get the accuracy from the markers and can work the way they are familiar with using a marker-based approach, and also have the extra information we can add by capturing whole expressions and areas like the inner lips with our markerless system.” In addition to providing facial capture for the entertainment industry, the company’s computer-vision algorithms help doctors and medical researchers. “We have an R&D project with a French hospital to help children with cerebral palsy learn how to re-educate the movement of their faces,” Breton says. “The kids are prisoners of their own bodies. They don’t want to see themselves in the mirror because it’s so [emotionally] painful. But using a visual corrector, they see their own face moving the way it should move. The company offers an evaluation and production license, and annual licenses, and are considering project fees rather than licenses. The evaluation license provides a functional system, free to use. The company also is developing a network of value-added service providers that work with motion-capture studios. “The message we want to send is that we’re strongly R&D oriented,” Breton says. “If a company selects us and works with us, they have the assurance that they’ll always have the latest cutting-edge technology in computer vision.”
Faceshift Real-time 3D markerless tracking and retargeting
Company Vice President Doug Griffin arrived at Starbucks to demonstrate Faceshift carrying his laptop and a 3D sensor in a sock.
Griffin joined the company, also named Faceshift, after leading motion-capture teams at Industrial Light & Magic, Electronic Arts, and ImageMovers Digital, and he was previously vice president of product and strategy at Vicon. He is eager to demonstrate the reason he joined Faceshift. “We want to democratize facial tracking,” Griffin says. “With our system, you can have live results with a five-minute setup and game-ready results with a couple minutes of post-processing. It’s so inexpensive you could outfit an entire team with Faceshift. You could capture faces in a sound booth.” Griffin aims the sensor at himself, points to a 3D model of a face on-screen, and explains how the system works. “We use the PrimeSense 3D sensor,” he says. “We start with a standard character head that we train. I hold a pose, look back and forth while the sensor scans, and then you can watch while Faceshift deforms the character to match the scan data. We suggest doing 18 training poses. That’s the five-minute process. You only have to do it once. And, you don’t need makeup or markers.” Brian Amberg, co-founder and CTO, explains: “We decompose the training data into 48 asymmetric expressions. Then, as an actor performs and the system runs live, it searches for the combination of these expressions that best match the current expression of the actor. It does this frame by frame in real time. For even better results, our post-processing algorithm can optimize across many frames. You can also touch up detections, and that will retrain the algorithm and improve the results.” The system is, in effect, puppeteering the blendshape model. If the model is highly characterized rather than a digital double, the blendshapes’ 0 to 1 values might, for example, raise the character’s eyebrow a lot when the actor raises an eyebrow a little. A team of PhDs from Swiss universities created the algorithms that do the real-time tracking and retargeting. “We had started developing our own face-tracking system and hardware before the Kinect came out,” Amberg says. “Suddenly there was a consumer scanner with 3D data. So, we changed course to focus on this sensor. The consumer-grade cameras meant we could make face capture available to everyone. That’s what I find exciting. Independent artists can afford to make a film. And, animators can have facial capture on their desks and use it like a mirror. We think many people will use it.”
People can buy and download Faceshift Studio and Faceshift Freelance from the website for $800 to $1,500, with various types of licensing and academic discounts available. Amberg and Griffin note that in addition to making it possible for more artists to use facial capture for animated characters, the low price point and real-time capture is enabling other markets, as well. “Our SDK can do tracking without the setup face,” Amberg says. “The expressions aren’t quite as accurate, but it’s exciting for consumer applications. You could put your facial expressions on an avatar in a virtual world like Second Life.” “Or,” Griffin says, “imagine you just woke up. Your hair’s a mess. You could Skype with an avatar and look great.” The facial-capture data could also drive something other than another face. The company has developed plug-ins for MotionBuilder, Maya, Unity, and other software platforms. “We give you 0 to 1 curves,” Griffin says. “You can read those into a 3D application and trigger whatever you want.”
Thus, it could trigger digital music and other sounds. Or, images. One artist, for example, has created a digital painting that viewers can interact with through their facial expressions. The traveling exhibition is currently in Moscow.“When the viewers smile, the sun rises,” Amberg says.
Launched in November 2012, the product is already a success, according to Griffin. “We have customers in 21 countries, and a number of big studios have bought licenses,” he says. “I’m surprised by how many people I ring up and they say, ‘We’ve already got it.’” And that’s just the beginning.
Researcher Hao Li at USC, for example, has developed algorithms for capturing faces and hair using a Kinect device. “The biggest thing I’ve done is solve the problem of correspondence between two scans using a method called a graph-based, non-registration algorithm,” he says. “The idea is that if you’re given two meshes, two surfaces, the computer computes the correspondences automatically. If you want to do facial tracking, you can throw 2,000 frames into the solver and it will find the correspondence.”Li plans to show his work at SIGGRAPH Asia.
Once mass-market consumer devices have embedded 3D sensors into various mobile devices, as is likely to happen, the door for facial capture bursts wide open.