Reconstructions generated by OFER. Our method can reconstruct faces from a single image under hard occlusions (a), providing multiple solutions with diverse expressions that align with the input image (d). We use two diffusion models that denoise shape and expression parameters of FLAME conditioned on the image. A novel ranking mechanism selects an optimal identity (c) from the generated set of shapes, on top of which the expression variants are applied to obtain the final results.
Abstract
Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models to generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This approach captures the multi-modal nature of the problem, generating a distribution of solutions as output. However, to maintain consistency across diverse expressions, the challenge is to select the best matching shape. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on predicted shape accuracy scores. We evaluate our method using standard benchmarks and introduce CO-545, a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results show improved performance over occlusion-based methods, while also enabling the generation of diverse expressions for a given image.
Overview
Our method takes as input a single-view image of an occluded face and generates a set of 3D faces as output. The goal is to produce reconstructions that explore a diverse range of expressions in the occluded areas, while accurately capturing the visible regions of the input image.
OFER is structured around three key components. First, the Identity Generative Network (IdGen), a DDPM conditional diffusion model, outputs a set of FLAME shape coefficients, recovering a distribution of plausible neutral 3D faces. Next, the Identity Ranking Network (IdRank), a small MLP, evaluates and ranks the shape samples given by the previous network. This step selects a unique shape coefficient that best explains the image. Finally, the Expression Generative Network (ExpGen), another conditional DDPM network, generates a diverse set of FLAME expression coefficients. We use default DDPM framework with DDPM sampler with 1000 sampling time steps for both IdGen and ExpGen. The three networks are conditioned on the same input image. By combining the selected shape coefficient with the set of expression hypotheses, we obtain the final set of 3D reconstructions.
Results
1. IdGen -> IdRank
2. IdRank + ExpGEN
3. IdRank
The median MSE error between the ground truth scan and the reconstructed 3D shape is displayed below each rendering. These results demonstrate that ranking as a selection method enhances the quality of sample selection.
Paper : OFER_arXiv (CVPR 2025)
Please cite for reference:
@misc{pratheba2024OFER,
Author = {Pratheba Selvaraju and Victoria Fernandez Abrevaya and Timo Bolkart and Rick Akkerman and Tianyu Ding and Faezeh Amjadi and Ilya Zharkov},
Title = {OFER: Occluded Face Expression Reconstruction},
Year = {2024},
Eprint = {arXiv:2410.21629},
}
Contact: ofer@tue.mpg.de