Generative A-Eye #2 - 17th Sept,2024

A (more or less) daily newsletter featuring brief summaries of the latest papers related to AI-based human image synthesis, or to research related to this topic.

Sep 17, 2024

The most popular day of the week for Arxiv submissions varies throughout the year, but it’s usually Tuesday. Since the submissions are scheduled in advance, I’m assuming that the sector believes that the requisite journalists, VCs and influencers tend to have a long weekend (!).

Despite over 180 posts, today has been pretty thin for AI-based human synthesis papers of interest. In fact, the whole year has been thin, as existing technologies get stirred around, with a minimum of novel ‘new dishes’.

There are some fundamental challenges in generating convincing AI humans, and most of the solutions I’ve seen this year pile on secondary systems, or create complex workflows with 3DMM-style interstitial systems, such as FLAME. With the possible exception of Gaussian Splatting, the core technologies (LDMs, NeRF, GAN, SDF, etc.) are hard to access and instrumentalize.

We’re still waiting for the architecture that takes T2V beyond Instagram virality and into narrative and content continuity (among many other requirements)

Neuromorphic Facial Analysis with Cross-Modal Supervision

Facial Action Units, part of the Facial Action Coding System, are the ABCs of new research into facial expression recognition (FER), whereas the sector needs a dictionary. Nonetheless, this interesting Italian outing makes good use of these standards in an effort to detect micro-emotions - exactly what is necessary to surpass FACS.

'FACEMORPHIC [is] a multimodal temporally synchronized face dataset comprising both RGB videos and event streams. The data is labeled at a video level with facial Action Units and also contains streams collected with a variety of applications in mind, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. '

http://export.arxiv.org/abs/2409.10213

PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

Free up your GPUs for this initiative’s video-heavy project page! Check out the use of human priors to generate plausible rear-views from single monocular input (first illustration on first page). As is common (particularly this year), the CGI-based SMPL-X model is used as an interstitial template. While the results appear static, this project appears to possess a NeRF-like ability to see around corners, with considerably less input than NeRF gets. However, the paper concedes that the system is subject to pose estimation errors, which, in my experience, are compounded by non-tight clothing; this is a separate and vigorous area of study in the literature.

'[Full]-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion'

http://export.arxiv.org/abs/2409.10141
https://penghtyx.github.io/PSHuman/

Towards Kinetic Manipulation of the Latent Space

Latent pushing and exploration is one of the most fascinating and high-potential strands of research in image synthesis. This new initiative offers a system that can follow the path of changes from a live feed (or input content), directly through a StyleGAN3 latent space. Mapping these relationships and connections in such an explicit way, rather than the cruder heat-maps of COLMAP and similar earlier projects, may be one of the keys to improved instrumentality.

To boot, there’s code!

[Clicking on the video below will take you to the web version of this newsletter, where you can play the video. It’s not possible to embed videos directly into this newsletter]

'The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation'

https://github.com/PDillis/stylegan3-fun
http://export.arxiv.org/abs/2409.09867
https://nvlabs-fi-cdn.nvidia.com/_web/stylegan3/videos/video_8_internal_activations.mp4

There are a couple of other interesting contenders in today’s submissions, but these are the pick of the crop. More tomorrow, unless it’s a really ‘dry’ day.

_________________________

Under the fold

If you want to see more extensive examples of my writing on research, as well as some epic features, many of which hit big at Hacker News and garnered significant traffic, check out my portfolio website at https://martinanderson.ai.

Martin’s Newsletter

Discussion about this post