Generative A-Eye #11 - 2nd/3rd Oct,2024
A (more or less) daily newsletter featuring brief summaries of the latest papers related to AI-based human image synthesis, or to research related to this topic.
This is another catch-up piece, due to the usual work deadlines being prioritized.
First of all, this article I wrote at unite.ai achieved a respectable amount of time at the top spot of Hacker News (though this snapshot shows it rising at #2). Here is the HN comments thread.
Image Editing with Gaussian Splatting
A method of ‘pulling out’ selections in an image into 3D space so that the user can manipulate it with a physics engine, and insert the modification back into the image. Unusually, it comes with full GitHub code.
https://www.unite.ai/image-editing-with-gaussian-splatting/
Now for the research papers that have caught my eye since the last newsletter.
ImageFolder: Autoregressive Image Generation with Folded Tokens
I have not looked extensively into the impact of token length in LDMs, but this offering seems to have discovered a novel method of compressing tokens to deliver better results.
'Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality.'
https://github.com/lxa9867/ImageFolder <- EMPTY ATOW
https://lxa9867.github.io/works/imagefolder/index.html
https://arxiv.org/abs/2410.01756
ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation
The node-based interface of the ComfyUI latent diffusion model framework has a nasty learning curve, so it’s unsurprising that LLMs are being proposed as a method of instrumentality.
‘While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt. Here, we introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt. We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows.’
https://arxiv.org/abs/2410.01731
Data Extrapolation for Text-to-image Generation on Small Datasets
If you have ever trained vision models, you’ll know the standard set of data augmentation options, such as flipping, rotating, etc. These are enacted so that the model does not overfit on the source images. But this interesting project proposes a RAG-style system that finds similar images on the internet and uses them for data augmentation. Of course, this is dodgy, in terms of copyright compliance – an issue that has come to the fore in 2024.
‘Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines.’
https://arxiv.org/abs/2410.01638
Fake It Until You Break It: On the Adversarial Robustness of AI-generated Image Detectors
Sticking to ancient baselines, datasets and metrics remains the curse of the deepfake/AI-detection sector, since the ground shifts so quickly. Yet it is difficult to find a robust low-level feature that can survive the rapid evolution of text-to-image and other innovations. This paper outlines the importance of real-world results in this sub-strand of computer vision research.
‘While generative AI (GenAI) offers countless possibilities for creative and productive tasks, artificially generated media can be misused for fraud, manipulation, scams, misinformation campaigns, and more. To mitigate the risks associated with maliciously generated media, forensic classifiers are employed to identify AI-generated content.
‘However, current forensic classifiers are often not evaluated in practically relevant scenarios, such as the presence of an attacker or when real-world artifacts like social media degradations affect images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI) detectors under different attack scenarios.
‘We demonstrate that forensic classifiers can be effectively attacked in realistic settings, even when the attacker does not have access to the target model and post-processing occurs after the adversarial examples are created, which is standard on social media platforms’
https://arxiv.org/abs/2410.01574
EVA-Gaussian: 3D Gaussian-based Real-time Human Novel View Synthesis under Diverse Camera Settings
Tolerable results for this GSplat human synthesis outing, though the distinction between results from prior methods and this approach could be clearer.
‘The feed-forward based 3D Gaussian Splatting method has demonstrated exceptional capability in real-time human novel view synthesis. However, existing approaches are restricted to dense viewpoint settings, which limits their flexibility in free-viewpoint rendering across a wide range of camera view angle discrepancies. To address this limitation, we propose a real-time pipeline named EVA-Gaussian for 3D human novel view synthesis across diverse camera settings.’
https://arxiv.org/abs/2410.01425
https://zhenliuzju.github.io/huyingdong/EVA-Gaussian
High-quality Animatable Eyelid Shapes from Lightweight Captures
So, in human image synthesis, this is an absolute first, at least as far as I have seen - a system that concentrates on the ways eyelids deform during gaze movement. This is the kind of attention to detail that we need!
‘High-quality eyelid reconstruction and animation are challenging for the subtle details and complicated deformations. Previous works usually suffer from the trade-off between the capture costs and the quality of details. In this paper, we propose a novel method that can achieve detailed eyelid reconstruction and animation by only using an RGB video captured by a mobile phone. Our method utilizes both static and dynamic information of eyeballs (e.g., positions and rotations) to assist the eyelid reconstruction, cooperating with an automatic eyeball calibration method to get the required eyeball parameters.’
https://arxiv.org/abs/2410.01360
Towards Native Generative Model for 3D Head Avatar
An interesting paper that explores the apparently unbreakable relationship between parametric head models (such as 3DMM) and neural head rendering.
‘[We] delve into how to learn a native generative model for 360∘ full head from a limited 3D head dataset. Specifically, three major problems are studied: 1) how to effectively utilize various representations for generating the 360∘-renderable human head; 2) how to disentangle the appearance, shape, and motion of human faces to generate a 3D head model that can be edited by appearance and driven by motion; 3) and how to extend the generalization capability of the generative model to support downstream tasks.’
https://arxiv.org/abs/2410.01226
Running out of space now. Other papers of interest:
LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures
Towards Better Control Of Latent Spaces For Face Editing
My domain expertise is in AI image synthesis, and I’m the former science content head at Metaphysic.ai. I’m an occasional machine learning practitioner, and an educator. I’m also a native Brit, currently resident in Bucharest.
If you want to see more extensive examples of my writing on research, as well as some epic features (many of which hit big at Hacker News and garnered significant traffic), check out my portfolio website at https://martinanderson.ai.