Generative A-Eye #5 - 20th Sept,2024

A (more or less) daily newsletter featuring brief summaries of the latest papers related to AI-based human image synthesis, or to research related to this topic.

Sep 20, 2024

Another thin day for AI-based human image synthesis, at least at Arxiv (though we can expect a true annual drought from mid-December to late March – with the exception of major releases from the big players, who favor this period).

The one paper that caught my eye touches on a subject that I intend to explore in a feature soon – the inability of LDM models to produce consistent output across shots, including characters, what they are wearing, and environments.

Companies such as Runway ML tend to promote their text-to-video offerings in rapid-cut visual blitzes designed to overcome the senses – and also to divert attention away from the lack of narrative consistency their systems currently offer (which is a common issue with other API T2V systems, such as Kling, Kaiber, and Luma).

Therefore narrative consistency remains an unsolved problem, addressed variously across such systems by diverse ‘hacks’ of limited effectiveness – at least for now.

The new work, from Xiaohongshu Inc., AKA ‘Red’, and characterized as ‘China’s answer to Instagram’, is remarkably similar to NVIDIA’s ConsiStory project, at least in terms of the static output it emits. Below is a sample of ConsiStory’s work:

Source: https://arxiv.org/pdf/2402.03286

And here is some output from the new system described in the Xiaohongshu paper StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation:

Source: http://export.arxiv.org/pdf/2409.12576

The key difference, besides ConsiStory’s training-free approach, is that the code appears to be complete and available on GitHub (https://github.com/RedAIGC/StoryMaker).

In any case, there’s a distinct lack of profile views on offer, with the generations limited to social media-style ‘to-camera’ shots. This could be because this is where Xiaohongshu’s potential market lies, or simply (and, I think, more likely) because these interpretations are not terribly flexible.

The authors describe their approach:

‘Tuning-free personalized image generation methods have achieved significant success in maintaining facial consistency, i.e., identities, even with multiple characters. However, the lack of holistic consistency in scenes with multiple characters hampers these methods' ability to create a cohesive narrative. In this paper, we introduce StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images.
StoryMaker incorporates conditions based on face identities and cropped character images, which include clothing, hairstyles, and bodies. Specifically, we integrate the facial identity information with the cropped character images using the Positional-aware Perceiver Resampler (PPR) to obtain distinct character features. To prevent intermingling of multiple characters and the background, we separately constrain the cross-attention impact regions of different characters and the background using MSE loss with segmentation masks. Additionally, we train the generation network conditioned on poses to promote decoupling from poses. A LoRA is also employed to enhance fidelity and quality. Experiments underscore the effectiveness of our approach’

Effectively, StoryMaker is powered by ControlNet, aspects of IPAdapter (which is clear from a look at the GitHub folders for the repository) and LoRA - all three of which are relatively high-effort approaches that are already in use in production, and which require a fair deal of manual effort.

Training requires 8 NVIDIA A100 GPUs, each with a batch size of 8 images per GPU, using the AdamW optimizer. If StoryMaker qualifies at all as an end-to-end system, it’s one with exorbitant demands — and in the end, only seems effective at producing ‘photo comics’.

(and it’s interesting that the majority of east Asia’s prolific generative AI research in voice synthesis uses east Asian personalities, while practically every image/video-based generative system coming out of the region uses western personalities such as Ann Hathaway)

_________________________

My domain expertise is in AI image synthesis, and I’m the former science content head at Metaphysic.ai. I’m an AI developer, current machine learning practitioner, and an educator. I’m also a native Brit, currently resident in Bucharest, but possibly interested in relocation.

If you want to see more extensive examples of my writing on research, as well as some epic features, many of which hit big at Hacker News and garnered significant traffic, check out my portfolio website at https://martinanderson.ai.

Martin’s Newsletter

Discussion about this post