Because the delivery of text-to-image DALL-E by OpenAI, the AI world has been working in the direction of related fashions, for instance, Midjourney, and Imagen, to call a couple of. Quickly got here text-to-video fashions like Transframer, NUWA Infinity, CogVideo, and many others. Even text-to-voice fashions like VALL-E had been lately unveiled by Microsoft.
Final month, researchers from Present Lab, Nationwide College of Singapore got here up with a text-to-video generator known as Tune-A-Video (TTV) to handle the problem of One-Shot Video Technology, the place solely a single text-video pair is supplied for coaching an open-domain text-to-video generator. With customised Sparse-Causal Consideration, Tune-A-Video expands spatial self-attention to the spatiotemporal area utilizing pretrained text-to-image (TTI) diffusion fashions.
Test the unofficial implementation of Tune-A-Video right here.
In a single coaching pattern, the projection matrices within the consideration block are modified to incorporate the related movement info. Tune-A-Video can create temporally coherent movies for numerous purposes, together with altering the topic or background, modifying attributes, and transferring kinds.
It was found that TTI fashions may produce pictures that match verb phrases effectively and that increasing TTI fashions to generate totally different pictures directly demonstrates unexpectedly sturdy content material consistency.
Obtain our Cellular App
Nice-Tuning: TTI fashions are expanded to TTV fashions utilizing TTI mannequin weights which have already been pretrained. The text-video pair is then subjected to one-shot tuning with the intention to create a one-shot TTV mannequin.
Inference: A modified textual content immediate is used to generate new movies.
After receiving a video and textual content pair as enter, it modifies the projection matrices in consideration blocks.
Learn the complete paper right here.