Late last week, OpenAI announced a new generative AI system named Sora, which produces short videos from text prompts. While Sora is not yet available to the public, the high quality of the sample outputs published so far has provoked both excited and concerned reactions.
The sample videos published by OpenAI, which the company says were created directly by Sora without modification, show outputs from prompts like "photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee" and "historical footage of California during the gold rush."
At first glance, it is often hard to tell they are generated by AI, due to the high quality of the videos, textures, dynamics of scenes, camera movements, and a good level of consistency.
OpenAI chief executive Sam Altman also posted some videos to X (formerly Twitter) generated in response to user-suggested prompts, to demonstrate Sora's capabilities.
How does Sora work?
Sora combines features of text and image generating tools in what is called a "diffusion transformer model".
Transformers are a type of neural network first introduced by Google in 2017. They are best known for their use in large language models such as ChatGPT and Google Gemini.
Diffusion models, on the other hand, are the foundation of many AI image generators. They work by starting with random noise and iterating towards a "clean" image that fits an input prompt.
A video can be made from a sequence of such images. However, in a video, coherence and consistency between frames are essential.
Sora uses the transformer architecture to handle how frames relate to one another. While transformers were initially designed to find patterns in tokens representing text, Sora instead uses tokens representing small patches of space and time.
Leading the pack
Sora is not the first text-to-video model. Earlier models include Emu by Meta, Gen-2 by Runway, Stable Video Diffusion by Stability AI, and recently Lumiere by Google.
Lumiere, released just a few weeks ago, claimed to produce better video than its predecessors. But Sora appears to be more powerful than Lumiere in at least some respects.
Sora can generate videos with a resolution of up to 1920 × 1080 pixels, and in a variety of aspect ratios, while Lumiere is limited to 512 × 512 pixels. Lumiere's videos are around five seconds long, while Sora makes videos up to 60 seconds.
Lumiere cannot make videos composed of multiple shots, while Sora can. Sora, like other models, is also reportedly capable of video-editing tasks such as creating videos from images or other videos, combining elements from different videos, and extending videos in time.
Both models generate broadly realistic videos, but may suffer from hallucinations. Lumiere's videos may be more easily recognized as AI-generated. Sora's videos look more dynamic, having more interactions between elements.
However, in many of the example videos inconsistencies become apparent on close inspection.
This article is republished from The Conversation under a Creative Commons license. Read the original article.
Citation: What is Sora? A new generative AI tool could transform video production and amplify disinformation risks (2024, February 20) retrieved 20 February 2024 from https://techxplore.com/news/2024-02-sora-generative-ai-tool-video.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.