Not content with unleashing two of the world’s most influential AI tools so far in ChatGPT and Dall-E, OpenAI this week turned its attention to a new frontier (AI-generated video) with its new model called Sora. While big questions remain, it might even be the most impressive of the lot.
How does it work?
OpenAi’s research paper says that Sora is both a “diffusion model” (like Dall-E) and a “transformer” (like ChatGPT). This means it can predict sequences or patterns (in this case, video) based on vast quantities of training data. What we don’t yet know is exactly what training data was used, which is a fairly big unanswered question.
Sora is a text-to-video tool that can create all kinds of video – photo-realistic, animated, downright strange – of up to sixty seconds in length. It isn’t publicly available to try yet, but a wave of sample videos released by OpenAI has created a clamor for that to happen as soon as possible. Well, unless you make stock videos for a living.
These early samples suggest that Sora is by far the most impressive text-to-video tool we’ve seen so far. It’s far from the first – the likes of Google Imagen and Runway Gen-2 have laid the groundwork, with nVidia releasing its own impressive demos last year. But Sora appears to trump all of them because it’s capable of doing a few new things.
Early AI-generated videos were dogged by inconsistency, warping and other oddities that instantly broke the illusion. But Sora, as OpenAI’s blog post explains, is not only able to create “complex scenes with multiple characters”, it can also “simulate the physical world in motion” and understand how objects should exist in that world. The result? From what we can see so far, you get coherent, consistent videos where everything largely stays where it should (something that’s known as ‘object permanence’).
Sora is far from perfect and a lot of questions remain unanswered. OpenAI admits that it can struggle with “accurately simulating the physics of a complex scene”, understanding “specific instances of cause of effect” and can also “confuse spatial details of a prompt”. We also don’t know which GPT model was used to build Sora, what data it was trained on, when OpenAI will deem it ready to be released into beyond its early testers, and how much it might cost.
But still, it’s hard not to be blown away by the quality of some of Sora’s early examples and what it could ultimately mean for video, cameras, movies, gaming and, most importantly, gifs. Here are 11 of the most impressive AI-generated videos so far from Sora and what they tell us about where this all could be going…
1. It can make convincing sci-fi trailers
- The prompt: A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
This sci-fi short is one of the more impressive examples of Sora’s generative chops, showcasing its ability to make photo-real characters and also ape particular cinematic styles.
The prompt specifies a “move trailer” so it includes cuts and close-ups – and what it lacks in narrative coherence it makes up for in quality and consistency compared to other text-to-video tools. There’s no sound, of course, but as a tool for storyboarding and brainstorming, it’s already seemingly hit new heights.
2. AI-generated humans look photo-real
- The prompt: A instructional cooking session for homemade gnocchi hosted by a grandmother social media influencer set in a rustic Tuscan country kitchen with cinematic lighting
It’s barely been eighteen months since Meta and Google showed their early examples of text-to-video tools, but Sora videos like the one above show the rapid progress that’s been made – particularly when it comes to creating clips involving people.
Early Google Imagen clips steered clear of humans and animals, but the example above – published by OpenAI CEO Sam Altman on X (formerly Twitter) after a request for prompts – shows the realistic, crisp detail it can produce. Even the hands look fairly realistic, although there is a disappearing spoon to show its AI origins.
3. Pixar-style animated shorts are possible too
- The prompt: Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle (see post for full prompt).
This Sora-made clip shows the potential for AI-generated video to democratize animation and open it up to anyone with an imagination. It shows a Pixar-style fluffy monster with incredibly detailed fur and realistic candle reflections.
The prompt may be long and we don’t know the processing time, but it’s sure to be a lot shorter than the historical processes used by animation studios. Pixar has previously talked about the painstaking process of making fur in Monsters, Inc and the original Toy Story took 800,000 machine hours to make, with Pixar only able to render less than 30 seconds of footage per day.
4. It could replace your drone
- Prompt: Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. (See post for full prompt).
Text-to-video tools won’t replace the best drones for capturing personal memories. But if you need some generic, stock aerial video (that can even roughly approximate real locations) then the Sora-made example above shows it could be up to the task – and good weather is guaranteed.
Only the waves in this clip are the giveaway that this is AI-generated – and even then, only if you look closely. It would certainly be good enough for social media and another example of the Amalfi coast shows the quality isn’t a one-off. The only question is, whose real aerial imagery has it been trained on?
5. It can transport you to an AI-generated past
- The prompt: Historical footage of California during the gold rush.
Did they have drones in the mid-19th century? Not to our knowledge, but Sora here gives us an idea of what one of DJI‘s flying cameras might have captured had it existed in California during the gold rush.
This clip raises serious questions about what AI-generated video could do to our recollection of historical events if it was simply unleashed into the wild. That’s why Open AI says it’s “building tools to help detect misleading content such as a detection classifier”, which can tell if a video was made by Sora.
While it’s good to hear that OpenAI’s taking these safety steps, it still leaves us concerned about social media, given the old adage that ‘a lie can travel halfway around the world while the truth is still putting its shoes on’.
- The prompt: Extreme close up of a 24 year old woman’s eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field, vivid colors, cinematic
All that money spent on an f/1.2 prime lens for your full-frame camera and a text-to-video tool rustles up this clip with a simple prompt – sickening. Of course, we’ll still need cameras to capture real people, events and memories, but this clip shows there’s no doubt that Sora and its rivals will again reduce the need for stock video clips.
The movement of the eye, the eyelashes, the realistic skin pores, the reflections of the Marrakech sunset – all are pretty much on point. It even seems to simulate a momentary focusing error. We haven’t seen anything quite as good as this from a text-to-video generator before, and they’re only going to get better.
7. It can get as surreal as your sea dreams
- The prompt: A bicycle race on ocean with different animals as athletes riding the bicycles with drone camera view
One of the most impressive things about Sora from this first range of sample clips is its versatility. It can do photo-realism and Pixar-style animation, but also combine the two to make some surreal clips that would otherwise take hours to animate.
This ocean-based bicycle race certainly isn’t perfect – quite why there’s a porpoise suspended in mid-air isn’t clear – but somehow the cycling sea creatures don’t look completely unnatural either. At the very least, our gif games are going up several notches.
8. A new kind of personalized gaming could be near
- The prompt: the camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. (See post for full prompt).
Sora is a way off being able to create a video game as realistic as the AI-generated video above, but it certainly has the potential to have a major impact on the gaming industry. An OpenAI paper reveals that it can render video games, learn physics and help create game worlds.
As noted by Nvidia Senior Researcher Dr Jim Fan on X (formerly Twitter), Sora is more than just an image generator like the ones we’ve seen before in the likes of Dall-E. It’s more akin to a “data-driven physics engine”, effectively learning physics and opening up realistic text-to-3D creation.
As OpenAI’s paper states “Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity”. Clearly, that’s just the start of its gaming potential.
9. Advertising could lap up the creative potential
- The prompt: Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.
Sora’s photo-realistic video potential and seemingly impressive understanding of physics could make it a potent creative weapon for lots of things, including advertising.
Expect to see your YouTube pre-rolls and social ads get a lot more surreal as scenes the one above become available to limited marketing budgets that would have previously only stretched to a simple smartphone-made short. That is, assuming OpenAI fends off its copyright lawsuits and Sora becomes viable for commercial use.
10. It has decent directorial chops
Sora developer Bill Peebles shared the clip above on X (formerly Twitter), stating that “this is a single video generated by sora, shot changes and all”.
We don’t know exactly what prompt was used to generate ‘bling zoo’, which shows some animals that appear to be enjoying a generous inheritance, but the video shows an understanding of cuts and pacing that shows Sora can go beyond looping the same sequences for a minute. Amateur filmmakers will no doubt be near the front of the queue.
11. Dog gifs are about to go next-level
- The prompt: A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.
Not all of the implications of OpenAI’s Sora are world-changing or industry-shifting – we’re frankly just as excited about the imminent possibilities for our gif game.
It seems that Sora is particularly adept at creating short, photo-real clips of dogs, puppies and cats – and while there isn’t exactly a shortage of those on the internet already, we are looking forward to tailoring the ideal clip for those times when Giphy falls short.
Well, unless the tech behind Sora commands an extortionate monthly subscription, which isn’t beyond the realms of possibility.