That works best if the scene has some action — but not too much action — something like “a rainy day in the big city” or “a dog with a cellphone in the park.” Hit enter, and the system generates a video in a minute or two.
“At this point, if I see a high-resolution video, I am probably going to trust it,” said Phillip Isola, a professor at the Massachusetts Institute of Technology who specializes in AI. “But that will change pretty quickly.” Several startups, including OpenAI, have released similar technology that can generate still images from short prompts like “photo of a teddy bear riding a skateboard in Times Square.” And the rapid advancement of AI-generated photos could suggest where the new video technology is going.
“There was a time when people would post deepfakes, and they wouldn’t fool me, because they were so outlandish or not very realistic,” he said. “Now, we can’t take any of the images we see on the internet at face value.” When someone describes an image for the system, it generates a list of features that the image might include. One feature might be the curve at the top of a dog’s ear. Another might be the edge of a cellphone. Then, a second neural network, called a diffusion model, creates the image and generates the pixels needed for the features. It eventually transforms the pixels into a coherent image.