TechingToday

Google Makes It Easy to Turn Photos into Short Videos - Meet VDIM

General

Google has unveiled a new AI model that can take two images and fill in the gaps between them to create a seamless animation that closely resembles live action.

VDIM (Video Interpolation With Diffusion Models) was created by Google's research division DeepMind and uses one image as the first frame and the other as the last frame. All shots in between are then filled in using AI to create the video.

This could be great for bringing to life a photo of a child playing in the park or during an event where you forgot to capture the action.

While currently only a preview of the research phase, the underlying technology may one day become an everyday part of photography with smartphones.

VDIM converts still images into video by creating missing frames using a diffusion model similar to those found in Midjourney, DALL-E, and Google's own Imagen 2.

It begins by creating a low-resolution version of the complete final video. This is done by running a cascade of diffusion models in sequence and continuously refining the video. This first step allows VDIM to capture the motion and dynamics of the final output.

This information is then passed to a higher resolution step, where it is upscaled and improved to be closer to the input image and to make the motion more fluid.

One potential use case for VDIM that the team examined in their research paper is video restoration; AI has been used to improve old images, which could be useful for cleaning up old family films or restoring films with broken frames.

Older films may have burned out frames in the middle of a sequence, making them difficult to see. Or there are several frames with scratches.

VDIM is given a clean frame at the beginning and end and used to recreate the motion between those two points.

Since VDIM is a research project, no one outside of the Google DeepMind research team has actually used it yet, but the example of the video clip is a good start for a new type of AI video.

Examples of videos shared by Google DeepMind include the start of a box-cart race with only two still images.

Another video showing a woman on a swing transformed two still images into a flowing swing movement.

Personally, I think this is one research project that Google should pursue and find a way to implement in live software, especially in video restoration. Especially if it can be extended beyond a few seconds or a few dozen frames.