AI video models try to mimic real-world physics, but don't understand it

AI video models try to mimic real-world physics, but don't understand it

AI video generators cannot understand the laws of physics just by watching a video, scientists have found.

Following in the footsteps of chatbots and image generators, AI video generators like Sora and Runway have already produced impressive results. However, a team of scientists from Bytedance Research, Tsinghua University, and the Technion wanted to know if such models could discover physical laws from visual data without additional human input.

In the real world, we understand physics through mathematics, but in the world of video generation, an AI model that understands physics should be able to look at a series of frames and predict which frame will come next. This should happen whether the AI model has seen the image before or if it is unfamiliar.

To investigate whether such understanding exists, the scientists created 2D simulations using simple shapes and movements, and created hundreds of thousands of mini-videos to train and test the models. As a result, the models were able to “mimic” physics, but not understand it.

The three basic physics laws for the simulations they chose to study were uniform linear motion of the ball, perfect elastic collision between two balls, and parabolic motion of the ball.

According to the team's preprint paper, simulations based on the learned data showed that the geometry behaved as it should, but not properly in the new, unexpected scenario. At best, it mimicked the closest training example.

Over the course of the experiment, the scientists also observed that the video generator often changed one shape to another (e.g., a square randomly turned into a ball) and made other meaningless adjustments. Model priorities followed a clear hierarchy, with color being the most important, followed by size, and then speed. Shape was least important.

“It is difficult to determine whether the video models are learning laws rather than simply memorizing data,” the researchers said.

The researchers explained that since the model's internal knowledge is inaccessible, one can only infer the model's understanding by examining its predictions for unknown scenarios.

“Our detailed analysis suggested that generalization of the video model relies on referring to similar training examples rather than learning universal rules,” they said, emphasizing that this occurs regardless of the amount of training data in the model.

Have they found a solution? Lead author Bingyi Kang wrote in X. 'Actually, this is probably the mission of the entire AI community.'

Categories