DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

1The Hong Kong University of Science and Technology, 2Horizon Robotics,
*Contributed Equally, Project Leader, Corresponding Author

Abstract

Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video generation. Specifically, we propose a next-state prediction strategy to model temporal coherence between consecutive frames and apply a next-token prediction strategy to capture spatial information within each frame. To further enhance generalization ability, we propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues and enable precise control. Our work demonstrates the ability to produce high-fidelity and consistent video clips of over 40 seconds in duration, which is over 2 times longer than state-of-the-art driving world models. Experiments show that, in contrast to prior works, our method achieves superior visual quality and significantly more accurate controllable future video generation. Our code will be publicly available.

Pipeline

The vehicle orientations, ego locations, and a front-view image sequence are taken as the conditional input, which are first tokenized as latent embeddings. Then our proposed multi-modal world model attempts to comprehend them and forecast the future states, which are detokenized to the vehicle orientation, location, and the front-view image. With the autoregressive process, we can generate over 40 seconds videos.

Experiment 1: Long-term Generation

In following examples, we present multiple different long-term generations. The longest one has over 600 frames. The ones with white borders are condition frames, and all the ones with red borders are generated by the model. All results are generated by the same model.

Experiment 2: Controllable Generation

In following examples, we designed multiple trajectories for the ego car, which can generate different driving scenarios. The ones with white borders are condition frames, and all the ones with red borders are generated by the model. All results are generated by the same model. The "x2" in the bottom right corner of the video indicates that the video has been processed at 2x speed.

Example 1: The car is designed to drive in a curved path, straight foward first, and then turning left to the left lane.

Example 2: The car is designed to drive in a curved path, straight foward first, and then turning right to the right lane.

Example 3: The car is designed to drive in a curved path, turning left twice to its left lane.

Example 4: The car is designed to drive in a curved path, turning left first and then turning right.

Example 5: The car is designed to drive in a curved path, turning right to the right lane and then turning left to the original lane.

Example 6: The car is designed to drive in a curved path, straight foward first, and then making a U-turn.

Example 7: The car is designed to drive in a curved path, slowing turning left.

Example 8: The car is designed to drive in a curved path, slowing turning left.

Example 9: The car is designed to drive in a curved path, turning right.

Example 10: The car is designed to drive in a curved path, turning right.

Example 11: The car is designed to drive in a curved path, turning right.

BibTeX

@article{hu2024drivingworld,
  author    = {Hu, Xiaotao and Yin, Wei and Jia, Mingkai and Deng, Junyuan and Guo, Xiaoyang and Zhang, Qian and Long, Xiaoxiao and Tan, Ping},
  title     = {DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT},
  year      = {2024},
}