Most generative AI models nowadays are autoregressive. That means they’re following the concept of next token prediction, and the transformer architecture is the current implementation that has been used for years now thanks to its computational efficiency. This is a rather simple concept that’s easy to understand - as long as you aren’t interested in the details - everything can be tokenized and fed into an autoregressive (AR) model. And by everything, I mean everything: text as you’d expect, but also images, videos, 3D models and whatnot.