Thank you. This shit is cool, no matter which grifters insist it’s the next internet or whatever.
We have proven that data and compute alone can produce deeply spooky results. LLMs should nnnot be capable of answering questions. They’re glorified Markov chains. And yet: asking “what word comes next?” is correct-ish, often enough, that we have to talk about the limits of their knowledge and reasoning. Diffusion models genuinely remove the parts of the marble that don’t look like a statue. Combined, you can just type “Han Solo scene but Jabba is Shrek” and it’ll do a half-ass job of rendering video of that ridiculous premise. It’ll do a lot more if you provide a half-ass job and tell it to remove the parts that don’t look right. It’s straight-up science fiction and people are performatively incensed that it’s William Gibson instead of Isaac Asimov.
Their use of diffusion to mean wider distribution is an unfortunate choice.
… this whole article still treats the technology as slightly magic, huh. Disappointing.
Look, the future is in smarter questions for what a neural network should do. We’ve solved training in a way that a shitload of effort will definitely produce results. We don’t seem to give a shit about licensing data, which is fine by me, because eternal copyright blows, and training is transformative use. So fundamental shifts in how well a pile of linear algebra does what we want will depend on knowing what we want.
Even just using the two formats anyone cares about, you can use diffusion on text, and it’ll gradually unfuck whatever’s written. That’s a significant improvement over guessing the next word and proceeding like it’s carved in stone.