
Right now, most generative picture fashions principally fall into two foremost classes: diffusion fashions, like Steady Diffusion, or autoregressive fashions, like OpenAI’s GPT-4o. However Apple simply launched two papers that present how there may be room for a 3rd, forgotten method: Normalizing Flows. And with a splash of Transformers on high, they may be extra succesful than beforehand thought.
First issues first: What are Normalizing Flows?
Normalizing Flows (NFs) are a kind of AI mannequin that works by studying methods to mathematically remodel real-world information (like pictures) into structured noise, after which reverse that course of to generate new samples.
The massive benefit is that they will calculate the precise probability of every picture they generate, a property that diffusion fashions can’t do. This makes flows particularly interesting for duties the place understanding the likelihood of an final result actually issues.
However there’s a purpose most individuals haven’t heard a lot about them these days: Early flow-based fashions produced pictures that seemed blurry or lacked the element and variety supplied by diffusion and transformer-based programs.
Research #1: TarFlow
Within the paper “Normalizing Flows are Succesful Generative Fashions”, Apple introduces a brand new mannequin known as TarFlow, quick for Transformer AutoRegressive Move.
At its core, TarFlow replaces the outdated, handcrafted layers utilized in earlier stream fashions with Transformer blocks. Principally, it splits pictures into small patches, and generates them in blocks, with every block predicted primarily based on all those that got here earlier than. That’s what’s known as autoregressive, which is identical underlying technique that OpenAI at the moment makes use of for picture technology.

The important thing distinction is that whereas OpenAI generates discrete tokens, treating pictures like lengthy sequences of text-like symbols, Apple’s TarFlow generates pixel values immediately, with out tokenizing the picture first. It’s a small, however important distinction as a result of it lets Apple keep away from the standard loss and rigidity that always include compressing pictures into a hard and fast vocabulary of tokens.
Nonetheless, there have been limitations, particularly when it got here to scaling as much as bigger, high-res pictures. And that’s the place the second research is available in.
Research #2: STARFlow
Within the paper “STARFlow: Scaling Latent Normalizing Flows for Excessive-resolution Picture Synthesis”, Apple builds immediately on TarFlow and presents STARFlow (Scalable Transformer AutoRegressive Move), with key upgrades.
The most important change: STARFlow not generates pictures immediately in pixel house. As a substitute, it principally works on a compressed model of the picture, after which palms issues off to a decoder that upsamples all the things again to full decision on the closing step.

This shift to what’s known as latent house means STARFlow doesn’t must predict tens of millions of pixels immediately. It could possibly concentrate on the broader picture construction first, leaving positive texture element to the decoder.
Apple additionally reworked how the mannequin handles textual content prompts. As a substitute of constructing a separate textual content encoder, STARFlow can plug in present language fashions (like Google’s small language mannequin Gemma, which in principle might run on-device) to deal with language understanding when the person prompts the mannequin to create the picture. This retains the picture technology aspect of the mannequin centered on refining visible particulars.
How STARFlow compares with OpenAI’s 4o picture generator
Whereas Apple is rethinking flows, OpenAI has additionally lately moved past diffusion with its GPT-4o mannequin. However their method is basically totally different.
GPT-4o treats pictures as sequences of discrete tokens, very like phrases in a sentence. Once you ask ChatGPT to generate a picture, the mannequin predicts one picture token at a time, constructing the image piece by piece. This offers OpenAI huge flexibility: the identical mannequin can generate textual content, pictures, and audio inside a single, unified token stream.
The tradeoff? Token-by-token technology might be gradual, particularly for big or high-resolution pictures. And it’s extraordinarily computationally costly. However since GPT-4o runs fully within the cloud, OpenAI isn’t as constrained by latency or energy use.
Briefly: each Apple and OpenAI are transferring past diffusion, however whereas OpenAI is constructing for its information facilities, Apple is clearly constructing for our pockets.
FTC: We use earnings incomes auto affiliate hyperlinks. Extra.