Apple Explores Alternative Approach to Video AI with STARFlow-V

Tue 2nd Dec, 2025

Apple's artificial intelligence research division has introduced a novel video generation model named STARFlow-V, diverging from the widely adopted diffusion models that dominate the current landscape of AI-generated video. Instead, this new system utilizes a method known as Normalizing Flows--an approach that has seen limited application in video synthesis until now.

STARFlow-V distinguishes itself by producing highly realistic videos that closely adhere to the input prompts. Unlike many existing AI video generators, which often produce unnatural effects or visible distortions, Apple's new model delivers more consistent and natural motion, although the output currently remains limited to a resolution of 480p. The focus at this stage appears to be on demonstrating technical feasibility rather than commercial readiness.

The underlying model comprises approximately seven billion parameters and is capable of generating videos from text descriptions, extending still images into video sequences, and editing pre-existing video content. For training, STARFlow-V was exposed to a dataset of 70 million text-video pairs and an additional 400 million text-image pairs. The system can generate video clips at 16 frames per second, with each segment lasting up to five seconds. Longer sequences are created by sequentially extending these segments, using the last frame of a previous segment as the starting point for the next. Demonstrations have shown video sequences of up to 30 seconds in length.

A key technical advantage of STARFlow-V lies in its mathematical reversibility, a feature inherent to Normalizing Flows. This enables the model to calculate the exact likelihood of a generated video, remove the need for a separate encoder for input images, and support direct end-to-end training. In contrast, diffusion models typically require parallel de-noising of all frames, while STARFlow-V generates each frame in strict chronological order, ensuring that subsequent frames are not influenced by later content.

The architecture incorporates a 'Global-Local' design, where broader temporal relationships are managed in a compact global space, and detailed aspects are handled locally within individual frames. This structure helps mitigate the accumulation of minor errors over longer video sequences. To improve efficiency, the system employs a 'video-aware Jacobi-Iteration,' allowing multiple blocks to be processed in parallel and accelerating the traditionally slow autoregressive generation process.

Benchmarking on the VBench platform reveals that STARFlow-V's performance is on par with current diffusion-based models, although it still trails behind leading commercial solutions such as Google's Veo 3 and Runway's Gen-3. Notably, the system is not without its flaws; for instance, some generated videos depict implausible scenarios, such as an octopus passing through solid objects or a hamster displaying unrealistic movement within a wheel. Furthermore, despite optimization efforts, the model's inference speed remains below real-time capabilities.

Apple has not yet disclosed specific plans for integrating STARFlow-V into its product ecosystem. However, the model's relatively compact size suggests potential for on-device deployment, which could be advantageous for privacy-focused applications or real-time augmented and virtual reality experiences. There is also speculation that the technology could support Apple's ambitions in robotics or serve as a foundational model for future multimedia AI applications.

Researchers and developers can access the model's code and technical documentation on GitHub, indicating an unusually open approach from Apple, which has traditionally maintained a closed stance toward its AI research. This move may signal a shift towards greater transparency and collaboration within the AI research community.


More Quick Read Articles »