Apple's 'LLM in a flash' lets run huge AI models
CVS Health's Dan Woods just ran a huge 397-billion-parameter AI model (Qwen 3.5 397B) on his MacBook Pro with 48GB of RAM, thanks to Apple's "LLM in a Flash" tech.
It's impressive because models this big usually need giant servers, not laptops you can actually own.
How it works
"LLM in a Flash" streams most of the model from the SSD instead of loading it all into memory, so it only uses about 5.5GB of RAM.
The MoE setup activates just 4 out of 512 "experts" for each token and shrinks the model size with clever compression, so you don't need supercomputer-level hardware.
Implications for the future of AI
This experiment shows that running advanced AI locally is getting way more accessible, even on regular devices.
Most of the code was even written by other AIs during rapid-fire experiments, hinting at how much faster and easier building powerful tools could get for everyone soon.