In case you missed it, here is a research paper that postulates something game-changing. In their paper, LLM in a flash: Efficient Large Language Model Inference with Limited Memory, Apple engineers propose that they can tackle the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. If this approach can be successfully mastered and productized, it will allow devices like iPhones to run LLM models. And this will be game-changing.
I don’t need to explain why LLM models are gaining all the attention. LLMs can contain hundreds of billions or even trillions of parameters. This aspect of LLM that makes these models powerful makes it challenging to load and run them efficiently, especially on resource-constrained devices.
As highlighted in the paper, the current approach is to load the model into DRAM for inference. This approach significantly limits the maximum model size that can be run on many devices. An example highlighted in the paper is that a 7 billion parameter model requires over 14GB of memory just to load the parameters in half-precision floating point format, exceeding the capabilities of most edge devices. The approach defined in the paper can bypass this constraint.
The proposed approach involves developing an inference cost model. This model harmonizes with the flash memory behavior, leading to optimization in two critical areas:
- Reducing the volume of data transferred from flash
- Reading data in larger, more contiguous chunks
The proposed framework introduces two principal techniques.
- Windowing: strategically reduces data transfer by reusing previously activated neurons
- Row-column bundling: Designed for the flash memory’s sequential data access strengths, it increases the size of data chunks read from flash memory.
The gist, as quoted in the paper is : “Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.“
Think about the implications of this on Edge AI. This just pushes the boundaries of the constraints on Edge AI further. This is not just about running LLMs on iPhones. As the approach gets refined, the gamut of limited memory devices that can support LLM models will grow. And with that comes the exponential growth in innovation. I suspect that within 2-3 years, we will see decent capabilitiy LLM models being run on our phones.
The current resource limitations of edge devices limit the possibilities of Edge AI solutions you can design. Progress like this can help erase those limitations. The progress is now two-fold. On one hand, the resource capacity of edge devices is increasing. And now, this advance allows significantly enhanced usage of limited memory. In shop-floor operations, I can already see possibilities to create Edge AI solutions that could not have been designed before. LLM models running on mobile devices on warehousing and manufacturing floors can help organizations take another leap in their digital transformation journey. The opportunities extend beyond operations into marketing.
There has never been a better time to live than now for anyone excited about such advances in AI.

