Apple researchers have developed a method that allows the company’s devices, such as iPhones, to run their own large language models locally, without exceeding the available memory of those devices. In a research paper, the team writes that they have developed a new technique for this, with which the data from the AI model can be temporarily stored in the flash memory of the device, so that the dram capacity is not exceeded (PDF).
This uses two methods that minimize data transfer and maximize throughput. The first, ‘ windowing ‘, ensures that some data that has already been processed can be reused, thus requiring less memory to be retrieved. With ‘ row-column bundling ‘ it is then possible to group data so that it can be read from the flash memory more quickly.
By combining these methods, AI models that take up to twice the amount of available dram memory of iPhones could still run locally on them. This technology is also said to work four to five times faster than when loaded directly into CPUs, and even twenty to twenty-five faster than into GPUs. The researchers speak of a technological ‘breakthrough’ that will be crucial in ‘the use of advanced LLMs in environments with limited resources’. It is not stated whether this means that a future iPhone will actually contain on-device AI.
Source: Apple Research (PDF)