Efficient Reasoning on the Edge
Challenges: Reasoning on the Edge
Deploying reasoning models on smartphones presents three core challenges:
- Memory bottleneck: Chain-of-thought traces can exceed 4,000 tokens, creating KV-cache footprints that strain mobile memory bandwidth and capacity.
- Latency overhead: Generating verbose reasoning traces token-by-token can take several minutes on mobile processors, making real-time interaction infeasible.
- Resource waste: Most user queries need simple chat responses, yet existing reasoning models apply the same expensive processing to every request.
LoRA Reasoning Adapters
To bring reasoning to the edge, we rely on Low-Rank Adaptation (LoRA) rather than full model fine-tuning. This modular approach gives us a critical advantage: LoRA adapters are lightweight and distinct from the base model, thus we can toggle them on or off instantly. Consequently, the device can switch seamlessly between cheap, general-purpose chat and expensive, high-performance reasoning.
We built these reasoning capabilities by fine-tuning our adapters on the OpenThoughts3-1.2M dataset, exposing the model to over a million diverse examples across math, code, and science. To ensure the model had sufficient capacity to absorb these complex reasoning patterns, we applied LoRA adapters with a rank of 128 and alpha of 256 across all linear layers for 5 epochs.
| Model | AIME25 | MATH500 | GPQA | AMC '23 |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 0.17 | 0.77 | 0.37 | 0.60 |
| R1-Distill-Qwen2.5-7B-Instruct | 0.42 | 0.92 | 0.51 | 0.91 |
| Qwen2.5-7B-Instruct (dense, Ours) | 0.49 | 0.96 | 0.47 | 0.86 |
| Qwen2.5-7B-Instruct (LoRA 128, Ours) | 0.43 | 0.92 | 0.44 | 0.82 |
| Qwen2.5-7B-Instruct (W4A16 - LoRA 128, Ours) | 0.40 | 0.92 | 0.45 | 0.85 |
While LoRA adapters successfully teach the 7B model to reason, the resulting traces remain too verbose for mobile deployment. This motivates our budget forcing approach.
Budget Forcing
The deployment of advanced reasoning capabilities (Inference-Time Scaling) on resource-constrained devices, particularly smartphones, is currently hindered by substantial latency overheads associated with generating full Chain-of-Thought (CoT) traces. Due to the intrinsic verbosity of standard Large Language Models (LLMs), the time required to produce a final answer can extend to several minutes, thereby creating a computational bottleneck that undermines the feasibility of real-time user interaction. Within this context, "budget forcing" — a Reinforcement Learning (RL) based fine-tuning methodology — emerges as a fundamental requirement for practical on-device deployment. By training models to generate substantially more concise responses, budget forcing alleviates these latency constraints, and thereby enables the effective application of Inference-Time Scaling on edge devices.
To instantiate this approach, we implemented a specialized RL training pipeline governed by a dual-objective reward function. The primary objective is to minimize the generation budget — quantified as the total number of tokens produced — while simultaneously preserving, or potentially improving upon, the accuracy of the base model by maintaining its capacity to sample correct solutions. This training regime incentivizes the model to discover and follow the shortest viable reasoning trajectory from the input prompt to the correct final answer. The resulting model is capable of producing compressed CoT traces that remove superfluous content while preserving the essential deductive steps required for solving complex problems.
Switcher
Beyond making individual reasoning traces efficient, to avoid paying the cost of long reasoning traces on every request, we attach a lightweight switcher head that only enables the LoRA reasoning adapters when the prompt actually requires reasoning. The switcher sits on top of the final transformer layer, takes the mean-pooled hidden states over all prompt tokens as input, and outputs a binary decision between chat mode and reasoning mode. In chat mode, the model responds using only the frozen base weights, while in reasoning mode the LoRA adapters are activated on top of the same base model.
To let both modes share a single KV cache, we always encode the prompt with the base model alone and train the reasoning LoRA adapters to decode tokens conditioned on KV states produced without LoRA. In practice, this lets us reuse the same KV cache when switching between chat and reasoning, without re-encoding the prompt with LoRA applied. As a result, the switcher design keeps everyday chat fast and inexpensive, while only turning on the budget-forced LoRA reasoning traces for genuinely complex, multi-step queries.
Parallel Reasoning and Verification
While budget forcing reduces the cost of sequential generation, parallel compute on modern mobile processors offers a complementary path to improve reasoning performance. Instead of reducing sequence length, parallel reasoning aims to increase accuracy while maintaining a similar latency budget. Our parallel reasoning features:
- Increased compute utilization during the memory-bound generation process by exploring multiple reasoning paths independently and concurrently.
- Reduced memory and compute overhead through joint a Generation-Verification architecture that combines a base generator model with a verification head which minimizes the verifier's memory footprint, avoids model switching cost (such as loading parameters to DRAM) and also utilizes the existing KV cache.
- Improved accuracy by parallel verification, as each reasoning chain is independently verified for correctness and assigned a score, allowing for rapid validation of multiple solutions.
BibTeX
@misc{efficient_reasoning_edge,
title = {Efficient Reasoning on the Edge},
author = {Yelysei Bondarenko and Thomas Hehn and Rob Hesselink and Romain Lepert and Fabio Valerio Massoli and Evgeny Mironov and Leyla Mirvakhabova and Spyridon Stasis and Tribhuvanesh Orekondy and Andrey Kuzmin and Anna Kuzina and Markus Nagel and Corrado Rainone and Ork de Rooij and Paul N Whatmough and Arash Behboodi and Babak Ehteshami Bejnordi},
howpublished = {\url{https://qualcomm-ai-research.github.io/llm-reasoning-on-edge/}},
note = {Qualcomm AI Research},
year = {2025}
}