Efficient Reasoning on the Edge

Bondarenko, Yelysei; Hehn, Thomas; Hesselink, Rob; Lepert, Romain; Massoli, Fabio Valerio; Mironov, Evgeny; Mirvakhabova, Leyla; Orekondy, Tribhuvanesh; Stasis, Spyridon; Kuzmin, Andrey; Kuzina, Anna; Nagel, Markus; Rainone, Corrado; de Rooij, Ork; Whatmough, Paul N; Behboodi, Arash; Ehteshami Bejnordi, Babak

Efficient Reasoning on the Edge

Yelysei Bondarenko^*, Thomas Hehn^*, Rob Hesselink^*, Romain Lepert^*, Fabio Valerio Massoli^*, Evgeny Mironov^*, Leyla Mirvakhabova^*, Tribhuvanesh Orekondy^*, Spyridon Stasis^*, Andrey Kuzmin^†, Anna Kuzina^†, Markus Nagel^†, Corrado Rainone^†, Ork de Rooij^†, Paul N Whatmough^†, Arash Behboodi^‡, Babak Ehteshami Bejnordi^‡

Qualcomm AI Research
^*Core contributors ^†Contributors ^‡Project leads (names in each group sorted alphabetically)

Paper (coming soon) arXiv (coming soon)

Reasoning: Sequential
Budget forcing ❌
Verification ❌

Reasoning: Sequential
Budget forcing ✅
Verification ❌

Reasoning: Parallel
Budget forcing ❌
Verification ✅

Reasoning has become an indispensable capability for Large Language Models (LLMs) in advancing intelligent systems. Deploying reasoning models on edge devices requires a careful balance between efficiency and performance. These models must utilize inference-time compute judiciously, only when necessary, while operating within strict memory and bit budgets. In this work, we consolidate best practices for enabling reasoning in small LLMs within a specific domain and demonstrate end-to-end deployment on cutting-edge edge AI inference engines. We introduce a lightweight approach to incorporate reasoning in compact LLMs using LoRA adapters combined with supervised fine-tuning. To further optimize efficiency, we apply budget forcing through reinforcement learning on these adapters, significantly reducing response length with minimal accuracy degradation. To realize Inference-Time scaling laws, we employ parallel decoding with on-device verification, improving accuracy with negligible latency overhead. Additionally, we propose a dynamic adapter-switching mechanism that activates reasoning only when required, alongside a KV-cache sharing strategy during prompt encoding to reduce time-to-first-token for on-device inference. Our solution leverages state-of-the-art quantization using Qualcomm FastForward and further optimizes inference using the Qualcomm GENIE SDK. In essence, the quantized and optimized model dynamically enables reasoning by activating LoRA adapters trained to generate tokens within device constraints, while supporting parallel reasoning with on-device verifiers. We validate the proposed pipeline using Qwen2.5-7B as the starting point, demonstrating that efficient and accurate reasoning under stringent resource limitations is achievable—making LLM reasoning practical for mobile scenarios.

Challenges: Reasoning on the Edge

Deploying reasoning models on smartphones presents three core challenges:

Memory bottleneck: Chain-of-thought traces can exceed 4,000 tokens, creating KV-cache footprints that strain mobile memory bandwidth and capacity.
Latency overhead: Generating verbose reasoning traces token-by-token can take several minutes on mobile processors, making real-time interaction infeasible.
Resource waste: Most user queries need simple chat responses, yet existing reasoning models apply the same expensive processing to every request.

These constraints make it impractical to simply deploy cloud reasoning models on mobile devices.

LoRA Reasoning Adapters

To bring reasoning to the edge, we rely on Low-Rank Adaptation (LoRA) rather than full model fine-tuning. This modular approach gives us a critical advantage: LoRA adapters are lightweight and distinct from the base model, thus we can toggle them on or off instantly. Consequently, the device can switch seamlessly between cheap, general-purpose chat and expensive, high-performance reasoning.

We built these reasoning capabilities by fine-tuning our adapters on the OpenThoughts3-1.2M dataset, exposing the model to over a million diverse examples across math, code, and science. To ensure the model had sufficient capacity to absorb these complex reasoning patterns, we applied LoRA adapters with a rank of 128 and alpha of 256 across all linear layers for 5 epochs.

Model	AIME25	MATH500	GPQA	AMC '23
Qwen2.5-7B-Instruct	0.17	0.77	0.37	0.60
R1-Distill-Qwen2.5-7B-Instruct	0.42	0.92	0.51	0.91
Qwen2.5-7B-Instruct (dense, Ours)	0.49	0.96	0.47	0.86
Qwen2.5-7B-Instruct (LoRA 128, Ours)	0.43	0.92	0.44	0.82
Qwen2.5-7B-Instruct (W4A16 - LoRA 128, Ours)	0.40	0.92	0.45	0.85

While LoRA adapters successfully teach the 7B model to reason, the resulting traces remain too verbose for mobile deployment. This motivates our budget forcing approach.

Budget Forcing

The deployment of advanced reasoning capabilities (Inference-Time Scaling) on resource-constrained devices, particularly smartphones, is currently hindered by substantial latency overheads associated with generating full Chain-of-Thought (CoT) traces. Due to the intrinsic verbosity of standard Large Language Models (LLMs), the time required to produce a final answer can extend to several minutes, thereby creating a computational bottleneck that undermines the feasibility of real-time user interaction. Within this context, "budget forcing" — a Reinforcement Learning (RL) based fine-tuning methodology — emerges as a fundamental requirement for practical on-device deployment. By training models to generate substantially more concise responses, budget forcing alleviates these latency constraints, and thereby enables the effective application of Inference-Time Scaling on edge devices.

To instantiate this approach, we implemented a specialized RL training pipeline governed by a dual-objective reward function. The primary objective is to minimize the generation budget — quantified as the total number of tokens produced — while simultaneously preserving, or potentially improving upon, the accuracy of the base model by maintaining its capacity to sample correct solutions. This training regime incentivizes the model to discover and follow the shortest viable reasoning trajectory from the input prompt to the correct final answer. The resulting model is capable of producing compressed CoT traces that remove superfluous content while preserving the essential deductive steps required for solving complex problems.

Switcher

Beyond making individual reasoning traces efficient, to avoid paying the cost of long reasoning traces on every request, we attach a lightweight switcher head that only enables the LoRA reasoning adapters when the prompt actually requires reasoning. The switcher sits on top of the final transformer layer, takes the mean-pooled hidden states over all prompt tokens as input, and outputs a binary decision between chat mode and reasoning mode. In chat mode, the model responds using only the frozen base weights, while in reasoning mode the LoRA adapters are activated on top of the same base model.

To let both modes share a single KV cache, we always encode the prompt with the base model alone and train the reasoning LoRA adapters to decode tokens conditioned on KV states produced without LoRA. In practice, this lets us reuse the same KV cache when switching between chat and reasoning, without re-encoding the prompt with LoRA applied. As a result, the switcher design keeps everyday chat fast and inexpensive, while only turning on the budget-forced LoRA reasoning traces for genuinely complex, multi-step queries.

Parallel Reasoning and Verification

While budget forcing reduces the cost of sequential generation, parallel compute on modern mobile processors offers a complementary path to improve reasoning performance. Instead of reducing sequence length, parallel reasoning aims to increase accuracy while maintaining a similar latency budget. Our parallel reasoning features:

Increased compute utilization during the memory-bound generation process by exploring multiple reasoning paths independently and concurrently.
Reduced memory and compute overhead through joint a Generation-Verification architecture that combines a base generator model with a verification head which minimizes the verifier's memory footprint, avoids model switching cost (such as loading parameters to DRAM) and also utilizes the existing KV cache.
Improved accuracy by parallel verification, as each reasoning chain is independently verified for correctness and assigned a score, allowing for rapid validation of multiple solutions.

A combination of generation and verification enables us to prototype recent parallel strategies, such as majority voting (self-consistency), Best-of-N, and Weighted Majority. Overall, we find parallel decoding on-device enables improved accuracy with minimal additional latency.

BibTeX

@misc{efficient_reasoning_edge,
  title        = {Efficient Reasoning on the Edge},
  author       = {Yelysei Bondarenko and Thomas Hehn and Rob Hesselink and Romain Lepert and Fabio Valerio Massoli and Evgeny Mironov and Leyla Mirvakhabova and Spyridon Stasis and Tribhuvanesh Orekondy and Andrey Kuzmin and Anna Kuzina and Markus Nagel and Corrado Rainone and Ork de Rooij and Paul N Whatmough and Arash Behboodi and Babak Ehteshami Bejnordi},
  howpublished = {\url{https://qualcomm-ai-research.github.io/llm-reasoning-on-edge/}},
  note         = {Qualcomm AI Research},
  year         = {2025}
}