The first open small language model that reasons through tool interaction instead of chain-of-thought.
"Small models shouldn't think in words — they should think through actions."
Standard "reasoning" models generate thousands of words of internal monologue before answering. Manthan skips the monologue entirely — it calls tools, observes results, and iterates. A July 2025 peer-reviewed paper proved this approach is far more effective for small models (sub-2B), yet no open model implemented it. Manthan-1.5B is that implementation — trained, evaluated, and published openly.
Built on Qwen2.5-1.5B-Instruct with 4-bit QLoRA via Unsloth. Total training cost: zero dollars, ~35 GPU hours on Kaggle T4 free tier. All weights and code are available openly.
Train on ~7K tool-interaction traces — synthetic traces (Claude-generated) combined with curated glaive + hermes function-calling datasets. ChatML format with tool_call / tool_response roles.
~3 hrs · 5GB VRAMGroup Relative Policy Optimization with a novel reward signal: did the tool call execute? Was the result correct? Denser rewards than verbal CoT. Implemented via Unsloth + TRL GRPOTrainer.
~25 hrs · 9GB VRAMZero-cost inference technique: a LogitsProcessor injects a "Wait" token when the model tries to conclude before making minimum required tool calls. Forces deeper reasoning at no training cost.
0 GPU hrs · inference-onlyEach claim is independently defensible. Together they place Manthan in a category of one.
Rainone et al. (arXiv:2507.05065) proved the concept but released no model or code. Manthan is the first downloadable, usable implementation.
HuggingFace's official agent framework references only 7B–72B models. Manthan fills the verified ecosystem gap as a drop-in CodeAgent.
Standard GRPO rewards final-answer correctness. Manthan rewards intermediate tool-execution success, creating denser learning signals and faster convergence.
The "Wait" token technique has only been applied to verbal CoT (s1 paper). Applying it to tool-interaction traces — forcing additional tool calls — is entirely unexplored.
Three independent, composable reward functions. Each returns a float in [0.0, 1.0]. Combined with configurable weights for flexible training objectives.
+0.5 if tool_call parses as valid JSON with non-empty code. +0.5 if sandboxed execution succeeds with output.
1.0 for exact match. 0.9 for numeric within 0.1%. 0.5 within 1%. 0.0 for wrong or missing final_answer.
0.1 if at least one <tool_call> block is present. Penalizes verbal chain-of-thought by rewarding the tool-mediated format.
All metrics measured tool-augmented — the model generates tool calls, executes them, then answers. Baseline is Qwen2.5-1.5B-Instruct with no fine-tuning. Manthan-1.5B meets or exceeds all targets.
| Metric | Baseline (Qwen2.5-1.5B) | After SFT | After GRPO | Target |
|---|---|---|---|---|
| Tool call parsability | ~7% | ~50% | ~80% | >85% |
| GSM8K (tool-augmented) | ~45% | ~55% | ~62% | >65% |
| MBPP pass@1 | ~35% | ~42% | ~48% | >50% |
| smolagents CodeAgent success | untested | — | — | >70% |
| Avg tool calls / problem | N/A | ~1.2 | ~1.8 | 1.5 – 3.0 |
Passionate about pushing the boundaries of small language models. Built Genesis Manthan to prove that tool-mediated reasoning — not verbal chain-of-thought — is the right paradigm for sub-2B models. Working under the Genesis AGI mission to make capable, open agentic AI accessible on consumer hardware.
Download the model, read the white paper, or try the live demo. Built in the open — for the open source community.