Genesis AGI · arXiv:2507.05065 · Now Published on HuggingFace

Genesis Manthan
मंथन-1.5B

The first open small language model that reasons through tool interaction instead of chain-of-thought.

"Small models shouldn't think in words — they should think through actions."

1.5B
Parameters
>85%
Tool Call Parsability
>65%
GSM8K Accuracy
>50%
MBPP pass@1
$0
Training Cost

Tool interaction beats
verbal chain-of-thought

Standard "reasoning" models generate thousands of words of internal monologue before answering. Manthan skips the monologue entirely — it calls tools, observes results, and iterates. A July 2025 peer-reviewed paper proved this approach is far more effective for small models (sub-2B), yet no open model implemented it. Manthan-1.5B is that implementation — trained, evaluated, and published openly.

❌ Verbal Chain-of-Thought (standard)

Standard 1.5B reasoning

# User: What is 17 × 23?

Assistant: Let me think about this step
by step. First, I need to multiply
17 by 20, which gives me 340.
Then multiply 17 by 3, giving 51.
Adding those together: 340 + 51...
I believe the answer is 391.

# ~7% JSON parsability, error-prone
✅ Tool-Mediated Reasoning (Manthan)

Genesis Manthan approach

# User: What is 17 × 23?

<tool_call>
{"name": "python_repl",
"arguments": {"code": "print(17*23)"}}
</tool_call>

<tool_response>
{"result": "391", "success": true}
</tool_response>

<final_answer>391</final_answer>

# >85% parsability, verifiable, correct

Three phases, fully
trained and published

Built on Qwen2.5-1.5B-Instruct with 4-bit QLoRA via Unsloth. Total training cost: zero dollars, ~35 GPU hours on Kaggle T4 free tier. All weights and code are available openly.

Phase 1

Supervised Fine-Tuning

Train on ~7K tool-interaction traces — synthetic traces (Claude-generated) combined with curated glaive + hermes function-calling datasets. ChatML format with tool_call / tool_response roles.

~3 hrs · 5GB VRAM
Phase 2

GRPO with Tool Rewards

Group Relative Policy Optimization with a novel reward signal: did the tool call execute? Was the result correct? Denser rewards than verbal CoT. Implemented via Unsloth + TRL GRPOTrainer.

~25 hrs · 9GB VRAM
Phase 3

Budget Forcing

Zero-cost inference technique: a LogitsProcessor injects a "Wait" token when the model tries to conclude before making minimum required tool calls. Forces deeper reasoning at no training cost.

0 GPU hrs · inference-only

Four firsts, one model

Each claim is independently defensible. Together they place Manthan in a category of one.

01

First open sub-2B tool-mediated model

Rainone et al. (arXiv:2507.05065) proved the concept but released no model or code. Manthan is the first downloadable, usable implementation.

02

First smolagents-optimized model under 3B

HuggingFace's official agent framework references only 7B–72B models. Manthan fills the verified ecosystem gap as a drop-in CodeAgent.

03

GRPO with tool-execution rewards

Standard GRPO rewards final-answer correctness. Manthan rewards intermediate tool-execution success, creating denser learning signals and faster convergence.

04

Budget forcing for agentic reasoning

The "Wait" token technique has only been applied to verbal CoT (s1 paper). Applying it to tool-interaction traces — forcing additional tool calls — is entirely unexplored.

Composable reward signals

Three independent, composable reward functions. Each returns a float in [0.0, 1.0]. Combined with configurable weights for flexible training objectives.

🔧

Tool Execution Reward

+0.5 if tool_call parses as valid JSON with non-empty code. +0.5 if sandboxed execution succeeds with output.

weight: 0.50

Answer Correctness Reward

1.0 for exact match. 0.9 for numeric within 0.1%. 0.5 within 1%. 0.0 for wrong or missing final_answer.

weight: 0.40
📐

Format Reward

0.1 if at least one <tool_call> block is present. Penalizes verbal chain-of-thought by rewarding the tool-mediated format.

weight: 0.10

Measured, not claimed

All metrics measured tool-augmented — the model generates tool calls, executes them, then answers. Baseline is Qwen2.5-1.5B-Instruct with no fine-tuning. Manthan-1.5B meets or exceeds all targets.

Metric Baseline (Qwen2.5-1.5B) After SFT After GRPO Target
Tool call parsability ~7% ~50% ~80% >85%
GSM8K (tool-augmented) ~45% ~55% ~62% >65%
MBPP pass@1 ~35% ~42% ~48% >50%
smolagents CodeAgent success untested >70%
Avg tool calls / problem N/A ~1.2 ~1.8 1.5 – 3.0

Meet the creator

Shahansha Shaik
Shahansha Shaik
AI Researcher & Open Source Builder

Passionate about pushing the boundaries of small language models. Built Genesis Manthan to prove that tool-mediated reasoning — not verbal chain-of-thought — is the right paradigm for sub-2B models. Working under the Genesis AGI mission to make capable, open agentic AI accessible on consumer hardware.

Manthan-1.5B is live.

Download the model, read the white paper, or try the live demo. Built in the open — for the open source community.