AIML

by Chandra Pendyala

If your goal is reliable answers on real workloads (enterprise QA, doc analysis, math/coding with tools), you don’t need a frontier-scale LLM. A Small Language Model (SML, ~0.5–1B params) paired with a logic verifier, orchestrated through an Agent-to-Agent (A2A) workflow, and backed by an engineered RAG stack with strong verifiers can reach GPT-4-level outcomes on many tasks—at a small fraction of the cost. You’ll still lag on tool-resistant, long-horizon reasoning (e.g., HLE-style puzzles), but for most business cases this architecture is the highest ROI.

Summary Architcture:

[User/Task]
|
[Planner (SML)] –decides subgoals & budgets–> [Retriever]
| / \
| [BM25] [Dense]
| \ /
| [Re-ranker]
| |
[Solver (SML)] <— evidence & tools — [Evidence Store / Tools] | \ (code exec, math, SQL) | \__________ candidates / traces __________/ | | [Critic (SML)] –logic/NLI checks–> [Strong Verifiers: NLI, Math, Code, SQL, Consistency]
| |
+—-(replan / escalate)—————+
|
[Selector] –weighted vote & abstain–> [Final Answer + Citations + Confidence]

Intuition on why this Architecture works:

Why this architecture now
• Scaling laws still favor bigger models—but returns on generalization per dollar are flattening for non-frontier teams.
• Externalizing knowledge (RAG) and verifying outputs (logic/math/code/tabular verifiers) beats trying to “store the world” in parameters.
• Adaptive workflows (A2A) use thinking tokens only where they pay off, controlling latency and spend.

Key ideas

  • SML: handles grammar, composition, and orchestration.
  • Logic model/verifier: small NLI/consistency model catches unsupported or contradictory claims.
  • A2A: Planner → Solver → Critic with a budget policy (how many steps, how many self-consistency samples k, when to call tools).
  • Engineered RAG: hybrid retrieval, cross-encoder re-ranking, semantic chunking, MMR diversity, freshness gates.
  • Strong verifiers: symbolic math/units; code execution; table/SQL evaluation; NLI/entailment for faithfulness.

What “good” looks like (expected performance ranges)

Task (setting)SML + Logic + A2A + Engineered RAG + VerifiersGPT-4-class (reference)
Enterprise QA (doc-grounded)85–95% EM on curated evals80–90%
Open-domain QA (RAG allowed)75–85% EM75–85%
Coding w/ execution testspass@1: 55–65% (mid-diff sets)55–70%
Math w/ solver65–80% (e.g., GSM8K-style)70–85%
MMLU-Open (RAG/tools allowed)60–68%~65–72%
MMLU-Strict (no tools)52–58%~70%
HLE (reasoning, tools allowed)12–16%~18–24%

Because our architecture is RAG dependent, we beat hallucination bench marks hands down, and because it is doc-grounded, the result quality beats great LLMs.

Components that matter (and why)

  1. Planner (SML)
    • Decomposes task → subgoals; assigns thought budget BBB and self-consistency kDecomposes task → subgoals; assigns thought budget BBB and self-consistency k.
    • Generates disjoint retrieval queries to maximize evidence diversity.
  2. Retriever + Re-ranker
    • Hybrid (BM25 + dense) → cross-encoder re-ranker (≤110M) → MMR diversity.
    • Chunking 512–800 tokens with 64–96 overlap; semantic tiling for tables.
  3. Solver (SML)
    • Produces grounded hypotheses citing spans, calls tools (math, code, SQL).
    • Keeps chains short and auditable (“program-of-thought” style).
  4. Critic (SML) + Strong Verifiers
    • NLI/entailment: evidence ⇒ claim, contradiction detection.
    • Math/units: symbolic checks (e.g., SymPy) + unit algebra.
    • Code: sandbox run + unit tests.
    • Tabular/SQL: query-then-answer; compare result vs natural-language claim.
    • Logic consistency across steps/hypotheses.
    • Feeds scores into selection and the RL reward.
  5. Selector & Abstain
    • Verifier-weighted voting over k traces; abstain/route-up under uncertainty to protect precision.

Training & cost envelope (order-of-magnitude)

  • SML pretrain / continue-pretrain (0.5–1B params)
    • 50–100B high-quality tokens for grammar/semantics/structure.
    • Cost: ~$50k–$200k on efficient H100 clusters (high utilization).
  • SFT + RL (“budget-aware thinking”)
    • Few-hundred-k curated traces; RL with reward = faithfulness + success + consistency − compute.
    • Cost: ~$10k–$60k.
  • Verifiers & retrieval
    • Train NLI and re-ranker (small models): $5k–$20k.
    • RAG infra (100M docs, FAISS/HNSW + re-ranker): $10k–$50k/yr.

Total v1: ~$70k–$300k to reach strong, production-ready performance on evidence-grounded workloads (ex-data licensing).

KPIs you should track

  • Faithfulness@k (claimed spans entail answer).
  • Consistency@k (agreement across traces).
  • Answer-Change Rate when masking top evidence (leak check).
  • Exact Match / F1 on QA sets; pass@k on coding; unit-correct on math.
  • Cost/answer (tokens + retrieval + verifier passes).
  • Abstain rate (and business impact).

Common failure modes & mitigations

  • Retriever myopia → diversify queries; enforce lexical/semantic diversity; MMR.
  • Hallucinated synthesis → hard faithfulness gate (NLI must pass; show spans).
  • Math/coding errors → require successful tool execution; retry with minimal temperature.
  • Latency spikes → budget-aware RL; escalate locally (per subgoal), not globally.
  • Over-abstain → calibrate thresholds per domain; route to a larger model only when ROI is clear.

When to scale the core

If you need better tool-resistant reasoning (HLE-style) without frontier costs, lift the core to ~3B params (keep the same stack). Expect step-changes on MMLU-Strict (+3–6 pts) and HLE (+3–4 pts), with modest inference cost increases.


Conclusion

For most real applications, the winning move is workflow engineering—not parameter bloat. A SML + logic verifier + A2A orchestration + engineered RAG + strong verifiers delivers GPT-4-class outcomes where it matters: evidence-grounded, auditable, and cost-controlled answers.

There is no single engineering recipe for all problems, good engineering matters, our proprietary trained models help accelerate projects and cut costs.