AIML

Re-Architecting Enterprise Threat Intelligence for the Foundation Model Era

Chandra Pendyala

Working Paper v5

14,000+Enterprise workloads processed~60%Inference cost reduction per requestTens of thousandsProduction requests measured

Executive Summary

DimensionResult / Position
Modernization triggerTensorFlow-era infrastructure cost pressure driving a migration of an existing production platform
Production scale14,000+ enterprise workloads across IaC, IAM, CI/CD, deployment and governance artifacts
Cost outcome~60% reduction in inference cost per threat model request, measured against AWS compute spend per request
Core shiftModel-centric pipelines to evaluation-centric orchestration
Unexpected valueDiscovery of hidden operational defects that earlier pipelines could not observe

This paper documents the evolution of a production enterprise AI threat-modeling platform during a practical modernization effort originally driven by infrastructure cost optimization. The project began as a bounded effort to modernize a TensorFlow-era platform, improve hardware efficiency, reduce operational cost, and take advantage of newer foundation-layer improvements in models, runtimes, and accelerator economics.

The expectation was primarily framework and infrastructure modernization. Instead, the migration exposed deeper architectural constraints embedded throughout the system: framework-coupled semantics, hidden preprocessing divergence, silent context corruption, replay instability, string ambiguity propagation, topology loss, and evaluation fragmentation. Correcting these issues recursively forced a broader redesign — evolving from framework migration into a rethinking of how enterprise AI systems represent, validate, orchestrate, and govern probabilistic reasoning.

Key Takeaway  Cost pressure started the migration. The evaluation infrastructure we built to manage it safely turned out to expose deeper problems in the original system — and fixing those produced more value than the migration itself.

How to Read This Paper

The document is designed for layered reading. Skimmers can follow tables, diagrams, and key-takeaway boxes. Patient readers can use the prose sections for the causal chain and operational detail.

If you care about…Read…
Why the migration started and what it costSections 1–2
TensorFlow to PyTorch and JAX migration lessonsSections 3, 9, and 10
Semantic numericalization and long stringsSections 4–5
Hidden production defects — the core findingSection 7
Threat-modeling architecture patternsSections 6 and 8
Hardware-aware orchestration and future portabilitySection 9
Operational outcomes and cost breakdownSection 12
Future quality comparison workSection 13

1. Introduction

The platform described in this paper had been running in production for years, processing threat models across more than 14,000 enterprise workloads. The decision to migrate off TensorFlow was straightforward: inference costs were rising, the ecosystem had moved toward PyTorch and JAX, and staying put was getting more expensive with each passing quarter.

The first practical question the migration raised was: how do we know the new system produces equivalent output? Users had years of familiarity with what the platform generated. Threat models had been reviewed, calibrated against, and built into security workflows. Migrating the infrastructure while silently changing the outputs was not acceptable. Before touching the model pipeline, we needed a rigorous evaluation framework — one that could define “identical quality” precisely enough to serve as a regression baseline throughout the migration.

Investing in evaluation first turned out to be the decision that changed the scope of the project. Once we had production replay infrastructure and deterministic verification in place, we could see things the original pipeline had never exposed. Preprocessing inconsistencies. Silent truncation of long inputs. Train/inference divergence. Retrieval instability. Defects that had been invisible not because they were subtle, but because there had been no mechanism to observe them. The migration became a redesign.

This paper documents that progression: a migration that started with a cost problem, was disciplined by an evaluation-first approach, and expanded into a broader rethinking of how the platform represents, validates, and orchestrates its reasoning. The sections that follow describe each of the major architectural decisions in the order they were encountered.

Key Takeaway  Investing in evaluation before touching the model pipeline was a risk management decision. It turned into the lens that made everything else visible.

2. The Migration Trigger

The immediate pressures were concrete. TensorFlow inference costs were rising. PyTorch had become the practical standard for new model development. The team maintaining the platform needed to stay current with an ecosystem that had largely moved on. A migration was going to happen; the question was how to do it without breaking a production system that people depended on.

Why Not Just Use a Frontier Model?

Before investing in a purpose-built system, we prototyped with frontier models. The answer came back quickly: simple agents on top of OpenAI, Anthropic, Gemini APIs could not handle the task adequately. The failure was not a prompting problem. It was structural.

Enterprise threat modeling across 14,000 heterogeneous workloads requires understanding platform-specific jargon, proprietary infrastructure patterns, and deployment conventions that frontier models had never seen. Without a semantic numericalization and parser layer feeding clean, canonicalized context, the models produced outputs that failed basic verifier checks at an unacceptable rate. With heavy prompt engineering, pass rates improved but remained well below production requirements. The quality gap closed partially when we applied the same preprocessing layer we had built for our own models — which confirmed that the input representation problem was the core issue, not model capability.

Cost and data sovereignty closed the question entirely.

DimensionFrontier Model (GPT-4)Our System (Post-Redesign)
Cost per threat model>$20 (token-based API)~$0.0X (on-prem compute)
Verifier pass rate~30% initial; ~60% with heavy prompt engineering95%+
Enterprise jargon handlingFailed without custom preprocessingWorks with semantic numericalization
Data sovereigntyCloud API — data leaves premisesOn-premises — compliant by design
P95 latency8–15s (API + tokenization overhead)<2s (optimized pipeline)
Human review rate~50% (hallucinations, missed context)<15%

At $20+ per threat model across tens of thousands of requests, the cost was not competitive. More importantly, the enterprise customers this platform serves will not send infrastructure topology, IAM policies, and deployment specifications to an external API. Data residency is a hard requirement, not a preference.

Training data compounds the constraint further. The historical security audits, vulnerability findings, remediation records, proprietary risk models, compliance stances, and control mappings that make the threat models valuable represent the accumulated institutional security knowledge of the enterprise — competitively sensitive and in many cases legally protected. Fine-tuning on that corpus via an external API is not an option regardless of cost. The full training pipeline runs on-premises against data that does not move.

The frontier model prototype was useful despite its limitations. It established a concrete quality baseline, confirmed that the input representation problem was real and solvable with proper preprocessing, and validated that a purpose-built system could achieve quality the API approach could not match at any price. That early comparison also gave us the verifier framework that later became the backbone of the migration evaluation infrastructure.

Legacy AssumptionEmerging PressureArchitectural Consequence
TensorFlow graph executionPyTorch and JAX dynamic orchestrationSeparate semantic contracts from framework execution
Direct string handling inside the pipelineLong, heterogeneous enterprise artifactsNumericalize strings before model execution
Monolithic GPU trainingMixed precision, routing, LoRA, distillationOptimize per task and per hardware target
Large custom models for narrow tasksSmall models, retrieval, frontier escalationUse the cheapest reliable adaptation path
Framework-coupled preprocessingNeed for portability and reproducibilityExternalize preprocessing as versioned contracts
Static evaluation reportsNeed to compare migrations safelyBuild replay-driven evaluation fabric

On this platform — threat modeling against structured enterprise inputs with deterministic verifiers and constrained output schemas — pipeline failures dominated over model capability gaps. That may not hold universally; tasks requiring open-ended reasoning over very long contexts face different constraints. But for structured enterprise AI at scale, getting the pipeline right mattered more than getting a better model.

Key Takeaway  The original architecture made sense for its time. Migrating it carefully surfaced what had accumulated inside it — and addressing that turned out to be the bulk of the work.

3. The Unexpected Discovery: Migration Was Mostly Understanding

AI-assisted code generation provided comfort during the migration. Files could be translated quickly, syntax could be updated, and the visible signs of progress were real. But the apparent speed did not remove the work that actually determined whether the migration was correct.

Most of the engineering effort remained concentrated in reading, running, replaying, tracing, evaluating, and thinking. The difficult work was semantic reconstruction: determining which behaviors had to be preserved, which behaviors were legacy artifacts, and which behaviors were accidental bugs disguised as model behavior.

What code generation helped withWhat still dominated effort
Syntax translationRecovering architectural intent
Framework API substitutionValidating train/eval/production parity
Creating visible migration momentumTracing hidden preprocessing assumptions
Reducing typing burdenRebuilding evaluation datasets and replay harnesses
Generating scaffold codeDetermining whether outputs were semantically equivalent
Key Takeaway  Code generation accelerated implementation mechanics. The bottleneck remained semantic: recovering architectural intent, validating operational behavior, and determining whether outputs were equivalent — none of which can be automated away.

4. Semantic Numericalization

String handling was one of the first practical friction points in the migration. TensorFlow and PyTorch treat strings differently at a low level, and the platform processed a lot of them — IAM policies, Terraform configurations, Kubernetes YAML, architecture documents, CI/CD specifications, audit artifacts. Getting string behavior consistent across the migration boundary required building explicit handling that the TensorFlow pipeline had never needed.

Solving the string problem properly — rather than patching it just enough to compile — meant building a layer that canonicalized inputs before they reached the model. Once that layer existed, it became clear how much ambiguity the original pipeline had been passing through silently. Enterprise strings carry dense semantic content: an IAM policy encodes permission structure, a Terraform file encodes infrastructure topology, an architecture document encodes deployment relationships. Treating them as undifferentiated text and processing them inside framework execution graphs had been obscuring that structure rather than preserving it.

The semantic numericalization layer that resulted performs parser-assisted canonicalization, symbolic reduction, ontology-aware entity mapping, semantic hashing, controlled tokenization, deployment-topology extraction, deterministic feature projection, and long-string decomposition. The governing principle it enforces is simple: ambiguity should be resolved at the ingestion boundary, before it reaches probabilistic reasoning. Passing ambiguous inputs into a model means the model absorbs the ambiguity into its outputs without signaling it — which is precisely what had been happening.

Core Design PrincipleGoverning Principle: Ambiguity must terminate at the ingestion boundary.
If a string is ambiguous, that ambiguity must be resolved — or explicitly represented — before it enters downstream processing. Passing ambiguity forward into model execution means outputs reflect the ambiguity without signaling it, making errors structurally invisible. This was a root cause of the cost problem: ambiguous inputs drove inconsistent token consumption, unpredictable context sizes, and expensive retry cycles.

One honest caveat: semantic numericalization is not a solved problem. As enterprise input formats evolve — new cloud providers, new IaC dialects, new policy languages — the ontology and parser layer requires ongoing maintenance to remain current. The layer reduces ambiguity for known input types; novel formats still require human-guided extension before they can be reliably canonicalized.

Key Takeaway  What started as a migration friction point became a foundational architectural layer. The string canonicalization work required ongoing maintenance as input formats evolve — new cloud providers, new IaC dialects — but the investment paid for itself quickly in reduced tokenization variance and more predictable model behavior.

5. Long String Architecture and Topology Preservation

Enterprise documents are not ordinary text. An architecture document encodes deployment topology, trust relationships, component dependencies, and control structures in a form that requires domain-specific interpretation to decompress correctly. Terraform encodes infrastructure as a graph of interdependent resources with implicit ordering and dependency constraints. A CI/CD specification encodes execution sequences, environment assumptions, and deployment conditions that interact with every other component in the system.

Traditional approaches to long strings — fixed-length truncation and naive chunking — fail to preserve the semantic structure that makes enterprise documents useful inputs for reasoning tasks. Truncation silently discards content that may be critical. Naive chunking breaks documents at boundaries that have no semantic significance, scattering related information across chunks in ways that prevent coherent reconstruction during retrieval.

The redesigned pipeline treated long document processing as a first-class engineering problem, introducing hierarchical chunking that preserves document structure, topology-aware segmentation that extracts deployment relationships as explicit graph representations, parser-guided decomposition by document type, semantic compression into canonical intermediate representations, and retrieval-oriented numericalization for bounded-context reasoning.

Key Takeaway  The truncation defects in Section 7 trace directly to this: topology that was discarded at ingestion could not be recovered downstream. Preserving document structure at parse time turned out to matter more than context window size.

6. Evaluation-Centric Design

The evaluation infrastructure built for the migration turned into the most operationally significant part of the redesign. The original reason was straightforward: we needed to know whether the migrated system was producing equivalent outputs. What we built to answer that question ended up changing how the entire platform was developed and validated going forward.

Periodic evaluation against held-out data had been sufficient when the pipeline was stable. During a migration, with preprocessing changing, tokenization behavior shifting, and retrieval indices evolving, it was not enough. We needed continuous correctness checks across every version, not just snapshots at release time.

In the redesigned system, evaluation became the primary software discipline. The evaluation fabric introduced curated benchmark datasets covering known threat-modeling scenarios; production replay datasets that reproduce historical behavior; regression validation suites that detect behavioral changes across versions; verifier disagreement scoring that flags divergence between deterministic checks and generative outputs; retrieval quality analysis; lineage-aware tracking of dataset, transform, model, hardware, and output provenance; and deterministic promotion gates before production release.

Key Takeaway  Having evaluation infrastructure in place before the migration meant we could run experiments — different models, different routing approaches, different hardware — and trust the results. It also meant regressions showed up before production rather than after.

7. Hidden Operational Defects Revealed

One of the least anticipated outcomes of the redesign was the systematic discovery of previously hidden operational defects. The earlier platform lacked the instrumentation necessary to make these failures visible. Without deterministic replay, there was no reliable way to reproduce historical behavior. Without semantic canonicalization, there was no stable basis for comparing inputs across pipeline versions. Without lineage-aware evaluation, there was no mechanism for tracing outputs back to the preprocessing decisions that produced them.

What emerged was a pattern of pipeline failures. Preprocessing decisions had propagated silently into training data, learned representations, and production outputs simultaneously.

Defect TypeWhy It Stayed HiddenHow the Redesign Exposed It
Silent truncationOutputs remained plausible even when context was missingProduction replay plus long-string decomposition
Train/inference divergenceTraining and production shared the same flawed preprocessing assumptionsLineage-aware evaluation and transform versioning
String ambiguity propagationFramework-specific string behavior hid semantic driftCanonicalization and numericalization
Retrieval instabilityNon-repeatable context looked like ordinary model varianceRetrieval quality analysis and replay
Topology lossThreat models could look complete while missing trust-boundary relationshipsTopology extraction and verifier checks

Representative Defect: Silent Truncation of Architecture Documents

The legacy platform applied fixed-length truncation to long string inputs as a standard preprocessing step. This was a practical decision at the time: bounded input lengths simplified framework execution, reduced memory consumption, and kept training tractable. Architecture description documents, however, frequently exceeded this boundary and were truncated silently. The pipeline produced no signal indicating information loss had occurred.

The downstream consequence was structurally compromised threat models — though the nature of the compromise was more subtle than simple omission. A post-redesign review suggests the defect likely manifested as over-generation rather than under-generation. The architecture description field — the most likely truncation victim — may have carried less discriminating signal than the architecture components field, which escaped truncation. With components intact but architectural context incomplete, the model appears to have compensated by generating a higher volume of less topology-specific threats. Newer outputs look noticeably tighter: fewer threats, more grounded in actual deployment context.

Silent Failure: Train/Inference Distribution MismatchThe truncation defect had two compounding layers:
Layer 1 — Inference: Incomplete architectural context drove over-generation. Threat models appeared thorough but were structurally misaligned — more threats than the architecture warranted, less grounded in actual topology.
Layer 2 — Training: Truncation occurred inside the training pipeline itself. The model never learned from complete architectural descriptions. Complete documents were out-of-distribution inputs at inference time — the model had never seen them during training.
Scope: A proper controlled comparison between legacy and redesigned outputs has not yet been completed. Preliminary observation suggests over-generation was widespread wherever architecture documents exceeded the truncation boundary, but quantification awaits the statistical study described in Section 13.

This defect was operationally invisible for its entire production lifetime. Outputs looked reasonable — voluminous, even, which can read as thoroughness. No pipeline errors were raised. No evaluation signal flagged the mismatch. The system produced structurally misaligned threat models systematically, at scale, wherever architecture documents exceeded the truncation boundary.

The model performed as well as its pipeline allowed. The pipeline was where the failure lived. Model-centric evaluation frameworks measure output quality against labels or human judgment — neither of which easily detects systematic context distortion baked into training itself. The defect was invisible to the tools the system had for looking.

Key Takeaway  Cost reduction was the goal. Exposing operational failures that earlier architectures could not observe was the unexpected dividend — and revealed that the failure mode was likely excess rather than absence.

8. Threat Modeling System Architecture

The threat-modeling task required reconstructing deployment topology from heterogeneous inputs, mapping components to vulnerability patterns, generating STRIDE-aligned threats appropriate to the architecture, validating generated threats against deterministic control mappings, and producing outputs specific enough to drive remediation.

The Epistemic Contract

The system knows what it knows. It knows what it does not know. Both are first-class outputs.

PrincipleMeaning
Input assessed before generationThe verifier sets the confidence ceiling before the model sees a token.
Abstention is validInsufficient input to assess a boundary is better than hallucinated specificity.
Confidence ceiling is deterministicDerived from input completeness and verifier logic. The model probability plays no part in setting the ceiling.
Human reviewer is co-authorThe reviewer completes what the system cannot complete alone. Review is a co-authorship step, embedded in the workflow by design.

Three Cognitive Modes

Human ModeSystem ComponentFailure Mode
Pattern recognitionStudent LLM, SFT on historical reviews and component threat modelsHallucination when pattern class is absent from training corpus
Principled reasoningSTRIDE/MITRE ontology plus control objective / control procedure / control standard frameworkGap detection failures when rules are incomplete or mappings are stale
Evidence groundingTwo-pass RAG and evidence retrievalRetrieval misses when index coverage is poor or embeddings diverge

Knowledge Fabric: Two Semantic Spaces

The platform deliberately separates threat patterns from policy and control records because they live in different semantic spaces. Merging the indexes causes the same vector query to retrieve the wrong kind of evidence.

IndexSemantic PurposeRetrieval Rule
Index A: Policy and Control CorpusCO/CP/CS records, AWS component threat models, enterprise policies, compliance frameworksPass 2 uses strict terms filtering on co_id, not kNN.
Index B: Historical Threat PatternsApproved threat descriptions by component type, trust boundary, and STRIDE categoryPass 1 uses semantic kNN over component_type, trust_boundary, and data_class.
Neptune graphAWSComponent → ThreatProfile → ControlObjectiveTraversal starts at ThreatProfile IDs from Pass 1, not globally at AWSComponent.
Key Takeaway  The system combines learned pattern recognition, deterministic reasoning, evidence grounding, verification, and human co-authorship — each handling the class of problem it is suited for.

9. Hardware-Aware Execution

Modern AI systems are increasingly hardware-sensitive: mixed precision, sharding, LoRA/QLoRA, distillation, CPU-resident inference, vLLM serving, and JAX/XLA-style compilation create task-specific cost profiles that static architectures cannot exploit.

The redesign separated orchestration logic from hardware execution. The system could route work to the cheapest reliable execution path rather than default every request to a large model or expensive GPU path.

Task ClassPreferred Execution PathReason
Simple deterministic validationRules, parsers, or SQL-style checksLowest cost and highest traceability
Known component threat patternSmall specialized model plus retrievalFast, cheap, repeatable
Ambiguous or novel architectureEscalation to larger teacher/synthesis modelHigher reasoning capacity only where entropy is high
Repeated expensive behaviorDistill or SFT smaller modelCollapse cost after behavior is understood
Long document understandingTopology extraction plus bounded retrievalAvoids raw context-window inflation

The routing table above summarizes how that played out in practice: each task class got the cheapest execution path that could pass the verifiers.

Hardware PortabilityBecause orchestration and execution are separated in the redesigned architecture, future hardware targets plug in without requiring pipeline re-engineering. Frameworks like TurboQuant and ThunderKittens (Stanford) are making it easier to target new accelerator generations with less expert effort — ThunderKittens showed near-theoretical memory bandwidth utilization on H100s, and its Apple Silicon port ThunderMittens required only a tile-size change to cross architectures.
As these abstraction layers mature, the cost of adding a new hardware target decreases. The separation we built for cost reasons in 2024 turns out to have ongoing portability value.

10. Mechanical Recoding Limitations

Early modernization attempts used AI-assisted code generation to translate TensorFlow-era pipelines into PyTorch implementations. The approach was appealing: it reduced the mechanical effort of syntax translation and created visible migration momentum. For simpler, well-structured pipelines, code generation can be highly effective. In our case — where the pipeline was tightly coupled to TensorFlow’s execution graph and preprocessing assumptions were implicit throughout — the limitations became visible quickly.

Framework translation and architectural modernization are different activities. Generated code frequently reproduced hidden preprocessing assumptions, framework coupling, ambiguous tensor semantics, inconsistent transformation ordering, and train/evaluation divergence — because these properties were embedded in the logic of the original code, and code generation preserves logic while updating syntax.

Migration ActivityObserved Reality
Mechanical TensorFlow to PyTorch recodingUseful for comfort and scaffolding, but not a reliable source of end-to-end time savings in tightly coupled systems
Generated code reviewRequired deep reading — generated code preserved hidden assumptions
Compilation successDid not imply semantic parity
Local unit testsCould miss production replay divergence
Developer time allocationOverwhelmingly reading, running, tracing, evaluating, and thinking — not typing
Key Takeaway  Typing was never the bottleneck. The real work was recovering semantic intent and validating operational behavior — and code generation leaves that work entirely to the engineer.

11. Organizational Consequences

A major operational constraint emerged from the divergence between legacy TensorFlow ecosystems and newer PyTorch- and JAX-centered environments. The challenge was not simply hiring availability — engineers fluent across TensorFlow graph semantics, legacy preprocessing systems, modern PyTorch execution models, distributed inference, and evolving accelerator ecosystems are rare. The deeper problem was that critical system behavior had become implicitly embedded inside framework-specific implementations rather than expressed as explicit architectural contracts. Understanding the system required knowing the framework behaviors that produced the behavior — knowledge that was neither documented nor transferable without significant effort.

The redesign addressed this by externalizing semantic normalization, evaluation logic, lineage tracking, deterministic verification, and orchestration behavior from framework runtimes. These concerns became explicit architectural components with defined interfaces, documented behavior, and evaluation coverage. New engineers could reason about system behavior from architectural documentation and evaluation results rather than needing to internalize framework-specific implementation details.

Key Takeaway  Externalizing system behavior into explicit contracts made the platform maintainable by engineers who understood the architecture, not engineers who had memorized the framework internals.

12. Operational Outcomes

The redesign produced approximately 60% reduction in inference cost per threat model request, measured as AWS compute spend per request across tens of thousands of production requests. This is a concrete, auditable business metric — the numerator is cloud compute spend and the denominator is production threat model requests processed. The four mechanisms that drove the reduction are summarized below.

MechanismCost Driver EliminatedOutcome
Semantic numericalization + model routingDefault frontier-model calls for all requestsSmaller, cheaper models for routine work
Long document decompositionOversized context windows and excess token costBounded, predictable token consumption
Deterministic verificationExpensive retries on generative failuresFailures caught early, upstream
Hardware-aware routingGPU-centric execution for all task typesRight hardware matched to each task class

None of these gains came from switching cloud providers or renegotiating rates. They came entirely from architectural decisions. Beyond cost reduction, the redesign produced substantial improvements in governance posture, replay consistency, traceability, and deployment portability.

Outcome AreaResult
Cost~60% reduction in inference cost per threat model request
Retraining overheadReduced through smaller adaptation loops, routing, and distillation
GPU dependencyReduced by assigning work to cheaper execution paths where sufficient
TraceabilityImproved through MLflow lineage, replay datasets, and transform versioning
GovernanceImproved through deterministic verifier checks and evidence-linked outputs
Defect visibilityImproved through production replay, semantic numericalization, and evaluation gates
Deployment portabilityImproved by separating orchestration logic from hardware execution
Key Takeaway  The 60% figure covers inference cost. The output correctness improvement — fixing defects that had been invisible since the original deployment — is harder to quantify but likely more significant in practice.

13. Future Work: Statistical Output Comparison

Future work involves statistically comparing outputs between the legacy platform and the redesigned platform using identical production replay datasets. Preliminary operational evidence — particularly the observation that newer outputs look tighter and more topology-grounded — suggests the redesigned platform may produce substantially superior outputs in addition to operational savings. But the quality claim requires controlled comparison before it can be stated with confidence.

Evaluation DimensionQuestion
Threat completenessDoes the redesigned platform identify more valid threats per workload, or fewer but better-grounded ones?
Over-generationDoes the redesigned platform produce fewer excess threats — consistent with the truncation defect hypothesis?
False negativesDoes topology preservation reduce missed critical threats?
Remediation qualityAre recommendations more specific and actionable?
Control mapping accuracyAre STRIDE/MITRE/control mappings more coherent?
Replay consistencyDo identical inputs produce stable outputs across versions?
Traceability qualityCan reviewers reconstruct why a threat was generated?
Human review effortDoes reviewer time decrease without reducing quality?
Key Takeaway  The next scientific step is a controlled, statistically defensible comparison of old and new outputs under identical production replay conditions — with particular attention to whether over-generation decreased and topology grounding improved.

14. What This Project Taught Us

This project ran across four years of significant change in models, frameworks, and hardware economics. Looking back at what actually drove the outcomes, a few patterns stand out that we expect to apply to similar problems going forward.

Invest in deterministic verification early

The verifier pipeline was built first as a regression safety mechanism for the migration. It turned out to be the most valuable infrastructure investment of the entire project. Verifiers gave us a consistent, objective quality bar that made cost experimentation tractable — we could try a classification model feeding a generation model, smarter retrieval-augmented approaches, different routing strategies, and know quickly whether each passed or failed on what mattered. Without that bar, each experiment would have been a judgment call.

Designing for deterministic verification from the start is worth the investment. When you know what correct looks like deterministically, the whole system becomes more tractable — experimentation is faster, regressions are visible, and the quality bar survives model and framework changes.

String handling, context management, and parsing discipline have outsized impact

These concerns get underweighted because they are unglamorous. On this project they had outsized impact on output predictability. How strings are canonicalized, how long documents are decomposed, how context is assembled before inference — these decisions propagate into every output the system produces. The truncation defect in Section 7 is a direct example: a preprocessing decision made for practical reasons at training time propagated silently into every threat model the system produced for years.

Layer contracts make code generation productive

One unexpected benefit of the layered architecture was what it did for code generation. When each layer has a stable, well-defined contract — known inputs, known outputs, known verification criteria — code generation becomes genuinely productive. The problem is local and bounded. The generator cannot silently break something downstream because the contract defines the boundary. The verifier checks the answer.

Earlier attempts at code-generation-assisted migration struggled because the pipeline was tightly coupled — there were no clean boundaries to contain the generated code. Architectural discipline and productive code generation turned out to be mutually reinforcing. Clean layer design makes generation tractable. Generation pressure is an incentive to keep it that way.

Stochastic inference belongs in the orchestration layer as an escape hatch

The convenience of letting a large frontier model handle everything is real. A single capable model absorbs distribution shift gracefully, requires no routing logic, and produces coherent output across a wide range of inputs. For many applications — content generation, summarization, creative work — that convenience is the right tradeoff.

For threat models, financial audit statements, compliance reports, and other outputs where correctness is auditable and stakes are high, the calculus is different. Stochastic inference by default means unpredictable outputs by default. That is a liability in domains where consumers need to trust and act on what the system produces.

The more productive pattern is to treat stochastic inference as an escape hatch in the orchestration layer. Once the intelligence for a class of problem is understood well enough to distill and verify, serve it deterministically. Reserve live inference for cases that are genuinely novel or where verifier confidence is low. The verifier is the mechanism that makes this routing decision trustworthy — it tells you when the deterministic path is safe and when it is not.

On this platform, where outputs feed into security review workflows and reviewers need to trust and act on the results, that routing decision was worth the engineering investment. As models continue to improve and inference costs continue to fall, this pattern becomes more practical, not less. The question for each problem class shifts from “which model is capable enough?” to “have we crystallized the intelligence well enough to stop running it stochastically at every inference?” For high-stakes auditable outputs, that is the question worth asking.

Conclusion

A cost optimization effort became a broader architectural redesign because the migration exposed hidden semantic coupling, topology loss, replay instability, evaluation fragmentation, and framework-bound behavior that model upgrades alone would never have surfaced.

The resulting architecture uses foundation models selectively: as teachers, synthesis engines, and escalation paths for cases the deterministic layer cannot handle confidently. The durable value came from the discipline applied at every layer — how inputs were canonicalized, how context was preserved, how outputs were verified, how behavior was made replayable. That discipline survived four years of rapid change in models, frameworks, and hardware because it was never coupled to any of them.

The engineering discipline that made this project work — evaluation-first, stable layer contracts, pipeline rigor, deterministic verification — felt specific to this migration at the time. Looking back, it reads more like a pattern. We expect to reach for the same approach on similar problems: any production AI system where output quality is auditable, costs need to be managed explicitly, and the underlying models and infrastructure are still moving fast enough that you need the system to stay flexible around them.

Appendix A: Operational Architecture Details

The following details summarize selected implementation patterns from the production threat-modeling architecture. They are included to make the paper concrete without turning the main narrative into a deployment manual.

Verifier Pipeline

CheckStageBehavior
V1 Input SufficiencyPre-generationBlocking. Missing trust boundary map prevents generation.
V2 Document CompletenessPre-generationLowers confidence ceiling for missing optional inputs; does not block generation.
V3 STRIDE CompletenessPost-generationFlags missing applicable STRIDE categories for human review.
V4 MITRE CoherencePost-generationFlags incoherent technique mappings for review.
V5 Semantic CoherencePost-generationChecks threat description against STRIDE category and evidence chunk.
V6 Evidence SufficiencyPost-generationValidates that evidence content supports the stated threat — chunk presence alone is insufficient.

Training Record Types

Record TypePurposeImportant Constraint
GenerationSFT imitation learning from teacher synthetic data and approved reviewsSynthetic data downweighted and eventually retired as real reviews accumulate.
CorrectionRL/reasoning distillation from human correctionsTeacher reasoning explains why the correction was needed — not just what changed.
AbstentionTeaches knowledge boundariesPrevents the worst failure mode: confident hallucination under insufficient input.

MLOps Promotion Metrics

MetricTargetBlocking?
schema_valid_rate≥ 99%Yes
co_match_accuracy≥ 90%Yes
stride_coverage_rate≥ 85%Yes
abstention_precision≥ 80%Yes
false_negative_rate≤ 5%Yes — immediate
generation_p95_latency≤ 30sAlert only
teacher_synthetic_quality≥ 85%Teacher gate