Re-Architecting Enterprise Threat Intelligence for the Foundation Model Era
Chandra Pendyala
Working Paper v5
| 14,000+Enterprise workloads processed | ~60%Inference cost reduction per request | Tens of thousandsProduction requests measured |
Executive Summary
| Dimension | Result / Position |
| Modernization trigger | TensorFlow-era infrastructure cost pressure driving a migration of an existing production platform |
| Production scale | 14,000+ enterprise workloads across IaC, IAM, CI/CD, deployment and governance artifacts |
| Cost outcome | ~60% reduction in inference cost per threat model request, measured against AWS compute spend per request |
| Core shift | Model-centric pipelines to evaluation-centric orchestration |
| Unexpected value | Discovery of hidden operational defects that earlier pipelines could not observe |
This paper documents the evolution of a production enterprise AI threat-modeling platform during a practical modernization effort originally driven by infrastructure cost optimization. The project began as a bounded effort to modernize a TensorFlow-era platform, improve hardware efficiency, reduce operational cost, and take advantage of newer foundation-layer improvements in models, runtimes, and accelerator economics.
The expectation was primarily framework and infrastructure modernization. Instead, the migration exposed deeper architectural constraints embedded throughout the system: framework-coupled semantics, hidden preprocessing divergence, silent context corruption, replay instability, string ambiguity propagation, topology loss, and evaluation fragmentation. Correcting these issues recursively forced a broader redesign — evolving from framework migration into a rethinking of how enterprise AI systems represent, validate, orchestrate, and govern probabilistic reasoning.
| Key Takeaway Cost pressure started the migration. The evaluation infrastructure we built to manage it safely turned out to expose deeper problems in the original system — and fixing those produced more value than the migration itself. |
How to Read This Paper
The document is designed for layered reading. Skimmers can follow tables, diagrams, and key-takeaway boxes. Patient readers can use the prose sections for the causal chain and operational detail.
| If you care about… | Read… |
| Why the migration started and what it cost | Sections 1–2 |
| TensorFlow to PyTorch and JAX migration lessons | Sections 3, 9, and 10 |
| Semantic numericalization and long strings | Sections 4–5 |
| Hidden production defects — the core finding | Section 7 |
| Threat-modeling architecture patterns | Sections 6 and 8 |
| Hardware-aware orchestration and future portability | Section 9 |
| Operational outcomes and cost breakdown | Section 12 |
| Future quality comparison work | Section 13 |
1. Introduction
The platform described in this paper had been running in production for years, processing threat models across more than 14,000 enterprise workloads. The decision to migrate off TensorFlow was straightforward: inference costs were rising, the ecosystem had moved toward PyTorch and JAX, and staying put was getting more expensive with each passing quarter.
The first practical question the migration raised was: how do we know the new system produces equivalent output? Users had years of familiarity with what the platform generated. Threat models had been reviewed, calibrated against, and built into security workflows. Migrating the infrastructure while silently changing the outputs was not acceptable. Before touching the model pipeline, we needed a rigorous evaluation framework — one that could define “identical quality” precisely enough to serve as a regression baseline throughout the migration.
Investing in evaluation first turned out to be the decision that changed the scope of the project. Once we had production replay infrastructure and deterministic verification in place, we could see things the original pipeline had never exposed. Preprocessing inconsistencies. Silent truncation of long inputs. Train/inference divergence. Retrieval instability. Defects that had been invisible not because they were subtle, but because there had been no mechanism to observe them. The migration became a redesign.
This paper documents that progression: a migration that started with a cost problem, was disciplined by an evaluation-first approach, and expanded into a broader rethinking of how the platform represents, validates, and orchestrates its reasoning. The sections that follow describe each of the major architectural decisions in the order they were encountered.
| Key Takeaway Investing in evaluation before touching the model pipeline was a risk management decision. It turned into the lens that made everything else visible. |
2. The Migration Trigger
The immediate pressures were concrete. TensorFlow inference costs were rising. PyTorch had become the practical standard for new model development. The team maintaining the platform needed to stay current with an ecosystem that had largely moved on. A migration was going to happen; the question was how to do it without breaking a production system that people depended on.
Why Not Just Use a Frontier Model?
Before investing in a purpose-built system, we prototyped with frontier models. The answer came back quickly: simple agents on top of OpenAI, Anthropic, Gemini APIs could not handle the task adequately. The failure was not a prompting problem. It was structural.
Enterprise threat modeling across 14,000 heterogeneous workloads requires understanding platform-specific jargon, proprietary infrastructure patterns, and deployment conventions that frontier models had never seen. Without a semantic numericalization and parser layer feeding clean, canonicalized context, the models produced outputs that failed basic verifier checks at an unacceptable rate. With heavy prompt engineering, pass rates improved but remained well below production requirements. The quality gap closed partially when we applied the same preprocessing layer we had built for our own models — which confirmed that the input representation problem was the core issue, not model capability.
Cost and data sovereignty closed the question entirely.
| Dimension | Frontier Model (GPT-4) | Our System (Post-Redesign) |
| Cost per threat model | >$20 (token-based API) | ~$0.0X (on-prem compute) |
| Verifier pass rate | ~30% initial; ~60% with heavy prompt engineering | 95%+ |
| Enterprise jargon handling | Failed without custom preprocessing | Works with semantic numericalization |
| Data sovereignty | Cloud API — data leaves premises | On-premises — compliant by design |
| P95 latency | 8–15s (API + tokenization overhead) | <2s (optimized pipeline) |
| Human review rate | ~50% (hallucinations, missed context) | <15% |
At $20+ per threat model across tens of thousands of requests, the cost was not competitive. More importantly, the enterprise customers this platform serves will not send infrastructure topology, IAM policies, and deployment specifications to an external API. Data residency is a hard requirement, not a preference.
Training data compounds the constraint further. The historical security audits, vulnerability findings, remediation records, proprietary risk models, compliance stances, and control mappings that make the threat models valuable represent the accumulated institutional security knowledge of the enterprise — competitively sensitive and in many cases legally protected. Fine-tuning on that corpus via an external API is not an option regardless of cost. The full training pipeline runs on-premises against data that does not move.
The frontier model prototype was useful despite its limitations. It established a concrete quality baseline, confirmed that the input representation problem was real and solvable with proper preprocessing, and validated that a purpose-built system could achieve quality the API approach could not match at any price. That early comparison also gave us the verifier framework that later became the backbone of the migration evaluation infrastructure.
| Legacy Assumption | Emerging Pressure | Architectural Consequence |
| TensorFlow graph execution | PyTorch and JAX dynamic orchestration | Separate semantic contracts from framework execution |
| Direct string handling inside the pipeline | Long, heterogeneous enterprise artifacts | Numericalize strings before model execution |
| Monolithic GPU training | Mixed precision, routing, LoRA, distillation | Optimize per task and per hardware target |
| Large custom models for narrow tasks | Small models, retrieval, frontier escalation | Use the cheapest reliable adaptation path |
| Framework-coupled preprocessing | Need for portability and reproducibility | Externalize preprocessing as versioned contracts |
| Static evaluation reports | Need to compare migrations safely | Build replay-driven evaluation fabric |
On this platform — threat modeling against structured enterprise inputs with deterministic verifiers and constrained output schemas — pipeline failures dominated over model capability gaps. That may not hold universally; tasks requiring open-ended reasoning over very long contexts face different constraints. But for structured enterprise AI at scale, getting the pipeline right mattered more than getting a better model.
| Key Takeaway The original architecture made sense for its time. Migrating it carefully surfaced what had accumulated inside it — and addressing that turned out to be the bulk of the work. |
3. The Unexpected Discovery: Migration Was Mostly Understanding
AI-assisted code generation provided comfort during the migration. Files could be translated quickly, syntax could be updated, and the visible signs of progress were real. But the apparent speed did not remove the work that actually determined whether the migration was correct.
Most of the engineering effort remained concentrated in reading, running, replaying, tracing, evaluating, and thinking. The difficult work was semantic reconstruction: determining which behaviors had to be preserved, which behaviors were legacy artifacts, and which behaviors were accidental bugs disguised as model behavior.
| What code generation helped with | What still dominated effort |
| Syntax translation | Recovering architectural intent |
| Framework API substitution | Validating train/eval/production parity |
| Creating visible migration momentum | Tracing hidden preprocessing assumptions |
| Reducing typing burden | Rebuilding evaluation datasets and replay harnesses |
| Generating scaffold code | Determining whether outputs were semantically equivalent |
| Key Takeaway Code generation accelerated implementation mechanics. The bottleneck remained semantic: recovering architectural intent, validating operational behavior, and determining whether outputs were equivalent — none of which can be automated away. |
4. Semantic Numericalization
String handling was one of the first practical friction points in the migration. TensorFlow and PyTorch treat strings differently at a low level, and the platform processed a lot of them — IAM policies, Terraform configurations, Kubernetes YAML, architecture documents, CI/CD specifications, audit artifacts. Getting string behavior consistent across the migration boundary required building explicit handling that the TensorFlow pipeline had never needed.
Solving the string problem properly — rather than patching it just enough to compile — meant building a layer that canonicalized inputs before they reached the model. Once that layer existed, it became clear how much ambiguity the original pipeline had been passing through silently. Enterprise strings carry dense semantic content: an IAM policy encodes permission structure, a Terraform file encodes infrastructure topology, an architecture document encodes deployment relationships. Treating them as undifferentiated text and processing them inside framework execution graphs had been obscuring that structure rather than preserving it.
The semantic numericalization layer that resulted performs parser-assisted canonicalization, symbolic reduction, ontology-aware entity mapping, semantic hashing, controlled tokenization, deployment-topology extraction, deterministic feature projection, and long-string decomposition. The governing principle it enforces is simple: ambiguity should be resolved at the ingestion boundary, before it reaches probabilistic reasoning. Passing ambiguous inputs into a model means the model absorbs the ambiguity into its outputs without signaling it — which is precisely what had been happening.
| Core Design PrincipleGoverning Principle: Ambiguity must terminate at the ingestion boundary. If a string is ambiguous, that ambiguity must be resolved — or explicitly represented — before it enters downstream processing. Passing ambiguity forward into model execution means outputs reflect the ambiguity without signaling it, making errors structurally invisible. This was a root cause of the cost problem: ambiguous inputs drove inconsistent token consumption, unpredictable context sizes, and expensive retry cycles. |
One honest caveat: semantic numericalization is not a solved problem. As enterprise input formats evolve — new cloud providers, new IaC dialects, new policy languages — the ontology and parser layer requires ongoing maintenance to remain current. The layer reduces ambiguity for known input types; novel formats still require human-guided extension before they can be reliably canonicalized.
| Key Takeaway What started as a migration friction point became a foundational architectural layer. The string canonicalization work required ongoing maintenance as input formats evolve — new cloud providers, new IaC dialects — but the investment paid for itself quickly in reduced tokenization variance and more predictable model behavior. |
5. Long String Architecture and Topology Preservation
Enterprise documents are not ordinary text. An architecture document encodes deployment topology, trust relationships, component dependencies, and control structures in a form that requires domain-specific interpretation to decompress correctly. Terraform encodes infrastructure as a graph of interdependent resources with implicit ordering and dependency constraints. A CI/CD specification encodes execution sequences, environment assumptions, and deployment conditions that interact with every other component in the system.
Traditional approaches to long strings — fixed-length truncation and naive chunking — fail to preserve the semantic structure that makes enterprise documents useful inputs for reasoning tasks. Truncation silently discards content that may be critical. Naive chunking breaks documents at boundaries that have no semantic significance, scattering related information across chunks in ways that prevent coherent reconstruction during retrieval.
The redesigned pipeline treated long document processing as a first-class engineering problem, introducing hierarchical chunking that preserves document structure, topology-aware segmentation that extracts deployment relationships as explicit graph representations, parser-guided decomposition by document type, semantic compression into canonical intermediate representations, and retrieval-oriented numericalization for bounded-context reasoning.
| Key Takeaway The truncation defects in Section 7 trace directly to this: topology that was discarded at ingestion could not be recovered downstream. Preserving document structure at parse time turned out to matter more than context window size. |
6. Evaluation-Centric Design
The evaluation infrastructure built for the migration turned into the most operationally significant part of the redesign. The original reason was straightforward: we needed to know whether the migrated system was producing equivalent outputs. What we built to answer that question ended up changing how the entire platform was developed and validated going forward.
Periodic evaluation against held-out data had been sufficient when the pipeline was stable. During a migration, with preprocessing changing, tokenization behavior shifting, and retrieval indices evolving, it was not enough. We needed continuous correctness checks across every version, not just snapshots at release time.
In the redesigned system, evaluation became the primary software discipline. The evaluation fabric introduced curated benchmark datasets covering known threat-modeling scenarios; production replay datasets that reproduce historical behavior; regression validation suites that detect behavioral changes across versions; verifier disagreement scoring that flags divergence between deterministic checks and generative outputs; retrieval quality analysis; lineage-aware tracking of dataset, transform, model, hardware, and output provenance; and deterministic promotion gates before production release.
| Key Takeaway Having evaluation infrastructure in place before the migration meant we could run experiments — different models, different routing approaches, different hardware — and trust the results. It also meant regressions showed up before production rather than after. |
7. Hidden Operational Defects Revealed
One of the least anticipated outcomes of the redesign was the systematic discovery of previously hidden operational defects. The earlier platform lacked the instrumentation necessary to make these failures visible. Without deterministic replay, there was no reliable way to reproduce historical behavior. Without semantic canonicalization, there was no stable basis for comparing inputs across pipeline versions. Without lineage-aware evaluation, there was no mechanism for tracing outputs back to the preprocessing decisions that produced them.
What emerged was a pattern of pipeline failures. Preprocessing decisions had propagated silently into training data, learned representations, and production outputs simultaneously.
| Defect Type | Why It Stayed Hidden | How the Redesign Exposed It |
| Silent truncation | Outputs remained plausible even when context was missing | Production replay plus long-string decomposition |
| Train/inference divergence | Training and production shared the same flawed preprocessing assumptions | Lineage-aware evaluation and transform versioning |
| String ambiguity propagation | Framework-specific string behavior hid semantic drift | Canonicalization and numericalization |
| Retrieval instability | Non-repeatable context looked like ordinary model variance | Retrieval quality analysis and replay |
| Topology loss | Threat models could look complete while missing trust-boundary relationships | Topology extraction and verifier checks |
Representative Defect: Silent Truncation of Architecture Documents
The legacy platform applied fixed-length truncation to long string inputs as a standard preprocessing step. This was a practical decision at the time: bounded input lengths simplified framework execution, reduced memory consumption, and kept training tractable. Architecture description documents, however, frequently exceeded this boundary and were truncated silently. The pipeline produced no signal indicating information loss had occurred.
The downstream consequence was structurally compromised threat models — though the nature of the compromise was more subtle than simple omission. A post-redesign review suggests the defect likely manifested as over-generation rather than under-generation. The architecture description field — the most likely truncation victim — may have carried less discriminating signal than the architecture components field, which escaped truncation. With components intact but architectural context incomplete, the model appears to have compensated by generating a higher volume of less topology-specific threats. Newer outputs look noticeably tighter: fewer threats, more grounded in actual deployment context.
| Silent Failure: Train/Inference Distribution MismatchThe truncation defect had two compounding layers: Layer 1 — Inference: Incomplete architectural context drove over-generation. Threat models appeared thorough but were structurally misaligned — more threats than the architecture warranted, less grounded in actual topology. Layer 2 — Training: Truncation occurred inside the training pipeline itself. The model never learned from complete architectural descriptions. Complete documents were out-of-distribution inputs at inference time — the model had never seen them during training. Scope: A proper controlled comparison between legacy and redesigned outputs has not yet been completed. Preliminary observation suggests over-generation was widespread wherever architecture documents exceeded the truncation boundary, but quantification awaits the statistical study described in Section 13. |
This defect was operationally invisible for its entire production lifetime. Outputs looked reasonable — voluminous, even, which can read as thoroughness. No pipeline errors were raised. No evaluation signal flagged the mismatch. The system produced structurally misaligned threat models systematically, at scale, wherever architecture documents exceeded the truncation boundary.
The model performed as well as its pipeline allowed. The pipeline was where the failure lived. Model-centric evaluation frameworks measure output quality against labels or human judgment — neither of which easily detects systematic context distortion baked into training itself. The defect was invisible to the tools the system had for looking.
| Key Takeaway Cost reduction was the goal. Exposing operational failures that earlier architectures could not observe was the unexpected dividend — and revealed that the failure mode was likely excess rather than absence. |
8. Threat Modeling System Architecture
The threat-modeling task required reconstructing deployment topology from heterogeneous inputs, mapping components to vulnerability patterns, generating STRIDE-aligned threats appropriate to the architecture, validating generated threats against deterministic control mappings, and producing outputs specific enough to drive remediation.
The Epistemic Contract
The system knows what it knows. It knows what it does not know. Both are first-class outputs.
| Principle | Meaning |
| Input assessed before generation | The verifier sets the confidence ceiling before the model sees a token. |
| Abstention is valid | Insufficient input to assess a boundary is better than hallucinated specificity. |
| Confidence ceiling is deterministic | Derived from input completeness and verifier logic. The model probability plays no part in setting the ceiling. |
| Human reviewer is co-author | The reviewer completes what the system cannot complete alone. Review is a co-authorship step, embedded in the workflow by design. |
Three Cognitive Modes
| Human Mode | System Component | Failure Mode |
| Pattern recognition | Student LLM, SFT on historical reviews and component threat models | Hallucination when pattern class is absent from training corpus |
| Principled reasoning | STRIDE/MITRE ontology plus control objective / control procedure / control standard framework | Gap detection failures when rules are incomplete or mappings are stale |
| Evidence grounding | Two-pass RAG and evidence retrieval | Retrieval misses when index coverage is poor or embeddings diverge |
Knowledge Fabric: Two Semantic Spaces
The platform deliberately separates threat patterns from policy and control records because they live in different semantic spaces. Merging the indexes causes the same vector query to retrieve the wrong kind of evidence.
| Index | Semantic Purpose | Retrieval Rule |
| Index A: Policy and Control Corpus | CO/CP/CS records, AWS component threat models, enterprise policies, compliance frameworks | Pass 2 uses strict terms filtering on co_id, not kNN. |
| Index B: Historical Threat Patterns | Approved threat descriptions by component type, trust boundary, and STRIDE category | Pass 1 uses semantic kNN over component_type, trust_boundary, and data_class. |
| Neptune graph | AWSComponent → ThreatProfile → ControlObjective | Traversal starts at ThreatProfile IDs from Pass 1, not globally at AWSComponent. |
| Key Takeaway The system combines learned pattern recognition, deterministic reasoning, evidence grounding, verification, and human co-authorship — each handling the class of problem it is suited for. |
9. Hardware-Aware Execution
Modern AI systems are increasingly hardware-sensitive: mixed precision, sharding, LoRA/QLoRA, distillation, CPU-resident inference, vLLM serving, and JAX/XLA-style compilation create task-specific cost profiles that static architectures cannot exploit.
The redesign separated orchestration logic from hardware execution. The system could route work to the cheapest reliable execution path rather than default every request to a large model or expensive GPU path.
| Task Class | Preferred Execution Path | Reason |
| Simple deterministic validation | Rules, parsers, or SQL-style checks | Lowest cost and highest traceability |
| Known component threat pattern | Small specialized model plus retrieval | Fast, cheap, repeatable |
| Ambiguous or novel architecture | Escalation to larger teacher/synthesis model | Higher reasoning capacity only where entropy is high |
| Repeated expensive behavior | Distill or SFT smaller model | Collapse cost after behavior is understood |
| Long document understanding | Topology extraction plus bounded retrieval | Avoids raw context-window inflation |
The routing table above summarizes how that played out in practice: each task class got the cheapest execution path that could pass the verifiers.
| Hardware PortabilityBecause orchestration and execution are separated in the redesigned architecture, future hardware targets plug in without requiring pipeline re-engineering. Frameworks like TurboQuant and ThunderKittens (Stanford) are making it easier to target new accelerator generations with less expert effort — ThunderKittens showed near-theoretical memory bandwidth utilization on H100s, and its Apple Silicon port ThunderMittens required only a tile-size change to cross architectures. As these abstraction layers mature, the cost of adding a new hardware target decreases. The separation we built for cost reasons in 2024 turns out to have ongoing portability value. |
10. Mechanical Recoding Limitations
Early modernization attempts used AI-assisted code generation to translate TensorFlow-era pipelines into PyTorch implementations. The approach was appealing: it reduced the mechanical effort of syntax translation and created visible migration momentum. For simpler, well-structured pipelines, code generation can be highly effective. In our case — where the pipeline was tightly coupled to TensorFlow’s execution graph and preprocessing assumptions were implicit throughout — the limitations became visible quickly.
Framework translation and architectural modernization are different activities. Generated code frequently reproduced hidden preprocessing assumptions, framework coupling, ambiguous tensor semantics, inconsistent transformation ordering, and train/evaluation divergence — because these properties were embedded in the logic of the original code, and code generation preserves logic while updating syntax.
| Migration Activity | Observed Reality |
| Mechanical TensorFlow to PyTorch recoding | Useful for comfort and scaffolding, but not a reliable source of end-to-end time savings in tightly coupled systems |
| Generated code review | Required deep reading — generated code preserved hidden assumptions |
| Compilation success | Did not imply semantic parity |
| Local unit tests | Could miss production replay divergence |
| Developer time allocation | Overwhelmingly reading, running, tracing, evaluating, and thinking — not typing |
| Key Takeaway Typing was never the bottleneck. The real work was recovering semantic intent and validating operational behavior — and code generation leaves that work entirely to the engineer. |
11. Organizational Consequences
A major operational constraint emerged from the divergence between legacy TensorFlow ecosystems and newer PyTorch- and JAX-centered environments. The challenge was not simply hiring availability — engineers fluent across TensorFlow graph semantics, legacy preprocessing systems, modern PyTorch execution models, distributed inference, and evolving accelerator ecosystems are rare. The deeper problem was that critical system behavior had become implicitly embedded inside framework-specific implementations rather than expressed as explicit architectural contracts. Understanding the system required knowing the framework behaviors that produced the behavior — knowledge that was neither documented nor transferable without significant effort.
The redesign addressed this by externalizing semantic normalization, evaluation logic, lineage tracking, deterministic verification, and orchestration behavior from framework runtimes. These concerns became explicit architectural components with defined interfaces, documented behavior, and evaluation coverage. New engineers could reason about system behavior from architectural documentation and evaluation results rather than needing to internalize framework-specific implementation details.
| Key Takeaway Externalizing system behavior into explicit contracts made the platform maintainable by engineers who understood the architecture, not engineers who had memorized the framework internals. |
12. Operational Outcomes
The redesign produced approximately 60% reduction in inference cost per threat model request, measured as AWS compute spend per request across tens of thousands of production requests. This is a concrete, auditable business metric — the numerator is cloud compute spend and the denominator is production threat model requests processed. The four mechanisms that drove the reduction are summarized below.
| Mechanism | Cost Driver Eliminated | Outcome |
| Semantic numericalization + model routing | Default frontier-model calls for all requests | Smaller, cheaper models for routine work |
| Long document decomposition | Oversized context windows and excess token cost | Bounded, predictable token consumption |
| Deterministic verification | Expensive retries on generative failures | Failures caught early, upstream |
| Hardware-aware routing | GPU-centric execution for all task types | Right hardware matched to each task class |
None of these gains came from switching cloud providers or renegotiating rates. They came entirely from architectural decisions. Beyond cost reduction, the redesign produced substantial improvements in governance posture, replay consistency, traceability, and deployment portability.
| Outcome Area | Result |
| Cost | ~60% reduction in inference cost per threat model request |
| Retraining overhead | Reduced through smaller adaptation loops, routing, and distillation |
| GPU dependency | Reduced by assigning work to cheaper execution paths where sufficient |
| Traceability | Improved through MLflow lineage, replay datasets, and transform versioning |
| Governance | Improved through deterministic verifier checks and evidence-linked outputs |
| Defect visibility | Improved through production replay, semantic numericalization, and evaluation gates |
| Deployment portability | Improved by separating orchestration logic from hardware execution |
| Key Takeaway The 60% figure covers inference cost. The output correctness improvement — fixing defects that had been invisible since the original deployment — is harder to quantify but likely more significant in practice. |
13. Future Work: Statistical Output Comparison
Future work involves statistically comparing outputs between the legacy platform and the redesigned platform using identical production replay datasets. Preliminary operational evidence — particularly the observation that newer outputs look tighter and more topology-grounded — suggests the redesigned platform may produce substantially superior outputs in addition to operational savings. But the quality claim requires controlled comparison before it can be stated with confidence.
| Evaluation Dimension | Question |
| Threat completeness | Does the redesigned platform identify more valid threats per workload, or fewer but better-grounded ones? |
| Over-generation | Does the redesigned platform produce fewer excess threats — consistent with the truncation defect hypothesis? |
| False negatives | Does topology preservation reduce missed critical threats? |
| Remediation quality | Are recommendations more specific and actionable? |
| Control mapping accuracy | Are STRIDE/MITRE/control mappings more coherent? |
| Replay consistency | Do identical inputs produce stable outputs across versions? |
| Traceability quality | Can reviewers reconstruct why a threat was generated? |
| Human review effort | Does reviewer time decrease without reducing quality? |
| Key Takeaway The next scientific step is a controlled, statistically defensible comparison of old and new outputs under identical production replay conditions — with particular attention to whether over-generation decreased and topology grounding improved. |
14. What This Project Taught Us
This project ran across four years of significant change in models, frameworks, and hardware economics. Looking back at what actually drove the outcomes, a few patterns stand out that we expect to apply to similar problems going forward.
Invest in deterministic verification early
The verifier pipeline was built first as a regression safety mechanism for the migration. It turned out to be the most valuable infrastructure investment of the entire project. Verifiers gave us a consistent, objective quality bar that made cost experimentation tractable — we could try a classification model feeding a generation model, smarter retrieval-augmented approaches, different routing strategies, and know quickly whether each passed or failed on what mattered. Without that bar, each experiment would have been a judgment call.
Designing for deterministic verification from the start is worth the investment. When you know what correct looks like deterministically, the whole system becomes more tractable — experimentation is faster, regressions are visible, and the quality bar survives model and framework changes.
String handling, context management, and parsing discipline have outsized impact
These concerns get underweighted because they are unglamorous. On this project they had outsized impact on output predictability. How strings are canonicalized, how long documents are decomposed, how context is assembled before inference — these decisions propagate into every output the system produces. The truncation defect in Section 7 is a direct example: a preprocessing decision made for practical reasons at training time propagated silently into every threat model the system produced for years.
Layer contracts make code generation productive
One unexpected benefit of the layered architecture was what it did for code generation. When each layer has a stable, well-defined contract — known inputs, known outputs, known verification criteria — code generation becomes genuinely productive. The problem is local and bounded. The generator cannot silently break something downstream because the contract defines the boundary. The verifier checks the answer.
Earlier attempts at code-generation-assisted migration struggled because the pipeline was tightly coupled — there were no clean boundaries to contain the generated code. Architectural discipline and productive code generation turned out to be mutually reinforcing. Clean layer design makes generation tractable. Generation pressure is an incentive to keep it that way.
Stochastic inference belongs in the orchestration layer as an escape hatch
The convenience of letting a large frontier model handle everything is real. A single capable model absorbs distribution shift gracefully, requires no routing logic, and produces coherent output across a wide range of inputs. For many applications — content generation, summarization, creative work — that convenience is the right tradeoff.
For threat models, financial audit statements, compliance reports, and other outputs where correctness is auditable and stakes are high, the calculus is different. Stochastic inference by default means unpredictable outputs by default. That is a liability in domains where consumers need to trust and act on what the system produces.
The more productive pattern is to treat stochastic inference as an escape hatch in the orchestration layer. Once the intelligence for a class of problem is understood well enough to distill and verify, serve it deterministically. Reserve live inference for cases that are genuinely novel or where verifier confidence is low. The verifier is the mechanism that makes this routing decision trustworthy — it tells you when the deterministic path is safe and when it is not.
On this platform, where outputs feed into security review workflows and reviewers need to trust and act on the results, that routing decision was worth the engineering investment. As models continue to improve and inference costs continue to fall, this pattern becomes more practical, not less. The question for each problem class shifts from “which model is capable enough?” to “have we crystallized the intelligence well enough to stop running it stochastically at every inference?” For high-stakes auditable outputs, that is the question worth asking.
Conclusion
A cost optimization effort became a broader architectural redesign because the migration exposed hidden semantic coupling, topology loss, replay instability, evaluation fragmentation, and framework-bound behavior that model upgrades alone would never have surfaced.
The resulting architecture uses foundation models selectively: as teachers, synthesis engines, and escalation paths for cases the deterministic layer cannot handle confidently. The durable value came from the discipline applied at every layer — how inputs were canonicalized, how context was preserved, how outputs were verified, how behavior was made replayable. That discipline survived four years of rapid change in models, frameworks, and hardware because it was never coupled to any of them.
The engineering discipline that made this project work — evaluation-first, stable layer contracts, pipeline rigor, deterministic verification — felt specific to this migration at the time. Looking back, it reads more like a pattern. We expect to reach for the same approach on similar problems: any production AI system where output quality is auditable, costs need to be managed explicitly, and the underlying models and infrastructure are still moving fast enough that you need the system to stay flexible around them.
Appendix A: Operational Architecture Details
The following details summarize selected implementation patterns from the production threat-modeling architecture. They are included to make the paper concrete without turning the main narrative into a deployment manual.
Verifier Pipeline
| Check | Stage | Behavior |
| V1 Input Sufficiency | Pre-generation | Blocking. Missing trust boundary map prevents generation. |
| V2 Document Completeness | Pre-generation | Lowers confidence ceiling for missing optional inputs; does not block generation. |
| V3 STRIDE Completeness | Post-generation | Flags missing applicable STRIDE categories for human review. |
| V4 MITRE Coherence | Post-generation | Flags incoherent technique mappings for review. |
| V5 Semantic Coherence | Post-generation | Checks threat description against STRIDE category and evidence chunk. |
| V6 Evidence Sufficiency | Post-generation | Validates that evidence content supports the stated threat — chunk presence alone is insufficient. |
Training Record Types
| Record Type | Purpose | Important Constraint |
| Generation | SFT imitation learning from teacher synthetic data and approved reviews | Synthetic data downweighted and eventually retired as real reviews accumulate. |
| Correction | RL/reasoning distillation from human corrections | Teacher reasoning explains why the correction was needed — not just what changed. |
| Abstention | Teaches knowledge boundaries | Prevents the worst failure mode: confident hallucination under insufficient input. |
MLOps Promotion Metrics
| Metric | Target | Blocking? |
| schema_valid_rate | ≥ 99% | Yes |
| co_match_accuracy | ≥ 90% | Yes |
| stride_coverage_rate | ≥ 85% | Yes |
| abstention_precision | ≥ 80% | Yes |
| false_negative_rate | ≤ 5% | Yes — immediate |
| generation_p95_latency | ≤ 30s | Alert only |
| teacher_synthetic_quality | ≥ 85% | Teacher gate |