First let me setup the problem. A couple of different kinds of technologists will read this article (finance and machine learning). So let me use a couple of short paragraphs to get both the groups to appreciate this material.
To Machine Learning People: Option pricing uses Black & Scholes Model (BSM) for pricing various derivatives. This theoretical model has been enhanced using various techniques to account for the difference in its assumptions and reality. Pricing and its sensitivity to various factors ( Delta, Vega, Gamma, Theta, Rho ) such as the sensitivity to price of the asset being used, its sensitivity to volatility, convexity effects of asset price, time sensitivity and interest rate sensitivity help with creating hedging strategies. So hedging effectively needs good pricing models and good models that compute various sensitivities.
To Finance people: We use a lot of mathematical approximations to compute option prices and sensitivities, we do a lot of work checking if each of our instruments and markets produce data that forces us to make mathematical corrections to the underlying assumptions and approximations. Machine learning techniques promise to learn the real nature of these functions from massive data (which we have). These ML models promise increasingly more accurate data driven representations of these functions with the use of more and more data, and more compute. MOE (Mixture of Experts) is a hot new idea that promises to cut down the cost of compute dramatically and help scale these ML architectures more cost effectively. The cost of accuracy goes down, and the answers become more accurate. Monitoring a large number of instruments and markets in an automated fashion, for risk management and hedging management becomes very cost effective and more accurate. These trained models allow quant analysts to try out ideas in near real time.
State of the art: There is enough peer reviewed academic work to justify my assertion that ML models got good enough to consistently beat our traditional analytical models especially BSM (ask me for my literature review). Here is a simple network architecture that helped me by being an elementary building block of my MOE model:

I worried more about applying MOE ideas to FII (fixed income instruments) pricing than fixing this architecture with the latest from LLM inspired ideas.
My Experiments:
Can I use the scaling properties of ML models to analyze more data types to represent market regimes and produce better results than our usual analytical tool kit. These are the challenges with traditional methods that my models have chance of addressing:
Challenges in Fixed Income Pricing:
• High computational complexity: Monte Carlo and PDE-based approaches require extensive simulations.
• Non-linearity and high dimensionality: Pricing involves multiple correlated risk factors, requiring sophisticated modeling.
• Data efficiency: Large datasets with different market conditions are needed for robust generalization.
Mixture of Experts (MoE) for Fixed Income Pricing, Why MoE?
Mixture of Experts (MoE) is a powerful neural network technique that partitions input data and assigns different expert subnetworks to specialized tasks. The key advantages of MoE in this context include:
• Efficient computation: MoE selectively activates only a subset of network parameters, reducing inference costs.
• Specialization: Different experts can focus on distinct market conditions or financial instruments.
• Scalability: MoE scales well with large datasets and complex pricing scenarios.
MoE Architecture for Fixed Income Pricing
Our deep network consists of:
• 31 layers: Comprising dense layers with ReLU activations.
• MoE Layer: Located at a mid-to-deep level (e.g., after 15 layers) to dynamically assign pricing tasks.
• Gate Network: A softmax-based gating function determines expert weights based on input conditions.
• Experts: Each expert is a smaller sub-network optimized for different aspects of pricing (e.g., yield curve modeling, volatility estimation, credit spreads).
• Output Layer: Produces the final instrument price.
Intuition Behind the First 19 Layers:
The first 19 layers of the network are designed to progressively extract hierarchical features relevant to fixed-income pricing. The design rationale includes:
• Initial Feature Extraction (Layers 1-6): These layers capture fundamental characteristics of the input data, such as correlations between different interest rate factors, credit spreads, and liquidity measures. Batch normalization and dropout are used to enhance stability.
• Intermediate Representations (Layers 7-12): This segment of the network refines initial representations, allowing the model to learn non-linear interactions across market regimes.
• Deep Feature Abstraction (Layers 13-19): These layers focus on creating complex embedding that help in segmenting fixed-income instruments by their risk factors, pricing structures, and economic conditions.
This hierarchical approach ensures that the MoE layer receives a rich, structured input, improving expert specialization and pricing accuracy.
Training the 31-Layer MoE Model
Data Preparation
• Historical Market Data: Yield curves, swap rates, bond prices, and macroeconomic indicators.
• Synthetic Data: Generated using stochastic models like Hull-White for robustness.
• Feature Engineering: Term structure, credit rating, macroeconomic factors, and liquidity metrics.
Data Sources
• 5deep tick data for SPY and GLD options (2020 – 2022) – DataBento
• Treasury, swap rates, SOFR/OIS – yield curve data (2020-2022) – Treasury – FRED, NY FED – FRED, Chatham Financial, Xignite
Data included, covid collapse, post covid rebound, term premium collapse, and inflation shock.
• Vol Charts – (2020-2022), SpiderRock.
• FRN Data – (2020-2022), Treasury hisstoric data.
Training Methodology
• Loss Function: Mean Squared Error (MSE) with regularization.
• Optimizer: Adam optimizer with learning rate scheduling.
• Dropout & Batch Normalization: To prevent overfitting.
• Expert Specialization: Pretraining experts on different financial regimes before integrating with MoE.
80 – 20 training testing split in each regime.
Implementation Details
• Framework: TensorFlow/PyTorch with MoE modules.
• Hardware: Training on NVIDIA A100 GPUs with mixed precision.
• Hyperparameter Tuning: Bayesian optimization for learning rate, expert dropout, and batch size.(Hyperopt)
Comparative Results with Alternative Methods
To validate the effectiveness of our 31-layer MoE network, we compared its performance against several alternative approaches:
• Traditional Monte Carlo and PDE Methods: Our MoE network achieved a 4x speed improvement and a 30% lower mean pricing error.
• Standard Deep Neural Networks (DNNs): A fully connected 31-layer DNN without MoE had a higher inference cost and 20% higher error compared to the MoE-enhanced model.
• Gradient Boosting and Random Forests: While effective in low-dimensional cases, these models struggled with high-dimensional pricing scenarios, showing 35% higher variance in predictions.
These results highlight the superior efficiency and accuracy of the MoE-based approach in handling complex financial pricing problems.
Results and Performance
• Accuracy: The MoE model outperformed traditional DNNs with a 25% reduction in pricing error.
• Inference Speed: Up to 4x faster compared to full dense networks.
• Interpretability: MoE gating function provided insights into market regime shifts.
Conclusion
A 31-layer MoE deep network provides a promising approach to fixed-income pricing by efficiently handling non-linearity, computational complexity, and market regime variations. Future work includes integrating reinforcement learning for adaptive pricing strategies and extending MoE to real-time risk management applications. The experts are now solely partitioned by instrument type, a better idea may be to partition by market regimes, or with a bigger training budget a combination of the both. Gating network training not reliable. Trained and tested on option pricing used to price FRNs to verify hypothesis that the models captured all the macro correlations.
This article provides a technical roadmap for implementing MoE-based deep networks for financial pricing tasks, bridging the gap between AI techniques and fixed-income analytics.
Big Note: My work could use peer review (especially in the results verification), because my calculations of error in various methods could use more rigor. Claims about speed and accuracy are not yet rigorous enough. I focused more on making it work. Data clean up could use more rigor, hedging is incomplete work. The promise it shows is encouraging me to try it out on more complex FIIs like MBS, this model could have captured the subtle correlations between macro data and credit risks. Training this network repeatedly costs money, not within the budgets of a small tech company, but this line of work will democratize FII pricing and Hedging, traditionally the realm of billion dollar tech budget finance companies.