A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management
Quick Answer
This paper introduces a three-phase deep reinforcement learning model for personalized portfolio management, addressing ticker lock-in, monolithic objectives, and static user models.
Quick Take
This paper introduces a three-phase deep reinforcement learning model for personalized portfolio management, addressing ticker lock-in, monolithic objectives, and static user models. It employs a T5-based time series model for asset encoding, a Mixture of Experts architecture for diverse investment goals, and a personalized inference layer using transaction history, marking a significant advancement in financial AI applications.
Key Points
- Phase 1 utilizes a ticker-identity-free encoder with a T5-based time series model.
- Phase 2 employs a Mixture of Experts architecture for six distinct investment goals.
- Phase 3 personalizes investment strategies using a 76-parameter LoRA module.
- The model eliminates cross-objective gradient conflicts during training.
- Investment objectives are inferred from actual trading behavior, not questionnaires.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 30997v1 Announce Type: new Abstract: We present a three-phase deep reinforcement learning system for personalized portfolio management that addresses three limitations shared by all prior financial RL work: 1) ticker lock-in, 2) monolithic objectives, and 3) static user models.
Phase 1 pretrains a ticker-identity-free cross asset encoder via self-supervised learning on a multi-asset corpus, augmented by a frozen parallel branch using Chronos, a T5-based time series foundation model, fused via a learned gating mechanism. To our knowledge, this is the first application of a time series foundation model to portfolio management RL. The encoder generalizes to any publicly traded asset via a 50-dimensional observable metadata vector that requires no retraining for new tickers.
Phase 2 fine-tunes a MoE (Mixture of Experts) portfolio actor critic with PPO under an objective-conditioned reward that simultaneously serves six distinct investment goals sampled per episode: short-term alpha, short-term gain, long-term gain, capital preservation, tax-loss harvesting, and long-term-gains-only.
A MoE architecture assigns each objective to a specialized expert head (momentum, growth, defensive, tax-aware), and a learned intent router blends experts based on the active objective and current market regime, which eliminates cross-objective gradient conflict.
Phase 3 adds a lightweight personalization layer further adapted at inference time to each individual via a 76-parameter LoRA module fine-tuned on real brokerage transaction history, inferring investment objectives from revealed trading behavior rather than questionnaires. A natural language intent parser converts free-form goals directly into structured investment objective parameters.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.