A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

arXiv cs.AI·Ramin Pishehvar

4h ago

·~2 min·7/1/2026·en·0

Quick Answer

This paper introduces a three-phase deep reinforcement learning model for personalized portfolio management, addressing ticker lock-in, monolithic objectives, and static user models.

Quick Take

This paper introduces a three-phase deep reinforcement learning model for personalized portfolio management, addressing ticker lock-in, monolithic objectives, and static user models. It employs a T5-based time series model for asset encoding, a Mixture of Experts architecture for diverse investment goals, and a personalized inference layer using transaction history, marking a significant advancement in financial AI applications.

Key Points

Phase 1 utilizes a ticker-identity-free encoder with a T5-based time series model.
Phase 2 employs a Mixture of Experts architecture for six distinct investment goals.
Phase 3 personalizes investment strategies using a 76-parameter LoRA module.
The model eliminates cross-objective gradient conflicts during training.
Investment objectives are inferred from actual trading behavior, not questionnaires.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 30997v1 Announce Type: new Abstract: We present a three-phase deep reinforcement learning system for personalized portfolio management that addresses three limitations shared by all prior financial RL work: 1) ticker lock-in, 2) monolithic objectives, and 3) static user models.

Phase 1 pretrains a ticker-identity-free cross asset encoder via self-supervised learning on a multi-asset corpus, augmented by a frozen parallel branch using Chronos, a T5-based time series foundation model, fused via a learned gating mechanism. To our knowledge, this is the first application of a time series foundation model to portfolio management RL. The encoder generalizes to any publicly traded asset via a 50-dimensional observable metadata vector that requires no retraining for new tickers.

Phase 2 fine-tunes a MoE (Mixture of Experts) portfolio actor critic with PPO under an objective-conditioned reward that simultaneously serves six distinct investment goals sampled per episode: short-term alpha, short-term gain, long-term gain, capital preservation, tax-loss harvesting, and long-term-gains-only.

A MoE architecture assigns each objective to a specialized expert head (momentum, growth, defensive, tax-aware), and a learned intent router blends experts based on the active objective and current market regime, which eliminates cross-objective gradient conflict.

Phase 3 adds a lightweight personalization layer further adapted at inference time to each individual via a 76-parameter LoRA module fine-tuned on real brokerage transaction history, inferring investment objectives from revealed trading behavior rather than questionnaires. A natural language intent parser converts free-form goals directly into structured investment objective parameters.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

5d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy