GSPO is an experimental feature. The API and behavior may change in future releases.
Overview
GSPO was introduced by the Qwen team to train state-of-the-art models including Qwen3-235B-A22B-Instruct-2507. It can improve training stability and efficiency for Mixture-of-Experts (MoE) models, and may have limited or no impact for dense models.Key Benefits
- Stable Training: Maintains stable training processes and resolves stability challenges in large MoE models
- Efficient Scaling: Achieves higher training efficiency and continues improving with increased computational resources
- Infrastructure-Friendly: More tolerant of precision discrepancies, eliminating the need for complex strategies like “Routing Replay”
How It Works
GSPO’s core innovation is its sequence-level optimization objective. Instead of focusing on individual token likelihoods, GSPO defines importance ratios based on the sequence likelihood with length normalization to reduce variance. The algorithm optimizes:sᵢ(θ)
is defined as:
Configuration
GSPO can be configured using theimportance_sampling_level
parameter when training with ART:
Usage Example
Technical Details
For a deeper understanding of GSPO’s technical foundations and comparative analysis with other RL algorithms, see the original research paper.Limitations
- As an experimental feature, GSPO may have limited compatibility with some model architectures
- Performance characteristics may vary depending on model size and dataset
- API is subject to change in future releases