GSPO (Group Sequence Policy Optimization)

GSPO is an experimental feature. The API and behavior may change in future releases.

Overview

GSPO was introduced by the Qwen team to train state-of-the-art models including Qwen3-235B-A22B-Instruct-2507. It can improve training stability and efficiency for Mixture-of-Experts (MoE) models, and may have limited or no impact for dense models.

Key Benefits

Stable Training: Maintains stable training processes and resolves stability challenges in large MoE models
Efficient Scaling: Achieves higher training efficiency and continues improving with increased computational resources
Infrastructure-Friendly: More tolerant of precision discrepancies, eliminating the need for complex strategies like “Routing Replay”

How It Works

GSPO’s core innovation is its sequence-level optimization objective. Instead of focusing on individual token likelihoods, GSPO defines importance ratios based on the sequence likelihood with length normalization to reduce variance. The algorithm optimizes:

J_GSPO(θ) = E[1/G ∑ᵢ min(sᵢ(θ) Âᵢ, clip(sᵢ(θ), 1-ε, 1+ε) Âᵢ)]

Where the importance ratio sᵢ(θ) is defined as:

sᵢ(θ) = (π_θ(yᵢ|x) / π_θ_old(yᵢ|x))^(1/|yᵢ|)

This sequence-level approach makes GSPO more robust to noise and eliminates the need for complex MoE-specific strategies.

Configuration

GSPO can be configured using the importance_sampling_level parameter when training with ART:

from art import PolicyOptimizer

# Initialize with GSPO
optimizer = PolicyOptimizer(
    algorithm="gspo",
    importance_sampling_level=0.8  # Adjust based on your needs
)

Usage Example

import art

# Train a model using GSPO
trainer = art.Trainer(
    model_name="your-model",
    algorithm="gspo",
    config={
        "importance_sampling_level": 0.8,
        "clip_epsilon": 0.2,
        "group_size": 4
    }
)

trainer.train(dataset)

Technical Details

For a deeper understanding of GSPO’s technical foundations and comparative analysis with other RL algorithms, see the original research paper.

Limitations

As an experimental feature, GSPO may have limited compatibility with some model architectures
Performance characteristics may vary depending on model size and dataset
API is subject to change in future releases

Get Started

Fundamentals

Features

Tutorials

Resources

Experimental

GSPO (Group Sequence Policy Optimization)

Overview

Key Benefits

How It Works

Configuration

Usage Example

Technical Details

Limitations

Get Started

Fundamentals

Features

Tutorials

Resources

Experimental

​Overview

​Key Benefits

​How It Works

​Configuration

​Usage Example

​Technical Details

​Limitations

Overview

Key Benefits

How It Works

Configuration

Usage Example

Technical Details

Limitations