Enabling Muon Optimizer in DeepSpeed #7509

PKUWZP · 2025-08-23T16:44:34Z

Related Issue: #7438

Introduction

Muon, a new optimizer that has attracted the community’s attention recently shows promising results in training large language models. Adding the Muon Optimizer to DeepSpeed, a popular OSS framework for large scale training and inference is critically important for DeepSpeed users and developers. There has been a PR attempting the adoption. (Huge Thanks to @qimcis), which is a good starting point. It still requires more substantial effort to make it fully compatible and work within DeepSpeed. We are publishing this PR to fully enable Muon Optimizer capabilities for DeepSpeed.

Issues and solutions

Issues

With stage 1, 2 or 3, the optimizer states will be partitioned within the same data parallel group. This means that each process is already handling only parts of the model parameters and there is no need to use the DP solution as in the code.
The parameters (and the gradients) will be flattened to 1D vector before being used in the optimizer, thus nullifying the major hypothesis of the muon optimizer: it works by orthogonalizing the updates for each matrix (dim >=2)

Solutions

To solve the issues, we propose this new PR in which:

We simplify the Muon code by removing the partitioning and muon updates logics.
We move the muon update to the get_flat_partition function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are collected before being flattened and used by the optimizer to update the model parameters. Since each parameter is still in its original shape, we can easily apply the muon updates.
We also save the momentum buffer into the optimizer’ state so that we have a smooth convergence after applying the saved checkpoints.
We added comprehensive unit tests to validate Muon Optimizer's correctness and functionality.

Future directions and roadmap

In the future, several follow up works are of interests:

Create a CPU offload version.
Apply Muon to Stage 3
Use the highly optimized version of Adam for the Adam part of MuonWithAuxAdam optimizer.
More efficient implementations e.g. a) add specialized kernels for Newton-Schulz iteration and muon updates; b) parallelize updates for the parameters (currently, each parameter is updated separately and sequentially)

Adding Muon dependencies to setup.py file.

…add init file

pengdumle and others added 12 commits August 19, 2025 21:44

add code changes and push to my branch

0d2ac8b

bring back some comments

b43a475

fix default use_muon to be backward compatible

8fd098f

fix some issues at copying code from test to branch

43c6db8

add muon change

1098b59

add unit test case

abd4328

change wording

bc42a55

assert optimizer

d80613b

make sure initialization works

98c7f96

enable only updating every LGA steps

68941d9

enable only updating every LGA steps

3e75372

Merge branch 'deepspeedai:master' into peng-add-muon-v1

f33e7c7

PKUWZP requested a review from sfc-gh-truwase August 23, 2025 16:44

PKUWZP requested review from tjruwase, tohtana and loadams as code owners August 23, 2025 16:44

PKUWZP and others added 14 commits August 23, 2025 13:05

Clarify Muon dependency in setup.py

c894dcc

Adding Muon dependencies to setup.py file.

try to add to install_requires and see if it fix

d5fd925

fix install requires and add copyright for

aa40845

Fix formatting in setup.py for muon dependency

e081773

fix conflicts

9c5e344

use original muon directly in the code

16a9d73

use original muon directly in the code, fix deepspeed comm error and …

9d55218

…add init file

break the torch distributed error

d998801

add licence

057c9f2

add licence

64a0232

add licence fix yapf

8ba1f06

Fix the end-of-file error.

bfd9260

Fix the end-of-file formatting error.

044d9e8

afix eof error

35ef8e3

Zhipeng Wang added 4 commits August 23, 2025 20:33

Fix the License and Copyright.

720e2d4

Fix the MIT License for original Muon Implementation.

e695af9

Fix the license issue in original Muon Implementation.

08b5c08

Fix Copyright in test_muon.py

363cbc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling Muon Optimizer in DeepSpeed #7509

Enabling Muon Optimizer in DeepSpeed #7509

PKUWZP commented Aug 23, 2025

Uh oh!

Uh oh!

Enabling Muon Optimizer in DeepSpeed #7509

Are you sure you want to change the base?

Enabling Muon Optimizer in DeepSpeed #7509

Conversation

PKUWZP commented Aug 23, 2025

Introduction

Issues and solutions

Issues

Solutions

Future directions and roadmap

Uh oh!

Uh oh!