fix(vqbet): use in-place fill_ to avoid overwriting DDP GPU buffers with CPU tensors (#3128)

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-09 02:51:56 +00:00

* fix(vqbet): use in-place fill_ to avoid overwriting DDP GPU buffers with CPU tensors

When VQ discretization phase completes, the code was overwriting
register_buffer('discretized') and register_buffer('freeze_codebook')
with torch.tensor(True), which is created on CPU. DDP then fails in
_sync_buffers() with: RuntimeError: No backend type associated with
device type cpu. Fix by updating the buffers in-place with .fill_(True)
so device and registration are preserved.

Made-with: Cursor

* test(vqbet): add regression test for in-place buffer update during discretization

Verifies that discretize() updates the 'discretized' and 'freeze_codebook'
registered buffers in-place (via fill_()) rather than replacing them with new
CPU tensors. The test checks data_ptr() identity and that the tensors remain
registered buffers after the call. This prevents regressions of the DDP fix.

Made-with: Cursor

* test(vqbet): add GPU regression test to verify buffers stay on CUDA after discretize()

Directly catches the original DDP failure mode: when buffers are replaced with
torch.tensor(True) they land on CPU, causing NCCL to raise 'No backend type
associated with device type cpu' in _sync_buffers(). The GPU test places the
model on cuda:0 and asserts both buffers remain on CUDA after discretization.

Made-with: Cursor

* test(vqbet): simplify to single device-check test in test_policies.py

Per reviewer feedback: remove the separate test file and replace the two
CPU/GPU tests (with data_ptr checks) with a single focused test in
tests/policies/test_policies.py that only asserts the registered buffers
remain on the model device after discretize(). Uses DEVICE from tests/utils.py
so it runs on whatever device the CI/user selects (cpu, cuda, mps).

Made-with: Cursor

* style: fix import order in test_policies.py to pass ruff/pre-commit checks

Made-with: Cursor

---------

Co-authored-by: Zhan DiJia <2476100824@example.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>

This commit is contained in:

Altman

2026-03-18 20:24:07 +08:00

committed by

GitHub

parent d9ec3a6fa2

commit e64fa667c3

2 changed files with 46 additions and 2 deletions

									
										src/lerobot/policies/vqbet/modeling_vqbet.py
									
		+2
		-2
	
												View File
												
				@@ -467,8 +467,8 @@ class VQBeTHead(nn.Module):

				        self.vqvae_model.optimized_steps += 1

				        # if we updated RVQ more than `n_vqvae_training_steps` steps, we freeze the RVQ part.

				        if self.vqvae_model.optimized_steps >= n_vqvae_training_steps:

				            self.vqvae_model.discretized = torch.tensor(True)

				            self.vqvae_model.vq_layer.freeze_codebook = torch.tensor(True)

				            self.vqvae_model.discretized.fill_(True)

				            self.vqvae_model.vq_layer.freeze_codebook.fill_(True)

				            print("Finished discretizing action data!")

				            self.vqvae_model.eval()

				            for param in self.vqvae_model.vq_layer.parameters():