Skip to content

Conversation

cyyever
Copy link
Contributor

@cyyever cyyever commented Aug 23, 2025

What does this PR do?

In vLLm inference and SFT training, there are lots of blocking operations of image_grid_thw such as torch.prod and torch.tolist, so let's always fix image_grid_thw to CPU to avoid them.
A simple grep gives the following examples:

src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:        split_sizes = (image_grid_thw.prod(-1) // self.visual.spatial_merge_size**2).tolist()
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py:        split_sizes = (image_grid_thw.prod(-1) // self.visual.spatial_merge_size**2).tolist()
src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:                    samples = torch.split(image_grid_thw, list(image_nums))
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py:                    samples = torch.split(image_grid_thw, list(image_nums))
src/transformers/models/glm4v/modeling_glm4v.py:                    samples = torch.split(image_grid_thw, list(image_nums))

@cyyever cyyever force-pushed the use_cpu_image_grid_thw branch from 95e6e30 to 64526de Compare August 23, 2025 10:47
Signed-off-by: cyy <cyyever@outlook.com>
@cyyever cyyever force-pushed the use_cpu_image_grid_thw branch from 64526de to 3e489ac Compare August 23, 2025 10:48
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm4v, qwen2_vl

@cyyever cyyever changed the title Fix image_grid_thw tensor to CPU Fix image_grid_thw to be in CPU Aug 23, 2025
@Rocketknight1
Copy link
Member

cc @qubvel @zucchini-nlp

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. The model needs all inputs to be in the same device and users who pass in device='cuda' will expect image grid tensors to be moved to CUDA when processing. This leads to inconsistencies in API

I'd better have vLLM move the tensor back to CPU if needed or call image processor with device="cpu"

@qubvel
Copy link
Member

qubvel commented Aug 26, 2025

We might need to provide pixel_values.device to avoid creating a tensor on CPU by default, wdyt?

@cyyever
Copy link
Contributor Author

cyyever commented Aug 26, 2025

@zucchini-nlp @qubvel This tensor is special because its main purpose is index manipulation. For such tensors for indexing, moving them to GPU doesn't provide acceleration.

The dilemma is that we want almost all of tensors to be in CUDA except some special ones, so using device for processor is not a good solution.
Admittedly this is a semantics-breaking change..

@zucchini-nlp
Copy link
Member

This tensor is special because its main purpose is index manipulation. For such tensors for indexing, moving them to GPU doesn't provide acceleration.

I see, though it will be quite breaking and as noted above, will not be consistent with what image processing API does. IMO all inputs returned will need to be in the same device and type, as requested by users. So I agree with @qubvel on that both should be created on the same device as pixel_values.device, which can be "cuda" or "cpu" depending on input kwargs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants