Skip to content

Model conversion with OLIVE and running with ONNX GenAI runtime Error 5005 NPU Models and GPU error with Microsoft.Neutron.OpenAI.Provider. #1679

@leestott

Description

@leestott

When converting model to ONNX GenAI Runtime - Model converts using OLIVE

When running the model with AITK or Foundry Local

NPU - Error 5005 Qualcomm Snapdragon issue microsoft/Foundry-Local#67
GPU - Error Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.LoadModelAsync

error for GPU

2025-08-07 20:27:35.858 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1400]  2025-08-07T20:27:35.857568+08:00 Loading model:gpt-oss-20b-cuda-gpu

2025-08-07 20:27:35.865 [info] Error: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1402]  2025-08-07T20:27:35.8645887+08:00 Failed loading model:gpt-oss-20b-cuda-gpu error: [Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9].,    at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr) + 0x56

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.LoadModelAsync(String, String, String, CancellationToken) + 0x730

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderBase`1.<EnsureModelLoadedAsync>d__44.MoveNext() + 0x549]

2025-08-07 20:27:35.866 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1401]  2025-08-07T20:27:35.864944+08:00 Finish loading model:gpt-oss-20b-cuda-gpu elapsed time:00:00:00.0073828

2025-08-07 20:27:35.905 [error] Failed loading model gpt-oss-20b-cuda-gpu. Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9]. 

2025-08-07 20:27:41.699 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1400]  2025-08-07T20:27:41.6987518+08:00 Loading model:gpt-oss-20b-cuda-gpu

2025-08-07 20:27:41.708 [info] Error: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1402]  2025-08-07T20:27:41.7079342+08:00 Failed loading model:gpt-oss-20b-cuda-gpu error: [Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9].,    at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr) + 0x56

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.LoadModelAsync(String, String, String, CancellationToken) + 0x730

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderBase`1.<EnsureModelLoadedAsync>d__44.MoveNext() + 0x549]

2025-08-07 20:27:41.709 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1401]  2025-08-07T20:27:41.7082954+08:00 Finish loading model:gpt-oss-20b-cuda-gpu elapsed time:00:00:00.0095458

2025-08-07 20:27:41.749 [error] Failed loading model gpt-oss-20b-cuda-gpu. Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9].

To Reproduce
Steps to reproduce the behavior:

  1. Convert GPT OSS Model using OLIVE 4 x H200 required
  2. Run in AITK Playground
  3. Error recieved above for NPU or GPU
  4. See error

Expected behavior
Would expect this run

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions