llama.vscode

Local LLM-assisted text completion, chat with AI and agentic coding extension for VS Code

Features

Auto-suggest on input
Accept a suggestion with Tab
Accept the first line of a suggestion with Shift + Tab
Accept the next word with Ctrl/Cmd + Right
Toggle the suggestion manually by pressing Ctrl + L
Control max text generation time
Configure scope of context around the cursor
Ring context with chunks from open and edited files and yanked text
Supports very large contexts even on low-end hardware via smart context reuse
Display performance stats
Llama Agent for agentic coding
Add/remove/export/import for models - completion, chat, embeddings and tools
Model selection - for completion, chat, embeddings and tools
Env (group of models) concept introduced. Selecting/Deselecting env selects/deselects all the models in it
Add/remove/export/import for env
Predefined models (including OpenAI gpt-oss 20B added as a local one)
Predefined envs for different use cases - only completion, chat + completion, chat + agent, loccal full package (with gpt-oss 20B), etc.
MCP tools selection for the agent (from VS Code installed MCP Servers)
Search and download models from Huggingface directly from llama-vscode

Installation

VS Code extension setup

Install the llama-vscode extension from the VS Code extension marketplace:

Note: also available at Open VSX

`llama.cpp` setup

Show llama-vscode menu by clicking on llama-vscode in the status bar or Ctrl+Shift+M and select "Install/Upgrade llama.cpp". This will install llama.cpp automatically for Mac and Windows. For Linux get the latest binaries and add the bin folder to the path.

Once you have llama.cpp installed, you can select env for your needs from llama-vscode menu "Select/start env..."

Below are some details how to install llama.cpp manually (if you prefer it).

Mac OS

brew install llama.cpp

Windows

winget install llama.cpp

Any other OS

Either use the latest binaries or build llama.cpp from source. For more information how to run the llama.cpp server, please refer to the Wiki.

llama.cpp settings

Here are recommended settings, depending on the amount of VRAM that you have:

More than 16GB VRAM:
```
llama-server --fim-qwen-7b-default
```
Less than 16GB VRAM:
```
llama-server --fim-qwen-3b-default
```
Less than 8GB VRAM:
```
llama-server --fim-qwen-1.5b-default
```

CPU-only configs

These are llama-server settings for CPU-only hardware. Note that the quality will be significantly lower:

llama-server \
    -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \
    --port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256

llama-server \
    -hf ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF \
    --port 8012 -ub 1024 -b 1024 --ctx-size 0 --cache-reuse 256

You can use any other FIM-compatible model that your system can handle. By default, the models downloaded with the -hf flag are stored in:

Mac OS: ~/Library/Caches/llama.cpp/
Linux: ~/.cache/llama.cpp
Windows: LOCALAPPDATA

Recommended LLMs

The plugin requires FIM-compatible models: HF collection

Llama Agent

The extension includes Llama Agent

Features

Llama Agent UI in Explorer view
Works with local models - gpt-oss 20B is the best choice for now
Could work with external models (for example from OpenRouter)
MCP Support - could use the tools from the MCP Servers, which are installed and started in VS Code
9 internal tools available for use
custom_tool - returns the content of a file or a web page
custom_eval_tool - write your own tool in javascript (function with input and return value string)
Attach the selection to the context
Configure maximum loops for Llama Agent

Usage

Open Llama Agent with Ctrl+Shift+A or from llama-vscode menu "Show Llama Agent"
Select Env with an agent if you haven't done it before.
Write a query and attach files with the @ button if needed

More details(https://github.com/ggml-org/llama.vscode/wiki)

Examples

Speculative FIMs running locally on a M2 Studio:

llama-vscode-1.mp4

Implementation details

The extension aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware.

The initial implementation was done by Ivaylo Gardev @igardev using the llama.vim plugin as a reference
Techincal description: ggml-org/llama.cpp#9787

Other IDEs

Vim/Neovim: https://github.com/ggml-org/llama.vim

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.vscode		.vscode
resources		resources
src		src
ui		ui
.gitignore		.gitignore
.prettierrc		.prettierrc
.vscode-test.mjs		.vscode-test.mjs
.vscodeignore		.vscodeignore
LICENSE		LICENSE
Only chat, chat with project context & e.orc		Only chat, chat with project context & e.orc
README.md		README.md
eslint.config.mjs		eslint.config.mjs
llama.png		llama.png
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama.vscode

Features

Installation

VS Code extension setup

`llama.cpp` setup

Mac OS

Windows

Any other OS

llama.cpp settings

Recommended LLMs

Llama Agent

Features

Usage

Examples

Implementation details

Other IDEs

About

Uh oh!

Releases 20

Packages

Uh oh!

Contributors 8

Languages

License

ggml-org/llama.vscode

Folders and files

Latest commit

History

Repository files navigation

llama.vscode

Features

Installation

VS Code extension setup

llama.cpp setup

Mac OS

Windows

Any other OS

llama.cpp settings

Recommended LLMs

Llama Agent

Features

Usage

Examples

Implementation details

Other IDEs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Contributors 8

Languages

`llama.cpp` setup

Packages