Improvements for working on larger codebases #4082

michaelneale · 2025-08-14T05:06:23Z

michaelneale
Aug 14, 2025
Maintainer

After a discussion with with a colleague we were exploring ways in which different tools work well with larger repositories (eg complex large code bases or even monorepos with thousands of modules).

Wanted to note some thoughts down on things to address and try out.

Speculative ideas

There are many well trodden paths here which are well understood, but first I wanted to throw in some other ideas based on patterns I have observed and things I have done, that may help with larger code bases to try:

Look at size/metrics of repo

If the agent operates in a project dir, there can be some "common sense" which, if no .goosehints etc, can detect if there are thousands of top level dirs/modules, and then suggest/activate tools (or even provide automatic up front context in the system prompt before even starting).

Make use of git

git and the github origin (if using github) can provide a rich vein of latent information for an agent to mine to work out what the user may need to work on and where.

A simple script like this one here
can look at recent history for the user automatically and report things (and possibly to the system prompt on boot up) like:

DIRECTORIES TOUCHED (by file count):
------------------------------------
   6  crates/goose/src/providers
   5  ui/desktop/src/components
   4  crates/goose/src/config/signup_openrouter
   4  crates/goose-mcp/src/developer/editor_models
   3  ui/desktop/src/utils
   3  crates/goose/src/config/signup_openrouter/templates
   3  crates/goose-server/src/routes
   3  crates/goose-mcp/src/developer
   2  ui/desktop/src
   2  scripts

and more - which combined with the comments, can provide excellent highly dense context for whatever the user may ask next.

Automatically generate `.goosehints/AGENTS.md`

Similar to the above, if there are no existing instructions, it may be possible to automatically crawl/generate and probe to build something less than a map, but hints, as to what it thinks are relevant parts of the tree.

Bespoke per repo MCPs

This is probably the heaviest approach, but there is no reason why an organisation couldn't make bespoke MCPs which have code search, rules, even build tools directly in an MCP which they provide, which can be activated when entering relevant directories.

Concrete tactical improvements to goose

Multi file read

search/index/and rg style tools can find a lot of relevant content, but if you have a text viewer tool which not only can take a file+optional range but could take a list of files, this can rapidly speed up the planning stage as it doesn't have to do linear round trips but can read files in parallel

Multi write

This is a relatively small upgrade to write - to let it be able to replace/write a bunch of content in places in a file or multiple files (the same content). Worth noting that goose already supports specialised fast-apply editor models which can help with this (at least in single file content editing) but the idea of more bulk editing is also interesting in these larger cases and taking a load off the main LLM

Expanding lead/worker to role based models

Lead/worker has been shown to work quite well, but can be expanded to have oracles, second opinions, reviewers and other roles.
This can be both user directed but also automatic. This also works nicely as you can play off different models and providers right at the source where the most is known.

This co-incides with some WIP: #4036 happily. The idea being that goose can decide at various points to switch/consult a model, ie based on complexity use a more thinking model or a mixture of experts. try another approach with tool calls with a model for that, and just play off different models as needed as things go back and forth from planning and execution fluidly.

Code indexing and search

indexing a codebase can take many forms from just in time tools, search or semantic search, symbol trees and more.
many MCPs exist and many approaches have tried, some tools use embeddings and a vector store and some don't, there are no clear obvious best approaches but there are many.

There are limitations to just using grep-style search to form a picture and navigate a codebase, so it is worth exploring what can help but also scale to the larges code bases, especially now goose providers can explose their embedding ability.

One MCP I contributed to uses qdrant behind the scenes and openai embeddings: https://github.com/lambdamechanic/groma/ (in rust) which has had some good results but not battle tested.
There are many mcps off the shelf that will provide semantic code search and similar too (but would be nice to have recommended ones and one click experience). Similarity searching via vectors can work well with an upfront cost calculating and storing the embeddings (they can be recalculated from time to time, but as you are just looking for hints of where to look for similar concepts, it is ok to have some drift).

There is also wip for tree sitter: #3389
Tree sitter can provide a fairly solid map of code to help find relevant symbols and sections rapidly with an in memory search, rg search can help with on disk search, and vector search can help find relevant files/code sections, especially conceptually similar.

taniashiba · 2025-08-14T21:37:57Z

taniashiba
Aug 14, 2025
Maintainer

Shared this discussion on Discord, excited to see what others think. <3

0 replies

cbruyndoncx · 2025-08-15T15:39:11Z

cbruyndoncx
Aug 15, 2025

Thinking a little out of the box, but if there is HITL and agents working on issues and feature requests both at the same time in corporate setting.
And github being abondoned by M$ (just speculating outloud), wouldnt that context engineering bit be more like a process step to take before an issue is considered ready to be worked on (Definition of Ready - in scrum terminology) ?
A scoping / impact assessment is generally needed before the decision is taken to actually start the work ...
Is there anything that tessl , or augment code, or other ai native dev players have shared on this yet ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements for working on larger codebases #4082

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improvements for working on larger codebases #4082

Uh oh!

Uh oh!

michaelneale Aug 14, 2025 Maintainer

Speculative ideas

Look at size/metrics of repo

Make use of git

Automatically generate .goosehints/AGENTS.md

Bespoke per repo MCPs

Concrete tactical improvements to goose

Multi file read

Multi write

Expanding lead/worker to role based models

Code indexing and search

Replies: 2 comments

Uh oh!

taniashiba Aug 14, 2025 Maintainer

Uh oh!

cbruyndoncx Aug 15, 2025

michaelneale
Aug 14, 2025
Maintainer

Automatically generate `.goosehints/AGENTS.md`

taniashiba
Aug 14, 2025
Maintainer

cbruyndoncx
Aug 15, 2025