Skip to content

feat(filesystem): add streaming get_file_hash tool for cryptographic digests (md5/sha1/sha256) #2516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Pucciano
Copy link

@Pucciano Pucciano commented Aug 9, 2025

Add a streaming file-hash tool to the filesystem server with Zod-validated
inputs, allowed-roots enforcement, and optional digest encoding.

Description

This PR adds a new tool, get_file_hash, to the filesystem MCP server.

  • Computes cryptographic digests via Node crypto.createHash + streaming
    fs.createReadStream (efficient on large files)
  • Supported algorithms (policy gate): md5, sha1, sha256 (default: sha256)
  • Output encoding: "hex" (default) or "base64" (optional)
  • Rejects non-regular files (directories/devices); respects roots/realpath checks
  • Zod input schema + ListTools registration
  • README updated with a tool entry consistent with existing docs

Server Details

  • Server: filesystem
  • Changes to: tools (new tool), unit tests (new tests), docs (README tools section)

Motivation and Context

I’m a computer forensics expert; verifying file integrity is critical to chain of
custody. Standards (e.g., ISO/IEC 27037, SWGDE) emphasize hashing digital evidence.
NIST recommends collision-resistant hashes (SHA-2); SHA-1/MD5 remain for legacy
identification but not for collision-sensitive uses. This tool defaults to SHA-256
while retaining MD5/SHA-1 for interoperability. Keeping the algorithm set small
improves DFIR compatibility and simplifies model prompts.

Providing get_file_hash inside the filesystem server lets LLM-driven workflows
compute/compare hashes under the same allowed-roots and realpath/symlink controls
as other file operations—no external copying, consistent and auditable results.

How Has This Been Tested?

  • Environment: macOS 15.5, LM Studio 0.3.22
  • MCP client: LM Studio (server built via Docker and used as a Docker-mapped
    MCP server; tool discovered via ListTools)
  • Models:
    • qwen/qwen3-coder-30b (benefits from explicit “when to use” + args in prompt)
    • openai/gpt-oss-120b (works with concise descriptions)
    • mistralai/devstral-small-2507 (tool calls succeed)
  • Unit tests: text vectors ("abc", "ForensicShark"), small binary snippet,
    encodings (hex/base64), non-regular paths rejected (dir/symlink/device), and
    unsupported algorithms (e.g., sha512, crc32, whirlpool) rejected by policy
  • Manual: end-to-end via stdio within LM Studio; expected digests returned;
    clear error on unavailable algorithms (FIPS/build)
  • Platform note: Not tested on Windows

Breaking Changes

None. Additive only.

Types of changes

  • New feature (non-breaking)
  • Documentation update
  • Bug fix
  • Breaking change

Checklist

  • Follows MCP security best practices (roots-restricted, realpath/symlink checks)
  • README updated (tool entry)
  • Code follows repo style guidelines
  • Appropriate error handling (unsupported algorithms / non-regular files)
  • Inputs documented (path, algorithm, encoding)
  • Tested with an LLM client (LM Studio)
  • New and existing tests pass locally
  • CI not included in this PR (out of scope)

Additional context

  • Paths should be relative to the allowed base directory (as returned by
    list_directory) when calling the tool from clients.
  • Tool description made concise and Qwen-friendly; instructs models to return
    only the digest string.
  • Only md5, sha1, sha256 are supported by design; no extended algorithms
    or env toggles in this PR.

Introduce a streaming file-hash tool using `crypto.createHash` and `fs.createReadStream`.
Validates input via Zod and respects allowed-roots path checks.

- Tool: `get_file_hash`
- Args: { path: string, algorithm: "md5"|"sha1"|"sha256", encoding?: "hex"|"base64" }
- Output: digest as hex (default) or base64
- Handler: added to CallToolRequest; included in ListTools

Notes:
- Rejects non-regular files (e.g., directories/devices)
- Fails fast if algorithm is unavailable in this Node/OpenSSL build
  (e.g., FIPS) with a clear error message.
…ed testing

Extract `getFileHash` from `index.ts` into a standalone module to avoid pulling server bootstrap and top-level await into unit tests.
This decouples hashing logic from transport setup and other side effects, allowing tests to import the function directly.

No behavior change: the server now imports `getFileHash` from `hash-file.ts`. This prepares the codebase for comprehensive unit tests covering success and error paths.
Add a policy gate to `getFileHash` that rejects any algorithm not in {md5, sha1, sha256}.

Motivation: these are the widely used hashes in digital forensics; keeping the list small helps
interoperability with DFIR tools and simplifies model/tool prompts. Also prepares unit tests
to assert failure on unsupported algorithms (e.g., sha512, crc32, etc.). Runtime availability is
still checked via crypto.getHashes to surface FIPS/build issues cleanly. Default remains sha256;
md5/sha1 are retained for legacy sets.
Add unit tests covering hashing of the text "ForensicShark" across md5/sha1/sha256 with expected
digests. Validate rejection of non-regular paths (directory, symlink to directory, device like
/dev/null when present). Verify hashing of a small binary snippet across all three algorithms.
Assert that unsupported algorithms (e.g. sha512, crc32, whirlpool) throw per the policy gate.
Exercise both encodings (hex and base64) for text and binary cases. Tests are platform-aware:
skip the device case on Windows and create a junction for the directory symlink on Windows.
Refine the `get_file_hash` tool description to be concise and Qwen-friendly with
properly escaped quotes. Document encoding as optional with default "hex";
require an absolute path and state the input must be a regular file under allowed
directories (not directories/devices). Instruct models to return only the digest
string. This improves tool-calling reliability and matches the Zod schema and
server defaults.
@Pucciano Pucciano marked this pull request as ready for review August 9, 2025 14:39
@olaservo olaservo added server-filesystem Reference implementation for the Filesystem MCP server - src/filesystem enhancement New feature or request labels Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server-filesystem Reference implementation for the Filesystem MCP server - src/filesystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants