Agent tools: Building a Filesystem Toolkit for AI Agents

After comparing different AI agent SDKs in my previous post, I noticed something: while the SDKs themselves are relatively similar at the surface level, the differences become more apparent when you dig deeper into actual capabilities.

One capability that stands out in Claude Code is its filesystem tools. The ability to work with local files—reading, writing, searching, and editing code—is what makes it extremely useful and versatile for not only software development tasks, but almost anything that involves files (which is nearly everything). It’s not just about calling an API; it’s about having a sophisticated understanding of how to interact with a large base of information, whether that’s code or other types of documents.

This got me wondering: What would it take to build similar filesystem tools for other agent frameworks? And more importantly, how much of Claude Code’s effectiveness comes from the tools versus everything else?

I decided to find out by building my own filesystem toolkit. My goals were to:

Learn how fast I can develop agent capabilities using Claude Code and AI-driven development
Measure how close (or far) from Claude Code’s filesystem tool quality I can get quickly

The Build

Architecture First

Before jumping into implementation, I designed a simple but extensible architecture:

Generic Filesystem API - A set of core operations (read, write, edit, list, glob, grep) that work independently of any specific agent framework.

Pluggable Tools - A functional API using curried functions where each operation is bound to a workspace directory for security sandboxing.

SDK Adapters - Thin wrappers that adapt the generic API to different agent frameworks (starting with LangChain).

Eval System - A way to test the tools with an actual agent running real scenarios.

The architecture decisions paid off immediately. By keeping the core separate from framework-specific code, I could:

Unit test individual tools without spinning up a full agent
Iterate much faster on tool implementation
Easily add support for new frameworks later

Implementation with Claude Code

The implementation went surprisingly smoothly:

Skeleton Setup - I used Claude Code to set up the basic project structure, TypeScript configuration, and build setup.

Individual Tools - Here’s where it got interesting. I used vibe-kanban to multitask Claude Code, building multiple tools in parallel. Each tool (read, write, edit, list, glob, grep) was implemented with its own test suite.

The development flow felt effortless. Instead of writing detailed specifications, I gave relatively high-level guidance, and Claude Code handled the implementation details. For example, I mentioned using a “currying pattern for workspace restriction,” and it implemented the entire functional API pattern correctly.

LangChain Adapter - As with my SDK comparison work, wrapping the generic tools for LangChain was straightforward with Claude Code’s help.

Eval System - This took longer than building the tools themselves, but I believe it was worth it. Testing tools in isolation is one thing; seeing how an agent actually uses them in realistic scenarios is entirely different.

The eval system consists of:

A simple ReAct agent using Claude 3.5 Sonnet
A set of filesystem scenarios with specific tasks
A runner that executes scenarios and reports results

Building evals took about another two hours and I was able to verify that the tools are “somewhat working” when used by an agent. Still, it feels that getting them to be “really good” would require much more learning and iteration—something that would need considerably more time.

The Numbers

Tools Implementation: ~2 hours Eval System: ~2 hours Total Test Coverage: 139+ tests Final Result: A working package for internal use

The tools work quite well based on the evals. An agent can:

Find files by pattern using glob
Search code with grep (powered by ripgrep)
Read and write files
Make surgical edits without rewriting entire files
Navigate directory structures

What I Learned

1. Code is Easy, Quality is Hard

The “last 10-20%” of tool quality is surprisingly difficult to achieve. It comes down to:

Parameter design - What arguments should the tool accept?
Description clarity - How do you describe the tool so the agent uses it correctly?
Result formatting - How should results be structured for the agent to understand?
Edge cases - Cases like how does the agent use the edit tool when it wants to add a new line (spoiler alert, it can not use "" as old string)

Each of these requires careful iteration. A small change in tool description can dramatically affect how and when an agent chooses to use it.

2. The System Prompt Matters, a LOT

While my tools performed reasonably well, there’s still a significant quality gap between my simple eval agent and Claude Code. The gap becomes especially apparent in complex scenarios.

I believe much of this comes down to the system prompt. Claude Code’s agent prompt likely encodes a lot of knowledge about:

When to use which tool
How to break down complex tasks
When to ask for clarification vs. making assumptions
How to recover from errors

This kind of prompt engineering is an art form I’m only beginning to appreciate.

3. Code Might Be Easy to Reproduce, Prompts Are Not

Here’s the most important lesson: In the age of AI-assisted development, code itself is becoming easier to reproduce. But the prompts—both for tools and for agents—represent accumulated knowledge that’s much harder to replicate.

You can look at Claude Code’s tool interface and implement something similar in a few hours. But the quality embedded in how it describes those tools to the agent, and in the agent’s own system prompt, represents months or years of learning and refinement.

4. Evals Are Critical (And I Need to Learn More)

I spent about as much time building evals as building the tools, and I still feel like I’m at the beginning of understanding effective evaluation.

Without good evals:

You can’t measure improvements systematically
You don’t know which changes help or hurt
You can’t compare different approaches objectively

Getting from “somewhat working” to “really good” tools requires solid evals to work against. This is definitely an area where I need to dig deeper.

Try It Yourself

The toolkit is available on GitHub:

Agent Toolkit on GitHub

It provides:

A functional API for filesystem operations
Built-in workspace sandboxing for security
LangChain/LangGraph integration
Full TypeScript support

Here’s a quick example to get an idea how it can be used in an agent:

import { ChatAnthropic } from '@langchain/anthropic';
import { createReactAgent } from '@langchain/langgraph/prebuilt';
import { createLangchainFileSystemTools } from '@agent-toolkit/filesystem/langchain';

// Create filesystem tools
const tools = createLangchainFileSystemTools({
  workspace: '/path/to/project'
});

// Create an agent
const llm = new ChatAnthropic({
  model: 'claude-3-5-sonnet-20241022',
  temperature: 0,
});

const agent = createReactAgent({ llm, tools });

// Run it
const result = await agent.invoke({
  messages: [{
    role: 'user',
    content: 'Find all TypeScript files that import React'
  }]
});

What’s Next

This exercise taught me that building agent capabilities with modern tools is faster than ever. But it also showed me how much depth there is to making those capabilities truly excellent.

Some areas I want to explore:

Digging deeper on effective eval design and methodology
Understanding what makes a tool description effective for agents
Learning more about system prompt patterns that improve agent reliability
Building more sophisticated tools (web scraping, API interactions, etc.)