What I Learned Building AI Agents in 2025

I spent a lot of time building AI agents this year, made plenty of mistakes, and read some excellent articles along the way. This post compiles what I learned. It’s not a tutorial. It’s more like a memo to myself, written so I don’t repeat the same mistakes next year.

1. The Core Shift in Thinking

The biggest lesson: agent architecture is forced into existence by problems, not designed in advance.

My initial approach was to find well-written articles on agent architecture, thinking I could just copy them and everything would be perfect. So I built a series of elegant-looking designs: upgraded system prompts, context isolation, added memory. The system got worse. Token costs went up, results didn’t improve, failures increased, and some failures were impossible to debug.

What went wrong? An AI agent is inherently a non-deterministic system. Layering sophisticated architecture onto a non-deterministic system means stacking more uncertainty on top of uncertainty. Traditional software can have its architecture perfected before building starts. Agents can’t.

Those “brilliant” articles with perfect architectures are correct, but they’re the destination, not the starting point. Without going through the struggle in between, copying them directly means you probably can’t even get the first component working.

So I had to step back. Before adding complexity, I needed to understand when agents were even the right tool.

2. When to Use Agents, When Not To

When you first get agents working, there’s a temptation to use them for everything. I fell into this trap. Not every problem needs an agent.

Principle one: If a single API call can solve it, don’t use an agent. I made this mistake once. I wanted to summarize a paragraph into one sentence and deployed a Plan-and-Execute pattern. What could have been done with one API call got split into planning first, then executing. The task didn’t get more complex; the pipeline did.

Principle two: Use workflows for deterministic processes, not conversational agents. For example, “transcribe subtitles → identify edit points → generate plan → control audio/video” is a fixed chain. The input is defined, the intermediate steps are fixed, the output is one-shot, and no user intervention is needed in between. This kind of task should use workflow tools like N8N or Dify.

Two signals that you actually need a conversational agent:

The process requires human involvement. Some tasks aren’t right-or-wrong questions; they’re aesthetic questions. AI can rarely generate satisfying results in one shot. People need to guide and iterate.
There are too many options for a UI to handle. If you insist on using buttons for every modification request, the interface becomes an airplane cockpit.

Once I understood when to use agents, the next question was how to start building one without repeating my earlier mistakes.

3. Principles for the Starting Phase

Tech stack: Get it running first, make it elegant later.

Don’t start by picking the most powerful, most complete framework and then designing nodes and data flows behind closed doors. Even if you choose a complex architecture, use its simplest form first. Get the baseline running. Know where the floor is for this task. Then decide whether to add anything.

Prompt iteration: Start loose, then tighten.

Don’t look for “god-tier prompts” to copy. The first version of your prompt shouldn’t be a complex instruction manual. Write it without constraints and see what the model naturally does. Then keep adding constraints: formatting, reasoning paths, few-shot examples. As long as the agent can follow instructions, keep layering.

Once the baseline worked, I inevitably hit a wall: the agent couldn’t do what I needed because it lacked capabilities, not instructions.

4. Introducing Tools and the Problems That Follow

Often when AI performs poorly, it’s not that the prompt is bad. It simply doesn’t have the capability. For example, you want it to reference trending designs, but it can’t access the data. Tweaking the prompt won’t help. You need to add tools.

Once you add tools, there’s a feeling of “emergence.” The model starts acting like a real agent, deciding which tools to use on its own, even chaining tools together. This feels great. You can’t help but keep adding tools.

But when tools multiply, problems appear: each tool’s description, conversation history, code, and images all get crammed in. The model’s attention gets scattered. Performance starts degrading consistently.

This leads to the next problem.

5. Context Engineering

When context spirals out of control, you need to make sure the model only sees what it needs to see.

Why isolate? Different tasks need different types of information. Take a video agent as an example: design tasks care about user intent, style, and mood, requiring divergent information; coding tasks care about interface structure, output format, and correctness, requiring precise information. If mixed together, design information interferes with code accuracy, and code information slows down design decisions.

Separating planner from executor. Introduce a top-level planner that dispatches to sub-agents below. The design agent only sees design information. The code agent only sees code information.

Slimming down context. Sub-agents are easy to isolate, but the planner seemingly needs to see everything to make assignments, causing context to keep growing. The solution: don’t feed all context to the planner. First determine what the user wants, then dynamically load the necessary rules and tools.

Specific techniques include RAG (retrieving relevant documents based on user input) and code filtering (using grep or scripts to filter relevant content).

There’s a trap here: you can’t use another LLM to decide “what context to read,” because that model would need to read all the context first to make the decision. You don’t save any tokens.

Progressive disclosure. A more structured approach is layered loading:

First layer: load only skill names and descriptions, so the agent knows “what I can do”
Second layer: read specific execution guides only when tasks match
Third layer: read forms, templates, and scripts on demand during execution

This way the skill library can expand indefinitely while runtime costs stay stable.

Context isolation solved the attention problem. But it created a new one: how do isolated agents share information?

6. Memory Systems

After introducing sub-agents, I hit a problem: how does the planner hand a piece of code to the executor? Direct copy-paste consumes massive output tokens, which is expensive, and models aren’t good at mechanical copying. They tend to introduce bugs.

The solution is introducing a file system, using “pointers” instead of passing content. The planner writes code to a file and tells the executor “go read and modify the file at this path.” Output tokens drop dramatically, and error rates decrease.

Distinguishing short-term and long-term memory:

Short-term memory: exists within the current context window, disappears when the conversation ends, useful only for the current turn
Long-term memory: persists across turns (like to-do lists, current progress), used for resuming multi-step tasks after interruption

Handling intermediate state. In multi-turn conversations, should intermediate state (tool call details) be passed to the next turn? My strategy: by default, don’t pass intermediate state. Only pass user input and agent output. If this causes task failure, don’t dump all state back in. Instead, identify which key piece of information was lost and explicitly save that one piece.

7. Debugging

With sub-agents, context isolation, and memory systems in place, the system becomes extremely hard to debug. When something fails, you’re staring at a chain of decisions made by multiple models, each with its own context. Where did it go wrong?

The only solution is building a robust logging system. The key is storing process, not results:

Which tools were used
Call sequence
Token consumption per step
Which context went unused

With good logs, you can replay the agent’s decision-making after the fact. You see not just what happened, but why. Without them, you’re guessing.

At this point, I had a system that could handle breadth: it could do many things. But could it do any one thing well?

8. From Breadth to Depth: Introducing Skills

After solving the above problems, breadth is handled. But depth, doing one thing professionally and reliably, becomes the issue.

Conversational agents are inherently loose. They arrange their own workflows, and each run might differ. But some tasks need to follow a strict methodology with fixed steps.

The solution is skills. A skill is packaged, composable, executable “procedural knowledge” for an agent. It has three characteristics:

Procedural: Emphasizes “how to do it.” For example, the steps for a financial report are: fetch data, validate, aggregate, then generate the report.
Composable: Like building blocks that stack. One task can load multiple skills simultaneously.
Executable: The model handles planning and decisions; scripts handle deterministic execution (calculations, transformations, file generation).

Under this architecture, the agent serves as the entry point for intent recognition, while skills handle stable execution. You keep the agent’s flexibility while gaining workflow’s reliability.

That’s what I learned about AI agents this year. If I had to distill it into one sentence for my future self: don’t let “elegance” become a stumbling block. Get it running first. The problems will tell you what to do next.