Because [[models have a limit.]], you don’t need the best model at any given time, you need a "good enough" model combined with architectural innovations.
Tool use is a good example; back in the day there used to be a little "web" button on chatgpt (and other tools), which would perform a Retrieval-Augmented-Generation ("RAG"). All that meant is that the chat harness would do a web search, stuff the top ten results into the prompt context, and run it. Usually, this led to pointless answers along the lines of "i don’t have enough information to answer your question" or, humorously, "that webpage had an error".

However, very soon after, GPT Researcher came out. It proposed that we have these LLM’s which can answer questions, why not answer "do i need more info to answer the query?" A few months later, chatgpt added a "deep research" button (which was most people’s introduction to the idea).

The exact same model produces a far better result when put through a better architecture. LLM’s aren’t about the quality of the model anymore, they’re about the integration into other tools.
"Fill-In-Middle" (fim) is when you’re in a code editor, writing code, and a model offers an autocomplete suggestion for the next few statements. In 2023, the first use of this was Copilot, backed by (OpenAI) Codex. In VSCode or Visual Studio, suddenly instead of "intellisense" popping up windows asking which variable you might mean, the entire line would have a (usually correct) proposed solution for you. Hit tab, all done.
This led to a rush of testimonials like this;
I have been using Copilot now for just under a year and it’s completely changed the way I code. I can spend less time writing small or repeatable functions and more time thinking about more intricate problems that require solving.
tjharrison
Very soon after, a "chat" feature was added, that let users ask questions about code. Snippets of code, or parts of the codebase, would be included in the context along with the prompt, and the model would answer them, or propose changes. This was the original "agentic" implementation.
Anthropic took this idea further, what if we took Deep Research’s looping behavior, and the idea of reading and editing files, and made all of those the same interface? Models would be offered "tools" they could run, and they’d use those tools until they were satisfied. This became the MCP protocol, and shortly thereafter Anthropic released Claude Code, which almost immediately obsoleted stuff like Cursor or other in-IDE chatting tools. It’d loop, creating edits, verifying changes, searching for info, and even shelling out on your computer to compile code or run tools, until the prompt was satisfied.
While Claude Code tends to emphasize that [[Software Managers have Dunning-Kruegar]], it’s undeniably a much better tool than trying to work with chatgpt over the web, like the old days.
Ultimately we’re using context to make up for the model’s shortcomings. And [[Userland is Not Enough]]. The model’s training might bias it towards some poor answers, and we can patch over it with architecture, but we can never truly be rid of it. It’s not quite a straightforward win, the way that inverting RAG to tool-use was, we’re starting to see cracks.
There are probably still architectural wins to be had - integrations with more tools, adding LLMs into small use cases in other industries. Things like Openclaw demonstrate a real desire for people to manage their entire technical life with an LLM (whoever can make windows drivers, sound and bluetooth settings into a chat window will probably be a millionaire) - though these broader uses of LLMs are mostly serving to introduce people to the concept that [[Alignment is Hard, Actually]]. And it’s possible that some other excellent tool like code editing harnesses will come out. But, again, [[models have a limit.]]. There is only so complex we can make this before the lack of quality prevents the tool being used.