4 stories
·
0 followers

Context Engineering: The Foundation for Reliable AI Agents

1 Share
A bunch of colorful question marks.

Context is king in the agentic world. Pairing a performant reasoning model like Claude, DeepSeek or GPT-5 with the right context drives efficient planning and tool usage and improves multistep reasoning, leading to personalized conversations, higher task accuracy and relevant responses.

In this article, we present the need for context engineering and associated benefits, identify challenges developers face as they use it for developing AI agents, and propose a high-level architecture to help address them.

Solving the Context Dilemma: Too Much vs. Too Little

Enterprises have access to vast amounts of structured and unstructured data. However, feeding this data as context directly to agents leads to confusion around task comprehension due to inherent noise and loss of important information, which can hurt a large language model’s (LLM) situational awareness as the limited context window is breached. Using a long context is not always the solution to this problem, as I’ve written before. On the other hand, sending too little context can cause agents to hallucinate. Simply put: garbage in, garbage out.

Context engineering refers to a collection of techniques and tools used to ensure an AI agent has only the necessary information to complete assigned tasks successfully. Based on the concept of context engineering described by Harrison Chase of LangChain, context engineering consists of the following:

  • Tool selection means ensuring the agent has access to the right tools for retrieving the information needed to accomplish the specified task. For example, consider a scenario where an agent is asked to complete an action, such as planning a trip to Maui for a family with two kids and a dog. It should be able to retrieve all tools that are required to answer the user’s question and execute tasks reliably.
  • Memory use is also a factor. It’s important to equip the agent with short-term memory that provides context for personalizing the ongoing session between the user and the agent, as well as long-term memory that offers context across multiple sessions to make the interactions cohesive, factual and even more personalized. This spans various memory types such as profile, semantic, episodic, conversational and procedural. It also includes working memory, which is used for sharing context for seamless task coordination among agents in a multiagent system.
  • Another component is prompt engineering. This ensures the agent has access to the right prompt, which is clearly defined in terms of the agent’s behavior, including specific instructions and constraints.
  • Finally, there’s retrieval. Dynamically retrieving relevant data based on the user’s question and inserting it into the prompt before sending it to the LLM ensures AI success. This is achieved by using Retrieval-Augmented Generation (RAG) and direct database calls. Enterprises generally have a polyglot environment with multiple sources of truth. In such cases, the Model Context Protocol (MCP) allows developers to retrieve context from numerous data sources in a standardized manner.

The above context is shared with the agent and subsequently the reasoning LLM. Augmenting the agent prompt with relevant tool names along with the associated tool spec, contents of the short and long-term memories, prompt and relevant content retrieved from RAG, databases and SaaS services ensures successful task execution. Figure 1 shows a conceptual view of the architecture for context engineering.

Figure 1: Conceptual view of the architecture for context engineering (source: Couchbase)

Figure 1: Conceptual view of the architecture for context engineering (source: Couchbase).

This is how it works. First, the user sends a request to the multiagent system. The agent application then retrieves context via APIs that span prompts and tools from the catalog; context for RAG from the vector store; summarized conversations from the short- and long-term memories; and summaries, sentiments and extracted entities from the operational database, including sources external to the database, using various MCP servers.

The agent application then augments the prompt with the consolidated context to create the agent prompt, which is sent to a reasoning model such as Claude, DeepSeek or GPT-5. The reasoning loop is triggered within the agent framework, like LangGraph, which exchanges messages with the reasoning model, during which several relevant tools are called.

Depending on the agent architecture, other agents may be called and context shared between them. Afterward, the generated answer is sent to the user, and the user-agent conversation is stored in memory to ensure continuity in conversations in subsequent sessions.

Here are a few challenges that developers face during context engineering and how the aforementioned architecture could help address them.

Extracting Context From Unstructured Data at Scale

Eighty percent of enterprise data is unstructured and is largely unusable as context. Therefore, to extract the context required to power important use cases, developers currently write extract, transform, load (ETL) jobs for Spark, Flink or other data processing engines. These jobs read unstructured data from a source database, process it and write back results for subsequent consumption by agents. These DIY solutions, albeit performant, not only slow down developer velocity but also create operational and maintenance overhead.

A few example use cases include summarizing the details of the “support_ticket_desc” field in a document so that the customer support AI agent can easily understand and take action; extracting medical terms (diseases, medications, symptoms) from the “patient_diagnosis” field so that a triaging agent can come up with an initial diagnosis for the patient; and labeling whether text in the “email_content” field is “irrelevant,” “promotional spam,” “potentially a scam” or “phishing attempt” so an email assistant can reason whether to automatically respond to an email.

AI functions allow developers to invoke LLMs from within SQL statements with the ability to write prompts to control the format, tone and other aspects of the LLM output. Here’s an example: A developer augments product reviews stored in a database with sentiment and summary using AI functions. A retail AI agent later reads it via a tool call and reasons whether to provide a compelling offer to a dissatisfied user to improve the Customer Satisfaction Score (CSAT) based on the severity of the issues they reported. This agent also creates a product feature request to drive.

Consider the following product review left by a customer who was disappointed with the performance and durability of a blender:

“I had high hopes for this blender based on the product description and reviews, but it’s been a let-down from day one. The motor struggles even with soft fruits, and it overheats after just a couple of minutes of use. I’ve had to stop mid-smoothie several times to let it cool down, which completely defeats the purpose of having a ‘high-speed’ blender.”

Here’s a no-code analysis using SQL:

SQL Statement Response
SELECT

review_id,

SUMMARIZE(review_text) AS summary,

SENTIMENT(“review_text”, prompt = “Evaluate the sentiment of the “customer_review” field on a 5-point scale: very negative, negative, neutral, positive, very Positive”) AS sentiment

FROMcustomer_reviews

WHERE review_text IS NOT NULL;

“sentiment”: very negative

“summary”: The blender needs a stronger motor to handle frozen fruits and ice without overheating, sharper blades for smoother blends and a better-sealed lid to prevent leaks. Durability should be improved to eliminate loud grinding noises and burning smells after short-term use.

This requires an underlying database that automates the above tasks in a no-code manner by invoking leading LLMs from within SQL statements.

Fitting Context Into a Limited Context Window

When it comes to context, less (but relevant) is more! A 1-million+ token limit does not mean you can treat the context like unlimited memory. Each additional token has cost, latency and performance implications. Instead of stuffing the prompt with long, unnecessary context, causing important details to get lost (especially in the middle of the prompt), consider using techniques like RAG to keep the context lean and highly relevant.

Listing all available tools that the LLM could use leads to prompt bloat and potentially confuses the agent due to similar tools having similar names or tool specs. Further, the proliferation of tools caused primarily by a lack of tool reusability and governance maximizes the likelihood of agent failure. However, cataloging all tools in a centralized location not only supports reusability but also retrieves only the tools that are relevant to answering the user’s question. This can be used in conjunction with well-written tool descriptions and tool routing to boost tool call accuracy. For example, the below API could retrieve only the tools within the agent application that are relevant to answer the user’s query:

catalog.find_tools(query="Plan a trip to Maui")


Agent behavior is highly sensitive to the quality of prompts; hence, changes to the prompts should be carefully managed. Cataloging all prompts with versioning and rollback support ensures consistent agent behavior, despite changes to the prompt. For example, the below API could retrieve only the prompt that is relevant to the query, keeping the context accurate:

catalog.find_prompt("query="Plan a trip to Maui")


You can achieve this by using a performant multimodel database, which allows you to extract context from a large volume of structured and unstructured data using vector search via RAG, and store and select highly relevant tools and prompts.

Managing Decay and Resolving Conflict in Agent Memory

Agent memory is a critical building block of context engineering. However, implementing memory decay and conflict resolution is not a trivial undertaking for developers.

Conversational agents accumulate vast amounts of data from their interactions. If an agent remembers every single past message, the context window will quickly fill up, leading to a loss of coherence and the inability to process new information. Hence, it is necessary to decay outdated information.

The challenge here is that information decays at different rates. For example, the return policies of a retailer do not change as frequently as the demand for fast fashion clothing. Hence, there is a need to implement information-specific Time to Live (TTL) in the memory across various user conversations so that a clothing recommendation agent does not recall outdated information from memory.

Additionally, developers need the option to delete outdated context from memory when necessary. This requires that agent memory be implemented using a database that supports TTLs to decay memory at desired rates and also to delete memory in a consistent manner as needed.

In a multiagent system, a single agent could have conflicting information, or perhaps multiple agents within the same user session might try to commit conflicting information to memory. This conflict can be resolved by using a timestamp for each message and sharing that with the LLM as context about how the information evolved. Further, the message could also be annotated with the agent name and other information so that the LLM can decide which piece of information goes with the memory.

Sign up for a Preview

At Couchbase, the ability to engineer context and serve it quickly is paramount, allowing developers to create performant and reliable agents. With Couchbase Capella AI Services, currently in private preview, and Capella NoSQL DBaaS, you can use a single data platform that encompasses various stores — such as operational, analytics, vectors, tools, prompt and memory — to extract context using SQL++ and augment your prompt. AI functions, an AI Services capability, automates the extraction of context from a large volume of data by invoking leading LLMs from within familiar SQL statements. An agent memory implemented using Couchbase allows tackling complex agent memory issues like memory decay and conflict resolution.

Sign up for a private preview of Capella AI Services and try Capella NoSQL DBaaS for free, and start building your agentic application.

The post Context Engineering: The Foundation for Reliable AI Agents appeared first on The New Stack.

Read the whole story
brentashley
3 days ago
reply
Share this story
Delete

Biden says he's starting VP search this month

1 Share

Joe Biden said he's spoken to Sen. Bernie Sanders and former President Barack Obama about selecting a running mate — and that he wants to build "a bench of younger, really qualified people" who can lead the nation over the course of the next four presidential cycles.

Driving the news: Biden spoke about the state of the 2020 race during a virtual fundraiser on Friday night that was opened to pooled coverage.


  • He said he'll announce a committee in mid-April to oversee the vice presidential selection process.
  • The former VP has a near-insurmountable lead over Sanders, but has not yet secured the number of delegates needed to claim the Democratic nomination.
  • "I don’t want him to think I’m being presumptuous, but you have to start now deciding who you’re going to have background checks done on as potential vice presidential candidates, and it takes time."

Between the lines: Biden also acknowledged the coronavirus has overshadowed coverage of the race and given President Trump an opportunity to dominate messaging via his task force briefings.

  • “I got a lot of people who are supporters getting very worried," Biden said. "‘Where’s Joe? Where’s Joe? The president’s every day holding these long press conferences.’”
  • “For a while there, I kept getting calls — people saying, ‘Joe, the president’s numbers are going way up and he’s every day on the news. What are you going to do about it?’”
  • “You can’t compete with a president. That’s the ultimate bully pulpit," Biden said, but added, "Those numbers aren’t going up anymore" because "the things he’s saying are turning out not to be accurate and people are getting very upset by it.”

Biden, 77, also said he's begun outreach to assess who he could bring into the administration if elected.

  • He said "one of the ways to deal with age" is "to build a bench of younger, really qualified people who haven’t had the exposure that others have had but are fully capable of being the leaders of the next four, eight, 12, 16 years to run the country."
  • "They’ve got to have an opportunity to rise up."


Read the whole story
brentashley
2039 days ago
reply
Share this story
Delete

A good nudge trumps a good prediction

1 Share

Editor’s note: this is part of our investigation into analytic models and best practices for their selection, deployment, and evaluation.

We all know that a working predictive model is a powerful business weapon. By translating data into insights and subsequent actions, businesses can offer better customer experience, retain more customers, and increase revenue. This is why companies are now allocating more resources to develop, or purchase, machine learning solutions.

While expectations on predictive analytics are sky high, the implementation of machine learning in businesses is not necessarily a smooth path. Interestingly, the problem often is not the quality of data or algorithms. I have worked with a number of companies that collected a lot of data; ensured the quality of the data; used research-proven algorithms implemented by well­-educated data scientists; and yet, they failed to see beneficial outcomes. What went wrong? Doesn’t good data plus a good algorithm equal beneficial insights?

The short answer: evaluation. You cannot improve what you improperly measure. A misguided evaluation approach leads us to adopt ineffective machine learning solutions. I have observed a number of common mistakes in companies trying to evaluate their predictive models. In this series, I will present a variety of evaluation methods and solutions, with practical industry examples. Here, in this first piece, I’ll look at accuracy evaluation metrics and the confusion between a good prediction and a good nudge.

Good predictions? Good nudges.

Let’s take the retail industry as an example. Many retailers believe if they can accurately predict their customers’ future purchasing preferences, they can increase sales. After all, everyone has heard the stories of Target identifying a pregnant teen and Nextflix’s success with its recommendation system.

This supposedly flawless assumption that accurate predictions can increase sales is easily overthrown, however, when we implement the accuracy evaluation metrics in reality. In machine learning, a metric measures how well a predictive model performs, usually based on some pre­defined scoring rules. Different models can then be compared. For instance, in recommender systems research, metrics such as recall/precision, Root Mean Square Error (RMSE), and Mean Average Precision (MAP) are frequently used to evaluate how “good” the models are. Roughly speaking, these metrics assume that a good model can accurately predict which products a customer will purchase or give the highest rating.

What’s the problem then? Let’s say I want to buy cereal and milk on Google Shopping Express. Once I open the app, let’s assume it accurately predicts that I will buy cereal and milk and shows them to me. I click and buy them. That’s great, right? But wait, the retailer originally expected the predictive model to increase sales. In this case, I was going to buy cereal and milk anyway, regardless of the accuracy of the prediction. Although my customer experience is probably improved, I do not necessarily buy more stuff. If the aim is to increase sales, the metric should, for example, focus on how well the model can predict and recommend products that will nudge me to buy much more than just cereal and milk.

Oftentimes, the true objective is to nudge customers toward some choice or action.

Researchers and businesses have a vested interest in nudging. For instance, Lowe’s grocery store in El Paso  successfully conducted an experiment that nudged customers to buy more produce. How? Simply by adding huge green arrows on the floor that pointed toward the produce aisle. Another successful example of nudging is “checkout charity”: retailers raise millions of dollars for charity by asking customers for small donations at the checkout screens.

Applying the predictive power of machine learning techniques to nudge, if done right, can be extremely valuable. Many bright statistical minds are, however, confused by the subtle difference between a good prediction and a good nudge. If we mix up the two in the evaluation process, we may end up choosing a model that does not help us achieve our goal. For example, Facebook’s emotional contagion experiment, despite its controversy, not only shows us how data influence users’ emotions, it also gives us a vivid example of when the goal of a metric should be to measure influence (or nudge) rather than predict.

The best metric is one that is consistent with your business goal. Oftentimes, the true commercial objective is to nudge customers toward some choice or action. Perhaps it is time for data practitioners to increase their awareness of metrics that reflect this kind of psychological impact of machine learning — sometimes the most effective result requires more than just good prediction.

Read the whole story
brentashley
4127 days ago
reply
Share this story
Delete

Taylor Swift is right about music, and the industry should act on her ideas

1 Comment

Country-pop star Taylor Swift penned an optimistic essay in Tuesday’s Wall Street Journalabout the lasting bonds between performers and their fans, and why she thinks the music industry is “just coming alive.” You can think what you want about Swift’s songs, but her take on the business is a welcome change from the doom-and-gloom we normally read.

In her essay, Swift is upfront about what everyone knows: CD sales fell off a cliff and, while streaming and digital sales have grown dramatically, they have not plugged a shortfall that has seen overall music revenue fall from $15 billion in million IN 2003 to $7 billion million today.

Often, such numbers are a cue for a musician to launch into a long harangue about piracy and the need for Congress to expand copyright. Instead, Swift does something different. She offers some new insights into about the evolving relationship between musicians and fans. Here’s what she says about autographs, for instance:

There are a few things I have witnessed becoming obsolete in the past few years, the first being autographs. I haven’t been asked for an autograph since the invention of the iPhonewith a front-facing camera. The only memento ‘kids these days’ want is a selfie. It’s part of the new currency, which seems to be ‘how many followers you have on Instagram.’

And here is how Swift sees social media changing traditional music deals:

A friend of mine, who is an actress, told me that when the casting for her recent movie came down to two actresses, the casting director chose the actress with more Twitter followers. I see this becoming a trend in the music industry … In the future, artists will get record deals because they have fans—not the other way around.

Swift’s essay also makes a heartfelt plea for the album as art, and expresses a belief that fans will always pay for those special albums that change their lives: “I think the future still holds the possibility for this kind of bond, the one my father has with the Beach Boys and the one my mother has with Carly Simon.”

Taylor Swift

A way forward

It’s easy to be snarky to Swift. Indeed, others have already pointed outthat her faith in revived album sales is misguided, and that the economics of digital distribution mean that only a lucky few, like Swift or Justin Bieber, have the celebrity klout to sell records in this day and age.

That might be true, but it doesn’t mean that Swift’s other observations aren’t helpful — if only the music industry would act on them. Alas, the industry is instead expending its legal and lobbying power on trying to wring more money from 50-year-old music. Just look at the current efforts to squeeze the likes of Pandora dry with ever-higher royalty rates and ill-considered class action suits.

Imagine if the industry directed more of its energy to finding new revenue sources amid all those selfies and Twitter followers surrounding Swift. As my colleague Mathew Ingram has explained in the context of news, new business ideas based on “membership” may offer more promise than attempting to replace past product sales.

Yes, many of the details are still fuzzy. But new user-based companies like Twitter and Instagram and are still developing their business models, and in coming years they will no doubt offer Swift and others a range of money-making ideas that we have yet to to imagine. Meanwhile, YouTube, despite contract scuffles, is already offering millions in ad revenue to famous acts and upcoming ones alike.

In the future, there will also continue to be a growing range of web and app platforms – everything from games to disappearing messaging apps — that offer both music licensing opportunities, and new ways for fans and performers to connect. The money may take time to emerge but, for cynics and the music industry, this is the way forward. Or in Swift’s words, “This is a love story, just say yes.”

This story was updated at 12:30pm ET to reflect that the music industry figures are in billions not millions.

Related research and analysis from Gigaom Research:
Subscriber content. Sign up for a free trial.

Read the whole story
brentashley
4136 days ago
reply
Jamie@keek.com
Share this story
Delete