Documentation
This guide explains what ThinkingMemory is, why each piece exists, and how it works under the hood. For endpoints, parameters, and copy-paste calls, see the API guide.
Why ThinkingMemory
The problem. LLM agents are stateless. Between sessions they forget everything, and within a session they only know what is in the prompt. Teams patch this two ways, and both are bad:
- Stuff everything into the prompt. It blows the token budget, costs more, and buries the useful facts in noise, which hurts answer quality.
- Bolt on a vector store. You get "nearest neighbours" to a query, but that is not the same as the right context for a task. You still write retrieval glue, re-ranking, dedup, and budget-trimming yourself, and the store never improves.
The idea. Treat memory as a database with a single, purpose-built query: recall. You tell it the agent's intent and a token budget; it returns the useful, deduped, cited context, and it gets better the more the agent uses it. Storage, ranking, packing, decay, and cleanup are the database's job, not yours.
The recall primitive
What it is.One call: intent in, context out. Instead of "find vectors similar to X," recall answers "give me what I need to act on X right now, in N tokens."
Why it matters. It moves the hard parts (which signals to combine, how to rank, what to keep, how to fit the budget, how to cite) out of your app and into the engine, so every agent benefits from the same well-tuned retrieval, and the result is a string you can paste straight into a prompt with [n] citations back to source memories.
How it is different from RAG.Classic RAG is "embed, nearest-neighbour, stuff." Recall fuses several retrievers, reranks with a cross-encoder, packs to a budget, and feeds a lifecycle that reshapes memory over time. RAG retrieves; recall curates.
One unified substrate
What it is. Everything you store is a memory: a single row holding text, an optional structured content object, a server-generated embedding, and metadata (type, scope, salience, confidence, timestamps, provenance).
Why one table.Older designs split memory into separate "layers" with separate APIs. That forces you to decide up front where something belongs and makes cross-layer search awkward. ThinkingMemory keeps one substrate so recall can search across everything or narrow by tag, and so the lifecycle can move a memory between roles (for example, promote an event into a durable fact) without copying it between stores.
Memory types
The mtype tag records what a memory is, so you can store and recall by role:
| Type | What | Why / when |
|---|---|---|
episodic | Events and observations. | "What happened" — interactions, actions, results. The raw stream the agent later learns from. |
semantic | Facts and knowledge. | Stable truths about a user, account, or domain. The durable knowledge you most want recalled. |
procedural | How-to and skills. | Steps and procedures that worked, so the agent repeats success instead of rediscovering it. |
working | Scratch state. | Short-lived context for the current task. Often kept in the faster working-memory store instead. |
Attributes & the feedback loop
Each memory carries metadata that drives ranking and lifecycle:
- salience — an importance weight. How it works: recall boosts the salience of every memory it surfaces, so frequently-useful memories rise over time and rarely-useful ones sink. This is the core feedback loop that makes recall sharpen with use.
- decay_rate — how fast relevance fades. How it works: the lifecycle lowers a memory's effective relevance as it ages at this rate; being recalled offsets decay. Why: stale context should naturally lose to fresh, used context.
- confidence — how sure you are (0–1). Why: lets the agent and the contradiction logic prefer trusted facts.
- scope —
private,shared, orglobalwithin your tenant. How it works: recall can be limited to certain scopes, so one agent's private notes stay separate from team-wide knowledge. - provenance — where it came from (source,
derived_from). Why: this is what makes the trace possible and keeps derived facts linked to their evidence.
How recall works
A single recall runs a five-stage pipeline. Each stage exists to fix a weakness of the others:
1. Candidate generation (hybrid retrieval)
Three retrievers run in parallel: vector similarity (cosine over embeddings, catches meaning and paraphrase), keyword full-text search (catches exact terms, names, and IDs that embeddings blur), and recency (favours fresh memories). Why hybrid:vectors miss exact tokens; keyword misses synonyms; recency breaks ties. Together they cover each other's blind spots.
2. Graph expansion (optional)
If you pass graph_hops > 0, the top hits are expanded along their graph links so closely-related memories come along even if they did not match the query text directly.
3. Fusion (Reciprocal Rank Fusion)
The ranked lists are merged with RRF, which scores each memory by its rank in each list (roughly the sum of 1/(k+rank)) rather than by raw, incomparable scores. Why:it combines retrievers fairly without tuning weights, and it is robust to one retriever returning wild scores. Salience then nudges important memories up.
4. Rerank (cross-encoder)
The top candidates are re-scored by a cross-encoder that reads the intent and the memory together (on by default on the cloud). Why: the first-stage retrievers encode query and document separately for speed; a cross-encoder is slower but much more precise, so it is the perfect final filter on a small shortlist.
5. Token-budget packing
The highest-ranked memories are packed in order until token_budget is reached, de-duplicated, and numbered with [n] citations. Why: you get a context that is guaranteed to fit your prompt and is traceable to its sources. The response also reports tokens_saved_vs_dump so you can see the ROI versus dumping everything, and each item lists why (which signals surfaced it). Pass as_of to run the whole pipeline against what the agent believed at a past moment.
The lifecycle engine
Why it exists. Memory that only ever grows becomes noise: duplicates pile up, facts go stale, contradictions accumulate, and recall quality drops. The lifecycle keeps memory healthy automatically. It runs on a daily scheduler, and you can trigger it on demand per agent.
- Decay — lowers the effective relevance of memories as they age (by each one's
decay_rate); recall offsets it. So "old and unused" loses to "fresh or frequently recalled." - Extraction — distills durable
semanticfacts out of recentepisodicevents, linking the new fact to the events it came from. So one-off observations become reusable knowledge. - Consolidation — merges near-duplicate memories into one stronger memory. So repetition increases confidence instead of cluttering recall.
- Supersession — when newer information replaces older, the old memory's validity window is closed (kept for history, hidden from "now"). So the agent acts on the current truth.
- Contradiction resolution — detects pairs that are highly similar in topic but opposite in meaning, and keeps the more confident/recent one. So conflicting facts do not both surface.
- Forgetting — prunes low-salience, expired memories. So storage and recall stay focused on what matters.
Bitemporal memory
What it is. Every memory has a validity window (valid_from / valid_to). Nothing is hard-deleted by default; superseded or forgotten memories simply have their window closed.
Why it matters.You can ask "what did the agent believe last Tuesday?" and get the answer as it stood then, not as it is now. That is essential for debugging agent decisions, auditing, and reproducing past behaviour. How: pass as_of to recall, or read a full snapshot from the timeline endpoint.
Provenance & audit
Provenance / trace.Each memory records where it came from and what it was derived from. The trace walks that chain so you can answer "why do I know this?" — for example, a semantic fact back to the episodic events extraction built it from. Why: explainability and trust.
Audit log. An append-only record of every operation (remember, recall, forget, maintenance). Why: compliance, debugging, and understanding what the agent did and when.
The entity graph
What it is. Memories can be linked with typed, weighted edges into a graph. Why: related knowledge is not always textually similar — a person, an account, and a contract may each be phrased differently but belong together. How: create edges with link, and set graph_hopson recall to pull a hit's neighbours into the candidate set, so recall can follow relationships, not just text similarity.
Working vs durable memory
The main memory database is durable, ranked, and curated. Working memoryis a separate, fast, ephemeral key/value store (TTL'd) for the current task — the scratchpad an agent uses mid-task. Why two: you do not want transient state competing with durable knowledge in recall, and you do not want to pay ranking/lifecycle cost for values that live for seconds.
Isolation & multi-tenancy
Why it matters. Your memories must never leak across accounts. ThinkingMemory enforces isolation in depth: the application always scopes queries by tenant; PostgreSQL row-level security enforces it again at the database so a query without the right tenant context returns nothing; and the memory table is hash-partitioned by tenant so each account's rows are physically grouped. API keys are stored only as hashes.
Embeddings
How it works. You send text; the service generates embeddings server-side with a local model (BAAI/bge-small-en-v1.5, 384 dimensions). The same model is used to embed what you store and to embed recall queries. Why: no per-token embedding bills, no extra service to run, and storage and query always use a consistent vector space.
Using it in your agent
The typical loop: recall before the model acts, remember after. Everything else (ranking, decay, cleanup) is automatic.
# 1) before the LLM call: pull the right context
ctx = recall(agent_id, intent=user_message, token_budget=1000)
prompt = SYSTEM + ctx.context + user_message # ctx.context has [n] citations
# 2) call your model
answer = llm(prompt)
# 3) after: write back what happened
remember(agent_id, content={"text": f"User asked: {user_message}; answered: {answer}"},
mtype="episodic")
# the lifecycle later distills durable facts, decays, dedupes, and forgets — on its ownOpen core vs cloud
The engine (the memory model, recall, lifecycle, graph, bitemporal) is open source under Apache-2.0, so you can self-host and inspect it. ThinkingMemory Cloud is the managed, multi-tenant service on top: self-serve accounts, API keys, usage and billing, row-level isolation, the background scheduler, and this console. Same engine, run for you.
Ready to build? Head to the API guide or the in-console Quickstart. Questions? Contact us.