Constructing a Context-Folding LLM Agent for Lengthy-Horizon Reasoning with Reminiscence Compression and Software Use

October 16, 2025

1

On this tutorial, we discover the right way to construct a Context-Folding LLM Agent that effectively solves lengthy, complicated duties by intelligently managing restricted context. We design the agent to interrupt down a big job into smaller subtasks, carry out reasoning or calculations when wanted, after which fold every accomplished sub-trajectory into concise summaries. By doing this, we protect important information whereas preserving the energetic reminiscence small. Try the FULL CODES right here.

import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import Record, Dict, Tuple
attempt:
   import transformers
besides:
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], test=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
mannequin = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", mannequin=mannequin, tokenizer=tokenizer, device_map="auto")
def llm_gen(immediate: str, max_new_tokens=160, temperature=0.0) -> str:
   out = llm(immediate, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0]["generated_text"]
   return out.strip()

We start by establishing the environment and loading a light-weight Hugging Face mannequin. We use this mannequin to generate and course of textual content regionally, making certain the agent runs easily on Google Colab with none API dependencies. Try the FULL CODES right here.

import ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
   if isinstance(n, ast.Num): return n.n
   if isinstance(n, ast.UnaryOp) and sort(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
   if isinstance(n, ast.BinOp) and sort(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.proper))
   increase ValueError("Unsafe expression")
def calc(expr: str):
   node = ast.parse(expr, mode="eval").physique
   return _eval_node(node)
class FoldingMemory:
   def __init__(self, max_chars:int=800):
       self.energetic=[]; self.folds=[]; self.max_chars=max_chars
   def add(self,textual content:str):
       self.energetic.append(textual content.strip())
       whereas len(self.active_text())>self.max_chars and len(self.energetic)>1:
           popped=self.energetic.pop(0)
           fold=f"- Folded: {popped[:120]}..."
           self.folds.append(fold)
   def fold_in(self,abstract:str): self.folds.append(abstract.strip())
   def active_text(self)->str: return "n".be part of(self.energetic)
   def folded_text(self)->str: return "n".be part of(self.folds)
   def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}

We outline a easy calculator software for primary arithmetic and create a reminiscence system that dynamically folds previous context into concise summaries. This helps us preserve a manageable energetic reminiscence whereas retaining important info. Try the FULL CODES right here.

SUBTASK_DECOMP_PROMPT="""You might be an skilled planner. Decompose the duty under into 2-4 crisp subtasks.
Return every subtask as a bullet beginning with '- ' in precedence order.
Activity: "{job}" """
SUBTASK_SOLVER_PROMPT="""You're a exact downside solver with minimal steps.
If a calculation is required, write one line 'CALC(expr)'.
In any other case write 'ANSWER: '.
Suppose briefly; keep away from chit-chat.


Activity: {job}
Subtask: {subtask}
Notes (folded context):
{notes}


Now reply with both CALC(...) or ANSWER: ..."""
SUBTASK_SUMMARY_PROMPT="""Summarize the subtask end result in Record[str]:
   return [ln[2:].strip() for ln in textual content.splitlines() if ln.strip().startswith("- ")]

We design immediate templates that information the agent in decomposing duties, fixing subtasks, and summarizing outcomes. These structured prompts allow clear communication between reasoning steps and the mannequin’s responses. Try the FULL CODES right here.

def run_subtask(job:str, subtask:str, reminiscence:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
   notes=(reminiscence.folded_text() or "(none)")
   hint=[]; closing=""
   for _ in vary(max_tool_iters):
       immediate=SUBTASK_SOLVER_PROMPT.format(job=job,subtask=subtask,notes=notes)
       out=llm_gen(immediate,max_new_tokens=96); hint.append(out)
       m=re.search(r"CALC((.+?))",out)
       if m:
           attempt:
               val=calc(m.group(1))
               hint.append(f"TOOL:CALC -> {val}")
               out2=llm_gen(immediate+f"nTool consequence: {val}nNow produce 'ANSWER: ...' solely.",max_new_tokens=64)
               hint.append(out2)
               if out2.strip().startswith("ANSWER:"):
                   closing=out2.break up("ANSWER:",1)[1].strip(); break
           besides Exception as e:
               hint.append(f"TOOL:CALC ERROR -> {e}")
       if out.strip().startswith("ANSWER:"):
           closing=out.break up("ANSWER:",1)[1].strip(); break
   if not closing:
       closing="No definitive reply; partial reasoning:n"+"n".be part of(hint[-2:])
   summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(title=subtask,hint="n".be part of(hint),closing=closing),max_new_tokens=80)
   summary_bullets="n".be part of(parse_bullets(summ)[:3]) or f"- {subtask}: {closing[:60]}..."
   return closing, summary_bullets, hint
class ContextFoldingAgent:
   def __init__(self,max_active_chars:int=800):
       self.reminiscence=FoldingMemory(max_chars=max_active_chars)
       self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0}
   def decompose(self,job:str)->Record[str]:
       plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(job=job),max_new_tokens=96)
       subs=parse_bullets(plan)
       return subs[:4] if subs else ["Main solution"]
   def run(self,job:str)->Dict:
       t0=time.time()
       self.reminiscence.add(f"TASK: {job}")
       subtasks=self.decompose(job)
       self.metrics["subtasks"]=len(subtasks)
       folded=[]
       for st in subtasks:
           self.reminiscence.add(f"SUBTASK: {st}")
           closing,fold_summary,hint=run_subtask(job,st,self.reminiscence)
           self.reminiscence.fold_in(fold_summary)
           folded.append(f"- {st}: {closing}")
           self.reminiscence.add(f"SUBTASK_DONE: {st}")
       closing=llm_gen(FINAL_SYNTH_PROMPT.format(job=job,folds=self.reminiscence.folded_text()),max_new_tokens=200)
       t1=time.time()
       return {"job":job,"closing":closing.strip(),"folded_summaries":self.reminiscence.folded_text(),
               "active_context_chars":len(self.reminiscence.active_text()),
               "subtask_finals":folded,"runtime_sec":spherical(t1-t0,2)}

We implement the agent’s core logic, during which every subtask is executed, summarized, and folded again into reminiscence. This step demonstrates how context folding permits the agent to motive iteratively with out dropping observe of prior reasoning. Try the FULL CODES right here.

DEMO_TASKS=[
   "Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.",
   "Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation."
]
def fairly(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__=="__main__":
   agent=ContextFoldingAgent(max_active_chars=700)
   for i,job in enumerate(DEMO_TASKS,1):
       print("="*70)
       print(f"DEMO #{i}: {job}")
       res=agent.run(job)
       print("n--- Folded Summaries ---n"+(res["folded_summaries"] or "(none)"))
       print("n--- Closing Reply ---n"+res["final"])
       print("n--- Diagnostics ---")
       diag={ok:res[k] for ok in ["active_context_chars","runtime_sec"]}
       diag["n_subtasks"]=len(agent.decompose(job))
       print(fairly(diag))

We run the agent on pattern duties to look at the way it plans, executes, and synthesizes closing outcomes. By these examples, we see the whole context-folding course of in motion, producing concise and coherent outputs.

In conclusion, we display how context folding permits long-horizon reasoning whereas avoiding reminiscence overload. We see how every subtask is deliberate, executed, summarized, and distilled into compact information, mimicking how an clever agent would deal with complicated workflows over time. By combining decomposition, software use, and context compression, we create a light-weight but highly effective agentic system that scales reasoning effectively.

Try the FULL CODES right here and Paper . Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be part of us on telegram as nicely.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

🙌 Observe MARKTECHPOST: Add us as a most well-liked supply on Google.