Essay

The Body Problem

Every agentic workflow eventually hits a step that requires a body — and most software still has no honest answer for what happens next.

May 25, 2026

Central claim

Agents cannot accept custody, authorize handoffs, access mobility infrastructure, grant physical access, complete commerce, or carry persistent context across real-world workflows — not because the models are weak, but because the stack still assumes a human proxy somewhere in the loop.

OpenMatter alignment

This essay frames the infrastructure gap OpenMatter is working on — governed execution across real-world handoffs — without treating our protocols as the only valid response.

Abstract

Background

Agents are marketed as autonomous, but operational workflows still break when software must become presence, custody, mobility, or permission in the physical world.

Objective

Define the body problem as a company-wide research frame and document where agents are incapable of completing common real-world actions without human proxy dependency.

Methods

We evaluate six exemplar workflows — accepting custody, authorizing outbound handoff, accessing mobility infrastructure, granting physical access, completing end-to-end agentic commerce, and persisting context across services — and classify where software stops and proxy dependency begins.

Results

Agents routinely cannot complete physical handoffs even when they return successful tool responses. The failure is structural: missing addressability, authority, execution surfaces, durable state, and a unified harness for inherited, multimodal, and persistent context across providers.

Conclusion

The body problem names a structural gap in agent infrastructure. Governed execution — inspectable, attributable, revocable physical action — is the design target. Whatever solutions emerge, they must close the gap between software intent and operational completion.

Keywords

body problem · agentic workflows · agentic commerce · physical handoffs · human proxy · unified context · tool calling · agent orchestration · execution harness · execution infrastructure · governed execution

Introduction

Software agents are getting better at deciding. They are not getting better at being present.

The modern world was designed around human presence. Addresses, signatures, payments, access control, identity verification, and logistics systems all assume a human body somewhere in the loop. Agents do not fail at these workflows because they lack intelligence. They fail because the infrastructure was never designed for software actors to inherit authority, custody, or physical execution directly.

Governed execution refers to physical actions initiated or coordinated by software that remain inspectable, attributable, revocable, policy-scoped, and persistently recorded across providers and execution environments.

The body problem names the moment an agentic workflow needs a human actor — someone or something on site to receive, release, carry, unlock, inspect, or accept custody — and software has no governed way to complete that step.

The body problem does not necessarily require humanoid robotics. It emerges anywhere software must coordinate physical authority, custody, mobility, or access across human-built infrastructure.

This is not a rare edge case. It is the default when agents move from documents and APIs into logistics, mobility, storage, access, and payments that touch atoms.

An agent may successfully purchase medication online, yet still fail to securely receive it. It may reserve transportation without being able to identify the correct passenger. It may authorize a courier pickup without being able to verify custody transfer at the handoff. The software run completes; the real-world operation does not.

The body problem is not only presence in space. It is also context — what a run should already know, what field evidence attaches to a handoff, and what must survive when execution crosses providers. We use three terms throughout: inherited context (mandates, custody, prior grants at action start), multimodal context (scans, location, signatures bound to the run), and persistent context (the same item, site, and policy references for the full governed run). Most stacks still treat all three as prompt stuffing or session-local state.

Software intent ≠ operational completion

1
Intent
2
Planning
3
Tool invocation
4
Payment
5
Handoff gap← body problem
6
Custody
7
Verification
8
Persistent state
9
Completion

We use the body problem as an organizing frame for research on execution infrastructure — including protocol work at OpenMatter, but not limited to it. Without governed execution and a unified harness for context and proof, “autonomous operations” is a transcript fiction.

Literature Review

The industry still sells autonomy while shipping products that assume a human operator behind the glass. Someone approves, someone signs in, someone meets the driver, someone catches the exception.

The market is also beginning to name the gap openly. RentAHuman markets humans-for-hire wired to agents through MCP — accept custody on site, meet couriers, run errands — because the default stack cannot complete those steps alone [1]. That category treats the body as an outsourced integration surface: an honest admission of the body problem, not a substitute for governed execution.

Payment rails are advancing on a different axis. At Stripe Sessions 2026, Kaliski et al. walk checkout (UCP) and machine payments (MPP, x402): structured APIs and scoped credentials instead of agents scraping human checkout pages [4]. Stripe’s machine payments documentation describes the same software-metered pattern — per invocation, with protocols such as MPP and x402 settling in USDC or card rails [2]. That improves how an agent pays for API access or a structured checkout session. It does not, by itself, let an agent buy something for an actor end to end — payment method, billing address, shipping address, and fulfillment proof still pull a human back in at each step, and rarely survive when another agent resumes the run.

Academic research frames the same divide from the modeling side. Fung et al. survey embodied agents — virtual, wearable, and robotic — and argue that world models integrating perception, planning, and memory are central to acting in the physical world [3]. They distinguish agents “instantiated in a visual, virtual, or physical form” from web-based agents that “do not possess embodiment” [3]. That work targets how embodied systems learn and plan; the body problem names what disembodied operational agents still cannot complete across logistics, custody, and human infrastructure.

Tool use expanded what agents can request. It did not standardize what it means for a physical operation to complete.

Orchestration frameworks chain software tools well. They do not, by themselves, give agents durable identities for custodied items, scoped authority for carriers, or state that survives from dispatch through handoff through completion.

A successful HTTP response is not operational completion. Completion requires the world itself to change state — custody transferred, access granted, delivery received, mobility engaged — and for that state transition to remain attributable to the initiating actor and agent.

World-model work emphasizes multimodal perception inside an embodied agent [3]. Operational agents still need field signals bound to the same run as API metadata — not orphaned in chat.

“I’m unaware of an agent that needs a T-shirt or anything like that, but certainly agents need things like API calls, access to data, invoking MCP servers, and so on.”

— Kaliski (2026), Stripe Sessions transcript [4]

“These agents, which include virtual avatars, wearable devices, and robots, are designed to perceive, learn and act within their surroundings, which makes them more similar to how humans learn and interact with the environments as compared to disembodied agents. We propose that the development of world models is central to reasoning and planning of embodied AI agents, allowing these agents to understand and predict their environment… World modeling encompasses the integration of multimodal perception, planning through reasoning for action and control, and memory to create a comprehensive understanding of the physical world.”

— Fung et al. (2025), abstract [3]

Methodology

We stress-test the body problem with six exemplar workflows that appear constantly in agent product demos and production roadmaps. For each, we ask what the agent can actually execute versus what still requires a human proxy.

Accepting custody on behalf of an actor.An agent may know that an item is arriving — from a tracking webhook, inventory record, or actor instruction — and that custody should transfer at a specific place and time. It cannot be present at that place: a residence, curbside, retail counter, locker, or facility loading dock. It cannot verify condition, witness the handoff, or accept custody into a governed record on the actor’s behalf. Custody changes are physical events wherever they occur, not only in warehouses. Without an execution surface that represents sites, principals, and handoffs, the agent stops at “notify the user.”

Authorizing an outbound handoff.An agent may draft a release request or fill a form. It cannot meet a carrier at a curb, release items under policy, or bind authority to a specific driver and time window in a revocable, inspectable way. The “authorization” lives in chat or email unless infrastructure carries grants and receipts.

Accessing mobility infrastructure.An agent may call a mobility API and receive a quote or even a reservation ID. It cannot enter a vehicle, confirm the correct passenger, or enforce route constraints in the world. Mobility state often dies when the session ends — the next turn rebuilds context from prompts instead of reading persistent infrastructure state.

Granting physical access.An agent may know that a courier, technician, or guest should enter a building, unit, or locker bank at a specific time. It cannot unlock a door, issue a revocable credential tied to place and policy, or prove who crossed the threshold and when. Access still flows through proprietary apps, SMS codes, or a human with a key — not a governed grant the run can inspect, extend, or revoke.

Completing end-to-end agentic commerce.An agent may find a product, populate a cart, and call checkout APIs. It cannot buy something for an actor without a human proxy at nearly every step: approving or supplying a payment method, confirming billing address, confirming shipping address, and often re-authorizing per merchant. Machine payments and shared payment tokens [2, 4] narrow the software settlement gap; they do not install governed delivery identity, fulfillment proof, or continuity when a different agent resumes the purchase. Context a human entered for one agent does not reliably exist for the next — billing, shipping, and mandate state fragment across sessions and vendors instead of inheriting through a unified commerce harness.

Persisting context across the workflow.Each exemplar above spans multiple services. Inherited, multimodal, and persistent context should compose for the full run. Instead, an agent may pay via machine payments [2, 4] while still unable to prove who accepted custody, who authorized handoff, or who was admitted on site — and rebuilds references at every boundary. That amnesia is a body-problem symptom: no unified execution harness for atoms.

Results

Across all six exemplars, the pattern is the same. The agent produces plausible language and partial software artifacts. The operation remains incomplete until a person or field system closes the gap.

Accepting custody on behalf of an actor.Custody is ambiguous at every site type. The agent cannot prove what was received on the actor’s behalf, by whom, under which grant, or at which addressable location — whether the handoff happened at home, on a curb, or in a facility.

Authorizing an outbound handoff.Carriers and senders coordinate through ad hoc channels. The agent did not authorize the handoff in a machine-readable, revocable form.

Accessing mobility infrastructure.The agent may hold a reservation reference while the user still stands outside, or the trip diverges from intent with no durable state for the run to recover.

Granting physical access.Entry may happen in the world while the run holds only a message like “code sent” or “ask the front desk.” The agent cannot bind admission to scoped policy, receipt it, or revoke it when the visit ends.

End-to-end agentic commerce.A checkout may clear in demo while the actor’s addresses and mandates stay outside the run. Payment success is not delivery completed. When another agent continues the workflow, shipping, billing, and spend authority are not reliably inherited — the human must proxy the same details again.

Persisting context across the workflow.A later agent turn cannot reliably answer “which item is under custody, who may authorize release, what mobility was engaged, who was admitted where, and what was purchased for whom at which address?” without re-deriving everything from scratch.

Context dimensions.Inherited and multimodal context fail at provider boundaries: mandates and field evidence drop instead of binding to the execution record, so software cannot reconcile what was authorized with what occurred.

These are not model failures. They are infrastructure failures. The body problem is the correct name for the gap between software intent, physical completion, and context that should travel with both.

Discussion

The body problem is why agent-native execution cannot mean software-native execution with humans assumed at every handoff.

Closing the gap requires shared execution infrastructure — protocol primitives composed as one harness. Addressable items and sites; scoped grants for handoff, spend, and admission; composed logistics, mobility, and access; the three context dimensions carried with traceability after execution.

Human-in-the-loop copy does not solve the body problem. It documents it. The design target remains governed execution — including when a physical step must be assigned to a person or robot with explicit authority.

Renting humans through marketplaces such as RentAHuman [1] scales the symptom on the physical side. Machine payments [2, 4] scale agent spend, not custody completion. World-model research [3] advances planning inside a body but does not by itself give operational agents governed handoffs or cross-provider state. Verification that payment cleared is not verification that the world changed [4]. None of these alone closes the full gap.

Limitations: this paper defines and illustrates the problem frame. Provider benchmarks and controlled trials across live networks are reported in subsequent company essays and implementation notes.

Conclusion

Agents are incapable of accepting custody, authorizing outbound handoffs, accessing mobility infrastructure, granting governed physical access, completing agentic commerce on an actor’s behalf, and carrying inherited, multimodal, and persistent context through a single governed run — not because they lack intelligence, but because the stack lacks a unified execution harness for atoms.

The internet standardized communication between computers. The next infrastructure transition standardizes execution across the physical world.

The missing layer is not intelligence. It is operational continuity between software intent and physical state change.

Naming the body problem sets the design bar: items addressable, authority explicit, services composable, governed execution with context that inherits and persists. OpenMatter’s protocol catalog is one attempt at that bar; we expect other stacks to pursue overlapping goals in different shapes.

Any product claiming operational agents should be evaluated on the six exemplars in this essay. If the physical world still depends on invisible human intervention to complete the workflow, the body problem remains unsolved.

Contributions

A formal definition of the body problem and governed execution for agent-native operations.
Documented incapabilities across custody, logistics authority, mobility, physical access, agentic commerce, and three context failures: inherited, multimodal, and persistent.
A research basis for protocol primitives — identity, context, policy, place, capability, and state — composed through a unified execution harness.

References

[1] RentAHuman. Hire humans for AI agents (MCP integration). — https://rentahuman.ai
[2] Stripe. Machine payments. — https://docs.stripe.com/payments/machine
[3] Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Jégou, H., Lazaric, A., Majumdar, A., Madotto, A., Meier, F., Metze, F., Moutakanni, T., Pino, J., Terver, B., Tighe, J., & Malik, J. (2025). Embodied AI Agents: Modeling the World. Meta AI Research. — https://arxiv.org/abs/2506.22355 · DOI 10.48550/arXiv.2506.22355
[4] Kaliski, S., Surtani, M., & Poncin, G. (2026). Machine payments and the protocols behind agentic commerce. Stripe Sessions 2026. — https://stripe.com/sessions/2026/machine-payments-and-the

Declarations

Funding

This work was conducted as part of OpenMatter internal research and protocol design.

Conflict of interest

The authors are affiliated with OpenMatter.

Ethics approval

Not applicable. These essays present systems research and protocol proposals rather than human-subjects experimentation.

Data availability

Supporting notes and implementation artifacts referenced in this paper are available to qualified researchers on request.

Back to Essay