Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

Taeyoon Kwon^∗1 Dongwook Choi^∗1 Sunghwan Kim¹ Hyojun Kim¹
Seungjun Moon¹ Beong-woo Kwak¹ Kuan-Hao Huang² Jinyoung Yeo¹

¹Yonsei University ²Texas A&M University

📝

Why Personalized Embodied Agents?

Embodied agents empowered by large language models (LLMs) have recently demonstrated remarkable success in executing object rearrangement tasks in household environments. The primary objective of embodied agents is to provide assistance to users while interacting with the physical world. But do such tasks truly reflect the challenges in providing meaningful assistance to the users?

To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. In this work, we propose Memento, a framework to evaluate the memory utilization of embodied agents in providing personalized assistance.

Through Memento, we evaluated embodied agents powered by various LLMs and found that even state-of-the-art models struggle with utilizing episodic memory containing personalized knowledge. GPT-4o showed a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. The agents were also found to be easily distracted by irrelevant memories. These findings provide valuable insights for developing personalized embodied agents in the future.

Preliminaries

Rearrangement Tasks for LLM-Powered Embodied Agents

The object rearrangement task for LLM-powered embodied agents involves moving objects to their correct target locations in a physical environment based on natural language instructions. At the beginning of each episode, the agent receives an instruction such as "Place the mug on the table and the book on the shelf." It must interpret this instruction and internally derive the intended goal — identifying which objects need to be moved and where they should be placed. Using this goal, the agent then plans and executes a sequence of actions to manipulate the environment accordingly. During execution, the agent incrementally receives text-based observations of nearby objects and decides on its next action by reasoning over the history of past observations and actions. The overall objective is to complete the rearrangement so that the final environment configuration matches the instruction-defined goal.

In this work, we focus on the partial observability of the task, meaning the agent can only perceive a limited view of its surroundings at any given time.

Designing Memento

Overview of Memento.

Two-stage Evaluation Process

We divide the evaluation process into two stages: (1) Memory Acquisition Stage and (2) Memory Utilization Stage. In the Memory Acquisition Stage, agents perform conventional object rearrangement tasks with instructions that contain personalized knowledge, while accumulating interaction history. In the Memory Utilization Stage, agents are given modified versions of the same tasks, which require them to recall and apply the personalized knowledge acquired in the previous stage in order to succeed.

Personalized Knowledge Categorization

We categorize personalized knowledge into two types — Object Semantics and User Patterns — to isolate a distinct reasoning challenge that the agent must address using episodic memory. For Object Semantics, we include subcategories such as ownership (e.g., my cup), preference (e.g., my favorite running gear), past history (e.g., a graduation gift from my grandma), and grouped references (e.g., my childhood toy collections), all referring to objects that carry personal meaning for the user. For User Patterns, we consider personal routines (e.g., my remote work setup) and arrangement preferences (e.g., my cozy dinner atmosphere), which reflect recurring sequences of actions that the user consistently performs in specific contexts. This category assesses the agent's ability to reconstruct the complete goal g by leveraging previously observed behavioral patterns across multiple objects and locations.

Evaluation metric

We use two main metrics: Percent Complete (PC) for the proportion of goal completion, and Success Rate (SR) for full task completion. To evaluate memory utilization, we also report performance drops between acquisition and utilization stages as ΔPC and ΔSR. We also measure Simulation Steps, representing the number of steps taken to complete a task, and Planning Cycles, which reflect the number of LLM inference calls during execution. For joint-memory tasks, ΔPC and ΔSR are computed relative to the average performance of the corresponding acquisition-stage episodes.

Evaluating Personalized Knowledge Utilization in Episodic Memory

Main Results

Model performance across memory acquisition and utilization stage in Memento.

LLM-powered embodied agents struggle to consistently utilize personalized knowledge. Even GPT-4o shows a 30.5% drop in success rate for joint-memory tasks, and all models experience over 20% drop compared to the memory acquisition stage. This highlights the challenge of referencing and applying personalized knowledge across multiple steps. The increased planning cycles and simulation steps suggest misinterpretation of instructions, leading to inefficient exploration. Furthermore, the larger gap between percent complete and success rate indicates that agents often overlook critical information needed for successful task completion.

Personalized Knowledge Type Based Analysis

Results of classifier-based RMs and PRMs

The results of personalized knowledge type based analysis (single-memory).

We compare success rates between the memory acquisition stage and the single-memory task to analyze performance gaps across personalized knowledge types. LLMs perform relatively well on object semantics, showing minimal drop, but struggle significantly with user patterns, which require reasoning over action sequences. This suggests that while LLMs can recall relevant objects, they have difficulty understanding and integrating temporal behavior patterns.

Top-k Retrieval Based Analysis

The results of top-k retrieval based analysis (single-memory).

As the number of retrieved memories k increases, all models show consistent performance degradation across both task types. This highlights the difficulty of identifying relevant information from a larger memory set. The drop is especially severe in user pattern tasks, indicating their sensitivity to noise when reasoning over multi-step, implicit personalized knowledge.

Case Study

The case study results of Memento.

To analyze how LLM-powered embodied agents leverage personalized knowledge from episodic memory during planning, we conduct qualitative case studies on both successes and failures. In particular, analysis of the frontier model GPT-4o reveals two key recurring failure patterns:

Missed personalization cues: Agents often misinterpret user-specific references as generic or proper nouns, failing to capture user intent.
Commonsense over personalization: Even when relevant memory exists, agents tend to rely on commonsense reasoning rather than drawing from episodic memory, leading to incorrect or suboptimal behaviors.

BibTeX

@article{kwon2024personalizedembodied,
title={Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance},
author={Kwon Taeyoon and Choi Dongwook and Kim Sunghwan and Kim Hyojun and Moon Seungjun and Kwak Beong-woo and Huang Kuan-Hao and Yeo Jinyoung},
journal={arXiv preprint arXiv:2403.xxxxx},
year={2024}
}