Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long1,2,*, Yichen He1,*, Wentao Ye1,2, Yiyuan Pan1,3, Yuan Lin1,†, Hang Li1, Junbo Zhao2, Wei Li1
1ByteDance Seed, 2Zhejiang University, 3Shanghai Jiao Tong University
*Equal Contribution, Corresponding author

linyuan.0@bytedance.com

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross- modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 8.2%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.

Human centric memorization

A demo of M3-Agent as a personal assistant!

M3-Bench

We introduce M3-Bench, an long video question answerin dataset designed to evaluate the capability of multimodal agents to perform reasoning over long-term memory. Each instance in M3-Bench comprises a long video simulating the perceptual input of an agent, along with a series of open-ended question-answer pairs. The dataset is organized into two subsets:

1. M3-Bench-robot, which contains 100 real-world videos recorded from a robot's first-person perspective,

2. M3-Bench-web, which includes 929 web-sourced videos covering a wider variety of content and scenarios.

m3-bench-example

Examples from M3-Bench. M3-Bench-robot features long videos from realistic robotic work scenarios, while M3-Bench-web expands the video diversity to support broader evaluation. The question-answering tasks are designed to assess a multimodal agent’s ability to construct consistent and reliable long-term memory, as well as to reason effectively over that memory.

Data Examples


M3-Bench-Robot

1. Why is Abel a little bit angry? [Human Understanding, Cross-Modal Reasoning]

    Answer: Because Cary drank up all the yogurt.

2. What dish is in the red pot? [Multi-Detail Reasoning]

    Answer: Winter melon stewed with spareribs.

3. Which shelf in the refrigerator, countiong from the top, does Cary's family usually put the wine they bought on? [Cross-Modal Reasoning]

    Answer: Third.

4. Is Cary good at cooking? [Human Understanding]

    Answer: No

M3-Bench-Web

1. Which collection has the highest starting price among the five items shown in the video? [Multi-Detail Reasoning]

    Answer: Pirate Ship Float.

2. What did the authentication expert do after examining the ink marks on the Led Zeppelin album? [Multi-Hop Reasoning]

    Answer: He compared the handwriting.

3. Which collection is Rick's favorite, as indicated in the video? [Multi-Hop Reasoning, Human Understanding]

    Answer: The album of Led Zeppelin.

4. Does Rick trust Trump's abilities, as shown in the video? [Multi-Detail Reasoning, Human Understanding]

    Answer: No

M3-Bench-Web

1. How should the knife be positioned to efficiently and neatly slice scallions into shreds? [General Knowledge Extraction]

    Answer: The blade should not be higher than the fingers.

2. Is Lucas skilled at cooking based on his performance in the video? [Multi-Detail Reasoning, Human Understanding]

    Answer: No.

3. What symbolic meaning does the father associate with keeping the shrimp head intact? [Human Understanding]

    Answer: It symbolizes a complete beginning and ending.

Dataset Statistics

m3-bench-statistic

Statistical overview of M3-Bench benchmark. Each question may correspond to multiple question types.

M3-Agent

Architecture of M3-Agent. The system consists of two parallel processes: memorization and control. During memorization, M3-Agent processes video and audio streams online to generate episodic and semantic memory. During control, it executes instructions by iteratively thinking and retrieving from long-term memory. The long-term memory is structured as a multimodal graph.

m3-agent

Experimental Results

result1

result1

result1

BibTeX

@misc{long2025seeing,
      title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory}, 
      author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},
      year={2025},
      eprint={2508.09736},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}