| Home | Registration (ICCV 2025) | Program | Keynote Speakers | Accepted Papers | Organizers | Past Events |
Multimodal representation learning is central to modern AI, enabling applications across retrieval, generation, RAG, reasoning, agentic AI, and embodied intelligence. With the growing ubiquity of multimodal data—from e-commerce listings to social media and video content—new challenges arise in multimodal retrieval, where both queries and indexed content span multiple modalities. This task requires deeper semantic understanding and reasoning, especially at scale, where data complexity and noise become significant hurdles. Following the success of our first edition, the second Multimodal Representation and Retrieval Workshop at ICCV 2025 will continue to foster progress in this critical area. The half-day event will feature keynote talks, an invited talk, and oral and poster presentations.
Our objective with this workshop is to capture the interest of researchers in the emerging field of multimodal retrieval and representation learning. As users increasingly use LLM based agents to interact with the world, the tools needed to retrieve relevant information will need to evolve to serve agents as well as human users. We anticipate that the workshop will serve as a catalyst for establishing a dedicated community focused on this topic. By highlighting the novelty and significance of the problem, we aim to attract researchers who are eager to explore and contribute to this field. We invite original research & industrial application papers that present research on learning multimodal representations and building multimodal retrieval systems.
Submissions of papers must be in English, in PDF format, and at most 8 pages (including figures, tables, proofs, appendixes, acknowledgments, and any content except references) in length, with unrestricted space for references, in the ICCV style. Please download the ICCV 2025 Author Kit for detailed formatting instructions.
Papers that are not properly anonymized, or do not use the template, or have less than four pages or more than eight pages (excluding references) will be rejected without review. We expect at least one author from each submission to be available as a reviewer.
Submissions should be submitted electronically:
The accepted papers will appear in ICCV proceedings by default unless the authors notify the organizers (email: mrr-2025-iccv@googlegroups.com ) separately before Jul 3 (11:59 pm, PST).
| Time | Session |
|---|---|
| 8:30 - 8:35 am | Opening Remarks |
| 8:35 - 8:55 am | Invited Talk - Roei Herzig Towards Structured Physical Intelligence Models |
| 8:55 - 9:35 am | Keynote: Cordelia Schmid Multi-modal video understanding |
| 9:35 - 9:50 am | Rate–Distortion Limits for Multimodal Retrieval: Theory, Optimal Codes, and Finite-Sample Guarantees Thomas Y Chen |
| 9:50 - 10:35 am | Coffee Break & Poster Session |
| 10:35 - 10:50 am | MIND-RAG: Multimodal Context-Aware and Intent-Aware Retrieval-Augmented Generation for Educational Publications Jiayang Yu, Yuxi Xie, Guixuan Zhang, Jie Liu, Zhi Zeng, Ying Huang, Shuwu Zhang |
| 10:50 - 11:30 am | Keynote: Jianwei Yang Towards Intelligent Multimodal AI Agents: From Digital to Physical Worlds and Beyond |
| 11:30 - 11:45 am | Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations Haocheng Dai, Sarang Joshi |
| 11:45 - 12:25 pm | Keynote: Kristen Grauman Multimodal activity understanding |
| 12:25 - 12:30 pm | Closing Remarks |
Inria Research Director
Title: Multi-modal video understanding
Bio: Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate, also in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis on "Local Greyvalue Invariants for Image Matching and Retrieval" received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled "From Image Matching to Learning Visual Models". Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at Inria, where she is a research director.
Dr. Schmid is a member of the German National Academy of Sciences, Leopoldina and a fellow of IEEE and the ELLIS society. She was awarded the Longuet-Higgins prize in 2006, 2014 and 2016, the Koenderink prize in 2018 and the Helmholtz prize in 2023, all for fundamental contributions in computer vision that have withstood the test of time. She received an ERC advanced grant in 2013, the Humboldt research award in 2015, the Inria & French Academy of Science Grand Prix in 2016, the Royal Society Milner award in 2020 and the PAMI distinguished researcher award in 2021. In 2023 she received the Körber European Science Prize and in 2024 the European Inventor Award in the research category. Dr. Schmid has been an Associate Editor for IEEE PAMI (2001--2005) and for IJCV (2004--2012), an editor-in-chief for IJCV (2013--2018), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015, ECCV 2020 and ICCV 2023. Starting 2018 she holds a joint appointment with Google research.
Professor, Department of Computer Science, UT Austin
Title: Multimodal activity understanding
Bio: Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin. Her research focuses on video understanding and embodied perception. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is a AAAS Fellow, IEEE Fellow, AAAI Fellow, Sloan Fellow, and recipient of the 2025 Huang Prize and 2013 Computers and Thought Award. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She has served as Associate Editor-in-Chief for PAMI and Program Chair of CVPR 2015, NeurIPS 2018, and ICCV 2023.
Research Scientist, Meta MSL
Title: Towards Intelligent Multimodal AI Agents: From Digital to Physical Worlds and Beyond
Bio: Jianwei Yang is an AI Research Scientist at Meta, and previously a Principal Researcher at MSR. His research lies at the intersection of computer vision and multimodal learning, with a focus on developing general-purpose multimodal agents capable of interacting with both humans and environments. He has co-organized several academic events, including the Workshops on Transformers for Vision, Workshops on Computer Vision in the Wild, and Tutorials on Recent Advances in Vision Foundation Models. Jianwei has also served as an Area Chair for top-tier conferences such as ICCV, NeurIPS, and ICLR. His work has been recognized with several honors, including a Best Student Paper Finalist at CVPR 2022, first place in the V3Det Challenge at CVPR 2024, and the Best Paper Award at the CoRL 2024 LangRob Workshop.
View all papers in the official proceedings.
Xinliang Zhu, Amazon
Arnab Dhua, Amazon
Shengsheng Qian, Associate Professor, Chinese Academy of Sciences
Xin (Eric) Wang, Assistant Professor, University of California, Santa Cruz
Rene Vidal, Rachleff University Professor, University of Pennsylvania
Douglas Gray, Amazon
For any questions, please email mrr-2025-iccv@googlegroups.com .