Home | Registration (SIGIR 2024) | Program | Speakers | Panel | Organizers |
Multimodal data is available in many applications like e-commerce production listings, social media posts and short videos. However, existing algorithms dealing with those types of data still focus on uni-modal representation learning by vision-language alignment and cross-modal retrieval. In this workshop, we target to bring a new retrieval problem where both queries and documents are multimodal. With the popularity of vision language modeling, large language models (LLMs), retrieval augmented generation (RAG), and multimodal LLM, we see a lot of new opportunities for multimodal representation and retrieval tasks. This event will be a comprehensive half-day workshop focusing on the subject of multimodal representation and retrieval. The agenda includes keynote speeches, oral presentations, and an interactive panel discussion.
Submissions of short papers must be in English, in PDF format, and be at most 4 pages (including figures, tables, proofs, appendixes, acknowledgments, and any content except references) in length, with unrestricted space for references, in the current ACM two-column conference format. Suitable LaTeX, Word, and Overleaf templates are available from the ACM Website (use “sigconf” proceedings template for LaTeX and the Interim Template for Word). ACM's CCS concepts and keywords are required for review.
For LaTeX, the following should be used:
\documentclass[sigconf,natbib=true,anonymous=true{acmart}]
Submissions must be anonymous and should be submitted electronically via EasyChair:
Image Captioning for Baidu Ad Image Generation with Multi-Stage Refinements.
Time | Activity | Host |
---|---|---|
9:00 AM - 9:05 AM | Opening Remarks | Doug Gray |
9:05 AM - 9:35 AM | Keynote Address by Hamed Zamani | Doug Gray |
9:35 AM - 10:35 AM | Oral Presentations | Xinliang Zhu |
10:35 AM - 10:45 AM | Coffee Break | - |
10:45 AM - 11:15 AM | Keynote Address by Dinesh Manocha | Arnab Dhua |
11:15 AM - 11:45 AM | Panel Discussion | Arnab Dhua |
11:45 AM - 11:50 AM | Closing Remarks | Xinliang Zhu |
11:50 AM - 12:15 PM | Networking | - |
Abstract: Information access systems, such as search engines and recommender systems, have long supported people in accomplishing a wide range of tasks. In this talk, I will discuss how one can broaden the scope of users of information access systems to include task-driven machines, such as generative AI models. In this way, the core principles of indexing, representation, retrieval, and ranking can be applied and extended to substantially improve model generalization, scalability, robustness, and interpretability. I will describe a generic retrieval-enhanced machine learning (REML) framework and connect this framework with the information retrieval literature. I will next introduce our recent implementations of REML for various language and vision tasks. Finally, I will discuss open problems in this area for future explorations.
Abstract: Perceiving and understanding non-speech sounds and non-verbal speech are essential for making informed decisions that facilitate our interactions with our surroundings. Audio is a crucial modality, offering rich, contextual information that complements visual and textual data, thereby enhancing the capabilities of AI systems. In this talk, we will highlight the significance of audio as an integral component in developing the next generation of intelligent AI agents. We will highlight why audio is an indispensable modality for AI, highlighting how humans naturally and extensively rely on auditory cues to navigate and comprehend the physical and virtual worlds. Understanding the auditory signals is fundamental to creating AI systems that can interact with the world in a more human-like and intuitive manner. Next, we will discuss how contemporary AI systems are beginning to integrate audio perception with other modalities to achieve more holistic and accurate environmental awareness. We will describe key advancements and methodologies that enable these multimodal integrations, focusing on the role of audio encoders and large language models (LLMs) in this synergy. Finally, we will address the open challenges and future directions in the field of audio question answering and multimodal AI.