Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Abstract

Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal dialogue dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed plan-and-execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation.

Proposed Pipeline: Mcu

Plan-and-Execute Image Aligner

Ultron 7B

Findings

Finding 1: Mcu framework produce coherent and engaging multi-modal conversations.

Despite being constructed using generative models such as ChatGPT and our proposed image aligner (which includes several diffusion models), our dataset ensures the relevance of image-sharing moments and maintains the quality of generated images. This demonstrates the robustness and reliability of our proposed framework in producing coherent and engaging multi-modal conversations, even when compared to datasets utilizing actual photo-realistic images

BibTeX

@misc{lee2024starksociallongtermmultimodal,
      title={Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge}, 
      author={Young-Jun Lee and Dokyong Lee and Junyoung Youn and Kyeongjin Oh and Byungsoo Ko and Jonghwan Hyeon and Ho-Jin Choi},
      year={2024},
      eprint={2407.03958},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.03958},}