We present the problem of Visually Precise Query (VPQ) generation which enables a more intuitive match between a user's information need and an e-commerce site's product description. Given an image of a fashion item, what is the most optimum search query that will retrieve the exact same or closely related product(s) with high probability. In this paper we introduce the task of VPQ generation which takes a product image and its title as its input and provides aword level extractive summary of the title, containing a list of salient attributes, which can now be used as a query to search for similar products. We collect a large dataset of fashion images and their titles and merge it with an existing research dataset which was created for a different task. Given the image and title pair, VPQ problem is posed as identifying a non-contiguous collection of spans within the title. We provide a dataset of around 400K image, title and corresponding VPQ entries and release it to the research community. We provide a detailed description of the data collection process as well as discuss the future direction of research for the problem introduced in this work. We provide the standard text as well as visual domain baseline comparisons and also provide multi-modal baseline models to analyze the task introduced in this work. Finally, we propose a hybrid fusion model which promises to be the direction of research in the multi-modal community.
Recommended citation: Dasgupta, Riddhiman and Tom, Francis and Kumar, Sudhir and Das Gupta, Mithun and Kumar, Yokesh and Patro, Badri N. and Namboodiri, Vinay P. (2020)``Visually Precise Query’’, Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020.