Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Page Not Found

Page not found. Your pixels are in another canvas.

Welcome to Home Page of Vinay P. Namboodiri

About me

Jupyter notebook markdown generator

Posts

Future Blog Post

less than 1 minute read

Published: January 01, 2199

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published: August 14, 2015

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published: August 14, 2014

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published: August 14, 2013

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

Audio-Visual Digital Humans

Project page for Audio Visual Digital Humans

Portfolio item number 2

Short description of portfolio item number 2

publications

Image retrieval based on projective invariance

Published in IEEE International Conference on Image Processing (ICIP), 2004

We propose an image retrieval scheme based on projectively invariant features. Since cross-ratio is the fundamental invariant feature under projective transformations for points, we use that as the basic feature parameter. We compute the cross-ratios of point sets in quadruplets and a discrete representation of the distribution of the cross-ratio is obtained from the computed values. The distribution is used as the feature for retrieval purposes. The method is very effective in retrieving images, like buildings, having similar planar 3D structures.

Recommended citation: Rajashekhar, S. Chaudhuri and V. P. Namboodiri (2004). "Image retrieval based on projective invariance." IEEE International Conference on Image Processing Singapore, October 2004, Page 405-408 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1418776

Use of Linear Diffusion in depth estimation based on defocus cue

Published in Fourth Indian Conference on Computer Vision, Graphics & Image Processing, 2004

Diffusion has been used extensively in computer vision. Most common applications of diffusion have been in low level vision problems like segmentation and edge detection. In this paper a novel application of the linear diffusion principle is made for the estimation of depth using the properties of the real aperture imaging system. The method uses two defocused images of a scene and the lens parameter setting as input and estimates the depth in the scene, and also generates the corresponding fully focused equivalent pin-hole image. The algorithm described here also brings out the equivalence of the two modalities, viz. depth from focus and depth from defocus for structure recovery.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2004). “Use of Linear Diffusion in depth estimation based on defocus cue” Proceedings of Fourth Indian Conference on Computer Vision, Graphics & Image Processing (ICVGIP) Kolkata, India. December 2004. http://vinaypn.github.io/files/icvgip04.pdf

Shock Filters based on Implicit Cluster Separation

Published in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2005

One of the classic problems in low level vision is image restoration. An important contribution toward this effort has been the development of shock filters by Osher and Rudin (1990). It performs image deblurring using hyperbolic partial differential equations. In this paper we relate the notion of cluster separation from the field of pattern recognition to the shock filter formulation. A kind of shock filter is proposed based on the idea of gradient based separation of clusters. The proposed formulation is general enough as it can allow various models of density functions in the cluster separation process. The efficacy of the method is demonstrated through various examples.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2005). "Shock Filters based on Implicit Cluster Separation." Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR),San Diego June 2005, Page 1-6. http://vinaypn.github.io/files/cvpr05.pdf

Improved Kernel-Based Object Tracking Under Occluded Scenarios

Published in Fifth Indian Conference on Computer Vision, Graphics & Image Processing, 2006

A successful approach for object tracking has been kernel based object tracking [1] by Comaniciu et al.. The method provides an effective solution to the problems of representation and localization in tracking. The method involves representation of an object by a feature histogram with an isotropic kernel and performing a gradient based mean shift optimization for localizing the kernel. Though robust, this technique fails under cases of occlusion. We improve the kernel based object tracking by performing the localization using a generalized (bidirectional) mean shift based optimization. This makes the method resilient to occlusions. Another aspect related to the localization step is handling of scale changes by varying the bandwidth of the kernel. Here, we suggest a technique based on SIFT features [2] by Lowe to enable change of bandwidth of the kernel even in the presence of occlusion. We demonstrate the effectiveness of the techniques proposed through extensive experimentation on a number of challenging data sets.

Recommended citation: V.P. Namboodiri, A. Ghorawat and S. Chaudhuri (2006) “Improved Kernel-Based Object Tracking Under Occluded Scenarios”. In: Kalra P.K., Peleg S. (eds) Computer Vision, Graphics and Image Processing. Lecture Notes in Computer Science, vol 4338. Springer, Berlin, Heidelberg http://vinaypn.github.io/files/icvgip06.pdf

Retrieval of images of man-made structures based on projective invariance

Published in Pattern Recognition Journal, 2007

In this paper we propose a geometry-based image retrieval scheme that makes use of projectively invariant features. Cross-ratio (CR) is an invariant feature under projective transformations for collinear points. We compute the CRs of point sets in quadruplets and the CR histogram is used as the feature for retrieval purposes. Being a geometric feature, it allows us to retrieve similar images irrespective of view point and illumination changes. We can retrieve the same building even if the facade has undergone a fresh coat of paints! Color and textural features can also be included, if desired. Experimental results show a favorably very good retrieval accuracy when tested on an image database of size 4000. The method is very effective in retrieving images having man-made objects rich in polygonal structures like buildings, rail tracks, etc.

Recommended citation: Rajashekhar, S. Chaudhuri and V.P. Namboodiri (2007), “Retrieval of images of man-made structures based on projective invariance”. Pattern Recognition Journal, Volume 40, Issue 1, January 2007, Pages 296-308 https://www.sciencedirect.com/science/article/pii/S0031320306001671

On defocus, diffusion and depth estimation

Published in Pattern Recognition Letters, 2007

An intrinsic property of real aperture imaging has been that the observations tend to be defocused. This artifact has been used in an innovative manner by researchers for depth estimation, since the amount of defocus varies with varying depth in the scene. There have been various methods to model the defocus blur. We model the defocus process using the model of diffusion of heat. The diffusion process has been traditionally used in low level vision problems like smoothing, segmentation and edge detection. In this paper a novel application of the diffusion principle is made for generating the defocus space of the scene. The defocus space is the set of all possible observations for a given scene that can be captured using a physical lens system. Using the notion of defocus space we estimate the depth in the scene and also generate the corresponding fully focused equivalent pin-hole image. The algorithm described here also brings out the equivalence of the two modalities, viz. depth from focus and depth from defocus for structure recovery.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2007). “On defocus, diffusion and depth estimation” Pattern Recognition Letters Volume 28, Issue 3, 1 February 2007, Pages 311-319 http://vinaypn.github.io/files/prl07.pdf

Super-Resolution Using Sub-band Constrained Total Variation

Published in International Conference on Scale Space and Variational Methods in Computer Vision (SSVM), 2007

Super-resolution of a single image is a severely ill-posed problem in computer vision. It is possible to consider solving this problem by considering a total variation based regularization framework. The choice of total variation based regularization helps in formulating an edge preserving scheme for super-resolution. However, this scheme tends to result in a piece-wise constant resultant image. To address this issue, we extend the formulation by incorporating an appropriate sub-band constraint which ensures the preservation of textural details in trade off with noise present in the observation. The proposed framework is extensively evaluated and the experimental results for the same are presented

Recommended citation: P. Chatterjee, V.P. Namboodiri, S. Chaudhuri (2007) “Super-Resolution Using Sub-band Constrained Total Variation” In: Sgallari F., Murli A., Paragios N. (eds) Scale Space and Variational Methods in Computer Vision SSVM 2007. Lecture Notes in Computer Science, vol 4485. Springer, Berlin, Heidelberg http://vinaypn.github.io/files/ssvm07.pdf

Shape Recovery Using Stochastic Heat Flow

Published in British Machine Vision Conference (BMVC), 2007

We consider the problem of depth estimation from multiple images based on the defocus cue. For a Gaussian defocus blur, the observations can be shown to be the solution of a deterministic but inhomogeneous diffusion process. However, the diffusion process does not sufficiently address the case in which the Gaussian kernel is deformed. This deformation happens due to several factors like self-occlusion, possible aberrations and imperfections in the aperture. These issues can be solved by incorporating a stochastic perturbation into the heat diffusion process. The resultant flow is that of an inhomogeneous heat diffusion perturbed by a stochastic curvature driven motion. The depth in the scene is estimated from the coefficient of the stochastic heat equation without actually knowing the departure from the Gaussian assumption. Further, the proposed method also takes into account the non-convex nature of the diffusion process. The method provides a strong theoretical framework for handling the depth from defocus problem.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2007) “Shape Recovery Using Stochastic Heat Flow“ Proceedings of the British Machine Vision Conference 2007, University of Warwick, UK, September 10-13, 2007 http://vinaypn.github.io/files/bmvc07.pdf

Image Restoration using Geometrically Stabilized Reverse Heat Equation

Published in IEEE International Conference on Image Processing (ICIP), 2007

Blind restoration of blurred images is a classical ill-posed problem. There has been considerable interest in the use of partial differential equations to solve this problem. The blurring of an image has traditionally been modeled by Witkin [10] and Koenderink [4] by the heat equation. This has been the basis of the Gaussian scale space. However, a similar theoretical formulation has not been possible for deblurring of images due to the ill-posed nature of the reverse heat equation. Here we consider the stabilization of the reverse heat equation. We do this by damping the distortion along the edges by adding a normal component of the heat equation in the forward direction. We use a stopping criterion based on the divergence of the curvature in the resulting reverse heat flow. The resulting stabilized reverse heat flow makes it possible to solve the challenging problem of blind space varying deconvolution. The method is justified by a varied set of experimental results.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2005). "Image Restoration using Geometrically Stabilized Reverse Heat Equation." Proceedings of IEEE International Conference on Image Processing (ICIP), San Antonio, Texas, USA, 2007, Pages IV - 413 - 416. http://vinaypn.github.io/files/icip07.pdf

Recovery of relative depth from a single observation using an uncalibrated (real-aperture) camera

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008

In this paper we investigate the challenging problem of recovering the depth layers in a scene from a single defocused observation. The problem is definitely solvable if there are multiple observations. In this paper we show that one can perceive the depth in the scene even from a single observation. We use the inhomogeneous reverse heat equation to obtain an estimate of the blur, thereby preserving the depth information characterized by the defocus. However, the reverse heat equation, due to its parabolic nature, is divergent. We stabilize the reverse heat equation by considering the gradient degeneration as an effective stopping criterion. The amount of (inverse) diffusion is actually a measure of relative depth. Because of ill-posedness we propose a graph-cuts based method for inferring the depth in the scene using the amount of diffusion as a data likelihood and a smoothness condition on the depth in the scene. The method is verified experimentally on a varied set of test cases.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2008). "Recovery of relative depth from a single observation using an uncalibrated (real-aperture) camera." Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR),Anchorage, AK, USA, June 2008, Page 1-6. http://vinaypn.github.io/files/cvpr08.pdf

Regularized depth from defocus

Published in IEEE International Conference on Image Processing (ICIP), 2008

n the area of depth estimation from images an interesting approach has been structure recovery from defocus cue. Towards this end, there have been a number of approaches [4,6]. Here we propose a technique to estimate the regularized depth from defocus using diffusion. The coefficient of the diffusion equation is modeled using a pair-wise Markov random field (MRF) ensuring spatial regularization to enhance the robustness of the depth estimated. This framework is solved efficiently using a graph-cuts based techniques. The MRF representation is enhanced by incorporating a smoothness prior that is obtained from a graph based segmentation of the input images. The method is demonstrated on a number of data sets and its performance is compared with state of the art techniques.

Recommended citation: V.P. Namboodiri, S. Chaudhuri and S. Hadap (2008). “Regularized depth from defocus”, Proceedings of IEEE International Conference on Image Processing (ICIP) , San Diego, CA, USA, pp. 1520-1523. http://vinaypn.github.io/files/icip08.pdf

Action Recognition: A Region Based Approach

Published in IEEE Workshop on Applications of Computer Vision (WACV), 2011

We address the problem of recognizing actions in reallife videos. Space-time interest point-based approaches have been widely prevalent towards solving this problem. In contrast, more spatially extended features such as regions have not been so popular. The reason is, any local region based approach requires the motion flow information for a specific region to be collated temporally. This is challenging as the local regions are deformable and not well delineated from the surroundings. In this paper we address this issue by using robust tracking of regions and we show that it is possible to obtain region descriptors for classification of actions. This paper lays the groundwork for further investigation into region based approaches. Through this paper we make the following contributions a) We advocate identification of salient regions based on motion segmentation b) We adopt a state-of-the art tracker for robust tracking of the identified regions rather than using isolated space-time blocks c) We propose optical flow based region descriptors to encode the extracted trajectories in piece-wise blocks. We demonstrate the performance of our system on real-world data sets.

Recommended citation: H. Bilen, V.P. Namboodiri and L. Van Gool (2011). “Action recognition: A region based approach”, 2011 IEEE Workshop on Applications of Computer Vision (WACV), Kona, HI , 2011, pp. 294-300 http://vinaypn.github.io/files/wacv2011.pdf

Object and Action Classification with Latent Variables

Published in British Machine Vision Conference, 2011

In this paper we propose a generic framework to incorporate unobserved auxiliary information for classifying objects and actions. This framework allows us to explicitly account for localisation and alignment of representations for generic object and action classes as latent variables. We approach this problem in the discriminative setting as learning a max-margin classifier that infers the class label along with the latent variables. Through this paper we make the following contributions a) We provide a method for incorporating latent variables into object and action classification b) We specifically account for the presence of an explicit class related subregion which can include foreground and/or background. c) We explore a way to learn a better classifier by iterative expansion of the latent parameter space. We demonstrate the performance of our approach by rigorous experimental evaluation on a number of standard object and action recognition datasets.
Awarded: Best Paper Prize

Recommended citation: H. Bilen, V.P. Namboodiri and L. Van Gool (2011). “Object and Action Classification with Latent Variables”, In Jesse Hoey, Stephen McKenna and Emanuele Trucco, Proceedings of the British Machine Vision Conference, pages 17.1-17.11. BMVA Press, September 2011 http://vinaypn.github.io/files/bmvc2011.pdf

Super-resolution techniques for minimally invasive surgery

Published in MICCAI workshop on augmented environments for computer assisted interventions-AE-CAI, 2011

We propose the use of super-resolution techniques to aid visualization while carrying out minimally invasive surgical procedures. These procedures are performed using small endoscopic cameras, which inherently have limited imaging resolution. The use of higher-end cam- eras is technologically challenging and currently not yet cost effective. A promising alternative is to consider improving the resolution by post- processing the acquired images through the use of currently prevalent super-resolution techniques. In this paper we analyse the different method- ologies that have been proposed for super-resolution and provide a comprehensive evaluation of the most significant algorithms. The methods are evaluated using challenging in-vivo real world medical datasets. We suggest that the use of a learning-based super-resolution algorithm com- bined with an edge-directed approach would be most suited for this application.

Recommended citation: V. De Smet, V.P. Namboodiri and L. Van Gool (2011). “Super-resolution techniques for minimally invasive surgery”, 6th MICCAI workshop on augmented environments for computer assisted interventions-AE-CAI 2011 , Toronto,Canada. http://vinaypn.github.io/files/AECAI.pdf

Systematic evaluation of super-resolution using classification

Published in Visual Communications and Image Processing (VCIP), 2011

Currently two evaluation methods of super-resolution (SR) techniques prevail: The objective Peak Signal to Noise Ratio (PSNR) and a qualitative measure based on manual visual inspection. Both of these methods are sub-optimal: The latter does not scale well to large numbers of images, while the former does not necessarily reflect the perceived visual quality. We address these issues in this paper and propose an evaluation method based on image classification. We show that perceptual image quality measures like structural similarity are not suitable for evaluation of SR methods. On the other hand a systematic evaluation using large datasets of thousands of real-world images provides a consistent comparison of SR algorithms that corresponds to perceived visual quality. We verify the success of our approach by presenting an evaluation of three recent super-resolution algorithms on standard image classification datasets.

Recommended citation: V. De Smet, V.P. Namboodiri and L. Van Gool (2011). “Systematic evaluation of super-resolution using classification”, 2011 Visual Communications and Image Processing (VCIP), Tainan, 2011, pp. 1-4. http://vinaypn.github.io/files/VCIP.pdf

Classification with Global, Local and Shared Features

Published in Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium, 2012

We present a framework that jointly learns and then uses multiple image windows for improved classification. Apart from using the entire image content as context, class-specific windows are added, as well as windows that target class pairs. The location and extent of the windows are set automatically by handling the window parameters as latent variables. This framework makes the following contributions: a) the addition of localized information through the class-specific windows improves classification, b) windows introduced for the classification of class pairs further improve the results, c) the windows and classification parameters can be effectively learnt using a discriminative max-margin approach with latent variables, and d) the same framework is suited for multiple visual tasks such as classifying objects, scenes and actions. Experiments demonstrate the aforementioned claims.

Recommended citation: Bilen H., Namboodiri V.P., Van Gool L.J. (2012) Classification with Global, Local and Shared Features. In: Pinz A., Pock T., Bischof H., Leberl F. (eds) Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium. Lecture Notes in Computer Science, vol 7476. Springer, Berlin, Heidelberg, pp 134-143 http://vinaypn.github.io/files/dagm2012.pdf

Nonuniform Image Patch Exemplars for Low Level Vision

Published in IEEE Workshop on Applications of Computer Vision (WACV), 2013

We approach the classification problem in a discrim- inative setting, as learning a max-margin classifier that infers the class label along with the latent variables. Through this paper we make the following contribu- tions: a) we provide a method for incorporating latent variables into object and action classification; b) these variables determine the relative focus on foreground vs. background information that is taken account of; c) we design an objective function to more effectively learn in unbalanced data sets; d) we learn a better classifier by iterative expansion of the latent parameter space. We demonstrate the performance of our approach through

Recommended citation: V. De Smet, L. Van Gool and V. P. Namboodiri, "Nonuniform image patch exemplars for low level vision," 2013 IEEE Workshop on Applications of Computer Vision (WACV), Tampa, FL, 2013, pp. 23-30. http://vinaypn.github.io/files/wacv2013.pdf

Object and Action Classification with Latent Window Parameters

Published in International Journal of Computer Vision (IJCV), 2014

In this paper we propose a generic framework to incorporate unobserved auxiliary information for classifying objects and actions. This framework allows us to automatically select a bounding box and its quadrants from which best to extract features. These spatial subdivisions are learnt as latent variables. The paper is an extended version of our earlier work [2], complemented with additional ideas, experiments and analysis.
We approach the classification problem in a discriminative setting, as learning a max-margin classifier that infers the class label along with the latent variables. Through this paper we make the following contributions: a) we provide a method for incorporating latent variables into object and action classification; b) these variables determine the relative focus on foreground vs. background information that is taken account of; c) we design an objective function to more effectively learn in unbalanced data sets; d) we learn a better classifier by iterative expansion of the latent parameter space. We demonstrate the performance of our approach through experimental evaluation on a number of standard object and action recognition data sets.

Recommended citation: H. Bilen, V.P. Namboodiri and L. Van Gool (2014), “Object and Action Classification with Latent Window Parameters”, International Journal of Computer Vision (IJCV) Vol: 106: 237 - 251, February 2014 http://vinaypn.github.io/files/ijcv2014.pdf

Object classification with adaptable regions

Published in IEEE Conference on Computer Vision and Pattern Recognition, 2014

In classification of objects substantial work has gone into improving the low level representation of an image by considering various aspects such as different features, a number of feature pooling and coding techniques and considering different kernels. Unlike these works, in this paper, we propose to enhance the semantic representation of an image. We aim to learn the most important visual components of an image and how they interact in order to classify the objects correctly. To achieve our objective, we propose a new latent SVM model for category level object classification. Starting from image-level annotations, we jointly learn the object class and its context in terms of spatial location (where) and appearance (what). Furthermore, to regularize the complexity of the model we learn the spatial and co-occurrence relations between adjacent regions, such that unlikely configurations are penalized. Experimental results demonstrate that the proposed method can consistently enhance results on the challenging Pascal VOC dataset in terms of classification and weakly supervised detection. We also show how semantic representation can be exploited for finding similar content.

Recommended citation: H. Bilen, M. Pedersoli, V. P. Namboodiri, T. Tuytelaars, L. Van Gool,“Object Classification with Adaptable Regions”, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2014 http://vinaypn.github.io/files/cvpr2014.pdf

Mind the gap: Subspace based hierarchical domain adaptation

Published in Workshop in Transfer and Multi-View Learning in Advances in Neural Information System Conference (NIPS) 27,, 2014

Domain adaptation techniques aim at adapting a classifier learnt on a source domain to work on the target domain. Exploiting the subspaces spanned by features of the source and target domains respectively is one approach that has been investigated towards solving this problem. These techniques normally assume the existence of a single subspace for the entire source / target domain. In this work, we consider the hierarchical organization of the data and consider multiple subspaces for the source and target domain based on the hierarchy. We evaluate different subspace based domain adaptation techniques under this setting and observe that using different subspaces based on the hierarchy yields consistent improvement over a non-hierarchical baseline

Recommended citation: A. Raj, V. P. Namboodiri, T. Tuytelaars, “Mind the Gap: Subspace based Hierarchical Domain Adaptation”, Workshop in Transfer and Multi-View Learning in Advances in Neural Information System Conference (NIPS) 27, Canada, 2014 http://vinaypn.github.io/files/task2014.pdf

Where is my Friend? - Person identification in Social Networks

Published in Proceedings of the Eleventh IEEE International Conference on Automatic Face and Gesture Recognition (FG 2015), 2015

One of the interesting applications of computer vision is to be able to identify or detect persons in real world. This problem has been posed in the context of identifying people in television series [2] or in multi-camera networks [8]. However, a common scenario for this problem is to be able to identify people among images prevalent on social networks. In this paper we present a method that aims to solve this problem in real world conditions where the person can be in any pose, profile and orientation and the face itself is not always clearly visible. Moreover, we show that the problem can be solved with as weak supervision only a label whether the person is present or not, which is usually the case as people are tagged in social networks. This is challenging as there can be ambiguity in association of the right person. The problem is solved in this setting using a latent max-margin formulation where the identity of the person is the latent parameter that is classified. This framework builds on other off the shelf computer vision techniques for person detection and face detection and is able to also account for inaccuracies of these components. The idea is to model the complete person in addition to face, that too with weak supervision. We also contribute three real-world datasets that we have created for extensive evaluation of the solution. We show using these datasets that the problem can be effectively solved using the proposed method.

Recommended citation: D. Pathak, Sai Nitish S. and V. P. Namboodiri, “Where is my Friend? - Person identification in Social Networks”, Proceedings of the Eleventh IEEE International Conference on Automatic Face and Gesture Recognition (FG 2015), Ljubljana, Slovenia, 2015 http://vinaypn.github.io/files/fg2015.pdf

Subspace alignment based domain adaptation for RCNN detector

Published in Proceedings of British Machine Vision Conference (BMVC), 2015

In this paper, we propose subspace alignment based domain adaptation of the state of the art RCNN based object detector. The aim is to be able to achieve high quality object detection in novel, real world target scenarios without requiring labels from the target domain. While, unsupervised domain adaptation has been studied in the case of object classification, for object detection it has been relatively unexplored. In subspace based domain adaptation for objects, we need access to source and target subspaces for the bounding box features. The absence of supervision (labels and bounding boxes are absent) makes the task challenging. In this paper, we show that we can still adapt sub- spaces that are localized to the object by obtaining detections from the RCNN detector trained on source and applied on target. Then we form localized subspaces from the detections and show that subspace alignment based adaptation between these subspaces yields improved object detection. This evaluation is done by considering challenging real world datasets of PASCAL VOC as source and validation set of Microsoft COCO dataset as target for various categories.

Recommended citation: Anant Raj, Vinay P. Namboodiri and Tinne Tuytelaars, “Subspace Alignment based Domain Adaptation for RCNN Detector”, Proceedings of British Machine Vision Conference (BMVC 2015), Swansea, UK, 2015 http://vinaypn.github.io/files/bmvc2015rnt.pdf

Adapting RANSAC SVM to Detect Outliers for Robust Classification.

Published in Proceedings of British Machine Vision Conference (BMVC), 2015

Most visual classification tasks assume the authenticity of the label information. However, due to several reasons such as difficulty of annotation or inadvertently due to human error, the annotation can often be noisy. This results in examples that are wrongly annotated. In this paper, we consider the examples that are wrongly annotated to be outliers. The task of learning a robust inlier model in the presence of outliers is typically done through the RANSAC algorithm. In this paper, we show that instead of adopting RANSAC to obtain the `right' model, we could use many instances of randomly sampled sets to build lot of models. The collective decision of all these classifiers can be used to identify samples that are likely to be outliers. This results in a modification to RANSAC SVM to explicitly obtain probable outliers from the set of given samples. Once, the outliers are detected, these examples are excluded from the training set. The method can also be used to identify very hard examples from the training set. In this case, where we believe that the examples are correctly annotated, we can achieve good generalization when such examples are excluded from the training set. The method is evaluated using the standard PASCAL VOC dataset. We show that the method is particularly suited for identifying wrongly annotated examples resulting in improvement of more than 12\% over the RANSAC SVM approach. Hard examples in PASCAL VOC dataset are also identified by this method and in fact this even results in a marginal improvement of the classification accuracy over the base classifier provided with all clean samples.

Recommended citation: Subhabrata Debnath, Anjan Banerjee and Vinay P. Namboodiri, “Adapting RANSAC SVM to detect outliers for Robust Classification”,Proceedings of British Machine Vision Conference (BMVC 2015), Swansea, UK, 2015 http://vinaypn.github.io/files/bmvc2015dbn.pdf

EDS pooling layer

Published in Journal of Image and Vision Computing, 2016

Convolutional neural networks (CNNs) have been the source of recent breakthroughs in many vision tasks. Feature pooling layers are being widely used in CNNs to reduce the spatial dimensions of the feature maps of the hidden layers. This gives CNNs the property of spatial invariance and also results in speed-up and reduces over-fitting. However, this also causes significant information loss. All existing feature pooling layers follow a one-step procedure for spatial pooling, which affects the overall performance due to significant information loss. Not much work has been done to do efficient feature pooling operation in CNNs. To reduce the loss of information at this critical operation of the CNNs, we propose a new EDS layer (Expansion Downsampling learnable-Scaling) to replace the existing pooling mechanism. We propose a two-step procedure to minimize the information loss by increasing the number of channels in pooling operation. We also use feature scaling in the proposed EDS layer to highlight the most relevant channels/feature-maps. Our results show a significant improvement over the generally used pooling methods such as MaxPool, AvgPool, and StridePool (strided convolutions with stride > 1). We have done the experiments on image classification and object detection task. ResNet-50 with our proposed EDS layer has performed comparably to ResNet-152 with stride pooling on the ImageNet dataset.

Recommended citation: P. Singh, P. Raj and V.P. Namboodiri, "EDS Pooling Layer", Journal of Image and Vision Computing, Volume 98, June 2020, 103923 https://www.sciencedirect.com/science/article/abs/pii/S026288562030055X

Deep attributes for one-shot face recognition

Published in ECCV Workshop on ‘Transfering and Adapting Source Knowledge in Computer Vision’, 2016

We address the problem of one-shot unconstrained face recognition. This is addressed by using a deep attribute representation of faces. While face recognition has considered the use of attribute based representations, for one-shot face recognition, the methods proposed so far have been using different features that represent the limited example available. We postulate that by using an intermediate attribute representation, it is possible to outperform purely face based feature representation for one-shot recognition. We use two one-shot face recognition techniques based on exemplar SVM and one-shot similarity kernel to compare face based deep feature representations against deep attribute based representation. Key ResultThe evaluation on standard dataset of 'Labeled faces in the wild' suggests that deep attribute based representations can outperform deep feature based face representations for this problem of one-shot face recognition.

Recommended citation: A. Jadhav, V.P. Namboodiri and K.S. Venkatesh, “Deep Attributes for One-Shot Face Recog- nition”, ECCV Workshop on ‘Transfering and Adapting Source Knowledge in Computer Vision’, Amsterdam, 2016 http://vinaypn.github.io/files/task2016.pdf

Using Gaussian Processes to Improve Zero-Shot Learning with Relative Attributes

Published in Proceedings of Asian Conference on Computer Vision (ACCV), 2016

Relative attributes can serve as a very useful method for zero-shot learning of images. This was shown by the work of Parikh and Grauman [1] where an image is expressed in terms of attributes that are relatively specified between different class pairs. However, for zero-shot learning the authors had assumed a simple Gaussian Mixture Model (GMM) that used the GMM based clustering to obtain the label for an unknown target test example. In this paper, we contribute a principled approach that uses Gaussian Process based classification to obtain the posterior probability for each sample of an unknown target class, in terms of Gaussian process classification and regression for nearest sample images. We analyse different variants of this approach and show that such a principled approach yields improved performance and a better understanding in terms of probabilistic estimates. The method is evaluated on standard Pubfig and Shoes with Attributes benchmarks

Recommended citation: Y. Dolma and V.P. Namboodiri, “Gaussian Processes to Improve Zero-Shot Learning with Relative Attributes”, Proceedings of Asian Conference in Computer Vision (ACCV), Taipei, Taiwan, 2016 http://vinaypn.github.io/files/accv2016.pdf

Contextual rnn-gans for abstract reasoning diagram generation

Published in Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 2017

Understanding, predicting, and generating object motions and transformations is a core problem in artificial intelligence. Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting, simulation, or video generation. Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence. For this, we develop a novel Contextual Generative Adversarial Network based on Recurrent Neural Networks (Context-RNN-GANs), where both the generator and the discriminator modules are based on contextual history (modeled as RNNs) and the adversarial discriminator guides the generator to produce realistic images for the particular time step in the image sequence. We evaluate the Context-RNN-GAN model (and its variants) on a novel dataset of Diagrammatic Abstract Reasoning, where it performs competitively with 10th-grade human performance but there is still scope for interesting improvements as compared to college-grade human performance. We also evaluate our model on a standard video next-frame prediction task, achieving improved performance over comparable state-of-the-art.

Recommended citation: A. Ghosh, V. Kulharia, A. Mukerjee, V.P. Namboodiri, M. Bansal, “Contextual RNN-GANs for Abstract Reasoning Diagram Generation”, Proceedings of Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), San Francisco, California, USA, February 2017 http://vinaypn.github.io/files/aaai2017.pdf

Sketchsoup: Exploratory ideation using design sketches

Published in Computer Graphics Forum (CGF) Journal, 2017

A hallmark of early stage design is a number of quick-and-dirty sketches capturing design inspirations, model variations, and alternate viewpoints of a visual concept. We present SketchSoup, a workflow that allows designers to explore the design space induced by such sketches. We take an unstructured collection of drawings as input, register them using a multi-image matching algorithm, and present them as a 2D interpolation space. By morphing sketches in this space, our approach produces plausible visualizations of shape and viewpoint variations despite the presence of sketch distortions that would prevent standard camera calibration and 3D reconstruction. In addition, our interpolated sketches can serve as inspiration for further drawings, which feed back into the design space as additional image inputs. SketchSoup thus fills a significant gap in the early ideation stage of conceptual design by allowing designers to make better informed choices before proceeding to more expensive 3D modeling and prototyping. From a technical standpoint, we describe an end-to-end system that judiciously combines and adapts various image processing techniques to the drawing domain – where the images are dominated not by color, shading and texture, but by sketchy stroke contours.

Recommended citation: R. Arora, I. Darolia, V.P. Namboodiri, K. Singh and A. Bousseau, “SketchSoup: Exploratory Ideation Using Design Sketches”, Computer Graphics Forum, 2017 http://vinaypn.github.io/files/cgf2017.pdf

Compact Environment-Invariant Codes for Robust Visual Place Recognition

Published in 14th Conference on Computer and Robot Vision (CRV), 2017

Robust visual place recognition (VPR) requires scene representations that are invariant to various environmental challenges such as seasonal changes and variations due to ambient lighting conditions during day and night. Moreover, a practical VPR system necessitates compact representations of environmental features. To satisfy these requirements, in this paper we suggest a modification to the existing pipeline of VPR systems to incorporate supervised hashing. The modified system learns (in a supervised setting) compact binary codes from image feature descriptors. These binary codes imbibe robustness to the visual variations exposed to it during the training phase, thereby, making the system adaptive to severe environmental changes. Also, incorporating supervised hashing makes VPR computationally more efficient and easy to implement on simple hardware. This is because binary embeddings can be learned over simple-to-compute features and the distance computation is also in the low-dimensional hamming space of binary codes. We have performed experiments on several challenging data sets covering seasonal, illumination and viewpoint variations. We also compare two widely used supervised hashing methods of CCAITQ and MLH and show that this new pipeline out-performs or closely matches the state-of-the-art deep learning VPR methods that are based on high-dimensional features extracted from pre-trained deep convolutional neural networks.

Recommended citation: U.Jain, V.P. Namboodiri and G. Pandey,“Supervised Hashing for Robust Visual Place Recognition”, 14th Conference on Computer and Robot Vision, Edmonton, Alberta, May 16-19, 2017 http://vinaypn.github.io/files/crv2017.pdf

Reactive Displays for Virtual Reality

Published in IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), 2017

The feeling of presence in virtual reality has enabled a large number of applications. These applications typically deal with 360° content. However, a large amount of existing content is available in terms of images and videos i.e 2D content. Unfortunately, these do not react to the viewer's position or motion when viewed through a VR HMD. Thus in this work, we propose reactive displays for VR which instigate a feeling of discovery while exploring 2D content. We create this by taking into account user's position and motion to compute homography based mappings that adapt the 2D content and re-project it onto the display. This allows the viewer to obtain a more richer experience of interacting with 2D content similar to the effect of viewing through the window at a scene. We also provide a VR interface that uses a constrained set of reactive displays to easily browse through 360° content. The proposed interface tackles the problem of nausea caused by existing interfaces like photospheres by providing a natural room-like intermediate interface before changing 360° content. We perform user studies to evaluate both of our interfaces. The results show that the proposed reactive display interfaces are indeed beneficial.

Recommended citation: G S S Srinivas Rao, Neeraj Thakur, Vinay P. Namboodiri, “Reactive Displays for Virtual Reality”, Proceedings of 16th IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (Poster Proceedings), Nantes, France, 2017 http://vinaypn.github.io/files/ismar2017.pdf

Visual Odometry Based Omni-directional Hyperlapse

Published in Proceedings of National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2017),, 2017

The prohibitive amounts of time required to review the large amounts of data captured by surveillance and other cameras has brought into question the very utility of large scale video logging. Yet, one recognizes that such logging and analysis are indispensable to security applications. The only way out of this paradox is to devise expedited browsing, by the creation of hyperlapse. We address the hyperlapse problem for the very challenging category of intensive egomotion which makes the hyperlapse highly jerky. We propose an economical approach for trajectory estimation based on Visual Odometry and implement cost functions to penalize pose and path deviations. Also, this is implemented on data taken by omni-directional camera, so that the viewer can opt to observe any direction while browsing. This requires many innovations, including handling the massive radial distortions and implementing scene stabilization that need to be operated upon the least distorted region of the omni view

Recommended citation: P. Rani, A. Jangid, V.P. Namboodiri and K.S. Venkatesh, “Visual Odometry based Omni-directional Hyperlapse”, Proceedings of National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2017), Mandi, India 2017 http://vinaypn.github.io/files/ncvpripg2017.pdf

No Modes left behind: Capturing the data distribution effectively using GANs

Published in Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018

Generative adversarial networks (GANs) while being very versatile in realistic image synthesis, still are sensitive to the input distribution. Given a set of data that has an imbalance in the distribution, the networks are susceptible to missing modes and not capturing the data distribution. While various methods have been tried to improve training of GANs, these have not addressed the challenges of covering the full data distribution. Specifically, a generator is not penalized for missing a mode. We show that these are therefore still susceptible to not capturing the full data distribution. In this paper, we propose a simple approach that combines an encoder based objective with novel loss functions for generator and discriminator that improves the solution in terms of capturing missing modes. We validate that the proposed method results in substantial improvements through its detailed analysis on toy and real datasets. The quantitative and qualitative results demonstrate that the proposed method improves the solution for the problem of missing modes and improves training of GANs.

Recommended citation: S. Sharma and V.P. Namboodiri,”No Modes left behind: Capturing the data distribution effectively using GANs”, Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18),New Orleans, USA, February 2018 http://vinaypn.github.io/files/aaai2018.pdf

Word spotting in silent lip videos

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model's recognition performance. Our contribution is two-fold: 1) we develop a pipeline for recognition-free retrieval, and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. 2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatio-temporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRWdataset. Finally, we demonstrate the application of the method by word spotting in a popular speech video ("The great dictator" by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies.

Recommended citation: A. Jha, V. P. Namboodiri and C. V. Jawahar, "Word Spotting in Silent Lip Videos," IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, 2018, pp. 150-159. http://vinaypn.github.io/files/wacv2018.pdf

Unsupervised domain adaptation of deep object detectors.

Published in European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) , 2018

Domain adaptation has been understood and adopted in vision. Recently with the advent of deep learning there are a number of techniques that propose methods for deep learning based domain adaptation. However, the methods proposed have been used for adapting object classification techniques. In this paper, we solve for domain adaptation of object detection that is more commonly used. We adapt deep adaptation techniques for the Faster R-CNN framework. The techniques that we adapt are the recent techniques based on Gradient Reversal and Maximum Mean Discrepancy (MMD) reduction based techniques. Among them we show that the MK-MMD based method when used appropriately provides the best results. We analyze our model with standard real world settings by using Pascal VOC as source and MS-COCO as target and show a gain of 2.5 mAP at IoU of 0.5 over a source only trained model. We show that this improvement is statistically significant

Recommended citation: Debjeet Majumdar and Vinay P. Namboodiri,”Unsupervised domain adaptation of deep object detectors.”, 26th European Symposium on Artificial Neural Networks, ESANN 2018, Bruges, Belgium, April 25-27, 2018. http://vinaypn.github.io/files/esann2018.pdf

Multi-agent diverse generative adversarial networks

Published in IEEE Conference on Computer Vision and Pattern Recognition, 2018

We propose MAD-GAN, an intuitive generalization to the Generative Adversarial Networks (GANs) and its conditional variants to address the well known problem of mode collapse. First, MAD-GAN is a multi-agent GAN architecture incorporating multiple generators and one discriminator. Second, to enforce that different generators capture diverse high probability modes, the discriminator of MAD-GAN is designed such that along with finding the real and fake samples, it is also required to identify the generator that generated the given fake sample. Intuitively, to succeed in this task, the discriminator must learn to push different generators towards different identifiable modes. We perform extensive experiments on synthetic and real datasets and compare MAD-GAN with different variants of GAN. We show high quality diverse sample generations for challenging tasks such as image-to-image translation and face generation. In addition, we also show that MAD-GAN is able to disentangle different modalities when trained using highly challenging diverse-class dataset (e.g. dataset with images of forests, icebergs, and bedrooms). In the end, we show its efficacy on the unsupervised feature representation task.

Recommended citation: A. Ghosh, V. Kulharia, V.P. Namboodiri, P.H.S. Torr and P. Dokania, “Multi-Agent Diverse Generative Adversarial Networks”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Salt Lake City, Utah, June 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Ghosh_Multi-Agent_Diverse_Generative_CVPR_2018_paper.html

Differential Attention for Visual Question Answering

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

In this paper we aim to answer questions based on images when provided with a dataset of question-answer pairs for a number of images during training. A number of methods have focused on solving this problem by using image based attention. This is done by focusing on a specific part of the image while answering the question. Humans also do so when solving this problem. However, the regions that the previous systems focus on are not correlated with the regions that humans focus on. The accuracy is limited due to this drawback. In this paper, we propose to solve this problem by using an exemplar based method. We obtain one or more supporting and opposing exemplars to obtain a differential attention region. This differential attention is closer to human attention than other image based attention methods. It also helps in obtaining improved accuracy when answering questions. The method is evaluated on challenging benchmark datasets. We perform better than other image based attention methods and are competitive with other state of the art methods that focus on both image and questions.

Recommended citation: B.N. Patro and V.P. Namboodiri, “Differential Attention for Visual Question Answering”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Salt Lake City, Utah, June 2018. https://badripatro.github.io/DVQA/

Eclectic domain mixing for effective adaptation in action spaces

Published in Journal on Multimedia Tools and Applications, 2018

Although videos appear to be very high-dimensional in terms of duration × frame-rate × resolution, temporal smoothness constraints ensure that the intrinsic dimensionality for videos is much lower. In this paper, we use this idea for investigating Domain Adaptation (DA) in videos, an area that remains under-explored. An approach that has worked well for the image DA is based on the subspace modeling of the source and target domains, which works under the assumption that the two domains share a latent subspace where the domain shift can be reduced or eliminated. In this paper, first we extend three subspace based image DA techniques for human action recognition and then combine it with our proposed Eclectic Domain Mixing (EDM) approach to improve the effectiveness of the DA. Further, we use discrepancy measures such as Symmetrized KL Divergence and Target Density Around Source for empirical study of the proposed EDM approach. While, this work mainly focuses on Domain Adaptation in videos, for completeness of the study, we comprehensively evaluate our approach using both object and action datasets. In this paper, we have achieved consistent improvements over chosen baselines and obtained some state-of-the-art results for the datasets.

Recommended citation: Jamal, A., Deodhare, D., Namboodiri, V.P., Venkatesh, K.S. “Eclectic domain mixing for effective adaptation in action spaces”, Journal on Multimedia Tools and Applications, November 2018, Volume 77, Issue 22, pp 29949–29969. https://link.springer.com/article/10.1007/s11042-018-6179-y

Learning semantic sentence embeddings using sequential pair-wise discriminator

Published in International Conference on Computational Linguistics (COLING), 2018

In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder that is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validated our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and outperforms the state-of-the-art on the paraphrase generation and sentiment analysis task on standard datasets. These results are also shown to be statistically significant.

Recommended citation: B.N. Patro, V.K. Kurmi, S. Kumar and V.P. Namboodiri, “Learning semantic sentence embeddings using sequential pair-wise discriminator”, Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 2018 https://badripatro.github.io/Question-Paraphrases/

Monoaural Audio Source Separation Using Variational Autoencoders.

Published in Proceedings of Interspeech Conference, 2018

We introduce a monaural audio source separation framework using a latent generative model. Traditionally, discriminative training for source separation is proposed using deep neural networks or non-negative matrix factorization. In this paper, we propose a principled generative approach using variational autoencoders (VAE) for audio source separation. VAE computes efficient Bayesian inference which leads to a continuous latent representation of the input data(spectrogram). It contains a probabilistic encoder which projects an input data to latent space and a probabilistic decoder which projects data from latent space back to input space. This allows us to learn a robust latent representation of sources corrupted with noise and other sources. The latent representation is then fed to the decoder to yield the separated source. Both encoder and decoder are implemented via multilayer perceptron (MLP). In contrast to prevalent techniques, we argue that VAE is a more principled approach to source separation. Experimentally, we find that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature i.e. DNN and RNN with different masking functions and autoencoders. We show that our method performs better than best of the relevant methods with ∼ 2 dB improvement in the source to distortion ratio.

Recommended citation: Laxmi Pandey, Anurendra Kumar and Vinay P. Namboodiri, “Monoaural Audio Source Separation Using Variational Autoencoders.”, 19th Annual Conference of the International Speech Communication Association, Interspeech 2018, Hyderabad, India, 2-6 September 2018 http://vinaypn.github.io/files/interspeech2018.pdf

Deep Domain Adaptation in Action Space.

Published in Proceedings of British Machine Vision Conference (BMVC), 2018

In the general settings of supervised learning, human action recognition has been a widely studied topic. The classifiers learned in this setting assume that the training and test data have been sampled from the same underlying probability distribution. However, in most of the practical scenarios, this assumption is not true, resulting in a suboptimal performance of the classifiers. This problem, referred to as Domain Shift, has been extensively studied, but mostly for image/object classification task. In this paper, we investigate the problem of Domain Shift in action videos, an area that has remained under-explored, and propose two new approaches named Action Modeling on Latent Subspace (AMLS) and Deep Adversarial Action Adaptation (DAAA). In the AMLS approach, the action videos in the target domain are modeled as a sequence of points on a latent subspace and adaptive kernels are successively learned between the source domain point and the sequence of target domain points on the manifold. In the DAAA approach, an end-to-end adversarial learning framework is proposed to align the two domains. The action adaptation experiments were conducted using various combinations of multi-domain action datasets, including six common classes of Olympic Sports and UCF50 datasets and all classes of KTH, MSR and our own SonyCam datasets. In this paper, we have achieved consistent improvements over chosen baselines and obtained some state-of-the-art results for the datasets.

Recommended citation: Arshad Jamal, Vinay P. Namboodiri, Dipti Deodhare and K.S. Venkatesh, “Deep Domain Adaptation in Action Space”, British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018 http://bmvc2018.org/contents/papers/0960.pdf

Deep active learning for object detection.

Published in Proceedings of British Machine Vision Conference (BMVC), 2018

Object detection methods like Single Shot Multibox Detector (SSD) provide highly accurate object detection that run in real-time. However, these approaches require a large number of annotated training images. Evidently, not all of these images are equally useful for training the algorithms. Moreover, obtaining annotations in terms of bounding boxes for each image is costly and tedious. In this paper, we aim to obtain a highly accurate object detector using only a fraction of the training images. We do this by adopting active learning that uses ‘human in the loop’ paradigm to select the set of images that would be useful if annotated. Towards this goal, we make the following contributions: 1. We develop a novel active learning method which poses the layered architecture used in object detection as a ‘query by committee’ paradigm to choose the set of images to be queried. 2. We introduce a framework to use the exploration/exploitation trade-off in our methods. 3. We analyze the results on standard object detection datasets which show that with only a third of the training data, we can obtain more than 95% of the localization accuracy of full supervision. Further our methods outperform classical uncertainty-based active learning algorithms like maximum entropy

Recommended citation: Soumya Roy, Asim Unmesh and Vinay P. Namboodiri, “Deep active learning for object detection”, British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018 http://bmvc2018.org/contents/papers/0287.pdf

Multimodal differential network for visual question generation

Published in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal Differential Network to produce natural and engaging questions. The generated questions show a remarkable similarity to the natural questions as validated by a human study. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU, METEOR, ROUGE, and CIDEr)

Recommended citation: Badri N. Patro, Sandeep Kumar, Vinod K. Kurmi, Vinay P. Namboodiri,”Multimodal Differential Network for Visual Question Generation”, 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018 https://badripatro.github.io/MDN-VQG/

U-DADA: Unsupervised Deep Action Domain Adaptation

Published in Asian Conference on Computer Vision (ACCV), 2018

The problem of domain adaptation has been extensively studied for object classification task. However, this problem has not been as well studied for recognizing actions. While, object recognition is well understood, the diverse variety of videos in action recognition make the task of addressing domain shift to be more challenging. We address this problem by proposing a new novel adaptation technique that we term as unsupervised deep action domain adaptation (U-DADA). The main concept that we propose is that of explicitly modeling density based adaptation and using them while adapting domains for recognizing actions. We show that these techniques work well both for domain adaptation through adversarial learning to obtain invariant features or explicitly reducing the domain shift between distributions. The method is shown to work well using existing benchmark datasets such as UCF50, UCF101, HMDB51 and Olympic Sports. As a pioneering effort in the area of deep action adaptation, we are presenting several benchmark results and techniques that could serve as baselines to guide future research in this area.

Recommended citation: Jamal A., Namboodiri V.P., Deodhare D., Venkatesh K.S. (2019) U-DADA: Unsupervised Deep Action Domain Adaptation. In: Jawahar C., Li H., Mori G., Schindler K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science, vol 11363. Springer, https://link.springer.com/chapter/10.1007/978-3-030-20893-6_28

Supervised Hashing for Retrieval of Multimodal Biometric Data

Published in 3rd Workshop on Computer Vision Applications (WCVA), 2018

Biometric systems commonly utilize multi-biometric approaches where a person is verified or identified based on multiple biometric traits. However, requiring systems that are deployed usually require verification or identification from a large number of enrolled candidates. These are possible only if there are efficient methods that retrieve relevant candidates in a multi-biometric system. To solve this problem, we analyze the use of hashing techniques that are available for obtaining retrieval. We specifically based on our analysis recommend the use of supervised hashing techniques over deep learned features as a possible common technique to solve this problem. Our investigation includes a comparison of some of the supervised and unsupervised methods viz. Principal Component Analysis (PCA), Locality Sensitive Hashing (LSH), Locality-sensitive binary codes from shift-invariant kernels (SKLSH), Iterative quantization: A procrustean approach to learning binary codes (ITQ), Binary Reconstructive Embedding (BRE) and Minimum loss hashing (MLH) that represent the prevalent classes of such systems and we present our analysis for the following biometric data: Face, Iris, and Fingerprint for a number of standard datasets. The main technical contributions through this work are as follows: (a) Proposing Siamese network based deep learned feature extraction method (b) Analysis of common feature extraction techniques for multiple biometrics as to a reduced feature space representation (c) Advocating the use of supervised hashing for obtaining a compact feature representation across different biometrics traits. (d) Analysis of the performance of deep representations against shallow representations in a practical reduced feature representation framework. Through experimentation with multiple biometrics traits, feature representations, and hashing techniques, we can conclude that current deep learned features when retrieved using supervised hashing can be a standard pipeline adopted for most unimodal and multimodal biometric identification tasks.

Recommended citation: Sumesh T.A., Namboodiri V., Gupta P. (2019) Supervised Hashing for Retrieval of Multimodal Biometric Data. In: Arora C., Mitra K. (eds) Computer Vision Applications. WCVA 2018. Communications in Computer and Information Science, vol 1019. Springer, Singapore https://link.springer.com/chapter/10.1007/978-981-15-1387-9_8

Multi-layer pruning framework for compressing single shot multibox detector

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2019

We propose a framework for compressing state-of-the-art Single Shot MultiBox Detector (SSD). The framework addresses compression in the following stages: Sparsity Induction, Filter Selection, and Filter Pruning. In the Sparsity Induction stage, the object detector model is sparsified via an improved global threshold. In Filter Selection & Pruning stage, we select and remove filters using sparsity statistics of filter weights in two consecutive convolutional layers. This results in the model with the size smaller than most existing compact architectures. We evaluate the performance of our framework with multiple datasets and compare over multiple methods. Experimental results show that our method achieves state-of-the-art compression of 6.7X and 4.9X on PASCAL VOC dataset on models SSD300 and SSD512 respectively. We further show that the method produces maximum compression of 26X with SSD512 on German Traffic Sign Detection Benchmark (GTSDB). Additionally, we also empirically show our method's adaptability for classification based architecture VGG16 on datasets CIFAR and German Traffic Sign Recognition Benchmark (GTSRB) achieving a compression rate of 125X and 200X with the reduction in flops by 90.50% and 96.6% respectively with no loss of accuracy. In addition to this, our method does not require any special libraries or hardware support for the resulting compressed models.

Recommended citation: P. Singh, Manikandan R., N. Matiyali and V. P. Namboodiri, "Multi-layer pruning framework for compressing single shot multibox detector," IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, Hawaii, USA. https://arxiv.org/abs/1811.08342

Stability Based Filter Pruning for Accelerating Deep CNNs

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2019

Convolutional neural networks (CNN) have achieved impressive performance on the wide variety of tasks (classification, detection, etc.) across multiple domains at the cost of high computational and memory requirements. Thus, leveraging CNNs for real-time applications necessitates model compression approaches that not only reduce the total number of parameters but reduce the overall computation as well. In this work, we present a stability-based approach for filter-level pruning of CNNs. We evaluate our proposed approach on different architectures (LeNet, VGG-16, ResNet, and Faster RCNN) and datasets and demonstrate its generalizability through extensive experiments. Moreover, our compressed models can be used at run-time without requiring any special libraries or hardware. Our model compression method reduces the number of FLOPS by an impressive factor of 6.03X and GPU memory footprint by more than 17X, significantly outperforming other state-of-the-art filter pruning methods.

Recommended citation: P. Singh, Manikandan V.S.R. Kadi, N. Verma and V. P. Namboodiri, "Stability Based Filter Pruning for Accelerating Deep CNNs," IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, Hawaii, USA. https://arxiv.org/abs/1811.08321

Spotting words in silent speech videos: a retrieval-based approach

Published in Journal of Machine Vision and Applications (MVA), 2019

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recognition performance. Our contribution is twofold: (1) we develop a pipeline for recognition-free retrieval and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. (2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatiotemporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRW dataset. We also demonstrate the application of the method by word spotting in a popular speech video (“The great dictator” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies. Finally, we compare our model against ASR in a noisy environment and analyze the effect of the performance of underlying lip-reader and input video quality on the proposed word spotting pipeline.

Recommended citation: A. Jha, V. P. Namboodiri and C. V. Jawahar,”Spotting words in silent speech videos: a retrieval-based approach”, Journal of Machine Vision and Applications, March 2019, Volume 30, Issue 2, pp 217–229 https://link.springer.com/article/10.1007/s00138-019-01006-y

Cross-language Speech Dependent Lip-synchronization

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

Understanding videos of people speaking across international borders is hard as audiences from different demographies do not understand the language. Such speech videos are often supplemented with language subtitles. However, these hamper the viewing experience as the attention is shared. Simple audio dubbing in a different language makes the video appear unnatural due to unsynchronized lip motion. In this paper, we propose a system for automated cross-language lip synchronization for re-dubbed videos. Our model generates superior photorealistic lip-synchronization over original video in comparison to the current re-dubbing method. With the help of a user-based study, we verify that our method is preferred over unsynchronized videos.

Recommended citation: A. Jha, V. Voleti, V. Namboodiri and C. V. Jawahar, "Cross-language Speech Dependent Lip-synchronization," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 7140-7144. http://vinaypn.github.io/files/icassp2019.pdf

Hetconv: Heterogeneous kernel-based convolutions for deep cnns

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

We present a novel deep learning architecture in which the convolution operation leverages heterogeneous kernels. The proposed HetConv (Heterogeneous Kernel-Based Convolution) reduces the computation (FLOPs) and the number of parameters as compared to standard convolution operation while still maintaining representational efficiency. To show the effectiveness of our proposed convolution, we present extensive experimental results on the standard convolutional neural network (CNN) architectures such as VGG \cite{vgg2014very} and ResNet \cite{resnet}. We find that after replacing the standard convolutional filters in these architectures with our proposed HetConv filters, we achieve 3X to 8X FLOPs based improvement in speed while still maintaining (and sometimes improving) the accuracy. We also compare our proposed convolutions with group/depth wise convolutions and show that it achieves more FLOPs reduction with significantly higher accuracy.

Recommended citation: Pravendra Singh, Vinay Kumar Verma, Piyush Rai and Vinay P. Namboodiri,”Hetconv: Heterogeneous kernel-based convolutions for deep cnns”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Long Beach, California, June 2019. http://openaccess.thecvf.com/content_CVPR_2019/papers/Singh_HetConv_Heterogeneous_Kernel-Based_Convolutions_for_Deep_CNNs_CVPR_2019_paper.pdf

Attending to Discriminative Certainty for Domain Adaptation

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

In this paper, we aim to solve for unsupervised domain adaptation of classifiers where we have access to label information for the source domain while these are not available for a target domain. While various methods have been proposed for solving these including adversarial discriminator based methods, most approaches have focused on the entire image based domain adaptation. In an image, there would be regions that can be adapted better, for instance, the foreground object may be similar in nature. To obtain such regions, we propose methods that consider the probabilistic certainty estimate of various regions and specify focus on these during classification for adaptation. We observe that just by incorporating the probabilistic certainty of the discriminator while training the classifier, we are able to obtain state of the art results on various datasets as compared against all the recent methods. We provide a thorough empirical analysis of the method by providing ablation analysis, statistical significance test, and visualization of the attention maps and t-SNE embeddings. These evaluations convincingly demonstrate the effectiveness of the proposed approach.

Recommended citation: Vinod Kumar Kurmi*, Shanu Kumar* and Vinay P Namboodiri,”Attending to Discriminative Certainty for Domain Adaptation”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Long Beach, California, June 2019. http://openaccess.thecvf.com/content_CVPR_2019/papers/Kurmi_Attending_to_Discriminative_Certainty_for_Domain_Adaptation_CVPR_2019_paper.pdf

Unsupervised Synthesis of Anomalies in Videos: Transforming the Normal

Published in International Joint Conference on Neural Networks (IJCNN) , 2019

Abnormal activity recognition requires detection of occurrence of anomalous events that suffer from a severe imbalance in data. In a video, normal is used to describe activities that conform to usual events while the irregular events which do not conform to the normal are referred to as abnormal. It is far more common to observe normal data than to obtain abnormal data in visual surveillance. In this paper, we propose an approach where we can obtain abnormal data by transforming normal data. This is a challenging task that is solved through a multi-stage pipeline approach. We utilize a number of techniques from unsupervised segmentation in order to synthesize new samples of data that are transformed from an existing set of normal examples. Further, this synthesis approach has useful applications as a data augmentation technique. An incrementally trained Bayesian convolutional neural network (CNN) is used to carefully select the set of abnormal samples that can be added. Finally through this synthesis approach we obtain a comparable set of abnormal samples that can be used for training the CNN for the classification of normal vs abnormal samples. We show that this method generalizes to multiple settings by evaluating it on two real world datasets and achieves improved performance over other probabilistic techniques that have been used in the past for this task.

Recommended citation: Abhishek Joshi and Vinay P. Namboodiri,”Unsupervised Synthesis of Anomalies in Videos: Transforming the Normal”, Proceedings of International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary https://arxiv.org/abs/1904.06633

Looking back at Labels: A Class based Domain Adaptation Technique

Published in International Joint Conference on Neural Networks (IJCNN) , 2019

In this paper, we tackle a problem of Domain Adaptation. In a domain adaptation setting, there is provided a labeled set of examples in a source dataset with multiple classes being present and a target dataset that has no supervision. In this setting, we propose an adversarial discriminator based approach. While the approach based on adversarial discriminator has been previously proposed; in this paper, we present an informed adversarial discriminator. Our observation relies on the analysis that shows that if the discriminator has access to all the information available including the class structure present in the source dataset, then it can guide the transformation of features of the target set of classes to a more structured adapted space. Using this formulation, we obtain the state-of-the-art results for the standard evaluation on benchmark datasets. We further provide detailed analysis which shows that using all the labeled information results in an improved domain adaptation.

Recommended citation: Vinod Kumar Kurmi and Vinay P. Namboodiri, “Looking back at Labels: A Class based Domain Adaptation Technique”, Proceedings of International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary https://vinodkkurmi.github.io/DiscriminatorDomainAdaptation/

Play and Prune: Adaptive Filter Pruning for Deep Model Compression

Published in International Joint Conference on Artificial Intelligence (IJCAI-2019), 2019

While convolutional neural networks (CNN) have achieved impressive performance on various classification/recognition tasks, they typically consist of a massive number of parameters. This results in significant memory requirement as well as computational overheads. Consequently, there is a growing need for filter-level pruning approaches for compressing CNN based models that not only reduce the total number of parameters but reduce the overall computation as well. We present a new min-max framework for filter-level pruning of CNNs. Our framework, called Play and Prune (PP), jointly prunes and fine-tunes CNN model parameters, with an adaptive pruning rate, while maintaining the model's predictive performance. Our framework consists of two modules: (1) An adaptive filter pruning (AFP) module, which minimizes the number of filters in the model; and (2) A pruning rate controller (PRC) module, which maximizes the accuracy during pruning. Moreover, unlike most previous approaches, our approach allows directly specifying the desired error tolerance instead of pruning level. Our compressed models can be deployed at run-time, without requiring any special libraries or hardware. Our approach reduces the number of parameters of VGG-16 by an impressive factor of 17.5X, and number of FLOPS by 6.43X, with no loss of accuracy, significantly outperforming other state-of-the-art filter pruning methods.

Recommended citation: Pravendra Singh, Vinay Kumar Verma, Piyush Rai and Vinay P. Namboodiri,”Play and Prune: Adaptive Filter Pruning for Deep Model Compression”, Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-2019)Macao, China, August 2019. https://arxiv.org/abs/1905.04446

Curriculum based Dropout Discriminator for Domain Adaptation

Published in Proceedings of British Machine Vision Conference (BMVC), 2019

Domain adaptation is essential to enable wide usage of deep learning based networks trained using large labeled datasets. Adversarial learning based techniques have shown their utility towards solving this problem using a discriminator that ensures source and target distributions are close. However, here we suggest that rather than using a point estimate, it would be useful if a distribution based discriminator could be used to bridge this gap. This could be achieved using multiple classifiers or using traditional ensemble methods. In contrast, we suggest that a Monte Carlo dropout based ensemble discriminator could suffice to obtain the distribution based discriminator. Specifically, we propose a curriculum based dropout discriminator that gradually increases the variance of the sample based distribution and the corresponding reverse gradients are used to align the source and target feature representations. The detailed results and thorough ablation analysis show that our model outperforms state-of-the-art results.

Recommended citation: Vinod Kumar Kurmi, Vipul Bajaj, Venkatesh K Subramanian and Vinay P Namboodiri, “Curriculum based Dropout Discriminator for Domain Adaptation”, British Machine Vision Conference 2018, BMVC 2018, Cardiff, UK, Northumbria University, Newcastle, UK, September 9-12, 2019 https://delta-lab-iitk.github.io/CD3A/

Towards Automatic Face-to-Face Translation

Published in 27th ACM International Conference on Multimedia (ACM-MM), 2019

In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available.

Recommended citation: Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Jerin Philip, Abhishek Jha, Vinay Namboodiri and C.V. Jawahar, “Towards Automatic Face-to-Face Translation”, 27th ACM International Conference on Multimedia (ACM-MM), Nice, France, 2019, Pages 1428–1436 https://cvit.iiit.ac.in/research/projects/cvit-projects/facetoface-translation

U-CAM: Visual Explanation using Uncertainty based Class Activation Maps

Published in IEEE International Conference on Computer Vision (ICCV), 2019

Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide visual attention maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold benefits: a) improvement in obtaining the certainty estimates that correlate better with misclassified samples and b) improved attention maps that provide state-of-the-art results in terms of correlation with human attention regions. The improved attention maps result in consistent improvement for various methods for visual question answering. Therefore, the proposed technique can be thought of as a recipe for obtaining improved certainty estimates and explanation for deep learning models. We provide detailed empirical analysis for the visual question answering task on all standard benchmarks and comparison with state of the art methods.

Recommended citation: Badri N. Patro, Mayank Lunayach, Shivansh Patel and Vinay P. Namboodiri, “U-CAM: Visual Explanation using Uncertainty based Class Activation Maps”, Proceedings of IEEE International Conference on Computer Vision (ICCV)Seoul, South Korea, October 2019. https://delta-lab-iitk.github.io/U-CAM/

Granular Multimodal Attention Networks for Visual Dialog

Published in ICCV Workshop on Intelligent Short Videos (ISV), 2019

Vision and language tasks have benefited from attention. There have been a number of different attention models proposed. However, the scale at which attention needs to be applied has not been well examined. Particularly, in this work, we propose a new method Granular Multi-modal Attention, where we aim to particularly address the question of the right granularity at which one needs to attend while solving the Visual Dialog task. The proposed method shows improvement in both image and text attention networks. We then propose a granular Multi-modal Attention network that jointly attends on the image and text granules and shows the best performance. With this work, we observe that obtaining granular attention and doing exhaustive Multi-modal Attention appears to be the best way to attend while solving visual dialog.

Recommended citation: B.N. Patro, S. Patel and V.P. Namboodiri, “Granular Multimodal Attention Networks for Visual Dialog”, ICCV Workshop on Intelligent Short Videos (ISV),Seoul, Korea, 2019. https://arxiv.org/abs/1910.05728

Dynamic Attention Networks for Task Oriented Grounding

Published in ICCV Workshop on Intelligent Short Videos (ISV), 2019

In order to successfully perform tasks specified by natural language instructions, an artificial agent operating in a visual world needs to map words, concepts, and actions from the instruction to visual elements in its environment. This association is termed as Task-Oriented Grounding. In this work, we propose a novel Dynamic Attention Network architecture for the efficient multi-modal fusion of text and visual representations which can generate a robust definition of state for the policy learner. Our model assumes no prior knowledge from visual and textual domains and is an end to end trainable. For a 3D visual world where the observation changes continuously, the attention on the visual elements tends to be highly co-related from a one-time step to the next. We term this as "Dynamic Attention". In this work, we show that Dynamic Attention helps in achieving grounding and also aids in the policy learning objective. Since most practical robotic applications take place in the real world where the observation space is continuous, our framework can be used as a generalized multi-modal fusion unit for robotic control through natural language. We show the effectiveness of using 1D convolution over Gated Attention Hadamard product on the rate of convergence of the network. We demonstrate that the cell-state of a Long Short Term Memory (LSTM) is a natural choice for modeling Dynamic Attention and shows through visualization that the generated attention is very close to how humans tend to focus on the environment.

Recommended citation: S. Dasgupta, B.N. Patro, V.P. Namboodiri, “Dynamic Attention Networks for Task Oriented Grounding”, ICCV Workshop on Intelligent Short Videos (ISV),Seoul, Korea, 2019. https://arxiv.org/abs/1910.06315

HetConv: Beyond Homogeneous Convolution Kernels for Deep CNNs

Published in International Journal of Computer Vision (IJCV), 2019

While usage of convolutional neural networks (CNN) is widely prevalent, methods proposed so far always have considered homogeneous kernels for this task. In this paper, we propose a new type of convolution operation using heterogeneous kernels. The proposed Heterogeneous Kernel-Based Convolution (HetConv) reduces the computation (FLOPs) and the number of parameters as compared to standard convolution operation while it maintains representational efficiency. To show the effectiveness of our proposed convolution, we present extensive experimental results on the standard CNN architectures such as VGG, ResNet, Faster-RCNN, MobileNet, and SSD. We observe that after replacing the standard convolutional filters in these architectures with our proposed HetConv filters, we achieve 1.5 × to 8 × FLOPs based improvement in speed while it maintains (sometimes improves) the accuracy. We also compare our proposed convolution with group/depth wise convolution and show that it achieves more FLOPs reduction with significantly higher accuracy. Moreover, we demonstrate the efficacy of HetConv based CNN by showing that it also generalizes on object detection and is not constrained to image classification tasks. We also empirically show that the proposed HetConv convolution is more robust towards the over-fitting problem as compared to standard convolution.

Recommended citation: Pravendra Singh, Vinay Kumar Verma, Piyush Rai and Vinay P. Namboodiri, “HetConv: Beyond Homogeneous Convolution Kernels for Deep CNNs”, International Journal of Computer Vision, accepted https://link.springer.com/article/10.1007/s11263-019-01264-3

FALF ConvNets: Fatuous auxiliary loss based filter-pruning for efficient deep CNNs

Published in Image and Vision Computing Journal, 2019

Obtaining efficient Convolutional Neural Networks (CNNs) are imperative to enable their application for a wide variety of tasks (classification, detection, etc.). While several methods have been proposed to solve this problem, we propose a novel strategy for solving the same that is orthogonal to the strategies proposed so far. We hypothesize that if we add a fatuous auxiliary task, to a network which aims to solve a semantic task such as classification or detection, the filters devoted to solving this frivolous task would not be relevant for solving the main task of concern. These filters could be pruned and pruning these would not reduce the performance on the original task. We demonstrate that this strategy is not only successful, it in fact allows for improved performance for a variety of tasks such as object classification, detection and action recognition. An interesting observation is that the task needs to be fatuous so that any semantically meaningful filters would not be relevant for solving this task. We thoroughly evaluate our proposed approach on different architectures (LeNet, VGG-16, ResNet, Faster RCNN, SSD-512, C3D, and MobileNet V2) and datasets (MNIST, CIFAR, ImageNet, GTSDB, COCO, and UCF101) and demonstrate its generalizability through extensive experiments. Moreover, our compressed models can be used at run-time without requiring any special libraries or hardware. Our model compression method reduces the number of FLOPS by an impressive factor of 6.03X and GPU memory footprint by more than 17X for VGG-16, significantly outperforming other state-of-the-art filter pruning methods. We demonstrate the usability of our approach for 3D convolutions and various vision tasks such as object classification, object detection, and action recognition.

Recommended citation: Pravendra Singh, Vinay Sameer Raja Kadi and Vinay P.Namboodiri, “FALF ConvNets: Fatuous auxiliary loss based filter-pruning for efficient deep CNNs”, Image and Vision Computing Journal, Volume 93, January 2020, 103857 https://www.sciencedirect.com/science/article/pii/S0262885619304500

Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Published in Association for the Advancement of Artificial Intelligence, 2020

In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision.

Recommended citation: Patro, B., Anupriy & Namboodiri, V. (2020, April). Explanation vs attention: A two-player game to obtain attention for vqa. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 11848-11855). https://aaai.org/ojs/index.php/AAAI/article/view/6858/6712, AAAI-2020

Bridged Variational Autoencoders for Joint Modeling of Images and Attributes

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2020

Generative models have recently shown the ability to realistically generate data and model the distribution accurately. However, joint modeling of an image with the attribute that it is labeled with requires learning a cross modal correspondence between image and attribute data. Though the information present in a set of images and its attributes possesses completely different statistical properties altogether, there exists an inherent correspondence that is challenging to capture. Various models have aimed at capturing this correspondence either through joint modeling of a variational autoencoder or through separate encoder networks that are then concatenated. We present an alternative by proposing a bridged variational autoencoder that allows for learning cross-modal correspondence by incorporating cross-modal hallucination losses in the latent space. In comparison to the existing methods, we have found that by using a bridge connection in latent space we not only obtain better generation results, but also obtain highly parameter-efficient model which provide 40% reduction in training parameters for bimodal dataset and nearly 70% reduction for trimodal dataset. We validate the proposed method through comparison with state of the art methods and benchmarking on standard datasets.

Recommended citation: Ravindra Yadav, Ashish Sardana, Vinay P Namboodiri, Rajesh M Hegde. "Bridged Variational Autoencoders for Joint Modeling of Images and Attributes." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 1479-1487 https://ieeexplore.ieee.org/abstract/document/9093565

Can I teach a robot to replicate a line art

Published in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020

Line art is arguably one of the fundamental and versatile modes of expression. We propose a pipeline for a robot to look at a grayscale line art and redraw it. The key novel elements of our pipeline are: a) we propose a novel task of mimicking line drawings, b) to solve the pipeline we modify the Quick-draw dataset to obtain supervised training for converting a line drawing into a series of strokes c) we propose a multi-stage segmentation and graph interpretation pipeline for solving the problem. The resultant method has also been deployed on a CNC plotter as well as a robotic arm. We have trained several variations of the proposed methods and evaluate these on a dataset obtained from Quick-draw. Through the best methods we observe an accuracy of around 98% for this task, which is a significant improvement over the baseline architecture we adapted from. This therefore allows for deployment of the method on robots for replicating line art in a reliable manner. We also show that while the rule-based vectorization methods do suffice for simple drawings, it fails for more complicated sketches, unlike our method which generalizes well to more complicated distributions.

Recommended citation: R. B. Venkataramaiyer, S. Kumar, and V. P. Namboodiri (2020). "Can I teach a robot to replicate a line art ", Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)" Aspen, USA, Mar. 2020. https://bvraghav.com/can-i-teach-a-robot-to-replicate-a-line-art/

Cooperative Initialization based Deep Neural Network Training

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, 2020

Researchers have proposed various activation functions. These activation functions help the deep network to learn non-linear behavior with a significant effect on training dynamics and task performance. The performance of these activations also depends on the initial state of the weight parameters, i.e., different initial state leads to a difference in the performance of a network. In this paper, we have proposed a cooperative initialization for training the deep network using ReLU activation function to improve the network performance. Our approach uses multiple activation functions in the initial few epochs for the update of all sets of weight parameters while training the network. These activation functions cooperate to overcome their drawbacks in the update of weight parameters, which in effect learn better "feature representation" and boost the network performance later. Cooperative initialization based training also helps in reducing the overfitting problem and does not increase the number of parameters, inference (test) time in the final model while improving the performance. Experiments show that our approach outperforms various baselines and, at the same time, performs well over various tasks such as classification and detection. The Top-1 classification accuracy of the model trained using our approach improves by 2.8% for VGG-16 and 2.1% for ResNet-56 on CIFAR-100 dataset.

Recommended citation: P. Singh, M. Varshney, V. P. Namboodiri, "Cooperative Initialization based Deep Neural Network Training", IEEE Winter Conference on Applications of Computer Vision (WACV), 2020 https://arxiv.org/abs/2001.01240

Leveraging Filter Correlations for Deep Model Compression

Published in Winter Conference on Applications of Computer Vision (WACV ’20), 2020

We present a filter correlation based model compression approach for deep convolutional neural networks. Our approach iteratively identifies pairs of filters with the largest pairwise correlations and drops one of the filters from each such pair. However, instead of discarding one of the filters from each such pair naïvely, the model is re-optimized to make the filters in these pairs maximally correlated, so that discarding one of the filters from the pair results in minimal information loss. Moreover, after discarding the filters in each round, we further finetune the model to recover from the potential small loss incurred by the compression. We evaluate our proposed approach using a comprehensive set of experiments and ablation studies. Our compression method yields state-of-the-art FLOPs compression rates on various benchmarks, such as LeNet-5, VGG-16, and ResNet-50,56, while still achieving excellent predictive performance for tasks such as object detection on benchmark datasets.

Recommended citation: P. Singh, V.K. Verma, P. Rai, V.P. Namboodiri, "Leveraging Filter Correlations for Deep Model Compression", IEEE Winter Conference on Applications of Computer Vision (WACV ’20) pages 824-833, 2020 https://arxiv.org/abs/1811.10559

A Network Pruning Network Approach to Deep Model Compression

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, 2020

We present a filter pruning approach for deep model compression, using a multitask network. Our approach is based on learning a a pruner network to prune a pre-trained target network. The pruner is essentially a multitask deep neural network with binary outputs that help identify the filters from each layer of the original network that do not have any significant contribution to the model and can therefore be pruned. The pruner network has the same architecture as the original network except that it has a multitask/multi-output last layer containing binary-valued outputs (one per filter), which indicate which filters have to be pruned. The pruner's goal is to minimize the number of filters from the original network by assigning zero weights to the corresponding output feature-maps. In contrast to most of the existing methods, instead of relying on iterative pruning, our approach can prune the network (original network) in one go and, moreover, does not require specifying the degree of pruning for each layer (and can learn it instead). The compressed model produced by our approach is generic and does not need any special hardware/software support. Moreover, augmenting with other methods such as knowledge distillation, quantization, and connection pruning can increase the degree of compression for the proposed approach. We show the efficacy of our proposed approach for classification and object detection tasks.

Recommended citation: V.K. Verma, P. Singh, V.P. Namboodiri, P. Rai, "A "Network Pruning Network" Approach to Deep Model Compression", IEEE Winter Conference on Applications of Computer Vision (WACV), 2020 https://arxiv.org/abs/2001.05545

Jointly Trained Image and Video Generation using Residual Vectors

Published in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020

In this work, we propose a modeling technique for jointly training image and video generation models by simultaneously learning to map latent variables with a fixed prior onto real images and interpolate over images to generate videos. The proposed approach models the variations in representations using residual vectors encoding the change at each time step over a summary vector for the entire video. We utilize the technique to jointly train an image generation model with a fixed prior along with a video generation model lacking constraints such as disentanglement. The joint training enables the image generator to exploit temporal information while the video generation model learns to flexibly share information across frames. Moreover, experimental results verify our approach's compatibility with pre-training on videos or images and training on datasets containing a mixture of both. A comprehensive set of quantitative and qualitative evaluations reveal the improvements in sample quality and diversity over both video generation and image generation baselines. We further demonstrate the technique's capabilities of exploiting similarity in features across frames by applying it to a model based on decomposing the video into motion and content. The proposed model allows minor variations in content across frames while maintaining the temporal dependence through latent vectors encoding the pose or motion features.

Recommended citation: Yatin Dandi, Aniket Das, Soumye Singhal, Vinay Namboodiri, Piyush Rai; “Jointly Trained Image and Video Generation using Residual Vectors “, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3028-3042 https://openaccess.thecvf.com/content_WACV_2020/html/Dandi_Jointly_Trained_Image_and_Video_Generation_using_Residual_Vectors_WACV_2020_paper.html

Deep Bayesian Network for Visual Question Generation

Published in The IEEE Winter Conference on Applications of Computer Vision, 2020

Generating natural questions from an image is a semantic task that requires using vision and language modalities to learn multimodal representations. Images can have multiple visual and language cues such as places, captions, and tags. In this paper, we propose a principled deep Bayesian learning framework that combines these cues to produce natural questions. We observe that with the addition of more cues and by minimizing uncertainty in the among cues, the Bayesian network becomes more confident. We propose a Minimizing Uncertainty of Mixture of Cues (MUMC), that minimizes uncertainty present in a mixture of cues experts for generating probabilistic questions. This is a Bayesian framework and the results show a remarkable similarity to natural questions as validated by a human study. We observe that with the addition of more cues and by minimizing uncertainty among the cues, the Bayesian framework becomes more confident. Ablation studies of our model indicate that a subset of cues is inferior at this task and hence the principled fusion of cues is preferred. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU-n, METEOR, ROUGE, and CIDEr). Here we provide project link for Deep Bayesian VQG https://delta-lab-iitk. github. io/BVQG/.

Recommended citation: Patro, B., Kurmi, V., Kumar, S., & Namboodiri, V. (2020). Deep Bayesian Network for Visual Question Generation. In The IEEE Winter Conference on Applications of Computer Vision (pp. 1566-1576). https://openaccess.thecvf.com/content_WACV_2020/html/Patro_Deep_Bayesian_Network_for_Visual_Question_Generation_WACV_2020_paper.html

Robust Explanations for Visual Question Answering

Published in The IEEE Winter Conference on Applications of Computer Vision, 2020

In this paper, we propose a method to obtain robust explanations for visual question answering (VQA) that correlate well with the answers. Our model explains the answers obtained through a VQA model by providing visual and textual explanations. The main challenges that we address are i) Answers and textual explanations obtained by current methods are not well correlated and ii) Current methods for visual explanation do not focus on the right location for explaining the answer. We address both these challenges by using a collaborative correlated module which ensures that even if we do not train for noise based attacks, the enhanced correlation ensures that the right explanation and answer can be generated. We further show that this also aids in improving the generated visual and textual explanations. The use of the correlated module can be thought of as a robust method to verify if the answer and explanations are coherent. We evaluate this model using VQA-X dataset. We observe that the proposed method yields better textual and visual justification that supports the decision. We showcase the robustness of the model against a noise-based perturbation attack using corresponding visual and textual explanations. A detailed empirical analysis is shown.

Recommended citation: Patro, B., Patel, S., & Namboodiri, V. (2020). Robust Explanations for Visual Question Answering. In The IEEE Winter Conference on Applications of Computer Vision (pp. 1577-1586). https://openaccess.thecvf.com/content_WACV_2020/papers/Patro_Robust_Explanations_for_Visual_Question_Answering_WACV_2020_paper.pdf

Accuracy booster: Performance boosting using feature map re-calibration

Published in Winter Conference on Applications of Computer Vision (WACV), 2020

Convolution Neural Networks (CNN) have been extremely successful in solving intensive computer vision tasks. The convolutional filters used in CNNs have played a major role in this success, by extracting useful features from the inputs. Recently researchers have tried to boost the performance of CNNs by re-calibrating the feature maps produced by these filters, eg, Squeeze-and-Excitation Networks (SENets). These approaches have achieved better performance by Exciting up the important channels or feature maps while diminishing the rest. However, in the process, architectural complexity has increased. We propose an architectural block that introduces much lower complexity than the existing methods of CNN performance boosting while performing significantly better than them. We carry out experiments on the CIFAR, ImageNet and MS-COCO datasets, and show that the proposed block can challenge the state-of-the-art results. Our method boosts the ResNet-50 architecture to perform comparably to the ResNet-152 architecture, which is a three times deeper network, on classification. We also show experimentally that our method is not limited to classification but also generalizes well to other tasks such as object detection.

Recommended citation: Pravendra Singh, Pratik Mazumder and Vinay P. Namboodiri, “Accuracy booster: Performance boosting using feature map re-calibration”, The IEEE Winter Conference on Applications of Computer Vision, pages=884--893, 2020 http://openaccess.thecvf.com/content_WACV_2020/papers/Singh_Accuracy_Booster_Performance_Boosting_using_Feature_Map_Re-calibration_WACV_2020_paper.pdf

Cpwc: Contextual point wise convolution for object recognition

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

Convolutional layers are a major driving force behind the successes of deep learning. Pointwise convolution (PWC) is a 1 × 1 convolutional filter that is primarily used for parameter reduction. However, the PWC ignores the spatial information around the points it is processing. This design is by choice, in order to reduce the overall parameters and computations. However, we hypothesize that this shortcoming of PWC has a significant impact on the network performance. We propose an alternative design for pointwise convolution, which uses spatial information from the input efficiently. Our design significantly improves the performance of the networks without substantially increasing the number of parameters and computations. We experimentally show that our design results in significant improvement in the performance of the network for classification as well as detection.

Recommended citation: Pravendra Singh, Pratik Mazumder and Vinay P. Namboodiri, “Cpwc: Contextual point wise convolution for object recognition”, ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ,pages=4152–4156, 2020 https://ieeexplore.ieee.org/document/9054205

Acceleration of Deep Convolutional Neural Networks Using Adaptive Filter Pruning

Published in IEEE Journal of Selected Topics in Signal Processing, 2020

While convolutional neural networks (CNNs) have achieved remarkable performance on various supervised and unsupervised learning tasks, they typically consist of a massive number of parameters. This results in significant memory requirements as well as a computational burden. Consequently, there is a growing need for filter-level pruning approaches for compressing CNN based models that not only reduce the total number of parameters but reduce the overall computation as well. We present a new min-max framework for the filter-level pruning of CNNs. Our framework jointly prunes and fine-tunes CNN model parameters, with an adaptive pruning rate, while maintaining the model's predictive performance. Our framework consists of two modules: (1) An adaptive filter pruning (AFP) module, which minimizes the number of filters in the model; and (2) A pruning rate controller (PRC) module, which maximizes the accuracy during pruning. In addition, we also introduce orthogonality regularization in training of CNNs to reduce redundancy across filters of a particular layer. In the proposed approach, we prune the least important filters and, at the same time, reduce the redundancy level in the model by using orthogonality constraints during training. Moreover, unlike most previous approaches, our approach allows directly specifying the desired error tolerance instead of the pruning level. We perform extensive experiments for object classification (LeNet, VGG, MobileNet, and ResNet) and object detection (SSD, and Faster-RCNN) over benchmarked datasets such as MNIST, CIFAR, GTSDB, ImageNet, and MS-COCO. We also present several ablation studies to validate the proposed approach. Our compressed models can be deployed at run-time, without requiring any special libraries or hardware. Our approach reduces the number of parameters of VGG-16 by an impressive factor of 17.5X, and the number of FLOPS by 6.43X, with no loss of accuracy, significantly outperforming other state-of-the-art filter prun…

Recommended citation: P. Singh; V.K. Verma; P. Rai; V.P. Namboodiri,"Acceleration of Deep Convolutional Neural Networks Using Adaptive Filter Pruning", IEEE Journal of Selected Topics in Signal Processing ( Volume: 14, Issue: 4, May 2020) https://ieeexplore.ieee.org/document/9086749

Minimizing Supervision in Multi-label Categorization

Published in CVPR 2020 Workshop on Fair, Data Efficient and Trusted Computer Vision, 2020

Multiple categories of objects are present in most images. Treating this as a multi-class classification is not justified. We treat this as a multi-label classification problem. In this paper, we further aim to minimize the supervision required for providing supervision in multi-label classification. Specifically, we investigate an effective class of approaches that associate a weak localization with each category either in terms of the bounding box or segmentation mask. Doing so improves the accuracy of multi-label categorization. The approach we adopt is one of active learning, i.e., incrementally selecting a set of samples that need supervision based on the current model, obtaining supervision for these samples, retraining the model with the additional set of supervised samples and proceeding again to select the next set of samples. A crucial concern is the choice of the set of samples. In doing so, we provide a novel insight, and no specific measure succeeds in obtaining a consistently improved selection criterion. We, therefore, provide a selection criterion that consistently improves the overall baseline criterion by choosing the top k set of samples for a varied set of criteria. Using this criterion, we are able to show that we can retain more than 98% of the fully supervised performance with just 20% of samples (and more than 96% using 10%) of the dataset on PASCAL VOC 2007 and 2012. Also, our proposed approach consistently outperforms all other baseline metrics for all benchmark datasets and model combinations.

Recommended citation: Rajat, M. Varshney, P. Singh, V.P. Namboodiri, "Minimizing Supervision in Multi-label Categorization". CVPR Workshop on Fair, Data-Efficient and Trusted Computer Vision June 2020 2020: 93-102 https://arxiv.org/abs/2005.12892

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Published in Pattern Recognition Journal or IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space.

Recommended citation: K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri and C. V. Jawahar, "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13793-13802, doi: 10.1109/CVPR42600.2020.01381. https://arxiv.org/abs/2005.08209

A Multilingual Parallel Corpora Collection Effort for Indian Languages

Published in Proceedings of The 12th Language Resources and Evaluation Conference (LREC), 2020

We present sentence aligned parallel corpora across 10 Indian Languages-Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English-many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.

Recommended citation: Shashank Siripragada, Jerin Philip, Vinay P Namboodiri, CV Jawahar, “A Multilingual Parallel Corpora Collection Effort for Indian Languages”, Proceedings of The 12th Language Resources and Evaluation Conference https://arxiv.org/abs/2007.07691

SkipConv: Skip Convolution for Computationally Efficient Deep CNNs

Published in 2020 International Joint Conference on Neural Networks (IJCNN), 2020

Convolution operation in deep convolutional neural networks is the most computationally expensive as compared to other operations. Most of the model computation (FLOPS) in the deep architecture belong to convolution operation. In this paper, we are proposing a novel skip convolution operation that employs significantly fewer computation as compared to the traditional one without sacrificing model accuracy. Skip convolution operation produces structured sparsity in the output feature maps without requiring sparsity in the model parameters for computation reduction. The existing convolution operation performs the redundant computation for object feature representation while the proposed convolution skips redundant computation. Our empirical evaluation for various deep models (VGG, ResNet, MobileNet, and Faster R-CNN) over various benchmarked datasets (CIFAR-10, CIFAR-100, ImageNet, and MS-COCO) show that skip convolution reduces the computation significantly while preserving feature representational capacity. The proposed approach is model-agnostic and can be applied over any architecture. The proposed approach does not require a pretrained model and does train from scratch. Hence we achieve significant computation reduction at training and test time. We are also able to reduce computation in an already compact model such as MobileNet using skip convolution. We also show empirically that the proposed convolution works well for other tasks such as object detection. Therefore, SkipConv can be a widely usable and efficient way of reducing computation in deep CNN models

Recommended citation: P. Singh, V.P. Namboodiri, "SkipConv: Skip Convolution for Computationally Efficient Deep CNNs", . International Joint Conference on Neural Networks (IJCNN), 2020 https://ieeexplore.ieee.org/abstract/document/9207705

Passive Batch Injection Training Technique: Boosting Network Performance by Injecting Mini-Batches from a different Data Distribution

Published in International Joint Conference on Neural Networks (IJCNN), 2020, 2020

This work presents a novel training technique for deep neural networks that makes use of additional data from a distribution that is different from that of the original input data. This technique aims to reduce overfitting and improve the generalization performance of the network. Our proposed technique, namely Passive Batch Injection Training Technique (PBITT), even reduces the level of overfitting in networks that already use the standard techniques for reducing overfitting such as L2 regularization and batch normalization, resulting in significant accuracy improvements. Passive Batch Injection Training Technique (PBITT) introduces a few passive mini-batches into the training process that contain data from a distribution that is different from the input data distribution. This technique does not increase the number of parameters in the final model and also does not increase the inference (test) time but still improves the performance of deep CNNs. To the best of our knowledge, this is the first work that makes use of different data distribution to aid the training of convolutional neural networks (CNNs). We thoroughly evaluate the proposed approach on standard architectures: VGG, ResNet, and WideResNet, and on several popular datasets: CIFAR-10, CIFAR-100, SVHN, and ImageNet. We observe consistent accuracy improvement by using the proposed technique. We also show experimentally that the model trained by our technique generalizes well to other tasks such as object detection on the MS-COCO dataset using Faster R-CNN. We present extensive ablations to validate the proposed approach. Our approach improves the accuracy of VGG-16 by a significant margin of 2.1% over the CIFAR-100 dataset.

Recommended citation: P. Singh, P. Mazumder and V. P. Namboodiri, "Passive Batch Injection Training Technique: Boosting Network Performance by Injecting Mini-Batches from a different Data Distribution," 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, United Kingdom, 2020, pp. 1-8, doi: 10.1109/IJCNN48605.2020.9206622. https://arxiv.org/abs/2006.04406

Probabilistic framework for solving visual dialog

Published in Pattern Recognition , 2020

In this paper, we propose a probabilistic framework for solving the task of ‘Visual Dialog’. Solving this task requires reasoning and understanding of visual modality, language modality, and common sense knowledge to answer. Various architectures have been proposed to solve this task by variants of multi-modal deep learning techniques that combine visual and language representations. However, we believe that it is crucial to understand and analyze the sources of uncertainty for solving this task. Our approach allows for estimating uncertainty and also aids a diverse generation of answers. The proposed approach is obtained through a probabilistic representation module that provides us with representations for image, question and conversation history, a module that ensures that diverse latent representations for candidate answers are obtained given the probabilistic representations and an uncertainty representation module that chooses the appropriate answer that minimizes uncertainty. We thoroughly evaluate the model with a detailed ablation analysis, comparison with state of the art and visualization of the uncertainty that aids in the understanding of the method. Using the proposed probabilistic framework, we thus obtain an improved visual dialog system that is also more explainable.

Recommended citation: Badri N. Patro, Anupriy & Vinay P. Namboodiri, ``Probabilistic framework for solving visual dialog'', Pattern Recognition, Volume 110, February 2021, 107586. https://www.sciencedirect.com/science/article/abs/pii/S0031320320303897

Revisiting paraphrase question generator using pairwise discriminator

Published in Neurocomputing, 2020

In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of obtaining word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder. This discriminator is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validate our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and provide competitive results on the paraphrase generation and sentiment analysis task on standard dataset. These results are also shown to be statistically significant.

Recommended citation: Badri N Patro, Dev Chauhan, Vinod K Kurmi, Vinay P Namboodiri,(2020)``Revisiting paraphrase question generator using pairwise discriminator'',Neurocomputing, Volume 420, 2021, pp. 149-161 https://www.sciencedirect.com/science/article/abs/pii/S0925231220312820

SD-MTCNN: Self-Distilled Multi-Task

Published in British Machine Vision Conference (BMVC), 2020

Multi-task learning (MTL) using convolutional neural networks (CNN) deals with training the network for multiple correlated tasks in concert. For accuracy-critical applications, there are endeavors to boost the model performance by resorting to a deeper network, which also increases the model complexity. However, such burdensome models are difficult to be deployed on mobile or edge devices. To ensure a trade-off between performance and complexity of CNNs in the context of MTL, we introduce the novel paradigm of self-distillation within the network. Different from traditional knowledge distillation (KD), which trains the Student in accordance with a cumbersome Teacher, our self-distilled multi-task CNN model: SD-MTCNN aims at distilling knowledge from deeper CNN layers into the shallow layers. Precisely, we follow a hard-sharing based MTL setup where all the tasks share a generic feature-encoder on top of which separate task-specific decoders are enacted. Under this premise, SD-MTCNN distills the more abstract features from the decoders to the encoded feature space, which guarantees improved multi-task performance from different parts of the network. We validate SD-MTCNN on three benchmark datasets: CityScapes, NYUv2, and Mini-Taskonomy, and results confirm the improved generalization capability of self-distilled multi-task CNNs in comparison to the literature and baselines

Recommended citation: Ankit Jha, Awanish Kumar, Biplab Banerjee and Vinay Namboodiri, “SD-MTCNN: Self-Distilled Multi-Task CNN”, Proceedings of British Machine Vision Conference (2020), https://www.bmvc2020-conference.com/conference/papers/paper_0448.html

Determinantal Point Process as an alternative to NMS

Published in British Machine Vision Conference (BMVC), 2020

We present a determinantal point process (DPP) inspired alternative to non-maximumsuppression (NMS) which has become an integral step in all state-of-the-art object de-tection frameworks. DPPs have been shown to encourage diversity in subset selectionproblems. We pose NMS as a subset selection problem and posit that directly incor-porating DPP like framework can improve the overall performance of the object detectionsystem. We propose an optimization problem which takes the same inputs as NMS, butintroduces a novel sub-modularity based diverse subset selection functional. Our resultsstrongly indicate that the modifications proposed in this paper can provide consistentimprovements to state-of-the-art object detection pipelines.

Recommended citation: Some, Samik, Mithun Das Gupta, and Vinay P. Namboodiri. "Determinantal Point Process as an alternative to NMS." Proceedings of British Machine Vision Conference (2020), arXiv preprint arXiv:2008.11451 (2020). https://arxiv.org/abs/2008.11451

A Lip Sync Expert Is All You Need For Speech To Lip Generation In The Wild

Published in ACM International Conference on Multimedia, 2020 (ACM Multimedia), 2020

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos.

Recommended citation: K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia (MM '20). Association for Computing Machinery, New York, NY, USA, 484–492. DOI:https://doi.org/10.1145/3394171.3413532 https://arxiv.org/abs/2008.10010

Visually Precise Query

Published in Proceedings of the 28th ACM International Conference on Multimedia, 2020

We present the problem of Visually Precise Query (VPQ) generation which enables a more intuitive match between a user's information need and an e-commerce site's product description. Given an image of a fashion item, what is the most optimum search query that will retrieve the exact same or closely related product(s) with high probability. In this paper we introduce the task of VPQ generation which takes a product image and its title as its input and provides aword level extractive summary of the title, containing a list of salient attributes, which can now be used as a query to search for similar products. We collect a large dataset of fashion images and their titles and merge it with an existing research dataset which was created for a different task. Given the image and title pair, VPQ problem is posed as identifying a non-contiguous collection of spans within the title. We provide a dataset of around 400K image, title and corresponding VPQ entries and release it to the research community. We provide a detailed description of the data collection process as well as discuss the future direction of research for the problem introduced in this work. We provide the standard text as well as visual domain baseline comparisons and also provide multi-modal baseline models to analyze the task introduced in this work. Finally, we propose a hybrid fusion model which promises to be the direction of research in the multi-modal community.

Recommended citation: Dasgupta, Riddhiman and Tom, Francis and Kumar, Sudhir and Das Gupta, Mithun and Kumar, Yokesh and Patro, Badri N. and Namboodiri, Vinay P. (2020)``Visually Precise Query'', Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020. https://dl.acm.org/doi/abs/10.1145/3394171.3413558#d81339e1

Stochastic Talking Face Generation Using Latent Distribution Matching

Published in INTERSPEECH, 2020

The ability to envisage the visual of a talking face based just on hearing a voice is a unique human capability. There have been a number of works that have solved for this ability recently. We differ from these approaches by enabling a variety of talking face generations based on single audio input. Indeed, just having the ability to generate a single talking face would make a system almost robotic in nature. In contrast, our unsupervised stochastic audio-to-video generation model allows for diverse generations from a single audio input. Particularly, we present an unsupervised stochastic audio-to-video generation model that can capture multiple modes of the video distribution. We ensure that all the diverse generations are plausible. We do so through a principled multi-modal variational autoencoder framework. We demonstrate its efficacy on the challenging LRW and GRID datasets and demonstrate performance better than the baseline, while having the ability to generate multiple diverse lip synchronized videos.

Recommended citation: Ravindra Yadav, Ashish Sardana, Vinay P Namboodiri, Rajesh M Hegde. ”Stochastic Talking Face Generation Using Latent Distribution Matching.” In InterSpeech, Shanghai, China. October 25, 2020. https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1823.pdf

Learning to Switch CNNs with Model Agnostic Meta Learning for Fine Precision Visual Servoing

Published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-2020), 2020

Convolutional Neural Networks (CNNs) have been successfully applied for relative camera pose estimation from labeled image-pair data, without requiring any hand-engineered features, camera intrinsic parameters or depth information. The trained CNN can be utilized for performing pose based visual servo control (PBVS). One of the ways to improve the quality of visual servo output is to improve the accuracy of the CNN for estimating the relative pose estimation. With a given state-of-the-art CNN for relative pose regression, how can we achieve an improved performance for visual servo control? In this paper, we explore switching of CNNs to improve the precision of visual servo control. The idea of switching a CNN is due to the fact that the dataset for training a relative camera pose regressor for visual servo control must contain variations in relative pose ranging from a very small scale to eventually a larger scale. We found that, training two different instances of the CNN, one for large-scale-displacements (LSD) and another for small-scale-displacements (SSD) and switching them during the visual servo execution yields better results than training a single CNN with the combined LSD+SSD data. However, it causes extra storage overhead and switching decision is taken by a manually set threshold which may not be optimal for all the scenes. To eliminate these drawbacks, we propose an efficient switching strategy based on model agnostic meta learning (MAML) algorithm. In this, a single model is trained to learn parameters which are simultaneously good for multiple tasks, namely a binary classification for switching decision, a 6DOF pose regression for LSD data and also a 6DOF pose regression for SSD data. The proposed approach performs far better than the naive approach, while storage and run-time overheads are almost negligible.

Recommended citation: Prem Raj, Vinay P. Namboodiri, Laxmidhar Behera, “Learning to Switch CNNs with Model Agnostic Meta Learning for Fine Precision Visual Servoing”, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-2020) https://arxiv.org/abs/2007.04645

Uncertainty Class Activation Map (U-CAM) using Gradient Certainty method

Published in IEEE Transactions on Image Processing, 2020

Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide visual attention maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold benefits: a) improvement in obtaining the certainty estimates that correlate better with misclassified samples and b) improved attention maps that provide state-of-the-art results in terms of correlation with human attention regions. The improved attention maps result in consistent improvement for various methods for visual question answering. Therefore, the proposed technique can be thought of as a tool for obtaining improved certainty estimates and explanations for deep learning models. We provide detailed empirical analysis for the visual question answering task on all standard benchmarks and comparison with state of the art methods.

Recommended citation: Badri N Patro, Mayank Lunayach, Vinay P Namboodiri (2020).``Uncertainty Class Activation Map (U-CAM) using Gradient Certainty method'', IEEE Transactions on Image Processing,2020. https://arxiv.org/pdf/2002.10309.pdf

GIFSL-grafting based improved few-shot learning

Published in Image and Vision Computing, 2020

A few-shot learning model generally consists of a feature extraction network and a classification module. In this paper, we propose an approach to improve few-shot image classification performance by increasing the representational capacity of the feature extraction network and improving the quality of the features extracted by it. The ability of the feature extraction network to extract highly discriminative features from images is essential to few-shot learning. Such features are generally class agnostic and contain information about the general content of the image. Our approach improves the training of the feature extraction network in order to enable them to produce such features. We train the network using filter-grafting along with an auxiliary self-supervision task and a knowledge distillation procedure. Particularly, filter-grafting rejuvenates unimportant (invalid) filters in the feature extraction network to make them useful and thereby, increases the number of important filters that can be further improved by using self-supervision and knowledge distillation techniques. This combined approach helps in significantly improving the few-shot learning performance of the model. We perform experiments on several few-shot learning benchmark datasets such as mini-ImageNet, tiered-ImageNet, CIFAR-FS, and FC100 using our approach. We also present various ablation studies to validate the proposed approach. We empirically show that our approach performs better than other state-of-the-art few-shot learning methods.

Recommended citation: Pratik Mazumder, Pravendra Singh and Vinay P. Namboodiri, ”GIFSL - grafting based improved few-shot learning", Journal of Image and Vision Computing, volume 104 2020 , issn = "0262-8856", doi = "https://doi.org/10.1016/j.imavis.2020.104006" https://www.sciencedirect.com/science/article/abs/pii/S0262885620301384

STEER: Simple Temporal Regularization For Neural ODEs

Published in 2020 Neural Information Processing Systems Conference (NeuRIPS), 2020

Training Neural Ordinary Differential Equations (ODEs) is often computationally expensive. Indeed, computing the forward pass of such models involves solving an ODE which can become arbitrarily complex during training. Recent works have shown that regularizing the dynamics of the ODE can partially alleviate this. In this paper we propose a new regularization technique: randomly sampling the end time of the ODE during training. The proposed regularization is simple to implement, has negligible overhead and is effective across a wide variety of tasks. Further, the technique is orthogonal to several other methods proposed to regularize the dynamics of ODEs and as such can be used in conjunction with them. We show through experiments on normalizing flows, time series models and image recognition that the proposed regularization can significantly decrease training time and even improve performance over baseline models.

Recommended citation: Arnab Ghosh, Harkirat Singh Behl, Emilien Dupont, Philip H. S. Torr, Vinay Namboodiri, “STEER: Simple Temporal Regularization For Neural ODEs”, Proceedings of Neural Information Processing Systems Conference (NeuRIPS) 2020 https://arxiv.org/abs/2006.10711

PhraseOut: A Code Mixed Data Augmentation Method for Multilingual Neural Machine Translation

Published in International Conference on Natural Language Processing (ICON-2020), 2020

Data Augmentation methods for Neural Machine Translation (NMT) such as back-translation (BT) and self-training (ST) are quite popular. In a multilingual NMT system, simply copying monolingual source sentences to the target (Copying) is an effective data augmentation method. Back-translation augments parallel data by translating monolingual sentences in the target side to source language. In this work we propose to use a partial back-translation method in a multilingual setting. Instead of translating the entire monolingual target sentence back into the source language, we replace selected high confidence phrases only and keep the rest of the words in the target language itself. (We call this method PhraseOut). Our experiments on low resource multilingual translation models show that PhraseOut gives reasonable improvements over the existing data augmentation methods.

Recommended citation: B. Jasim, V.P. Namboodiri and C.V. Jawahar, "PhraseOut: A Code Mixed Data Augmentation Method for Multilingual Neural Machine Translation", International Conference on Natural Language Processing (ICON-2020) http://vinaypn.github.io/files/icon2020_binu.pdf

Exploring Pair-Wise NMT for Indian Languages

Published in International Conference on Natural Language Processing (ICON-2020), 2020

In this paper, we address the task of improving pair-wise machine translation for specific low resource Indian languages. Multilingual NMT models have demonstrated a reasonable amount of effectiveness on resource-poor languages. In this work, we show that the performance of these models can be significantly improved upon by using back-translation through a filtered back-translation process and subsequent fine-tuning on the limited pair-wise language corpora. The analysis in this paper suggests that this method can significantly improve a multilingual model's performance over its baseline, yielding state-of-the-art results for various Indian languages.

Recommended citation: K. Akella, S.H. Allu, S.S. Ragupathi, A. Singhal, Z. Khan, V.P. Namboodiri, C V Jawahar, "Exploring Pair-Wise NMT for Indian Languages", International Conference on Natural Language Processing (ICON-2020) short paper https://arxiv.org/abs/2012.05786

Visual Speech Enhancement Without A Real Visual Stream

Published in Winter Conference on Applications of Computer Vision (WACV ’21), 2021

In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as avisual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is almost close (< 3\% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as qualitative human evaluations. Additional ablation studies and a demo video in the supplementary material containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.

Recommended citation: Hegde, Sindhu B., K. Prajwal, R. Mukhopadhyay, Vinay Namboodiri and C. Jawahar. “Visual Speech Enhancement Without A Real Visual Stream.” Winter Conference on Applications of Computer Vision (WACV ’21) https://arxiv.org/abs/2012.10852

Domain Impression: A Source Data Free Domain Adaptation Method

Published in Winter Conference on Applications of Computer Vision (WACV), 2021

Unsupervised Domain adaptation methods solve the adaptation problem for an unlabeled target set, assuming that the source dataset is available with all labels. However, the availability of actual source samples is not always possible in practical cases. It could be due to memory constraints, privacy concerns, and challenges in sharing data. This practical scenario creates a bottleneck in the domain adaptation problem. This paper addresses this challenging scenario by proposing a domain adaptation technique that does not need any source data. Instead of the source data, we are only provided with a classifier that is trained on the source data. Our proposed approach is based on a generative framework, where the trained classifier is used for generating samples from the source classes. We learn the joint distribution of data by using the energy-based modeling of the trained classifier. At the same time, a new classifier is also adapted for the target domain. We perform various ablation analysis under different experimental setups and demonstrate that the proposed approach achieves better results than the baseline models in this extremely novel scenario.

Recommended citation: Vinod K. Kurmi, Venkatesh K Subramanian, Vinay P. Namboodiri, “Domain Impression: A Source Data Free Domain Adaptation Method”, IEEE Winter Conference of Applications on Computer Vision (WACV), Virtual, 2021 https://delta-lab-iitk.github.io/SFDA/

Do not Forget to Attend to Uncertainty while Mitigating Catastrophic Forgetting

Published in Winter Conference on Applications of Computer Vision (WACV ), 2021

One of the major limitations of deep learning models is that they face catastrophic forgetting in an incremental learning scenario. There have been several approaches proposed to tackle the problem of incremental learning. Most of these methods are based on knowledge distillation and do not adequately utilize the information provided by older task models, such as uncertainty estimation in predictions. The predictive uncertainty provides the distributional information can be applied to mitigate catastrophic forgetting in a deep learning framework. In the proposed work, we consider a Bayesian formulation to obtain the data and model uncertainties. We also incorporate self-attention framework to address the incremental learning problem. We define distillation losses in terms of aleatoric uncertainty and self-attention. In the proposed work, we investigate different ablation analyses on these losses. Furthermore, we are able to obtain better results in terms of accuracy on standard benchmarks.

Recommended citation: Vinod K. Kurmi, Badri N. Patro, Venkatesh K Subramanian, Vinay P. Namboodiri,“Do not Forget to Attend to Uncertainty while Mitigating Catastrophic Forgetting”, IEEE Winter Conference of Applications on Computer Vision (WACV), Virtual, 2021 https://delta-lab-iitk.github.io/Incremental-learning-AU/

AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings

Published in Winter Conference on Applications of Computer Vision (WACV) 2021, 2021

In this paper, we solve for the problem of generalized zero-shot learning in a multi-modal setting, where we have novel classes of audio/video during testing that were not seen during training. We demonstrate that projecting the audio and video embeddings to the class label text feature space allows us to use the semantic relatedness of text embeddings as a means for zero-shot learning. Importantly, our multi-modal zero-shot learning approach works even if a modality is missing at test time. Our approach makes use of a cross-modal decoder which enforces the constraint that the class label text features can be reconstructed from the audio and video embeddings of data points in order to perform better on the multi-modal zero-shot learning task. We further minimize the gap between audio and video embedding distributions using KL-Divergence loss. We test our approach on the zero-shot classification and retrieval tasks, and it performs better than other models in the presence of a single modality as well as in the presence of multiple modalities.

Recommended citation: Pratik Mazumder and Pravendra Singh, Kranti Kumar Parida and Vinay P. Namboodiri, “AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings”, The IEEE Winter Conference on Applications of Computer Vision, WACV 2021 https://arxiv.org/abs/2005.13402

Improving Few-Shot Learning using Composite Rotation based Auxiliary Task

Published in Winter Conference on Applications of Computer Vision (WACV) 2021, 2021

In this paper, we propose an approach to improve few-shot classification performance using a composite rotation based auxiliary task. Few-shot classification methods aim to produce neural networks that perform well for classes with a large number of training samples and classes with less number of training samples. They employ techniques to enable the network to produce highly discriminative features that are also very generic. Generally, the better the quality and generic-nature of the features produced by the network, the better is the performance of the network on few-shot learning. Our approach aims to train networks to produce such features by using a self-supervised auxiliary task. Our proposed composite rotation based auxiliary task performs rotation at two levels, ie, rotation of patches inside the image (inner rotation) and rotation of the whole image (outer rotation) and assigns one out of 16 rotation classes to the modified image. We then simultaneously train for the composite rotation prediction task along with the original classification task, which forces the network to learn high-quality generic features that help improve the few-shot classification performance. We experimentally show that our approach performs better than existing few-shot learning methods on multiple benchmark datasets.

Recommended citation: Pratik Mazumder, Pravendra Singh and Vinay P. Namboodiri, “Improving Few-Shot Learning using Composite Rotation based Auxiliary Task”, The IEEE Winter Conference on Applications of Computer Vision, WACV 2021 https://arxiv.org/abs/2006.15919

RNNP: A Robust Few-Shot Learning Approach

Published in Winter Conference on Applications of Computer Vision (WACV) 2021, 2021

Learning from a few examples is an important practical aspect of training classifiers. Various works have examined this aspect quite well. However, all existing approaches assume that the few examples provided are always correctly labeled. This is a strong assumption, especially if one considers the current techniques for labeling using crowd-based labeling services. We address this issue by proposing a novel robust few-shot learning approach. Our method relies on generating robust prototypes from a set of few examples. Specifically, our method refines the class prototypes by producing hybrid features from the support examples of each class. The refined prototypes help to classify the query images better. Our method can replace the evaluation phase of any few-shot learning method that uses a nearest neighbor prototype-based evaluation procedure to make them robust. We evaluate our method on standard mini-ImageNet and tiered-ImageNet datasets. We perform experiments with various label corruption rates in the support examples of the few-shot classes. We obtain significant improvement over widely used few-shot learning methods that suffer significant performance degeneration in the presence of label noise. We finally provide extensive ablation experiments to validate our method.

Recommended citation: Pratik Mazumder, Pravendra Singh and Vinay P. Namboodiri, “RNNP: A Robust Few-Shot Learning Approach”, The IEEE Winter Conference on Applications of Computer Vision, WACV 2021 https://arxiv.org/abs/2011.11067

Self Supervision for Attention Networks

Published in The IEEE Winter Conference on Applications of Computer Vision, 2021

In recent years, the attention mechanism has become a fairly popular concept and has proven to be successful in many machine learning applications. However, deep learning models do not employ supervision for these attention mechanisms which can improve the model's performance significantly. Therefore, in this paper, we tackle this limitation and propose a novel method to improve the attention mechanism by inducing ``self-supervision". We devise a technique to generate desirable attention maps for any model that utilizes an attention module. This is achieved by examining the model's output for different regions sampled from the input and obtaining the attention probability distributions that enhance the proficiency of the model. The attention distributions thus obtained are used for supervision. We rely on the fact, that attenuation of the unimportant parts, allows a model to attend to more salient regions, thus strengthening the prediction accuracy. The quantitative and qualitative results published in this paper show that this method successfully improves the attention mechanism as well as the model's accuracy. In addition to the task of Visual Question Answering(VQA), we also show results on the task of Image classification and Text classification to prove that our method can be generalized to any vision and language model that uses an attention module.

Recommended citation: Badri N. Patro, Kasturi G S, Ansh Jain, Vinay P. Namboodiri (2021). `` Self Supervision for Attention Networks '', The IEEE Winter Conference on Applications of Computer Vision, USA, 2021.) https://github.com/Anonymous1207/Self_Supervsion_for_Attention_Networks

Revisiting Low Resource Status of Indian Languages in Machine Translation

Published in ACM India Joint International Conference on Data Science & Management of Data (CODS-COMAD) 2021, 2021

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.

Recommended citation: Jerin Philip, Shashank Siripragada, Vinay P Namboodiri, CV Jawahar, “Revisiting Low Resource Status of Indian Languages in Machine Translation”, 8th ACM IKDD CODS and 26th COMAD (CODS COMAD 2021), January 2-4, 2021, Bangalore, India https://arxiv.org/abs/2008.04860

SHAD3S: A model to Sketch, Shade and Shadow

Published in 2021 IEEE Winter Conference for Applications on Computer Vision (WACV), 2021

Hatching is a common method used by artists to accentuate the third dimension of a sketch, and to illuminate the scene. Our system attempts to compete with a human at hatching generic three-dimensional (3d) shapes, and also tries to assist her in a form exploration exercise. The novelty of our approach lies in the fact that we make no assumptions about the input other than that it represents a 3d shape, and yet, given a contextual information of illumination and texture, we synthesise an accurate hatch pattern over the sketch, without access to 3d or pseudo 3d. In the process, we contribute towards a) a cheap yet effective method to synthesise a sufficiently large high fidelity dataset, pertinent to task; b) creating a pipeline with conditional generative adversarial network (cgan); and c) creating an interactive utility with gimp, that is a tool for artists to engage with automated hatching or a form-exploration exercise. User evaluation of the tool suggests that the model performance does generalise satisfactorily over diverse input, both in terms of style as well as shape. A simple comparison of inception scores suggest that the generated distribution is as diverse as the ground truth.

Recommended citation: R. B. Venkataramaiyer, A. Joshi, S. Narang, and V. P. Namboodiri (2021). "SHAD3S : A model to Sketch, Shade and Shadow", Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)" Hawaii, USA, Jan. 2021. https://bvraghav.com/shad3s/

Multimodal Humor Dataset: Predicting Laughter tracks for Sitcoms

Published in The IEEE Winter Conference on Applications of Computer Vision, 2021

A great number of situational comedies (sitcoms) are being regularly made and the task of adding laughter tracks to these is a critical task. Providing an ability to be able to predict whether something will be humorous to the audience is also crucial. In this project, we aim to automate this task. Towards doing so, we annotate an existing sitcom (`Big Bang Theory') and use the laughter cues present to obtain a manual annotation for this show. We provide detailed analysis for the dataset design and further evaluate various state of the art baselines for solving this task. We observe that existing LSTM and BERT based networks on the text alone do not perform as well as joint text and video or only video-based networks. Moreover, it is challenging to ascertain that the words attended to while predicting laughter are indeed humorous. Our dataset and analysis provided through this paper is a valuable resource towards solving this interesting semantic and practical task. As an additional contribution, we have developed a novel model for solving this task that is a multi-modal self-attention based model that outperforms currently prevalent models for solving this task. The project page for our paper is \url{https://delta-lab-iitk.github.io/Multimodal-Humor-Dataset/}.

Recommended citation: Badri N. Patro, Mayank Lunayach, Deepankar Srivastava, Sarvesh, Hunar Singh, Vinay P. Namboodiri (2021). `` Multimodal Humor Dataset: Predicting Laughter tracks for Sitcoms'', The IEEE Winter Conference on Applications of Computer Vision, USA, 2021.) https://delta-lab-iitk.github.io/Multimodal-Humor-Dataset/

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.