Posts by Collection

portfolio

publications

Image retrieval based on projective invariance

Published in IEEE International Conference on Image Processing (ICIP), 2004

We propose an image retrieval scheme based on projectively invariant features. Since cross-ratio is the fundamental invariant feature under projective transformations for points, we use that as the basic feature parameter. We compute the cross-ratios of point sets in quadruplets and a discrete representation of the distribution of the cross-ratio is obtained from the computed values. The distribution is used as the feature for retrieval purposes. The method is very effective in retrieving images, like buildings, having similar planar 3D structures.

Recommended citation: Rajashekhar, S. Chaudhuri and V. P. Namboodiri (2004). "Image retrieval based on projective invariance." IEEE International Conference on Image Processing Singapore, October 2004, Page 405-408 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1418776

Use of Linear Diffusion in depth estimation based on defocus cue

Published in Fourth Indian Conference on Computer Vision, Graphics & Image Processing, 2004

Diffusion has been used extensively in computer vision. Most common applications of diffusion have been in low level vision problems like segmentation and edge detection. In this paper a novel application of the linear diffusion principle is made for the estimation of depth using the properties of the real aperture imaging system. The method uses two defocused images of a scene and the lens parameter setting as input and estimates the depth in the scene, and also generates the corresponding fully focused equivalent pin-hole image. The algorithm described here also brings out the equivalence of the two modalities, viz. depth from focus and depth from defocus for structure recovery.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2004). “Use of Linear Diffusion in depth estimation based on defocus cue” Proceedings of Fourth Indian Conference on Computer Vision, Graphics & Image Processing (ICVGIP) Kolkata, India. December 2004. http://vinaypn.github.io/files/icvgip04.pdf

Shock Filters based on Implicit Cluster Separation

Published in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2005

One of the classic problems in low level vision is image restoration. An important contribution toward this effort has been the development of shock filters by Osher and Rudin (1990). It performs image deblurring using hyperbolic partial differential equations. In this paper we relate the notion of cluster separation from the field of pattern recognition to the shock filter formulation. A kind of shock filter is proposed based on the idea of gradient based separation of clusters. The proposed formulation is general enough as it can allow various models of density functions in the cluster separation process. The efficacy of the method is demonstrated through various examples.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2005). "Shock Filters based on Implicit Cluster Separation." Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR),San Diego June 2005, Page 1-6. http://vinaypn.github.io/files/cvpr05.pdf

Improved Kernel-Based Object Tracking Under Occluded Scenarios

Published in Fifth Indian Conference on Computer Vision, Graphics & Image Processing, 2006

A successful approach for object tracking has been kernel based object tracking [1] by Comaniciu et al.. The method provides an effective solution to the problems of representation and localization in tracking. The method involves representation of an object by a feature histogram with an isotropic kernel and performing a gradient based mean shift optimization for localizing the kernel. Though robust, this technique fails under cases of occlusion. We improve the kernel based object tracking by performing the localization using a generalized (bidirectional) mean shift based optimization. This makes the method resilient to occlusions. Another aspect related to the localization step is handling of scale changes by varying the bandwidth of the kernel. Here, we suggest a technique based on SIFT features [2] by Lowe to enable change of bandwidth of the kernel even in the presence of occlusion. We demonstrate the effectiveness of the techniques proposed through extensive experimentation on a number of challenging data sets.

Recommended citation: V.P. Namboodiri, A. Ghorawat and S. Chaudhuri (2006) “Improved Kernel-Based Object Tracking Under Occluded Scenarios”. In: Kalra P.K., Peleg S. (eds) Computer Vision, Graphics and Image Processing. Lecture Notes in Computer Science, vol 4338. Springer, Berlin, Heidelberg http://vinaypn.github.io/files/icvgip06.pdf

Retrieval of images of man-made structures based on projective invariance

Published in Pattern Recognition Journal, 2007

In this paper we propose a geometry-based image retrieval scheme that makes use of projectively invariant features. Cross-ratio (CR) is an invariant feature under projective transformations for collinear points. We compute the CRs of point sets in quadruplets and the CR histogram is used as the feature for retrieval purposes. Being a geometric feature, it allows us to retrieve similar images irrespective of view point and illumination changes. We can retrieve the same building even if the facade has undergone a fresh coat of paints! Color and textural features can also be included, if desired. Experimental results show a favorably very good retrieval accuracy when tested on an image database of size 4000. The method is very effective in retrieving images having man-made objects rich in polygonal structures like buildings, rail tracks, etc.

Recommended citation: Rajashekhar, S. Chaudhuri and V.P. Namboodiri (2007), “Retrieval of images of man-made structures based on projective invariance”. Pattern Recognition Journal, Volume 40, Issue 1, January 2007, Pages 296-308 https://www.sciencedirect.com/science/article/pii/S0031320306001671

On defocus, diffusion and depth estimation

Published in Pattern Recognition Letters, 2007

An intrinsic property of real aperture imaging has been that the observations tend to be defocused. This artifact has been used in an innovative manner by researchers for depth estimation, since the amount of defocus varies with varying depth in the scene. There have been various methods to model the defocus blur. We model the defocus process using the model of diffusion of heat. The diffusion process has been traditionally used in low level vision problems like smoothing, segmentation and edge detection. In this paper a novel application of the diffusion principle is made for generating the defocus space of the scene. The defocus space is the set of all possible observations for a given scene that can be captured using a physical lens system. Using the notion of defocus space we estimate the depth in the scene and also generate the corresponding fully focused equivalent pin-hole image. The algorithm described here also brings out the equivalence of the two modalities, viz. depth from focus and depth from defocus for structure recovery.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2007). “On defocus, diffusion and depth estimation” Pattern Recognition Letters Volume 28, Issue 3, 1 February 2007, Pages 311-319 http://vinaypn.github.io/files/prl07.pdf

Super-Resolution Using Sub-band Constrained Total Variation

Published in International Conference on Scale Space and Variational Methods in Computer Vision (SSVM), 2007

Super-resolution of a single image is a severely ill-posed problem in computer vision. It is possible to consider solving this problem by considering a total variation based regularization framework. The choice of total variation based regularization helps in formulating an edge preserving scheme for super-resolution. However, this scheme tends to result in a piece-wise constant resultant image. To address this issue, we extend the formulation by incorporating an appropriate sub-band constraint which ensures the preservation of textural details in trade off with noise present in the observation. The proposed framework is extensively evaluated and the experimental results for the same are presented

Recommended citation: P. Chatterjee, V.P. Namboodiri, S. Chaudhuri (2007) “Super-Resolution Using Sub-band Constrained Total Variation”  In: Sgallari F., Murli A., Paragios N. (eds) Scale Space and Variational Methods in Computer Vision SSVM 2007. Lecture Notes in Computer Science, vol 4485. Springer, Berlin, Heidelberg http://vinaypn.github.io/files/ssvm07.pdf

Shape Recovery Using Stochastic Heat Flow

Published in British Machine Vision Conference (BMVC), 2007

We consider the problem of depth estimation from multiple images based on the defocus cue. For a Gaussian defocus blur, the observations can be shown to be the solution of a deterministic but inhomogeneous diffusion process. However, the diffusion process does not sufficiently address the case in which the Gaussian kernel is deformed. This deformation happens due to several factors like self-occlusion, possible aberrations and imperfections in the aperture. These issues can be solved by incorporating a stochastic perturbation into the heat diffusion process. The resultant flow is that of an inhomogeneous heat diffusion perturbed by a stochastic curvature driven motion. The depth in the scene is estimated from the coefficient of the stochastic heat equation without actually knowing the departure from the Gaussian assumption. Further, the proposed method also takes into account the non-convex nature of the diffusion process. The method provides a strong theoretical framework for handling the depth from defocus problem.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2007) “Shape Recovery Using Stochastic Heat Flow“ Proceedings of the British Machine Vision Conference 2007, University of Warwick, UK, September 10-13, 2007 http://vinaypn.github.io/files/bmvc07.pdf

Image Restoration using Geometrically Stabilized Reverse Heat Equation

Published in IEEE International Conference on Image Processing (ICIP), 2007

Blind restoration of blurred images is a classical ill-posed problem. There has been considerable interest in the use of partial differential equations to solve this problem. The blurring of an image has traditionally been modeled by Witkin [10] and Koenderink [4] by the heat equation. This has been the basis of the Gaussian scale space. However, a similar theoretical formulation has not been possible for deblurring of images due to the ill-posed nature of the reverse heat equation. Here we consider the stabilization of the reverse heat equation. We do this by damping the distortion along the edges by adding a normal component of the heat equation in the forward direction. We use a stopping criterion based on the divergence of the curvature in the resulting reverse heat flow. The resulting stabilized reverse heat flow makes it possible to solve the challenging problem of blind space varying deconvolution. The method is justified by a varied set of experimental results.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2005). "Image Restoration using Geometrically Stabilized Reverse Heat Equation." Proceedings of IEEE International Conference on Image Processing (ICIP), San Antonio, Texas, USA, 2007, Pages IV - 413 - 416. http://vinaypn.github.io/files/icip07.pdf

Recovery of relative depth from a single observation using an uncalibrated (real-aperture) camera

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008

In this paper we investigate the challenging problem of recovering the depth layers in a scene from a single defocused observation. The problem is definitely solvable if there are multiple observations. In this paper we show that one can perceive the depth in the scene even from a single observation. We use the inhomogeneous reverse heat equation to obtain an estimate of the blur, thereby preserving the depth information characterized by the defocus. However, the reverse heat equation, due to its parabolic nature, is divergent. We stabilize the reverse heat equation by considering the gradient degeneration as an effective stopping criterion. The amount of (inverse) diffusion is actually a measure of relative depth. Because of ill-posedness we propose a graph-cuts based method for inferring the depth in the scene using the amount of diffusion as a data likelihood and a smoothness condition on the depth in the scene. The method is verified experimentally on a varied set of test cases.

Recommended citation: V.P. Namboodiri and S. Chaudhuri (2008). "Recovery of relative depth from a single observation using an uncalibrated (real-aperture) camera." Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR),Anchorage, AK, USA, June 2008, Page 1-6. http://vinaypn.github.io/files/cvpr08.pdf

Regularized depth from defocus

Published in IEEE International Conference on Image Processing (ICIP), 2008

n the area of depth estimation from images an interesting approach has been structure recovery from defocus cue. Towards this end, there have been a number of approaches [4,6]. Here we propose a technique to estimate the regularized depth from defocus using diffusion. The coefficient of the diffusion equation is modeled using a pair-wise Markov random field (MRF) ensuring spatial regularization to enhance the robustness of the depth estimated. This framework is solved efficiently using a graph-cuts based techniques. The MRF representation is enhanced by incorporating a smoothness prior that is obtained from a graph based segmentation of the input images. The method is demonstrated on a number of data sets and its performance is compared with state of the art techniques.

Recommended citation: V.P. Namboodiri, S. Chaudhuri and S. Hadap (2008). “Regularized depth from defocus”, Proceedings of IEEE International Conference on Image Processing (ICIP) , San Diego, CA, USA, pp. 1520-1523. http://vinaypn.github.io/files/icip08.pdf

Action Recognition: A Region Based Approach

Published in IEEE Workshop on Applications of Computer Vision (WACV), 2011

We address the problem of recognizing actions in reallife videos. Space-time interest point-based approaches have been widely prevalent towards solving this problem. In contrast, more spatially extended features such as regions have not been so popular. The reason is, any local region based approach requires the motion flow information for a specific region to be collated temporally. This is challenging as the local regions are deformable and not well delineated from the surroundings. In this paper we address this issue by using robust tracking of regions and we show that it is possible to obtain region descriptors for classification of actions. This paper lays the groundwork for further investigation into region based approaches. Through this paper we make the following contributions a) We advocate identification of salient regions based on motion segmentation b) We adopt a state-of-the art tracker for robust tracking of the identified regions rather than using isolated space-time blocks c) We propose optical flow based region descriptors to encode the extracted trajectories in piece-wise blocks. We demonstrate the performance of our system on real-world data sets.

Recommended citation: H. Bilen, V.P. Namboodiri and L. Van Gool (2011). “Action recognition: A region based approach”, 2011 IEEE Workshop on Applications of Computer Vision (WACV), Kona, HI , 2011, pp. 294-300 http://vinaypn.github.io/files/wacv2011.pdf

Object and Action Classification with Latent Variables

Published in British Machine Vision Conference, 2011

In this paper we propose a generic framework to incorporate unobserved auxiliary information for classifying objects and actions. This framework allows us to explicitly account for localisation and alignment of representations for generic object and action classes as latent variables. We approach this problem in the discriminative setting as learning a max-margin classifier that infers the class label along with the latent variables. Through this paper we make the following contributions a) We provide a method for incorporating latent variables into object and action classification b) We specifically account for the presence of an explicit class related subregion which can include foreground and/or background. c) We explore a way to learn a better classifier by iterative expansion of the latent parameter space. We demonstrate the performance of our approach by rigorous experimental evaluation on a number of standard object and action recognition datasets.
Awarded: Best Paper Prize

Recommended citation: H. Bilen, V.P. Namboodiri and L. Van Gool (2011). “Object and Action Classification with Latent Variables”, In Jesse Hoey, Stephen McKenna and Emanuele Trucco, Proceedings of the British Machine Vision Conference, pages 17.1-17.11. BMVA Press, September 2011 http://vinaypn.github.io/files/bmvc2011.pdf

Super-resolution techniques for minimally invasive surgery

Published in MICCAI workshop on augmented environments for computer assisted interventions-AE-CAI, 2011

We propose the use of super-resolution techniques to aid visualization while carrying out minimally invasive surgical procedures. These procedures are performed using small endoscopic cameras, which inherently have limited imaging resolution. The use of higher-end cam- eras is technologically challenging and currently not yet cost effective. A promising alternative is to consider improving the resolution by post- processing the acquired images through the use of currently prevalent super-resolution techniques. In this paper we analyse the different method- ologies that have been proposed for super-resolution and provide a comprehensive evaluation of the most significant algorithms. The methods are evaluated using challenging in-vivo real world medical datasets. We suggest that the use of a learning-based super-resolution algorithm com- bined with an edge-directed approach would be most suited for this application.

Recommended citation: V. De Smet, V.P. Namboodiri and L. Van Gool (2011). “Super-resolution techniques for minimally invasive surgery”, 6th MICCAI workshop on augmented environments for computer assisted interventions-AE-CAI 2011 , Toronto,Canada. http://vinaypn.github.io/files/AECAI.pdf

Systematic evaluation of super-resolution using classification

Published in  Visual Communications and Image Processing (VCIP), 2011

Currently two evaluation methods of super-resolution (SR) techniques prevail: The objective Peak Signal to Noise Ratio (PSNR) and a qualitative measure based on manual visual inspection. Both of these methods are sub-optimal: The latter does not scale well to large numbers of images, while the former does not necessarily reflect the perceived visual quality. We address these issues in this paper and propose an evaluation method based on image classification. We show that perceptual image quality measures like structural similarity are not suitable for evaluation of SR methods. On the other hand a systematic evaluation using large datasets of thousands of real-world images provides a consistent comparison of SR algorithms that corresponds to perceived visual quality. We verify the success of our approach by presenting an evaluation of three recent super-resolution algorithms on standard image classification datasets.

Recommended citation: V. De Smet, V.P. Namboodiri and L. Van Gool (2011). “Systematic evaluation of super-resolution using classification”, 2011 Visual Communications and Image Processing (VCIP), Tainan, 2011, pp. 1-4. http://vinaypn.github.io/files/VCIP.pdf

Classification with Global, Local and Shared Features

Published in Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium, 2012

We present a framework that jointly learns and then uses multiple image windows for improved classification. Apart from using the entire image content as context, class-specific windows are added, as well as windows that target class pairs. The location and extent of the windows are set automatically by handling the window parameters as latent variables. This framework makes the following contributions: a) the addition of localized information through the class-specific windows improves classification, b) windows introduced for the classification of class pairs further improve the results, c) the windows and classification parameters can be effectively learnt using a discriminative max-margin approach with latent variables, and d) the same framework is suited for multiple visual tasks such as classifying objects, scenes and actions. Experiments demonstrate the aforementioned claims.

Recommended citation: Bilen H., Namboodiri V.P., Van Gool L.J. (2012) Classification with Global, Local and Shared Features. In: Pinz A., Pock T., Bischof H., Leberl F. (eds) Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium. Lecture Notes in Computer Science, vol 7476. Springer, Berlin, Heidelberg, pp 134-143 http://vinaypn.github.io/files/dagm2012.pdf

Nonuniform Image Patch Exemplars for Low Level Vision

Published in IEEE Workshop on Applications of Computer Vision (WACV), 2013

We approach the classification problem in a discrim- inative setting, as learning a max-margin classifier that infers the class label along with the latent variables. Through this paper we make the following contribu- tions: a) we provide a method for incorporating latent variables into object and action classification; b) these variables determine the relative focus on foreground vs. background information that is taken account of; c) we design an objective function to more effectively learn in unbalanced data sets; d) we learn a better classifier by iterative expansion of the latent parameter space. We demonstrate the performance of our approach through

Recommended citation: V. De Smet, L. Van Gool and V. P. Namboodiri, "Nonuniform image patch exemplars for low level vision," 2013 IEEE Workshop on Applications of Computer Vision (WACV), Tampa, FL, 2013, pp. 23-30. http://vinaypn.github.io/files/wacv2013.pdf

Object and Action Classification with Latent Window Parameters

Published in International Journal of Computer Vision (IJCV), 2014

In this paper we propose a generic framework to incorporate unobserved auxiliary information for classifying objects and actions. This framework allows us to automatically select a bounding box and its quadrants from which best to extract features. These spatial subdivisions are learnt as latent variables. The paper is an extended version of our earlier work [2], complemented with additional ideas, experiments and analysis.
We approach the classification problem in a discriminative setting, as learning a max-margin classifier that infers the class label along with the latent variables. Through this paper we make the following contributions: a) we provide a method for incorporating latent variables into object and action classification; b) these variables determine the relative focus on foreground vs. background information that is taken account of; c) we design an objective function to more effectively learn in unbalanced data sets; d) we learn a better classifier by iterative expansion of the latent parameter space. We demonstrate the performance of our approach through experimental evaluation on a number of standard object and action recognition data sets.

Recommended citation: H. Bilen, V.P. Namboodiri and L. Van Gool (2014), “Object and Action Classification with Latent Window Parameters”, International Journal of Computer Vision (IJCV) Vol: 106: 237 - 251, February 2014 http://vinaypn.github.io/files/ijcv2014.pdf

Object classification with adaptable regions

Published in IEEE Conference on Computer Vision and Pattern Recognition, 2014

In classification of objects substantial work has gone into improving the low level representation of an image by considering various aspects such as different features, a number of feature pooling and coding techniques and considering different kernels. Unlike these works, in this paper, we propose to enhance the semantic representation of an image. We aim to learn the most important visual components of an image and how they interact in order to classify the objects correctly. To achieve our objective, we propose a new latent SVM model for category level object classification. Starting from image-level annotations, we jointly learn the object class and its context in terms of spatial location (where) and appearance (what). Furthermore, to regularize the complexity of the model we learn the spatial and co-occurrence relations between adjacent regions, such that unlikely configurations are penalized. Experimental results demonstrate that the proposed method can consistently enhance results on the challenging Pascal VOC dataset in terms of classification and weakly supervised detection. We also show how semantic representation can be exploited for finding similar content.

Recommended citation: H. Bilen, M. Pedersoli, V. P. Namboodiri, T. Tuytelaars, L. Van Gool,“Object Classification with Adaptable Regions”, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2014 http://vinaypn.github.io/files/cvpr2014.pdf

Mind the gap: Subspace based hierarchical domain adaptation

Published in Workshop in Transfer and Multi-View Learning in Advances in Neural Information System Conference (NIPS) 27, , 2014

Domain adaptation techniques aim at adapting a classifier learnt on a source domain to work on the target domain. Exploiting the subspaces spanned by features of the source and target domains respectively is one approach that has been investigated towards solving this problem. These techniques normally assume the existence of a single subspace for the entire source / target domain. In this work, we consider the hierarchical organization of the data and consider multiple subspaces for the source and target domain based on the hierarchy. We evaluate different subspace based domain adaptation techniques under this setting and observe that using different subspaces based on the hierarchy yields consistent improvement over a non-hierarchical baseline

Recommended citation: A. Raj, V. P. Namboodiri, T. Tuytelaars, “Mind the Gap: Subspace based Hierarchical Domain Adaptation”, Workshop in Transfer and Multi-View Learning in Advances in Neural Information System Conference (NIPS) 27, Canada, 2014 http://vinaypn.github.io/files/task2014.pdf

Where is my Friend? - Person identification in Social Networks

Published in Proceedings of the Eleventh IEEE International Conference on Automatic Face and Gesture Recognition (FG 2015), 2015

One of the interesting applications of computer vision is to be able to identify or detect persons in real world. This problem has been posed in the context of identifying people in television series [2] or in multi-camera networks [8]. However, a common scenario for this problem is to be able to identify people among images prevalent on social networks. In this paper we present a method that aims to solve this problem in real world conditions where the person can be in any pose, profile and orientation and the face itself is not always clearly visible. Moreover, we show that the problem can be solved with as weak supervision only a label whether the person is present or not, which is usually the case as people are tagged in social networks. This is challenging as there can be ambiguity in association of the right person. The problem is solved in this setting using a latent max-margin formulation where the identity of the person is the latent parameter that is classified. This framework builds on other off the shelf computer vision techniques for person detection and face detection and is able to also account for inaccuracies of these components. The idea is to model the complete person in addition to face, that too with weak supervision. We also contribute three real-world datasets that we have created for extensive evaluation of the solution. We show using these datasets that the problem can be effectively solved using the proposed method.

Recommended citation: D. Pathak, Sai Nitish S. and V. P. Namboodiri, “Where is my Friend? - Person identification in Social Networks”, Proceedings of the Eleventh IEEE International Conference on Automatic Face and Gesture Recognition (FG 2015), Ljubljana, Slovenia, 2015 http://vinaypn.github.io/files/fg2015.pdf

Subspace alignment based domain adaptation for RCNN detector

Published in Proceedings of British Machine Vision Conference (BMVC), 2015

In this paper, we propose subspace alignment based domain adaptation of the state of the art RCNN based object detector. The aim is to be able to achieve high quality object detection in novel, real world target scenarios without requiring labels from the target domain. While, unsupervised domain adaptation has been studied in the case of object classification, for object detection it has been relatively unexplored. In subspace based domain adaptation for objects, we need access to source and target subspaces for the bounding box features. The absence of supervision (labels and bounding boxes are absent) makes the task challenging. In this paper, we show that we can still adapt sub- spaces that are localized to the object by obtaining detections from the RCNN detector trained on source and applied on target. Then we form localized subspaces from the detections and show that subspace alignment based adaptation between these subspaces yields improved object detection. This evaluation is done by considering challenging real world datasets of PASCAL VOC as source and validation set of Microsoft COCO dataset as target for various categories.

Recommended citation: Anant Raj, Vinay P. Namboodiri and Tinne Tuytelaars, “Subspace Alignment based Domain Adaptation for RCNN Detector”, Proceedings of British Machine Vision Conference (BMVC 2015), Swansea, UK, 2015 http://vinaypn.github.io/files/bmvc2015rnt.pdf

Adapting RANSAC SVM to Detect Outliers for Robust Classification.

Published in Proceedings of British Machine Vision Conference (BMVC), 2015

Most visual classification tasks assume the authenticity of the label information. However, due to several reasons such as difficulty of annotation or inadvertently due to human error, the annotation can often be noisy. This results in examples that are wrongly annotated. In this paper, we consider the examples that are wrongly annotated to be outliers. The task of learning a robust inlier model in the presence of outliers is typically done through the RANSAC algorithm. In this paper, we show that instead of adopting RANSAC to obtain the `right' model, we could use many instances of randomly sampled sets to build lot of models. The collective decision of all these classifiers can be used to identify samples that are likely to be outliers. This results in a modification to RANSAC SVM to explicitly obtain probable outliers from the set of given samples. Once, the outliers are detected, these examples are excluded from the training set. The method can also be used to identify very hard examples from the training set. In this case, where we believe that the examples are correctly annotated, we can achieve good generalization when such examples are excluded from the training set. The method is evaluated using the standard PASCAL VOC dataset. We show that the method is particularly suited for identifying wrongly annotated examples resulting in improvement of more than 12\% over the RANSAC SVM approach. Hard examples in PASCAL VOC dataset are also identified by this method and in fact this even results in a marginal improvement of the classification accuracy over the base classifier provided with all clean samples.

Recommended citation: Subhabrata Debnath, Anjan Banerjee and Vinay P. Namboodiri, “Adapting RANSAC SVM to detect outliers for Robust Classification”,Proceedings of British Machine Vision Conference (BMVC 2015), Swansea, UK, 2015 http://vinaypn.github.io/files/bmvc2015dbn.pdf

Deep attributes for one-shot face recognition

Published in ECCV Workshop on ‘Transfering and Adapting Source Knowledge in Computer Vision’, 2016

We address the problem of one-shot unconstrained face recognition. This is addressed by using a deep attribute representation of faces. While face recognition has considered the use of attribute based representations, for one-shot face recognition, the methods proposed so far have been using different features that represent the limited example available. We postulate that by using an intermediate attribute representation, it is possible to outperform purely face based feature representation for one-shot recognition. We use two one-shot face recognition techniques based on exemplar SVM and one-shot similarity kernel to compare face based deep feature representations against deep attribute based representation. Key ResultThe evaluation on standard dataset of 'Labeled faces in the wild' suggests that deep attribute based representations can outperform deep feature based face representations for this problem of one-shot face recognition.

Recommended citation: A. Jadhav, V.P. Namboodiri and K.S. Venkatesh, “Deep Attributes for One-Shot Face Recog- nition”, ECCV Workshop on ‘Transfering and Adapting Source Knowledge in Computer Vision’, Amsterdam, 2016 http://vinaypn.github.io/files/task2016.pdf

Using Gaussian Processes to Improve Zero-Shot Learning with Relative Attributes

Published in Proceedings of Asian Conference on Computer Vision (ACCV), 2016

Relative attributes can serve as a very useful method for zero-shot learning of images. This was shown by the work of Parikh and Grauman [1] where an image is expressed in terms of attributes that are relatively specified between different class pairs. However, for zero-shot learning the authors had assumed a simple Gaussian Mixture Model (GMM) that used the GMM based clustering to obtain the label for an unknown target test example. In this paper, we contribute a principled approach that uses Gaussian Process based classification to obtain the posterior probability for each sample of an unknown target class, in terms of Gaussian process classification and regression for nearest sample images. We analyse different variants of this approach and show that such a principled approach yields improved performance and a better understanding in terms of probabilistic estimates. The method is evaluated on standard Pubfig and Shoes with Attributes benchmarks

Recommended citation: Y. Dolma and V.P. Namboodiri, “Gaussian Processes to Improve Zero-Shot Learning with Relative Attributes”, Proceedings of Asian Conference in Computer Vision (ACCV), Taipei, Taiwan, 2016 http://vinaypn.github.io/files/accv2016.pdf

Contextual rnn-gans for abstract reasoning diagram generation

Published in Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 2017

Understanding, predicting, and generating object motions and transformations is a core problem in artificial intelligence. Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting, simulation, or video generation. Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence. For this, we develop a novel Contextual Generative Adversarial Network based on Recurrent Neural Networks (Context-RNN-GANs), where both the generator and the discriminator modules are based on contextual history (modeled as RNNs) and the adversarial discriminator guides the generator to produce realistic images for the particular time step in the image sequence. We evaluate the Context-RNN-GAN model (and its variants) on a novel dataset of Diagrammatic Abstract Reasoning, where it performs competitively with 10th-grade human performance but there is still scope for interesting improvements as compared to college-grade human performance. We also evaluate our model on a standard video next-frame prediction task, achieving improved performance over comparable state-of-the-art.

Recommended citation: A. Ghosh, V. Kulharia, A. Mukerjee, V.P. Namboodiri, M. Bansal, “Contextual RNN-GANs for Abstract Reasoning Diagram Generation”, Proceedings of Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), San Francisco, California, USA, February 2017 http://vinaypn.github.io/files/aaai2017.pdf

Sketchsoup: Exploratory ideation using design sketches

Published in Computer Graphics Forum (CGF) Journal, 2017

A hallmark of early stage design is a number of quick-and-dirty sketches capturing design inspirations, model variations, and alternate viewpoints of a visual concept. We present SketchSoup, a workflow that allows designers to explore the design space induced by such sketches. We take an unstructured collection of drawings as input, register them using a multi-image matching algorithm, and present them as a 2D interpolation space. By morphing sketches in this space, our approach produces plausible visualizations of shape and viewpoint variations despite the presence of sketch distortions that would prevent standard camera calibration and 3D reconstruction. In addition, our interpolated sketches can serve as inspiration for further drawings, which feed back into the design space as additional image inputs. SketchSoup thus fills a significant gap in the early ideation stage of conceptual design by allowing designers to make better informed choices before proceeding to more expensive 3D modeling and prototyping. From a technical standpoint, we describe an end-to-end system that judiciously combines and adapts various image processing techniques to the drawing domain – where the images are dominated not by color, shading and texture, but by sketchy stroke contours.

Recommended citation: R. Arora, I. Darolia, V.P. Namboodiri, K. Singh and A. Bousseau, “SketchSoup: Exploratory Ideation Using Design Sketches”, Computer Graphics Forum, 2017 http://vinaypn.github.io/files/cgf2017.pdf

Compact Environment-Invariant Codes for Robust Visual Place Recognition

Published in 14th Conference on Computer and Robot Vision (CRV), 2017

Robust visual place recognition (VPR) requires scene representations that are invariant to various environmental challenges such as seasonal changes and variations due to ambient lighting conditions during day and night. Moreover, a practical VPR system necessitates compact representations of environmental features. To satisfy these requirements, in this paper we suggest a modification to the existing pipeline of VPR systems to incorporate supervised hashing. The modified system learns (in a supervised setting) compact binary codes from image feature descriptors. These binary codes imbibe robustness to the visual variations exposed to it during the training phase, thereby, making the system adaptive to severe environmental changes. Also, incorporating supervised hashing makes VPR computationally more efficient and easy to implement on simple hardware. This is because binary embeddings can be learned over simple-to-compute features and the distance computation is also in the low-dimensional hamming space of binary codes. We have performed experiments on several challenging data sets covering seasonal, illumination and viewpoint variations. We also compare two widely used supervised hashing methods of CCAITQ and MLH and show that this new pipeline out-performs or closely matches the state-of-the-art deep learning VPR methods that are based on high-dimensional features extracted from pre-trained deep convolutional neural networks.

Recommended citation: U.Jain, V.P. Namboodiri and G. Pandey,“Supervised Hashing for Robust Visual Place Recognition”, 14th Conference on Computer and Robot Vision, Edmonton, Alberta, May 16-19, 2017 http://vinaypn.github.io/files/crv2017.pdf

Reactive Displays for Virtual Reality

Published in IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), 2017

The feeling of presence in virtual reality has enabled a large number of applications. These applications typically deal with 360° content. However, a large amount of existing content is available in terms of images and videos i.e 2D content. Unfortunately, these do not react to the viewer's position or motion when viewed through a VR HMD. Thus in this work, we propose reactive displays for VR which instigate a feeling of discovery while exploring 2D content. We create this by taking into account user's position and motion to compute homography based mappings that adapt the 2D content and re-project it onto the display. This allows the viewer to obtain a more richer experience of interacting with 2D content similar to the effect of viewing through the window at a scene. We also provide a VR interface that uses a constrained set of reactive displays to easily browse through 360° content. The proposed interface tackles the problem of nausea caused by existing interfaces like photospheres by providing a natural room-like intermediate interface before changing 360° content. We perform user studies to evaluate both of our interfaces. The results show that the proposed reactive display interfaces are indeed beneficial.

Recommended citation: G S S Srinivas Rao, Neeraj Thakur, Vinay P. Namboodiri, “Reactive Displays for Virtual Reality”, Proceedings of 16th IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (Poster Proceedings), Nantes, France, 2017 http://vinaypn.github.io/files/ismar2017.pdf

Visual Odometry Based Omni-directional Hyperlapse

Published in Proceedings of National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2017),, 2017

The prohibitive amounts of time required to review the large amounts of data captured by surveillance and other cameras has brought into question the very utility of large scale video logging. Yet, one recognizes that such logging and analysis are indispensable to security applications. The only way out of this paradox is to devise expedited browsing, by the creation of hyperlapse. We address the hyperlapse problem for the very challenging category of intensive egomotion which makes the hyperlapse highly jerky. We propose an economical approach for trajectory estimation based on Visual Odometry and implement cost functions to penalize pose and path deviations. Also, this is implemented on data taken by omni-directional camera, so that the viewer can opt to observe any direction while browsing. This requires many innovations, including handling the massive radial distortions and implementing scene stabilization that need to be operated upon the least distorted region of the omni view

Recommended citation: P. Rani, A. Jangid, V.P. Namboodiri and K.S. Venkatesh, “Visual Odometry based Omni-directional Hyperlapse”, Proceedings of National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2017), Mandi, India 2017 http://vinaypn.github.io/files/ncvpripg2017.pdf

No Modes left behind: Capturing the data distribution effectively using GANs

Published in Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018

Generative adversarial networks (GANs) while being very versatile in realistic image synthesis, still are sensitive to the input distribution. Given a set of data that has an imbalance in the distribution, the networks are susceptible to missing modes and not capturing the data distribution. While various methods have been tried to improve training of GANs, these have not addressed the challenges of covering the full data distribution. Specifically, a generator is not penalized for missing a mode. We show that these are therefore still susceptible to not capturing the full data distribution. In this paper, we propose a simple approach that combines an encoder based objective with novel loss functions for generator and discriminator that improves the solution in terms of capturing missing modes. We validate that the proposed method results in substantial improvements through its detailed analysis on toy and real datasets. The quantitative and qualitative results demonstrate that the proposed method improves the solution for the problem of missing modes and improves training of GANs.

Recommended citation: S. Sharma and V.P. Namboodiri,”No Modes left behind: Capturing the data distribution effectively using GANs”, Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18),New Orleans, USA, February 2018 http://vinaypn.github.io/files/aaai2018.pdf

Word spotting in silent lip videos

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model's recognition performance. Our contribution is two-fold: 1) we develop a pipeline for recognition-free retrieval, and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. 2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatio-temporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRWdataset. Finally, we demonstrate the application of the method by word spotting in a popular speech video ("The great dictator" by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies.

Recommended citation: A. Jha, V. P. Namboodiri and C. V. Jawahar, "Word Spotting in Silent Lip Videos," IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, 2018, pp. 150-159. http://vinaypn.github.io/files/wacv2018.pdf

Unsupervised domain adaptation of deep object detectors.

Published in European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) , 2018

Domain adaptation has been understood and adopted in vision. Recently with the advent of deep learning there are a number of techniques that propose methods for deep learning based domain adaptation. However, the methods proposed have been used for adapting object classification techniques. In this paper, we solve for domain adaptation of object detection that is more commonly used. We adapt deep adaptation techniques for the Faster R-CNN framework. The techniques that we adapt are the recent techniques based on Gradient Reversal and Maximum Mean Discrepancy (MMD) reduction based techniques. Among them we show that the MK-MMD based method when used appropriately provides the best results. We analyze our model with standard real world settings by using Pascal VOC as source and MS-COCO as target and show a gain of 2.5 mAP at IoU of 0.5 over a source only trained model. We show that this improvement is statistically significant

Recommended citation: Debjeet Majumdar and Vinay P. Namboodiri,”Unsupervised domain adaptation of deep object detectors.”, 26th European Symposium on Artificial Neural Networks, ESANN 2018, Bruges, Belgium, April 25-27, 2018.  http://vinaypn.github.io/files/esann2018.pdf

Multi-agent diverse generative adversarial networks

Published in IEEE Conference on Computer Vision and Pattern Recognition, 2018

We propose MAD-GAN, an intuitive generalization to the Generative Adversarial Networks (GANs) and its conditional variants to address the well known problem of mode collapse. First, MAD-GAN is a multi-agent GAN architecture incorporating multiple generators and one discriminator. Second, to enforce that different generators capture diverse high probability modes, the discriminator of MAD-GAN is designed such that along with finding the real and fake samples, it is also required to identify the generator that generated the given fake sample. Intuitively, to succeed in this task, the discriminator must learn to push different generators towards different identifiable modes. We perform extensive experiments on synthetic and real datasets and compare MAD-GAN with different variants of GAN. We show high quality diverse sample generations for challenging tasks such as image-to-image translation and face generation. In addition, we also show that MAD-GAN is able to disentangle different modalities when trained using highly challenging diverse-class dataset (e.g. dataset with images of forests, icebergs, and bedrooms). In the end, we show its efficacy on the unsupervised feature representation task.

Recommended citation: A. Ghosh, V. Kulharia, V.P. Namboodiri, P.H.S. Torr and P. Dokania, “Multi-Agent Diverse Generative Adversarial Networks”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Salt Lake City, Utah, June 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Ghosh_Multi-Agent_Diverse_Generative_CVPR_2018_paper.html

Differential Attention for Visual Question Answering

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

In this paper we aim to answer questions based on images when provided with a dataset of question-answer pairs for a number of images during training. A number of methods have focused on solving this problem by using image based attention. This is done by focusing on a specific part of the image while answering the question. Humans also do so when solving this problem. However, the regions that the previous systems focus on are not correlated with the regions that humans focus on. The accuracy is limited due to this drawback. In this paper, we propose to solve this problem by using an exemplar based method. We obtain one or more supporting and opposing exemplars to obtain a differential attention region. This differential attention is closer to human attention than other image based attention methods. It also helps in obtaining improved accuracy when answering questions. The method is evaluated on challenging benchmark datasets. We perform better than other image based attention methods and are competitive with other state of the art methods that focus on both image and questions.

Recommended citation: B.N. Patro and V.P. Namboodiri, “Differential Attention for Visual Question Answering”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Salt Lake City, Utah, June 2018. https://badripatro.github.io/DVQA/

Eclectic domain mixing for effective adaptation in action spaces

Published in Journal on Multimedia Tools and Applications, 2018

Although videos appear to be very high-dimensional in terms of duration × frame-rate × resolution, temporal smoothness constraints ensure that the intrinsic dimensionality for videos is much lower. In this paper, we use this idea for investigating Domain Adaptation (DA) in videos, an area that remains under-explored. An approach that has worked well for the image DA is based on the subspace modeling of the source and target domains, which works under the assumption that the two domains share a latent subspace where the domain shift can be reduced or eliminated. In this paper, first we extend three subspace based image DA techniques for human action recognition and then combine it with our proposed Eclectic Domain Mixing (EDM) approach to improve the effectiveness of the DA. Further, we use discrepancy measures such as Symmetrized KL Divergence and Target Density Around Source for empirical study of the proposed EDM approach. While, this work mainly focuses on Domain Adaptation in videos, for completeness of the study, we comprehensively evaluate our approach using both object and action datasets. In this paper, we have achieved consistent improvements over chosen baselines and obtained some state-of-the-art results for the datasets.

Recommended citation: Jamal, A., Deodhare, D., Namboodiri, V.P., Venkatesh, K.S. “Eclectic domain mixing for effective adaptation in action spaces”, Journal on Multimedia Tools and Applications, November 2018, Volume 77, Issue 22, pp 29949–29969. https://link.springer.com/article/10.1007/s11042-018-6179-y

Learning semantic sentence embeddings using sequential pair-wise discriminator

Published in International Conference on Computational Linguistics (COLING), 2018

In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder that is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validated our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and outperforms the state-of-the-art on the paraphrase generation and sentiment analysis task on standard datasets. These results are also shown to be statistically significant.

Recommended citation: B.N. Patro, V.K. Kurmi, S. Kumar and V.P. Namboodiri, “Learning semantic sentence embeddings using sequential pair-wise discriminator”, Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 2018 https://badripatro.github.io/Question-Paraphrases/

Monoaural Audio Source Separation Using Variational Autoencoders.

Published in Proceedings of Interspeech Conference, 2018

We introduce a monaural audio source separation framework using a latent generative model. Traditionally, discriminative training for source separation is proposed using deep neural networks or non-negative matrix factorization. In this paper, we propose a principled generative approach using variational autoencoders (VAE) for audio source separation. VAE computes efficient Bayesian inference which leads to a continuous latent representation of the input data(spectrogram). It contains a probabilistic encoder which projects an input data to latent space and a probabilistic decoder which projects data from latent space back to input space. This allows us to learn a robust latent representation of sources corrupted with noise and other sources. The latent representation is then fed to the decoder to yield the separated source. Both encoder and decoder are implemented via multilayer perceptron (MLP). In contrast to prevalent techniques, we argue that VAE is a more principled approach to source separation. Experimentally, we find that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature i.e. DNN and RNN with different masking functions and autoencoders. We show that our method performs better than best of the relevant methods with ∼ 2 dB improvement in the source to distortion ratio.

Recommended citation: Laxmi Pandey, Anurendra Kumar and Vinay P. Namboodiri, “Monoaural Audio Source Separation Using Variational Autoencoders.”, 19th Annual Conference of the International Speech Communication Association, Interspeech 2018, Hyderabad, India, 2-6 September 2018 http://vinaypn.github.io/files/interspeech2018.pdf

Deep Domain Adaptation in Action Space.

Published in Proceedings of British Machine Vision Conference (BMVC), 2018

In the general settings of supervised learning, human action recognition has been a widely studied topic. The classifiers learned in this setting assume that the training and test data have been sampled from the same underlying probability distribution. However, in most of the practical scenarios, this assumption is not true, resulting in a suboptimal performance of the classifiers. This problem, referred to as Domain Shift, has been extensively studied, but mostly for image/object classification task. In this paper, we investigate the problem of Domain Shift in action videos, an area that has remained under-explored, and propose two new approaches named Action Modeling on Latent Subspace (AMLS) and Deep Adversarial Action Adaptation (DAAA). In the AMLS approach, the action videos in the target domain are modeled as a sequence of points on a latent subspace and adaptive kernels are successively learned between the source domain point and the sequence of target domain points on the manifold. In the DAAA approach, an end-to-end adversarial learning framework is proposed to align the two domains. The action adaptation experiments were conducted using various combinations of multi-domain action datasets, including six common classes of Olympic Sports and UCF50 datasets and all classes of KTH, MSR and our own SonyCam datasets. In this paper, we have achieved consistent improvements over chosen baselines and obtained some state-of-the-art results for the datasets.

Recommended citation: Arshad Jamal, Vinay P. Namboodiri, Dipti Deodhare and K.S. Venkatesh, “Deep Domain Adaptation in Action Space”, British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018 http://bmvc2018.org/contents/papers/0960.pdf

Deep active learning for object detection.

Published in Proceedings of British Machine Vision Conference (BMVC), 2018

Object detection methods like Single Shot Multibox Detector (SSD) provide highly accurate object detection that run in real-time. However, these approaches require a large number of annotated training images. Evidently, not all of these images are equally useful for training the algorithms. Moreover, obtaining annotations in terms of bounding boxes for each image is costly and tedious. In this paper, we aim to obtain a highly accurate object detector using only a fraction of the training images. We do this by adopting active learning that uses ‘human in the loop’ paradigm to select the set of images that would be useful if annotated. Towards this goal, we make the following contributions: 1. We develop a novel active learning method which poses the layered architecture used in object detection as a ‘query by committee’ paradigm to choose the set of images to be queried. 2. We introduce a framework to use the exploration/exploitation trade-off in our methods. 3. We analyze the results on standard object detection datasets which show that with only a third of the training data, we can obtain more than 95% of the localization accuracy of full supervision. Further our methods outperform classical uncertainty-based active learning algorithms like maximum entropy

Recommended citation: Soumya Roy, Asim Unmesh and Vinay P. Namboodiri, “Deep active learning for object detection”, British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018 http://bmvc2018.org/contents/papers/0287.pdf

Multimodal differential network for visual question generation

Published in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal Differential Network to produce natural and engaging questions. The generated questions show a remarkable similarity to the natural questions as validated by a human study. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU, METEOR, ROUGE, and CIDEr)

Recommended citation: Badri N. Patro, Sandeep Kumar, Vinod K. Kurmi, Vinay P. Namboodiri,”Multimodal Differential Network for Visual Question Generation”, 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018 https://badripatro.github.io/MDN-VQG/

U-DADA: Unsupervised Deep Action Domain Adaptation

Published in Asian Conference on Computer Vision (ACCV), 2018

The problem of domain adaptation has been extensively studied for object classification task. However, this problem has not been as well studied for recognizing actions. While, object recognition is well understood, the diverse variety of videos in action recognition make the task of addressing domain shift to be more challenging. We address this problem by proposing a new novel adaptation technique that we term as unsupervised deep action domain adaptation (U-DADA). The main concept that we propose is that of explicitly modeling density based adaptation and using them while adapting domains for recognizing actions. We show that these techniques work well both for domain adaptation through adversarial learning to obtain invariant features or explicitly reducing the domain shift between distributions. The method is shown to work well using existing benchmark datasets such as UCF50, UCF101, HMDB51 and Olympic Sports. As a pioneering effort in the area of deep action adaptation, we are presenting several benchmark results and techniques that could serve as baselines to guide future research in this area.

Recommended citation: Jamal A., Namboodiri V.P., Deodhare D., Venkatesh K.S. (2019) U-DADA: Unsupervised Deep Action Domain Adaptation. In: Jawahar C., Li H., Mori G., Schindler K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science, vol 11363. Springer, https://link.springer.com/chapter/10.1007/978-3-030-20893-6_28

Supervised Hashing for Retrieval of Multimodal Biometric Data

Published in 3rd Workshop on Computer Vision Applications (WCVA) , 2018

Biometric systems commonly utilize multi-biometric approaches where a person is verified or identified based on multiple biometric traits. However, requiring systems that are deployed usually require verification or identification from a large number of enrolled candidates. These are possible only if there are efficient methods that retrieve relevant candidates in a multi-biometric system. To solve this problem, we analyze the use of hashing techniques that are available for obtaining retrieval. We specifically based on our analysis recommend the use of supervised hashing techniques over deep learned features as a possible common technique to solve this problem. Our investigation includes a comparison of some of the supervised and unsupervised methods viz. Principal Component Analysis (PCA), Locality Sensitive Hashing (LSH), Locality-sensitive binary codes from shift-invariant kernels (SKLSH), Iterative quantization: A procrustean approach to learning binary codes (ITQ), Binary Reconstructive Embedding (BRE) and Minimum loss hashing (MLH) that represent the prevalent classes of such systems and we present our analysis for the following biometric data: Face, Iris, and Fingerprint for a number of standard datasets. The main technical contributions through this work are as follows: (a) Proposing Siamese network based deep learned feature extraction method (b) Analysis of common feature extraction techniques for multiple biometrics as to a reduced feature space representation (c) Advocating the use of supervised hashing for obtaining a compact feature representation across different biometrics traits. (d) Analysis of the performance of deep representations against shallow representations in a practical reduced feature representation framework. Through experimentation with multiple biometrics traits, feature representations, and hashing techniques, we can conclude that current deep learned features when retrieved using supervised hashing can be a standard pipeline adopted for most unimodal and multimodal biometric identification tasks.

Recommended citation: Sumesh T.A., Namboodiri V., Gupta P. (2019) Supervised Hashing for Retrieval of Multimodal Biometric Data. In: Arora C., Mitra K. (eds) Computer Vision Applications. WCVA 2018. Communications in Computer and Information Science, vol 1019. Springer, Singapore https://link.springer.com/chapter/10.1007/978-981-15-1387-9_8

Multi-layer pruning framework for compressing single shot multibox detector

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2019

We propose a framework for compressing state-of-the-art Single Shot MultiBox Detector (SSD). The framework addresses compression in the following stages: Sparsity Induction, Filter Selection, and Filter Pruning. In the Sparsity Induction stage, the object detector model is sparsified via an improved global threshold. In Filter Selection & Pruning stage, we select and remove filters using sparsity statistics of filter weights in two consecutive convolutional layers. This results in the model with the size smaller than most existing compact architectures. We evaluate the performance of our framework with multiple datasets and compare over multiple methods. Experimental results show that our method achieves state-of-the-art compression of 6.7X and 4.9X on PASCAL VOC dataset on models SSD300 and SSD512 respectively. We further show that the method produces maximum compression of 26X with SSD512 on German Traffic Sign Detection Benchmark (GTSDB). Additionally, we also empirically show our method's adaptability for classification based architecture VGG16 on datasets CIFAR and German Traffic Sign Recognition Benchmark (GTSRB) achieving a compression rate of 125X and 200X with the reduction in flops by 90.50% and 96.6% respectively with no loss of accuracy. In addition to this, our method does not require any special libraries or hardware support for the resulting compressed models.

Recommended citation: P. Singh, Manikandan R., N. Matiyali and V. P. Namboodiri, "Multi-layer pruning framework for compressing single shot multibox detector," IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, Hawaii, USA. https://arxiv.org/abs/1811.08342

Stability Based Filter Pruning for Accelerating Deep CNNs

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), 2019

Convolutional neural networks (CNN) have achieved impressive performance on the wide variety of tasks (classification, detection, etc.) across multiple domains at the cost of high computational and memory requirements. Thus, leveraging CNNs for real-time applications necessitates model compression approaches that not only reduce the total number of parameters but reduce the overall computation as well. In this work, we present a stability-based approach for filter-level pruning of CNNs. We evaluate our proposed approach on different architectures (LeNet, VGG-16, ResNet, and Faster RCNN) and datasets and demonstrate its generalizability through extensive experiments. Moreover, our compressed models can be used at run-time without requiring any special libraries or hardware. Our model compression method reduces the number of FLOPS by an impressive factor of 6.03X and GPU memory footprint by more than 17X, significantly outperforming other state-of-the-art filter pruning methods.

Recommended citation: P. Singh, Manikandan V.S.R. Kadi, N. Verma and V. P. Namboodiri, "Stability Based Filter Pruning for Accelerating Deep CNNs," IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, Hawaii, USA. https://arxiv.org/abs/1811.08321

Spotting words in silent speech videos: a retrieval-based approach

Published in Journal of Machine Vision and Applications (MVA), 2019

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recognition performance. Our contribution is twofold: (1) we develop a pipeline for recognition-free retrieval and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. (2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatiotemporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRW dataset. We also demonstrate the application of the method by word spotting in a popular speech video (“The great dictator” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies. Finally, we compare our model against ASR in a noisy environment and analyze the effect of the performance of underlying lip-reader and input video quality on the proposed word spotting pipeline.

Recommended citation: A. Jha, V. P. Namboodiri and C. V. Jawahar,”Spotting words in silent speech videos: a retrieval-based approach”, Journal of Machine Vision and Applications, March 2019, Volume 30, Issue 2, pp 217–229 https://link.springer.com/article/10.1007/s00138-019-01006-y

Cross-language Speech Dependent Lip-synchronization

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

Understanding videos of people speaking across international borders is hard as audiences from different demographies do not understand the language. Such speech videos are often supplemented with language subtitles. However, these hamper the viewing experience as the attention is shared. Simple audio dubbing in a different language makes the video appear unnatural due to unsynchronized lip motion. In this paper, we propose a system for automated cross-language lip synchronization for re-dubbed videos. Our model generates superior photorealistic lip-synchronization over original video in comparison to the current re-dubbing method. With the help of a user-based study, we verify that our method is preferred over unsynchronized videos.

Recommended citation: A. Jha, V. Voleti, V. Namboodiri and C. V. Jawahar, "Cross-language Speech Dependent Lip-synchronization," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 7140-7144. http://vinaypn.github.io/files/icassp2019.pdf

Hetconv: Heterogeneous kernel-based convolutions for deep cnns

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

We present a novel deep learning architecture in which the convolution operation leverages heterogeneous kernels. The proposed HetConv (Heterogeneous Kernel-Based Convolution) reduces the computation (FLOPs) and the number of parameters as compared to standard convolution operation while still maintaining representational efficiency. To show the effectiveness of our proposed convolution, we present extensive experimental results on the standard convolutional neural network (CNN) architectures such as VGG \cite{vgg2014very} and ResNet \cite{resnet}. We find that after replacing the standard convolutional filters in these architectures with our proposed HetConv filters, we achieve 3X to 8X FLOPs based improvement in speed while still maintaining (and sometimes improving) the accuracy. We also compare our proposed convolutions with group/depth wise convolutions and show that it achieves more FLOPs reduction with significantly higher accuracy.

Recommended citation: Pravendra Singh, Vinay Kumar Verma, Piyush Rai and Vinay P. Namboodiri,”Hetconv: Heterogeneous kernel-based convolutions for deep cnns”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Long Beach, California, June 2019. http://openaccess.thecvf.com/content_CVPR_2019/papers/Singh_HetConv_Heterogeneous_Kernel-Based_Convolutions_for_Deep_CNNs_CVPR_2019_paper.pdf

Attending to Discriminative Certainty for Domain Adaptation

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

In this paper, we aim to solve for unsupervised domain adaptation of classifiers where we have access to label information for the source domain while these are not available for a target domain. While various methods have been proposed for solving these including adversarial discriminator based methods, most approaches have focused on the entire image based domain adaptation. In an image, there would be regions that can be adapted better, for instance, the foreground object may be similar in nature. To obtain such regions, we propose methods that consider the probabilistic certainty estimate of various regions and specify focus on these during classification for adaptation. We observe that just by incorporating the probabilistic certainty of the discriminator while training the classifier, we are able to obtain state of the art results on various datasets as compared against all the recent methods. We provide a thorough empirical analysis of the method by providing ablation analysis, statistical significance test, and visualization of the attention maps and t-SNE embeddings. These evaluations convincingly demonstrate the effectiveness of the proposed approach.

Recommended citation: Vinod Kumar Kurmi*, Shanu Kumar* and Vinay P Namboodiri,”Attending to Discriminative Certainty for Domain Adaptation”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Long Beach, California, June 2019. http://openaccess.thecvf.com/content_CVPR_2019/papers/Kurmi_Attending_to_Discriminative_Certainty_for_Domain_Adaptation_CVPR_2019_paper.pdf

Unsupervised Synthesis of Anomalies in Videos: Transforming the Normal

Published in International Joint Conference on Neural Networks (IJCNN) , 2019

Abnormal activity recognition requires detection of occurrence of anomalous events that suffer from a severe imbalance in data. In a video, normal is used to describe activities that conform to usual events while the irregular events which do not conform to the normal are referred to as abnormal. It is far more common to observe normal data than to obtain abnormal data in visual surveillance. In this paper, we propose an approach where we can obtain abnormal data by transforming normal data. This is a challenging task that is solved through a multi-stage pipeline approach. We utilize a number of techniques from unsupervised segmentation in order to synthesize new samples of data that are transformed from an existing set of normal examples. Further, this synthesis approach has useful applications as a data augmentation technique. An incrementally trained Bayesian convolutional neural network (CNN) is used to carefully select the set of abnormal samples that can be added. Finally through this synthesis approach we obtain a comparable set of abnormal samples that can be used for training the CNN for the classification of normal vs abnormal samples. We show that this method generalizes to multiple settings by evaluating it on two real world datasets and achieves improved performance over other probabilistic techniques that have been used in the past for this task.

Recommended citation: Abhishek Joshi and Vinay P. Namboodiri,”Unsupervised Synthesis of Anomalies in Videos: Transforming the Normal”, Proceedings of International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary https://arxiv.org/abs/1904.06633

Looking back at Labels: A Class based Domain Adaptation Technique

Published in International Joint Conference on Neural Networks (IJCNN) , 2019

In this paper, we tackle a problem of Domain Adaptation. In a domain adaptation setting, there is provided a labeled set of examples in a source dataset with multiple classes being present and a target dataset that has no supervision. In this setting, we propose an adversarial discriminator based approach. While the approach based on adversarial discriminator has been previously proposed; in this paper, we present an informed adversarial discriminator. Our observation relies on the analysis that shows that if the discriminator has access to all the information available including the class structure present in the source dataset, then it can guide the transformation of features of the target set of classes to a more structured adapted space. Using this formulation, we obtain the state-of-the-art results for the standard evaluation on benchmark datasets. We further provide detailed analysis which shows that using all the labeled information results in an improved domain adaptation.

Recommended citation: Vinod Kumar Kurmi and Vinay P. Namboodiri, “Looking back at Labels: A Class based Domain Adaptation Technique”, Proceedings of International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary https://vinodkkurmi.github.io/DiscriminatorDomainAdaptation/

Play and Prune: Adaptive Filter Pruning for Deep Model Compression

Published in International Joint Conference on Artificial Intelligence (IJCAI-2019), 2019

While convolutional neural networks (CNN) have achieved impressive performance on various classification/recognition tasks, they typically consist of a massive number of parameters. This results in significant memory requirement as well as computational overheads. Consequently, there is a growing need for filter-level pruning approaches for compressing CNN based models that not only reduce the total number of parameters but reduce the overall computation as well. We present a new min-max framework for filter-level pruning of CNNs. Our framework, called Play and Prune (PP), jointly prunes and fine-tunes CNN model parameters, with an adaptive pruning rate, while maintaining the model's predictive performance. Our framework consists of two modules: (1) An adaptive filter pruning (AFP) module, which minimizes the number of filters in the model; and (2) A pruning rate controller (PRC) module, which maximizes the accuracy during pruning. Moreover, unlike most previous approaches, our approach allows directly specifying the desired error tolerance instead of pruning level. Our compressed models can be deployed at run-time, without requiring any special libraries or hardware. Our approach reduces the number of parameters of VGG-16 by an impressive factor of 17.5X, and number of FLOPS by 6.43X, with no loss of accuracy, significantly outperforming other state-of-the-art filter pruning methods.

Recommended citation: Pravendra Singh, Vinay Kumar Verma, Piyush Rai and Vinay P. Namboodiri,”Play and Prune: Adaptive Filter Pruning for Deep Model Compression”, Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-2019)Macao, China, August 2019. https://arxiv.org/abs/1905.04446

Curriculum based Dropout Discriminator for Domain Adaptation

Published in Proceedings of British Machine Vision Conference (BMVC), 2019

Domain adaptation is essential to enable wide usage of deep learning based networks trained using large labeled datasets. Adversarial learning based techniques have shown their utility towards solving this problem using a discriminator that ensures source and target distributions are close. However, here we suggest that rather than using a point estimate, it would be useful if a distribution based discriminator could be used to bridge this gap. This could be achieved using multiple classifiers or using traditional ensemble methods. In contrast, we suggest that a Monte Carlo dropout based ensemble discriminator could suffice to obtain the distribution based discriminator. Specifically, we propose a curriculum based dropout discriminator that gradually increases the variance of the sample based distribution and the corresponding reverse gradients are used to align the source and target feature representations. The detailed results and thorough ablation analysis show that our model outperforms state-of-the-art results.

Recommended citation: Vinod Kumar Kurmi, Vipul Bajaj, Venkatesh K Subramanian and Vinay P Namboodiri, “Curriculum based Dropout Discriminator for Domain Adaptation”, British Machine Vision Conference 2018, BMVC 2018, Cardiff, UK, Northumbria University, Newcastle, UK, September 9-12, 2019 https://delta-lab-iitk.github.io/CD3A/

Towards Automatic Face-to-Face Translation

Published in 27th ACM International Conference on Multimedia (ACM-MM), 2019

In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available.

Recommended citation: Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Jerin Philip, Abhishek Jha, Vinay Namboodiri and C.V. Jawahar, “Towards Automatic Face-to-Face Translation”, 27th ACM International Conference on Multimedia (ACM-MM), Nice, France, 2019, Pages 1428–1436 https://cvit.iiit.ac.in/research/projects/cvit-projects/facetoface-translation

U-CAM: Visual Explanation using Uncertainty based Class Activation Maps

Published in IEEE International Conference on Computer Vision (ICCV), 2019

Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide visual attention maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold benefits: a) improvement in obtaining the certainty estimates that correlate better with misclassified samples and b) improved attention maps that provide state-of-the-art results in terms of correlation with human attention regions. The improved attention maps result in consistent improvement for various methods for visual question answering. Therefore, the proposed technique can be thought of as a recipe for obtaining improved certainty estimates and explanation for deep learning models. We provide detailed empirical analysis for the visual question answering task on all standard benchmarks and comparison with state of the art methods.

Recommended citation: Badri N. Patro, Mayank Lunayach, Shivansh Patel and Vinay P. Namboodiri, “U-CAM: Visual Explanation using Uncertainty based Class Activation Maps”, Proceedings of IEEE International Conference on Computer Vision (ICCV)Seoul, South Korea, October 2019. https://delta-lab-iitk.github.io/U-CAM/

HetConv: Beyond Homogeneous Convolution Kernels for Deep CNNs

Published in International Journal of Computer Vision (IJCV), 2019

While usage of convolutional neural networks (CNN) is widely prevalent, methods proposed so far always have considered homogeneous kernels for this task. In this paper, we propose a new type of convolution operation using heterogeneous kernels. The proposed Heterogeneous Kernel-Based Convolution (HetConv) reduces the computation (FLOPs) and the number of parameters as compared to standard convolution operation while it maintains representational efficiency. To show the effectiveness of our proposed convolution, we present extensive experimental results on the standard CNN architectures such as VGG, ResNet, Faster-RCNN, MobileNet, and SSD. We observe that after replacing the standard convolutional filters in these architectures with our proposed HetConv filters, we achieve 1.5 × to 8 × FLOPs based improvement in speed while it maintains (sometimes improves) the accuracy. We also compare our proposed convolution with group/depth wise convolution and show that it achieves more FLOPs reduction with significantly higher accuracy. Moreover, we demonstrate the efficacy of HetConv based CNN by showing that it also generalizes on object detection and is not constrained to image classification tasks. We also empirically show that the proposed HetConv convolution is more robust towards the over-fitting problem as compared to standard convolution.

Recommended citation: Pravendra Singh, Vinay Kumar Verma, Piyush Rai and Vinay P. Namboodiri, “HetConv: Beyond Homogeneous Convolution Kernels for Deep CNNs”, International Journal of Computer Vision, accepted https://link.springer.com/article/10.1007/s11263-019-01264-3

FALF ConvNets: Fatuous auxiliary loss based filter-pruning for efficient deep CNNs

Published in Image and Vision Computing Journal, 2019

Obtaining efficient Convolutional Neural Networks (CNNs) are imperative to enable their application for a wide variety of tasks (classification, detection, etc.). While several methods have been proposed to solve this problem, we propose a novel strategy for solving the same that is orthogonal to the strategies proposed so far. We hypothesize that if we add a fatuous auxiliary task, to a network which aims to solve a semantic task such as classification or detection, the filters devoted to solving this frivolous task would not be relevant for solving the main task of concern. These filters could be pruned and pruning these would not reduce the performance on the original task. We demonstrate that this strategy is not only successful, it in fact allows for improved performance for a variety of tasks such as object classification, detection and action recognition. An interesting observation is that the task needs to be fatuous so that any semantically meaningful filters would not be relevant for solving this task. We thoroughly evaluate our proposed approach on different architectures (LeNet, VGG-16, ResNet, Faster RCNN, SSD-512, C3D, and MobileNet V2) and datasets (MNIST, CIFAR, ImageNet, GTSDB, COCO, and UCF101) and demonstrate its generalizability through extensive experiments. Moreover, our compressed models can be used at run-time without requiring any special libraries or hardware. Our model compression method reduces the number of FLOPS by an impressive factor of 6.03X and GPU memory footprint by more than 17X for VGG-16, significantly outperforming other state-of-the-art filter pruning methods. We demonstrate the usability of our approach for 3D convolutions and various vision tasks such as object classification, object detection, and action recognition.

Recommended citation: Pravendra Singh, Vinay Sameer Raja Kadi and Vinay P.Namboodiri, “FALF ConvNets: Fatuous auxiliary loss based filter-pruning for efficient deep CNNs”, Image and Vision Computing Journal, Volume 93, January 2020, 103857 https://www.sciencedirect.com/science/article/pii/S0262885619304500

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.