NIPS 2014
Deep Joint Task Learning for Generic Object Extraction
Xiaolong Wang, Liliang Zhang, Liang Lin*, Zhujin Liang, Wangmeng Zuo
NIPS 2014


This project investigates how to extract objects-of-interest without relying on handcraft features and sliding windows approaches, that aims to jointly solve two subtasks:
(i) rapidly localizing salient objects from images
(ii) accurately segmenting the objects based on the localizations.
Deep Joint Task Learning

We present a general joint task learning framework, in which each task (either object localization or object segmentation) is tackled via a multi-layer convolutional neural network, and the two networks work collaboratively to boost performance.
An EM-type method is then studied for the joint optimization, iterating with two steps:
(i) by using the two networks, it estimates the latent variables by employing an MCMC-based sampling method;
(ii) it optimizes the parameters of the two networks unitedly via back propagation, with the fixed latent variables.




Saliency dataset[1,2]
  Ours(full) Ours(sim) FgSeg[3] CPMC[4] ObjProp[5] HS[6] GC[7] RC[8] HC[8]
P 97.81 96.62 91.92 83.64 72.60 89.99 89.23 90.16 89.24
J 87.02 81.10 70.85 56.14 54.12 64.72 58.30 63.69 58.42
OE dataset
  Ours(full) Ours(sim) FgSeg[3] CPMC[4] ObjProp[5] HS[6] GC[7] RC[8] HC[8]
P 93.12 91.25 90.42 76.33 72.14 87.42 85.53 86.25 83.37
J 77.69 71.50 70.93 53.76 54.70 62.83 54.83 59.34 50.61
P: Precision (%)
J: Jaccard similarity (%)

We validate our approach on the Saliency dataset( a combination of THUR15000 [1] and THUS10000 [2]) and a more challenging dataset newly collected by us, namely Object Extraction(OE) dataset. We compare our approach with state-of-the-art methods and empirical analyses are also presented in the experiment.

  Ours(full) FgSeg[3] CPMC[4] ObjProp[5] Saliency methods[6,7,8,9]
Time 0.014s 94.3s 59.6s 83.64 0.711s

One spotlight of our work is its high efficiency in testing. The average time for object extraction from an image with our method is 0.014 seconds, while figure-ground segmentation [3] requires 94.3 seconds, CPMC [4] requires 59.6 seconds and Object Proposal [5] requires 37.4 seconds. For most of the saliency region detection methods, the runtime are dominated by the iterative GrabCut process, thus we apply its time as the average testing time for the saliency region detection methods, which is 0.711 seconds. As a result, our approach is 50~6000 times faster than the state-of-the-art methods.




The Object Extraction(OE) dataset is available at:



Code for testing

Our codes are hosted on Github.

Supplementary material


We randomly selected our experiment results in the two datasets as supplimentary materials. [download].
Explanation for the results:
XXX_loc.jpg: The red object bounding box on the image is generated by our localization network.
XXX_seg.png: The grey-scale image is the object extraction result with our model.
XXX_overlap.jpg: This image shows the overlapping(yellow) of our object extraction result(red) and the groundtruth mask(green)




  1. M. Cheng, N. Mitra, X. Huang, and S. Hu, SalientShape: Group Saliency in Image Collections, In The Visual Computer 34(4):443-453, 2014.
  2. M. Cheng, G. Zhang, N. Mitra, X. Huang, and Shi. Hu, Global Contrast based Salient Region Detection, In CVPR, 2011.
  3. D. Kuettel and V. Ferrari, Figure-ground segmentation by transferring window masks, In CVPR, 2012.
  4. J. Carreira and C. Sminchisescu, Constrained Parametric Min-Cuts for Automatic Object Segmentation, In CVPR, 2010.
  5. I. Endres and D. Hoiem, Category-Independent Object Proposals with Diverse Ranking, In IEEE Trans. Pattern Anal. Mach. Intell., 2014.
  6. Q. Yan, L. Xu, J. Shi, and J. Jia, Hierarchical Saliency Detection, In CVPR, 2013.
  7. M. Cheng, J. Warrell, W. Lin, S. Zheng, V. Vineet, and N. Crook, Efficient Salient Region Detection with Soft Image Abstraction, In ICCV, 2013.
  8. M. Cheng, G. Zhang, N. Mitra, X. Huang, and Shi. Hu, Global Contrast based Salient Region Detection, In CVPR, 2011.