Though quite challenging, leveraging large-scale unlabeled or partially labeled images in a cost-effective way has increasingly attracted interests for its great importance to computer vision. To tackle this problem, many Active Learning (AL) methods have been developed. However, these methods mainly define their sample selection criteria within a single image context, leading to the suboptimal robustness and impractical solution for large-scale object detection. In this paper, aiming to remedy the drawbacks of existing AL methods, we present a principled Self-supervised Sample Mining (SSM) process accounting for the real challenges in object detection. Specifically, our SSM process concentrates on automatically discovering and pseudo-labeling reliable region proposals for enhancing the object detector via the introduced cross image validation, i.e., pasting these proposals into different labeled images to comprehensively measure their values under different image contexts. By resorting to the SSM process, we propose a new AL framework for gradually incorporating unlabeled or partially labeled data into the model learning while minimizing the annotating effort of users. Extensive experiments on two public benchmarks clearly demonstrate our proposed framework can achieve the comparable performance to the state-of-the-art methods with significantly fewer annotations.
The pipeline of the proposed framework with Self-supervised Sample Mining (SSM) process for object detection. Our framework includes stages of high-consistency sample pseudo-labeling via the SSM and low-consistency sample selecting via the AL, where the arrows represent the work-flow, the full lines denote data flow in each mini-batch based training iteration, and the dash lines represent data are processed intermittently. As shown, our framework presents a rational pipeline for improving object detection from unlabeled and partially labeled images by automatically distinguishing high-consistency region proposals, which can be easily and faithfully recognized by computers after the cross image validation, and low-consistency ones, which can be labeled by active users in an interactive manner.
Cross Image Validation
In the proposed SSM process, given the region proposals from unlabeled or partially labeled images, we evaluate their estimation consistency by performing the cross image validation, i.e., pasting them into different annotated images to validate its prediction consistency by the up-to-date object detector. Note that, to avoid ambiguity, the images for validation are randomly picked out from the labeled samples that do not belong to the estimated category of the under-processing proposal. Through a simple ranking mechanism, those incorrectly pseudo-labeled proposals have a large chance to be filtered due to the challenges inside various image contexts. In this way, the bias of classifiers and unbalance of samples can be effectively alleviated. Then, we propose to provisionally assign disposable pseudo-labels to the ones with high estimation consistency, and retrain the detectors within each mini-batch iteration. Since the pseudo-annotations may still contain errors, small amount of user interactions is necessary to keep our SSM under well control.
We have introduced a principled Self-supervised Sample Mining (SSM) process, and justified its effectiveness in mining valuable information from unlabeled or partially labeled data to boost object detection. We further involve this process in the AL pipeline with a concise formulation, which is developed for retraining object detectors via faithfully pseudo-labeled high-consistency object proposals after our proposed cross image validation. The proposed SSM process contributes to effectively improve the detection accuracy and the robustness against noisy samples. Meanwhile, the rest samples, being low consistency (high uncertainty) by the current detectors, can be handled by the AL, which benefits to generate reliable and diverse samples gradually. In the future, we will apply our SSM to improve other specific visual detection task with unlabeled web images/videos.