CVPR 2017
Learning to Person Search
Tong Xiao*, Shuang Li*, Bochao Wang, Liang Lin, Xiaogang Wang
CVPR 2017


Existing person re-identification benchmarks and methods mainly focus on matching cropped pedestrian images between queries and candidates. However, it is different from real-world scenarios where the annotations of pedestrian bounding boxes are unavailable and the target person needs to be searched from a gallery of whole scene images. To close the gap, we propose a new deep learning framework for person search. Instead of breaking it down into two separate tasks—pedestrian detection and person re-identification, we jointly handle both aspects in a single convolutional neural network. An Online Instance Matching (OIM) loss function is proposed to train the network effectively , which is scalable to datasets with numerous identities. To validate our approach, we collect and annotate a large-scale benchmark dataset for person search. It contains 18, 184 images, 8, 432 identities, and 96, 143 pedestrian bounding boxes. Experiments show that our framework outperforms other separate approaches, and the proposed OIM loss function converges much faster and better than the conventional Softmax loss.



Contribution Highlights

  • We propose a new deep learning framework to search a target person from a gallery of whole scene images. Instead of simply combining the pedestrian detectors and person re-id methods, we jointly optimize both objectives in a single CNN and they better adapt with each other.
  • We propose an Online Instance Matching loss function to learn identification features more effectively, which enables our framework to be scalable to large datasets with numerous identities. Together with the fast inference speed, our framework is much closer to the real-world application requirements.
  •  We collect and annotate a large-scale benchmark dataset for person search, covering hundreds of scenes from street and movie snapshots. The dataset contains 18; 184 images, 8; 432 identities, and 96; 143 pedestrian bounding boxes. We validate the effectiveness of our approach comparing against other baselines on this dataset.




We propose a new deep learning framework that jointly handles the pedestrian detection and person re-identification in a single convolutional neural network (CNN), as show in Figure 2. Given as input a whole scene image, we first use a stem CNN to transform from raw pixels to convolutional feature maps. A pedestrian proposal net is built upon these feature maps to predict bounding boxes of candidate people, which are then fed into an identification net with RoI-Pooling [9] to extract L2-normalized 256-d features foreach of them. At inference stage, we rank the gallery people according to their features distances to the target person. At training stage, we propose an Online Instance Matching (OIM) loss function on top of the feature vectors to supervise the identification net, together with several other loss functions for training the proposal net in a multi-task manner. Below we will first detail the CNN model structure, and then elaborate on the OIM loss function.

1) Model Structure

The following is our proposed framework. Pedestrian proposal net generates bounding boxes of candidate people, which are fed into an identification net for feature extraction. We project the features to a L2-normalized 256-d subspace, and train it with a proposed Online Instance Matching loss. Both the pedestrian proposal net and the identification net share the underlying convolutional feature maps.


2) Online Instance Matching Loss

The left part shows the labeled (blue) and unlabeled (orange) identity proposals in an image. We maintain a lookup table (LUT) and a circular queue (CQ) to store the features. When forward, each labeled identity is matched with all the stored features. When backward, we update LUT according to the id, pushing new features to CQ, and pop out-of-date ones. Note that both data structures are external memory, rather than the parameters of the CNN.





We collect and annotate a large-scale person search dataset for comprehensive evaluation of our proposed method. To increase scene diversity, we collect images from two kinds of data sources. On one hand, we use hand-held cameras to shoot street snaps around an urban city. On the other hand, we also collect from movie snapshots that contain pedestrians, as they could enrich the variations of viewpoints, lighting, and background conditions.

1) Statistic

After collecting all the 18; 184 images, we first densely annotate all the 96; 143 pedestrians bounding boxes in these scenes, and then associate the person that appears across different images, resulting in 8; 432 labeled identities. The statistics of two data sources are listed in Table 1. 


We did not annotate those people who appear with half bodies or abnormal poses such as sitting or squatting. We ensure that the background pedestrians do not contain labeled identities, and thus they can be safely served as negative samples for identification. Note that we also ignore the background pedestrians whose heights are smaller than 50 pixels, as they would be hard to recognize even for human labelers. The height distributions of labeled and unlabeled identities are demonstrated in Figure 4. 


2) Evaluation Protocols and Metrics

We split the dataset into a training and a test subset, ensuring no overlapped images or labeled identities between them. Table 1 shows the statistics of these two subsets. We divide the test identity instances into queries and galleries. For each of the 2; 900 test identities, we randomly choose one of his/her instances as the query, while the corresponding gallery set consists of two parts—all the images containing the other instances and some randomly sampled images not containing this person. Different queries have different galleries, and jointly they cover all the 6; 978 test images.

To better understand how gallery size would affect the person search performance, we define a set of protocols with gallery size ranging from 50 to 4000. Taking gallery size of 100 as an example, as each image approximately contains 6 pedestrians, then our task is to find the target person among about 600 people.

We employ two kinds of evaluation metrics—cumulative matching characteristics (CMC top-K) and mean averaged precision (mAP). The first one is inherited from person re-id problem, where a matching is counted if there is at least one of the top-K predicted bounding boxes overlaps with the ground truths with intersection-over-union (IoU) greater or equal to 0:5. The second one is inspired from the object detection tasks.



1) Experiment Settings

We implement our framework based on Caffe1 [16, 31] and py-faster-rcnn2 [9, 27]. ImageNet-pretrained ResNet-50 [13] are exploited for parameters initialization. We fix the first 7× 7 convolution layer and the batch normalization (BN) layers as constant affine transformations in the stem part, while keep the other BN layers as normal in the identification part. The temperature scalar T in Eq. (1) and Eq. (2) is set to 0:1, the size of the circular queue is set to 5; 000. All the losses have the same loss weight. Each mini-batch consists of two scene images. The learning rate is initialized to 0:001, dropped to 0:0001 after 40K iterations, and kept unchanged until the model converges at 50K iterations.

For pedestrian detection, we directly use the off-the-shelf deep learning CCF [36] detector released by the authors, as well as two other detectors specifically fine-tuned on our dataset. One is the ACF [6], and the other is Faster-RCNN(CNN) [27] with ResNet-50, which is equivalent to our framework but without the identification task. The recallprecision curve of each detector on our dataset are plotted in Figure 5. 



For person re-identification, we use several popular reid feature representations, including DenseSIFT-ColorHist (DSIFT) [41], Bag of Words (BoW) [42], and Local Maximal Occurrence (LOMO) [21]. Each feature representation is used in conjunction with a specific distance metric, including Euclidean, Cosine similarity, KISSME [17], and XQDA [21], where KISSME and XQDA are trained on our dataset. Moreover, by discarding the pedestrian proposal network in our framework and training the remaining net to classify identities with Softmax loss from cropped pedestrian images, we get another baseline re-id method (IDNet).

2) Comparison with Detection and ReID

We first compare our proposed person search framework with other 15 baseline combinations that break down the problem into separate detection and re-identification tasks. The results are summarized in Table 2. Our method outperforms the others by large margin.




3) Effectiveness of Online Instance Matching

We validate the effectiveness of the proposed Online Instance Matching (OIM) loss by comparing it against Softmax baselines with or without pretraining the classifier matrix. The training identification accuracy and test person search mAP curves are demonstrated in Figure 6.



As the number of identities increases, the computation time of the OIM loss could become the bottleneck of the whole system. Thus we proposed in Section 3.2 to approximate Eq. (1) and Eq. (2) by sub-sampling both the labeled and unlabeled identities in the denominators. We validate this approach here by training the framework with sub-sampling size of 10, 100, and 1000. The test mAP curves are demonstrated in Figure 7. 


We further investigate how the dimension of the L2-normalized feature vector affects the person search performance. The results are summarized Table 3. Comparisons among different dimensions of L2-normalized feature subspace. N/A means that we directly use the L2-normalized 2048-d global pooled feature vector.





4) Factors for Person Search

  •  Detection Recall. We investigate how detection recalls would affect the person search performance by using LOMO+XQDA as the re-id method and setting different thresholds on detection scores. A lower threshold reduces misdetections (increases the recall) but results in more false alarms. We choose the recall rates ranging from 30% to the maximum value of each detector. The final person search mAP under each setting is demonstrated in Figure 8.


  •  Gallery size. Person search could be more challenging as the gallery size increases. We evaluate several methods under different test gallery sizes from 50 to full set of 6; 978 images, following the protocols defined in Section 4.2. The test mAPs are demonstrated in Figure 9.