In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (Co-CNN) architecture, which well integrates the cross-layer context, global image-level context, semantic edge context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network. Given an input human image, Co-CNN produces the pixel-wise categorization in an end-to-end way. First, the cross-layer context is captured by our basic local-to-global-to-local structure, which hierarchically combines the global semantic information and the local fine details across different convolutional layers. Second, the global image-level label prediction is used as an auxiliary objective in the intermediate layer of the Co-CNN, and its outputs are further used for guiding the feature learning in subsequent convolutional layers to leverage the global imagelevel context. Third, semantic edge context is further incorporated into Co-CNN, where the high-level semantic boundaries are leveraged to guide pixel-wise labeling. Finally, to further utilize the local super-pixel contexts, the within-super-pixel smoothing and cross-super-pixel neighbourhood voting are formulated as natural sub-components of the Co-CNN to achieve the local label consistency in both training and testing process. Comprehensive evaluations on two public datasets well demonstrate the significant superiority of our Co-CNN over other state-of-the-arts for humanparsing
Figure 1. Our Co-CNN integrates the cross-layer context, global image-level context and local super-pixel contexts into a unified network. It consists of cross-layer combination, global image-level label prediction, within-super-pixel smoothing and cross-super-pixel neighborhood voting. First, given an input 150 × 100 image, we extract the feature maps for four resolutions (i.e., 150 × 100, 75 × 50, 37 × 25 and 18 × 12). Then we gradually up-sample the feature maps and combine the corresponding early, fine layers (blue dash line) and deep, coarse layers (blue circle with plus) under the same resolutions to capture the cross-layer context. Second, an auxiliary objective (shown as “Squared loss on image-level labels”) is appended after the down-sampling stream to predict global image-level labels. These predicted probabilities are then aggregated into the subsequent layers after the up-sampling (green line) and used to re-weight pixel-wise prediction (green circle with plus). Finally, the within-super-pixel smoothing and cross-super-pixel neighborhood voting are performed based on the predicted confidence maps (orange planes) and the generated super-pixel over-segmentation map to produce the final parsing result. Only down-sampling, up-sampling, and prediction layers are shown; intermediate
Table 1. Comparison of human parsing performances with several architectural variants of our model and four state-of-the-arts when evaluating on ATR. The∗indicates the method is not a fully end-to-end framework.
Table 2. Per-Class Comparison of F-1 scores with several variants of our versions and four state-of-the-art methods on ATR.
Table 3. Comparison of parsing performance with three state-of-the-arts on the test images of Fashionista.
Figure 2. Result comparison of our Co-CNN and two state-of-the-art methods. For each image, we show the parsing results by Paper-Doll , ATR and our Co-CNN sequentially.
- ATR – X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin,and S. Yan. Deep human parsing with active template re-gression. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2015.
- Fashionista – K. Yamaguchi, M. Kiapour, L. Ortiz, and T. Berg. Parsing clothing in fashion photographs. In Computer Vision and Pattern Recognition, pages 3570–3577, 2012.
- Paper-Doll – K. Yamaguchi, M. Kiapour, and T. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In Interna-tional Conference on Computer Vision, 2013.