Exploring Hierarchical Graph Representation for Large-Scale Zero-Shot Image Classification


The main question we address in this paper is how to scale up visual recognition of unseen classes, also known as zero-shot learning, to tens of thousands of categories as in the ImageNet-21K benchmark. At this scale, especially with many fine-grained categories included in ImageNet-21K, it is critical to learn quality visual semantic representations that are discriminative enough to recognize unseen classes and distinguish them from seen ones. We propose a Hierarchical Graphical knowledge Representation framework for the confidence-based classification method, dubbed as HGR-Net. Our experimental results demonstrate that HGR-Net can grasp class inheritance relations by utilizing hierarchical conceptual knowledge. Our method significantly outperformed all existing techniques, boosting the performance 7% compared to the runner-up approach on the ImageNet-21K benchmark. We show that HGR-Net is learning-efficient in few-shot scenarios. We also analyzed our method on smaller datasets like ImageNet-21K-P, 2-hops and 3-hops, demonstrating its generalization ability. Our benchmark and code will be made publicly available.


Intuitive illustration of our proposed HGR-Net. Suppose the ground truth is Hunting Dog, then we can find the real-label path: Root --> Animal --> Domestic Animal --> Dog --> Hunting Dog. Our goal is to efficiently leverage semantic hierarchical information to help better understand the visual-language pairs.

Method: HGR-Net

HGR-Net: Suppose the annotated single label is {D} and we can find the tracked label path {R} ... --> {A} --> {B} --> {D} from the semantic graph extended from WordNet. We first set {D} as the positive anchor and contrast with negatives which are sampled siblings of its ancestors (i.e., {E}, {C}, {G}) layer by layer. Then we iterate to set the positive anchor to be controlled depth as {B}, {A}, which has layer-by-layer negatives ({C}, {G}) and {G}, respectively. Finally, we use a memory-efficient adaptive re-weighting strategy to fuse knowledge from different conceptual level.



Results on ImageNet-21K-D

Top@k accuracy, Top-Overlap Ratio (TOR), and Point-Overlap Ratio (POR) for different models on the ImageNet-21K-D only testing on unseen classes. Tr means text encoder is CLIP Transformer.

More results could refer to the original paper.


If you find our work useful in your research, please consider citing:
  title={Exploring Hierarchical Graph Representation for Large-Scale Zero-Shot Image Classification},
  author={Yi, Kai and Shen, Xiaoqian and Gou, Yunhao and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2203.01386},