CVPR 2020: The Top Object Detection Papers

The recently-concluded CVPR 2020 had quite a large number of contributions in pushing object detection forward. In this piece, we’ll look at a couple of the especially impressive papers.

A Hierarchical Graph Network for 3D Object Detection on Point Clouds

This paper proposes a graph convolution-based (GConv) hierarchical graph network (HGNet) for 3D object detection. It processes raw point clouds directly to predict 3D bounding boxes. HGNet is able to capture the relationship of the points and uses multi-level semantics for object detection.

HGNet consists of three main components:

a GConv based U-shape network (GU-net)
a Proposal Generator
a Proposal Reasoning Module (ProRe Module) — that uses a fully-connected graph to reason on the proposals

The authors present a shape-attentive GConv (SA-GConv) to capture the local shape features. This is done by modeling the relative geometric positions to describe object shapes.

The SA-GConv based U-shape network captures the multi-level features. They are then mapped onto an identical feature space by a voting module and used to generate proposals. In the next step, a GConv based Proposal Reasoning Module uses the proposals to predict bounding boxes.

Here are some of the performance results obtained on the SUN RGB-D V1 dataset.

HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection

In this paper, the authors present the Hybrid Voxel Network (HVNet), a one-stage network for point cloud-based 3D object detection for autonomous driving.

The voxel feature encoding (VFE) method used in this paper contains three steps:

Voxelization — assigning of a point cloud to a 2D voxel grid
Voxel Feature Extraction — computation of a grid-dependent point-wise feature that’s fed to a PointNet style feature encoder
Projection — aggregation of the point-wise feature to the voxel-level feature and projection to their original grid. This forms a pseudo-image feature map

The size of the voxel is very important in VFE methods. Smaller voxel sizes capture finer geometry features. They’re also better at object localization, but take longer at inference. Faster inference speeds can be obtained using a coarser voxel, since it leads to a smaller feature map. Its performance is inferior, however.

The authors propose the Hybrid Voxel Network (HVNet) to enable the utilization of fine-grained voxel features. It’s made up of three steps:

Multi-Scale Voxelization — the creation of a set of feature voxel scales and the assignment of each to multiple voxels.
Hybrid Voxel Feature Extraction —computing of a voxel dependent feature for each scale and feeding it into the attentive feature encoder (AVFE). Features from each voxel scale are concatenated point-wise.
Dynamic Feature Projection — Projecting the feature back to the pseudo-image by creating a set of multi-scale project voxels.

Here are the results obtained on the KITTI dataset.

Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud

Authors of this paper present a graph neural network — Point-GNN — to detect objects from a LiDAR point cloud. The network predicts the category and shape of the object that each vertex in the graph belongs to. Point-GNN has an auto-regression mechanism that detects multiple objects in a single shot.

The proposed method has three components:

graph construction: a voxel downsampled point cloud is used for graph construction
a graph neural network of T iterations
bounding box merging and scoring

Here’re the results obtained on the KITTI dataset:

The code is available here:

Camouflaged Object Detection

This paper addresses the challenge of detecting objects that are embedded in their surroundings — camouflaged object detection (COD). The authors also present a new dataset called COD10K. It contains 10,000 images covering camouflaged objects in many natural scenes. It has 78 object categories. The images are annotated with category labels, bounding boxes, instance-level, and matting-level labels.

The authors develop a COD framework called a Search Identification Network (SINet). The code is available here:

The network has two main modules:

the search module (SM) for searching for a camouflaged object
the identification module (IM) for detecting the object

Here are the results obtained on various datasets:

Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector

This paper proposes a few-shot object detection network whose objective is to detect objects of unseen categories that have a few annotated examples.

Their method includes an attention-RPN, multi-relation detector, and a contrastive training strategy. The method takes advantage of the similarity between the few-shot support set and query set to identify new objects, while also reducing false identification. The authors also contribute a new dataset that contains 1000 categories with objects that have high-quality annotations.

The network architecture consists of a weight-shared framework that has multiple branches—one branch is the query set, while the rest are for the support set. The query branch of the weight-shared framework is a Faster R-CNN network.

The authors introduce an attention-RPN and detector with multi-relation modules to produce accurate parsing between support and the potential boxes in the query.

Here are some results obtained on the ImageNet dataset.

Here are some observations obtained on a number of datasets.

D2Det: Towards High-Quality Object Detection and Instance Segmentation

Authors of this paper propose D2Det, a method that addresses both precise localization and accurate classification. They introduce a dense local regression that predicts multiple dense box offsets for an object proposal. This enables them to achieve precise localization.

The authors also introduce a discriminative RoI pooling scheme in order to achieve accurate classification. The pooling scheme samples from several sub-regions of a proposal and performs adaptive weighting to get discriminating features.

The code is available at:

The method is based on the standard Faster R-CNN framework. In this method, the traditional box offset regression of Faster R-CNN is replaced by the proposed dense local regression. In the method, classification is enhanced by the discriminative RoI pooling.

In the two-stage method, a region proposal network (RPN) is used in the first stage, while separate classification and regression branches are put into effect in the second stage. The classification branch is based on discriminative pooling. The local regression branch’s objective is exact localization of an object.

Here are the results obtained on the MS COCO dataset:

Final Thought

When it comes to object detection and a whole host of other computer vision tasks, CVPR 2020 offered plenty more. Here’s the open source repo of all the conference papers, in case you’d like to explore further.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.