Center-based 3D Object Detection and Tracking
Abstract
Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, CenterPoint outperforms all previous single model method by a large margin and ranks first among all Lidar-only submissions. The code and pretrained models are available at https://github.com/tianweiy/CenterPoint.
Community
Proposes CenterPoint: instead of detecting 3D boxes directly (they don’t work well for orientated objects) detect the 3D center point and regress size, orientation, and velocity from it (using keypoint detector), also has a refining stage; 3D tracking is keypoint matching. Uses standard LiDAR backbone (3D encoder like of VoxelNet or PointPillars) for representation, get to BEV (top) and get object centers form keypoint detector; refinement stage is efficient (uses few points only). Uses 2D CenterNet: converts object detection to (centeral) keypoint estimation task, predicting h, w, K binary class tensor; regress a class-agnostic size map (width and height) for each center location; trained to get Gaussian kernels at center of objects and regress bounding box shape. 3D detection in top view predicts 3D center point, 3D size/bounding box, and yaw rotation. A 3D backbone gives spatial features; aim to produce a class-specific Gaussian centered at each object in the (top) map view; center features have sub-voxel refinement, height above ground, 3D size, and yaw (rotation) angle; also has separate head for velocity tracking, create tracks by matching objects with nearest matches from previous frames (keep upto three unmatched frames). Second phase takes point features from the centres of all outward facing box faces (and center), concatenate features, pass through MPL, regress the confidence and bounding box refinement. Use BCE loss for confidence; final confidence output is geometric average of both stages. First stage has 3x3 conv, BN, and ReLU (shared, then multi-headed), second stage has MLP, ReLU, and dropout (shared) followed by FC (for confidence and box regression). Evaluated with VoxelNet and PointPillars backbones on Waymo open and nuScenes dataset (mAP for detection and multiple object tracking accuracy - MOTA - for tracking); best detection results (compared to StarNet, PointPillars, RCD, PV-RCNN, CVCNet on L1 & L2 Waymo; vehicle and pedestrian classes), better than AB3D tracker. Ablation results: center based is better than anchor based, two-stage approach gives better results (surface center and dense sampling, also use Voxel Set Abstraction and interpolation through radial basis function). Has lower detection and tracking time with higher accuracy on nuScenes validation (compared to CBGS). Appendix has tracking algorithm, implementation details (baselines), more results, and ablations. From UT Austin.
Links: GitHub (in mmdetection3d)
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper