Abstract: We address the 3D object detection task by capturing features directly on depth maps with a 2D CNN. Most existing 3D object detection methods take point clouds as input, even when each point cloud is converted from a single depth map. Although they have achieved impressive performance, point cloud based 3D detectors usually have high computational cost and complex structure, which limits their application on mobile devices and in real-time scenarios. Building on the state-of-the-art VoteNet, we propose 2.5-VoteNet, a powerful and efficient depth map based 3D detection pipeline. Since our models extract features directly on depth maps, most computation remains in 2D space and can be efficiently executed. Instead of using an off-the-shelf 2D CNN, we introduce relative depth convolution (RDConv) to learn robust local features. Our end-to-end pipeline achieves state-of-the-art results on the challenging SUN RGB-D benchmark and surpasses the baseline with a clear margin on ScanNet frame-level detection task. Meanwhile, our method reaches a significantly higher inference speed than existing methods (69 FPS).