2D/ 3D Object detection — bounding boxes — in the Autonomous Systems Domain

prash
3 min readMay 23, 2021

--

1.1 Object Detection in 2D and 3D

Object detection belongs to the core abilities of autonomous systems as they are required to perceive the surrounding environment. Currently, systems based on deep neural networks are state-of-the-art in image classification and object detection.

Fig 1.0 comparison of 2D vs 3D bounding boxes (source)

The vast majority of existing object detectors focus on finding 2D bounding boxes (e.g. YOLO, Faster RCNN, etc.), which provide sufficient information for basic reasoning about object positions (Fig. 1.0 above). However, it is insufficient for autonomous driving applications, where finding poses of objects in the 3D world is desired

Object detection in 3D is done via drawing a 3D bounding box (Fig. 1.1 below) which is a very convenient and, for many applications, sufficient representation of the objects in the 3D world. We, in line with general practice, define the 3D bounding box as a tight rectangular cuboid around an object that has 9 degrees of freedom (3 for position, 3 for rotation, and 3 for dimensions). This information is sufficient to determine the position, orientation, and size of the object in the 3D world, which can be used especially for path planning of an autonomous car.

Fig 1.1 Mathematical representation of 2D and 3D bounding boxes (source)

1.2 Autonomous driving context

Self-driving cars are equipped with many sensors that analyze the surrounding environment. LiDAR and camera are the sensors with the highest resolution and are complementary: while LiDAR outputs a 3D point cloud, measuring scene geometry, the camera outputs 2D images, capturing visual appearance. For example, the camera is crucial to analyze street signs and other objects that can not be discriminated against based on geometry only.

However, the active laser scanning technology of LiDAR also works well at night, while the passive camera is highly impacted by external lighting conditions. In spite of recent progress in 2D and 3D deep learning, many questions remain regarding the combination of multi-modal data, especially how to learn joint features given their heterogeneous representation spaces.

Towards integrated data-driven pipelines

With the advent of deep learning and the intention to tackle complex urban driving, there has been a paradigm shift in soft- and hardware architectures of perception systems in autonomous driving. Traditional perception architectures rely on lightweight recognition algorithms that can be optimized to fit into economical processing units, directly integrated into each sensor. Consequently, all sensors already produce high-level output, e.g. bounding boxes for object detection, which can then be fused centrally, e.g. with a Kalman filter. Today the trend goes towards raw data fusion, where the sensors simply provide the captured data, i.e. point cloud and image, to a computationally powerful central processing unit, usually carrying a GPU to accelerate neural networks. A multi-modal perception algorithm running on the central processing unit can then exploit the complementarity

1.3 Process Parts

On a very high level, there are 3 parts to the whole process:

  1. Data Generation
  2. Training
  3. Testing

Data Generation

It is an expensive operation to capture the data for training and testing, but it is worth it as the same dataset once captured and labeled well can be used to train and test a wide variety of models and algorithms. For most of the developers developing real-world applications, there are few good data sources available in 3D

Training

Testing

--

--