We provide depth, RGB and IR images of front car cabin for both real and synthetic imagesets. Corresponding to the images, ground truth annotations for 2D and 3D object detection and instance segmentation are provided. The depth images are provided as uint16 .png files with the values being in millimeters.

Real Imageset

The data has been divided into separate folders according to the camera pose. Within each folder, there are 5 directories named Depth, RGB, IR, RGB videos and Labels containing depth, RGB, infrared images, RGB videos and labels respectively. All images are further grouped into sequence folders such that frames from the same recording are grouped in one folder.

Directory Structure



Semantic and instance segmentation

For each image, two masks are provided: class mask and instance mask. Each pixel value in class mask corresponds to the class_id of the class that the pixel belongs to. Whereas each pixel value in the instance mask corresponds to the count of instances for that particular class. For example, if an image contains one person and one book, the instance mask values will be 1 for both the person and the book. However, if an image consists of two persons, the instance mask will contain 1 where the first person appears and 2 where the second person appears in the image. The masks are saved as .png images. Following dictionary maps class_ids used in class masks to their class labels.

class_ids = {
                 0: 'Background',
                 1: 'Person'
                 2: 'Backpack',
                 3: 'WinterJacket',
                 4: 'Box',
                 5: 'WaterBottle',
                 6: 'MobilePhone',
                 7: 'Blanket',
                 8: 'Accessory',
                 9: 'Book',
                 10: 'Laptop',
                 11: 'LaptopBag',
                 12: 'Infant',
                 13: 'Handbag',
                 14: 'FF', #Front-facing childseat
                 15: 'RF', #Rearfacing childseat
                 16: 'Child'

Bounding boxes

For each sequence, we provide 2 CSV files containing the 2D and 3D bounding boxes for all the images from that sequence.

2D Bounding Boxes

The columns in the CSV files are titled as the following:

frame, label , x_min, y_min, x_max, y_max, low_remission

where the (x_min, y_min) and (x_max,y_max) are the top-left and bottom-right corners of the bounding boxes respectively and low_remission indicates if the object has very few pixels visible in the depth image due to its black color or high reflectivity or both.

As an example, such a file could look as follows:


3D Bounding Boxes

The columns in the CSV files are titled as the following:

frame, label, cx, cy, cz, dx, dy, dz, phi_x, phi_y, phi_z,low_remission

where cx, cy, cz denotes the center coordinates of the box with respect to the world coordinate system;

dx, dy, dz are the height width and length of the box in meters.

phi_x, phi_y, phi_z are the rotation angles around each axis of the box and are computed in the order of Z-Y-X. They represent the rotation around each axis if the box is located in cx=cy=cz=0


Activity Annotations

For all sequences with people in the driver’s or passenger’s seat, we provide a csv file describing the activities done throughout the sequence with columns structured as follows:

activity_id, activity_name, person_id, 'driver'OR'passenger, start_frame, end_frame, duration

The activity ids and the actions they represent are described in the following dictionary:

activity_ids = {
    1: 'drive',
    2: 'look_left_while_turning_wheel',
    3: 'look_right_while_turning_wheel',
    4: 'touch_screen',
    5: 'open_glove_compartment',
    6: 'touch_head_or_face',
    7: 'lean_forward',
    8: 'turn_left',
    9: 'turn_right',
    10: 'turn_backwards_while_reversing',
    11: 'adjust_sun_visor',
    12: 'turn_backwards',
    13: 'talk',
    14: 'take_something_from_dashboard',
    15: 'sitting_normally',
    16: 'read_paper_or_book',
    17: 'bending_down',
    18: 'using_phone',
    19: 'using_laptop'

Synthetic Imageset

Images can be downloaded as RGB, grayscale (simulated active infrared camera system) or depth maps. However, sceneries can appear quite dark when RGB images are used. The ground truth data is identical for all three types of images. More information on rendering of this dataset can be found here.

Instance Segmentation Mask2D Bounding Boxes

Directory Structure


File names

The file names of all the different ground truth data and images are constructed in the following way, such that most of the important information is already included in the naming:


The following dictionary describes the content of GT_label_left and GT_label_right:

GT_labels = {
                 0: Empty Seat,
                 1: Infant in an Infant Seat,
                 2: Child in a Child Seat,
                 3: Person,
                 4: Everyday Objects,
                 5: Empty Infant Seat,
                 6: Empty Child Seat

As the rectangular images contain only information about a single seat, the naming is slightly different. However, the imageID still refers to the whole image and we additionally specify which seat is shown (0 for left and 2 for right) :


Semantic and instance segmentation

We provide a position and class based instance segmentation mask, which can be used as ground truth data for semantic and instance segmentation. For both tasks, we have five different classes and we want to separate infants/children from their seats and classify them as persons.

Each pixel is colored according to which object it belongs to, but also according to which seat (left or right) the object is placed on.

The masks need to be converted to integer values, because most cost functions and deep learning frameworks ask for an integer value as ground truth label. A simple approach to get a label between 0 and 4 is to transform the mask to grayscale. For example, one can use PIL and the following function:

from PIL import Image

After that, each grayscale value can be converted to an integer (and position) using the following function:

def get_class_by_gray_pixel_value(gray_pixel_value):
    Returns for a grayscale pixel value the corresponding ground truth label as an integer (0 to 4). 
    Further, the function also outputs on which seat the object is placed (left or right). 

    The relationship between gray_pixel_value and the class integer depends on the grayscale transformation function used. 
    These functions should work fine with PIL and OpenCV.

    0 = background
    1 = infant seat
    2 = child seat
    3 = person
    4 = everyday object

    Keyword arguments:
    gray_pixel_value -- grayscale pixel value between 0 and 225

    class_label, position

    # background
    if gray_pixel_value == 226 or gray_pixel_value == 225:
        return 0, None

    # infant
    if gray_pixel_value == 173 or gray_pixel_value == 172:
        return 1, "left"

    # child
    if gray_pixel_value == 175 or gray_pixel_value == 174:
        return 2, "left"

    # person
    if gray_pixel_value == 29:
        return 3, "left"

    if gray_pixel_value == 132 or gray_pixel_value == 131:
        return 3, "right"

    # everyday objects
    if gray_pixel_value == 105:
        return 4, "left"
    # infant seat
    if gray_pixel_value == 76:
        return 5, "left"

    # child seat
    if gray_pixel_value == 150 or gray_pixel_value == 149:
        return 6, "left"

    return None, None

Notice: depending on which function and library you use to convert the mask to grayscale, different values need to be used in the above function, because different libraries might use different grayscale transformations. The above function should work with PIL and openCV.

Bounding Boxes

For each scenery, we provide a single text file that contains the 2D and 3D bounding boxes of all objects in the scene. We assume that the origin of the images is at the upper left corner.

The class_label is defined according to the following dictionary:

class_label = {
                 1: Infant,
                 2: Child,
                 3: Person,
                 4: Everyday Object,
                 5: Infant Seat,
                 6: Child Seat

2D Bounding Boxes

Each line in the text file corresponds to a different object and each line contains the following information:

[class_label], [x_upper_left_corner], [y_upper_left_corner], [x_lower_right_corner], [y_lower_right_corner]

As an example, such a file could look as follows:


In this scenery, we would have an everyday object (label 4) on the right seat and a driver (label 3) at the left seat.

3D Bounding Boxes

Similarly, each line in the 3d Bounding Boxes files corresponds to an object and is formatted as follows:

[class_label], [cx], [cy], [cz], [dx], [dy], [dz]

where cx, cy, cz denotes the center coordinates of the box with respect to the world coordinate system and dx, dy, dz are the height width and length of the box in meters.

An example from the dataset looks like this:



For each scenery, we provide a single .json file which includes the poses of all people (babies included) in the scene. We save the human poses by using keypoints, as used by the COCO dataset, but our skeleton is defined using partially different joints. The .json file contains the 2D pixel coordinates of the keypoints of all people together with the visibility flag, the bone names and their seat position. It is constructed as follows:

    [name_of_person_1_in_the_scene]: {
        "bones": {
            [name_of_bone_1]: [
            [name_of_bone_1]: [
        "position": [seat_position_of_person_1]
    [name_of_person_2_in_the_scene]: {
        "bones": {
        "position": [seat_position_of_person_2]

Here is a list of bone names used in our benchmark:

["head", "clavicle_r", "clavicle_l", "upperarm_r", "upperarm_l", "lowerarm_r", "lowerarm_l", "hand_r", "hand_l", "thigh_r", "thigh_l", "calf_r", "calf_l", "pelvis", "neck_01", "spine_02", "spine_03"]

The visibility of the keypoints are set to

  • 0, if the keypoint is outside the image,
  • 1, if it is occluded by an object or neighboring human,
  • 2, if it is visible or occluded by the person itself.

For the benchmark evaluation, only the bones with a visibility of 2 are considered. Usually, this means that bones with a visibility of 0 and 1 are set to a visibility of 0 and the bones with a visibility of 2 are set to a visibility of 1.