The Stanford Egocentric Thermal and RGBD Dataset provides egocentric RGB-D-Thermal (RGB-D-T) videos of humans performing daily real-word activities. The locations of the hands and of objects that interact with the hands are annotated. In this project, we used this dataset as an input to a framework for joint 6 DOF camera localization, 3D reconstruction, and semantic segmentation. All of our materials are open-sourced, including our code and our designs for the demonstration hardware platform used to acquire this dataset.




To collect our dataset, we designed a multi-modal data acquisition system combining an RGB-D camera (an Intel RealSense SR300) with a mobile thermal camera (a Flir One Android). We then used this setup to collect a large dataset of aligned multi-modal videos, and annotated semantically relevant information in these videos. We mounted both cameras on a GoPro chest harness and connected them through a single USB 3.0 cable to a lightweight laptop kept in the backpack of the data collector. We developed a GNU/Unix driver for the Flir One, since the camera was originally designed for Android mobile phones. We also time synchronized the cameras using the frame rate of the slower of the two cameras (the Flir One), resulting in a data acquisition rate of approximately 8.33 FPS. Our code for setting up this system is linked in the section below.

Our dataset includes approximately 250 videos of people performing various activities. These activities are divided into four high-level categories - kitchen, office, recreation, and household - with 44 different types of action sequences distributed across these four categories. We observed 14 different people collecting data over more than 20 different environments. All interactions in our dataset are very natural, since we did not give the data collectors any specific instructions other than asking them to wear the camera while performing the high-level activities. Also, since we did not tell the data collectors which actions to perform, the distribution of the dataset reflects the natural distribution of activities that the data collectors do routinely.

Number of Videos: 248
Number of Activity Types: 44

To see more examples or download the dataset, click here.

Scene Semantic Reconstruction from Egocentric RGB-D-Thermal Videos
Rachel Luo, Ozan Sener, Silvio Savarese
In 3D Vision (3DV), 2017
[Paper] [Supplement] [Github]

Rachel Luo