20224701 Jisu Han, 20224550 Doohyun Lee

The remarkable generalization capabilities of extensive neural network models depend on their ability to assimilate vast quantities of training data. Existing large-scale human video datasets, which are easily accessible, are extensively utilized in robotics. To make appropriate use of these datasets, it's necessary for the model to support generalization across various tasks and environments.

To be specific, a task embedding space (See Figure 1) must exist that enables accurate identification of tasks across various environments, achieved through generalization across these environments. Furthermore, within a single task, this space should allow for mapping from a starting point to a goal point.

![Figure 1. Overview of our approach. (Left) The entire task embedding space is represented. Each point represents the frame-level feature vector and this space is based on two key insights:

(Right-Up) Different tasks are mapped into different colors on the task embedding space. Frames that correspond to the same task are clustered together, regardless of their different environment information. We pursue the disentanglement between the task information and the environment information.
(Right-bottom) A single task embedding space represents a task trajectory; starting from the start point and passing through each key embedding point, we eventually reach the goal embedding point. This structure can be applied universally as long as the task remains the same, even if the environment differs.](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4312c9d4-bce5-4ce3-8067-6c3ec8307b5e/Untitled.png)

Figure 1. Overview of our approach. (Left) The entire task embedding space is represented. Each point represents the frame-level feature vector and this space is based on two key insights:

(Right-Up) Different tasks are mapped into different colors on the task embedding space. Frames that correspond to the same task are clustered together, regardless of their different environment information. We pursue the disentanglement between the task information and the environment information.
(Right-bottom) A single task embedding space represents a task trajectory; starting from the start point and passing through each key embedding point, we eventually reach the goal embedding point. This structure can be applied universally as long as the task remains the same, even if the environment differs.

In contrast to the majority of existing approaches [1, 6] that concentrate on mimicking demonstrations on a frame-by-frame or action-by-action, our approach seeks to extract and emphasize task information by eliminating environment information from each video. By obtaining such a representation, robots can identify the essential elements for task execution and ultimately achieve higher performance in a reduced time.

We hypothesize that low performance arises from the entanglement of environment and task information within the demonstrations. Therefore, our approach is to devise a representation that can effectively disentangle environment information and task information while facilitating successful execution of the manipulation task. Furthermore, through this approach, our aim is to learn a generalized reward function that can be utilized as a tool during reinforcement learning.

Overview

Self-supervised pre-training from human demo videos

Our goal is to get an encoder that trained on human demonstration videos, focusing on task information by disentangling environment information. This encoder will later serve as a tool for inferring the reward function.

We intend this encoder to adequately represent the two insights described in Figure 1, necessitating the conception of an encoder model that can exhibit this. The first insight, which is the task identification by disentangling the environment information, we used a contrastive loss that represents whether two randomly chosen videos belong to the same task. The second insight is defined to examine the embedding in accordance with task progress.

Figure 2. We train an encoder with human demo videos. This is held in self-supervised manner.

Encoder details

Each video in the human demonstration dataset takes a label indicating the corresponding task. The model randomly selects two videos and samples a sequence of frames within each video (i.e. Temporal Random Crop). Spatial augmentation techniques, same approach from [3], such as color jitter and horizontal flipping, are applied to the frames. Figure 3 shows the encoder model architecture.

Figure 3. Encoder model architecture. It takes data pre-processing first, the encoder maps to task embedding space, and then trained by contrastive losses.

Figure 3. Encoder model architecture. It takes data pre-processing first, the encoder maps to task embedding space, and then trained by contrastive losses.

Once we obtain a sequence of frames for each video through data pre-processing, they are mapped to the task embedding space via the encoder $\phi$. the A task-level contrastive loss and a sequence contrastive loss (SCL) [2] are employed for comparing tasks and obtaining frame-wise representations, respectively, for each video. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two videos and a prior Gaussian distribution. The loss function is defined as follows:

$$ L = L_{task-cont} + \lambda L_{sequence-cont} $$

Task contrastive loss is defined as a binary cross entropy loss, which classifies whether two video belong to the same task. Hyper-parameter $\lambda$ exists to balance between two losses.

(Details) We adopt ResNet-50 pre-trained by BYOL [5] as the frame-wise spatial encoder. We use a 3-layer Transformer encoder with 256 hidden size and 8 heads to model temporal context. We train the model using Adam optimizer with learning rate ${10}^{-4}$ and weight decay ${10}^{-5}$. $\lambda$ is $0.05$.

Reinforcement learning with learned reward function

Using the pre-trained frozen encoder $\phi$, we define a Markov Decision Process (MDP) for any agent as the tuple $< S, A, P, r >$ , where $S$ is the set of possible states, $A$ is the set of possible actions, $P$ is the state transition probability matrix encoding the dynamics of the environment (including the agent) and $r$ is the learned reward function (See here).

Figure 4. We use the reward value by computing the L2 distance between the current state and goal state from the task embedding space.

Given the goal image, current state image, each image would be mapped into the task embedding space by the encoder. Once each point is mapped, the reward function is defined using the L2 distance between them. The reward function is defined as follows:

$$ r_t = - {\| \phi(s_t) - \phi(g) \|} ^2_2 $$

Here, $s_t$ indicates the current state and $g$ indicates the goal state. Since the initial state embedding $\phi(s_0)$ is far from the goal state embedding point $\phi(g)$, the reward would be low. As the agent takes actions and gets closer to the goal, the reward would increase. Figure 4 summarizes the description.

The robot agent receives the reward based on the aforementioned reward function, enabling the functioning of reinforcement learning. The reward operates densely at each time-step, providing feedback to the agent.

Experiment

We would like to verify whether self-supervised pre-training is possible by visualizing the task embedding space. To accomplish this, it is necessary to examine whether the currently trained representation in the encoder possesses the following features: 1) Can the representation effectively discriminate between different tasks? 2) Can the representation accurately trace the progression across various time steps?