AED: Adaptable Error Detection for Few-shot Imitation Policy

Abstract

We introduce a new task called Adaptable Error Detection (AED), which aims to identify behavior errors in few-shot imitation (FSI) policies based on visual observations in novel environments. The potential to cause serious damage to surrounding areas limits the application of FSI policies in real-world scenarios. Thus, a robust system is necessary to notify operators when FSI policies are inconsistent with the intent of demonstrations. This task introduces three challenges: (1) detecting behavior errors in novel environments, (2) identifying behavior errors that occur without revealing notable changes, and (3) lacking complete temporal information of the rollout due to the necessity of online detection. However, the existing benchmarks cannot support the development of AED because their tasks do not present all these challenges. To this end, we develop a cross-domain AED benchmark, consisting of 322 base and 153 novel environments. Additionally, we propose Pattern Observer (PrObe) to address these challenges. PrObe is equipped with a powerful pattern extractor and guided by novel learning objectives to parse discernible patterns in the policy feature representations of normal or error states. Through our comprehensive evaluation, PrObe demonstrates superior capability to detect errors arising from a wide range of FSI policies, consistently surpassing strong baselines. Moreover, we conduct detailed ablations and a pilot study on error correction to validate the effectiveness of the proposed architecture design and the practicality of the AED task, respectively.

Adaptable Error Detection (AED): Detecting Erroneous Behaviors in Policies

Few-shot imitation (FSI) policies have achieved notable breakthroughs in recent studies. However, their potential to cause serious damage in real-world scenarios limits broader applications. A robust system is essential to notify operators when FSI policies deviate from the intended behavior illustrated in demonstrations. Thus, we formulate the Adaptable Error Detection (AED) task, which presents three key challenges: (1) monitoring policies in unseen environments, (2) detecting erroneous behaviors that may not cause significant visual changes, and (3) performing online detection to terminate the policy in a timely manner. These challenges render existing error or anomaly detection methods inadequate for addressing the AED task.

In this work, we design a practical AED framework tailored to the nature of few-shot imitation (FSI) policies. The framework comprises three stages: (1) FSI policies are trained in base environments using successful agent rollouts and expert demonstrations, (2) an error detector is trained to distinguish between states from successful and failed agent rollouts by referencing expert demonstrations, also in the base environments. The error detector can also access the policy's internal knowledge if necessary, and (3) the policy is deployed in novel (unseen) environments, where the error detector monitors the policy's behavior, issuing an alert when erroneous actions occur.

Our AED Benchmark

To fairly evaluate the effectiveness of various methods, we developed the AED benchmark, which includes FSI tasks spanning six indoor scenes and one industrial scene. Each task comprises multiple base and novel environments, with a total of 322 base environments and 153 novel environments. Additionally, our tasks incorporate the FSI challenges introduced in our previous work [1], SCAN, such as multi-stage tasks, length-variant demonstrations, and experts with different appearances.

Our benchmark presents following key features:

Generalization: We applied domain randomization to create dozens of base and novel environments for each task, laying the groundwork for future sim2real experiments. Each environment includes the target object as well as distractors, simulating the complexity of real-world scenes.
Challenging Environments: In addition to the FSI challenge mentioned earlier, we also simulated gravity and friction throughout the process, rather than directly attaching the grasped object to the gripper. As a result, uneven movements may cause even grasped objects to drop.
Realistic: Our environments support multi-source lighting, soft shadows, and complex object textures, making them more realistic. Additionally, all objects are assigned reasonable weights. The inclusion of gravity and friction further enhances the realism of our tasks.

Our Method: Pattern Observer (PrObe)

We introduce Pattern Observer (PrObe), a novel framework designed to address the challenges of the AED task. Our key insight lies in detecting erroneous behaviors by analyzing the feature embeddings extracted by the policy's encoder, rather than relying on raw visual signals. PrObe consists of three key components: a pattern extractor, a flow generator, and an embedding fusion module for consistency comparison. The pattern extractor assigns importance scores to individual cells within the feature embeddings, defining these as "patterns." These patterns are then passed into a recurrent model to capture their temporal evolution, referred to as the pattern flow. Finally, the pattern flow is concatenated with the task embedding to perform a consistency comparison, which serves as the basis for identifying erroneous behaviors.

Training Objectives: PrObe is guided by three objectives during training: one supervised and two unsupervised. The aims of these objectives are outlined below.

Classification loss $L_{cls}$ : a binary cross-entropy (BCE) loss guides PrObe in distinguishing between embeddings that correspond to normal and erroneous behaviors.
Sparsity loss $L_{pat}$ : an L1 loss encourages the pattern extractor to learn sparse pattern embeddings in continuous space.
Temporal-aware contrastive loss $L_{pat}$ : a modified triplet loss function where the margin dynamically adjusts based on the temporal distances between anchor, positive, and negative samples.

Timing Accuracy Experiment

To assess whether the error detectors identify behavioral errors at the correct moments, we conducted a timing accuracy experiment. In successful rollouts, the predicted probability of an error should remain consistently low throughout the entire sequence. In failed rollouts, the error detectors should raise the predicted probability immediately after the error occurs (indicated by the black vertical line).

Embedding Visualization

We visualized the embeddings extracted by various error detectors to validate the hypothesis that PrObe can capture implicit patterns from the policy’s feature embeddings. The embeddings produced by PrObe reveal distinct characteristics related to task progress, as well as homogeneous clustering of states. Specifically, states with similar properties (e.g., the beginnings of rollouts or the same types of failures) are positioned closer together in the embedding space.

Note: we also conducted experiments to examine component contributions, performance stability, demonstration quality, viewpoint changes, and error correction. To find the details, please refer to our paper.

BibTeX

@inproceedings{yeh2024aed,
  title={AED: Adaptable Error Detection for Few-shot Imitation Policy},
  author={Jia-Fong Yeh and Kuo-Han Hung and Pang-Chi Lo and Chi-Ming Chung and Tsung-Han Wu and Hung-Ting Su and Yi-Ting Chen and Winston H. Hsu},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year={2024}
}