A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

Authors

Fanqi Lin1,2, Kushal Arora1, Jean Mercat1, Haruki Nishimura1, Paarth Shah1, Chen Xu1, Mengchao Zhang1, Mark Zolotas1, Owen Pfannenstiehl1, Maya Angeles1, Andrew Beaulieu1, Jose Barreiros1

1Toyota Research Institute, 2Tsinghua University

Co-training, jointly learning from target robot data and heterogeneous data modalities, has emerged as a promising way to scale Large Behavior Models (LBMs) beyond the limits of expensive and narrowly distributed robot datasets. By leveraging heterogeneous modalities, co-training aims to expand data coverage and improve generalization without requiring vast additional target robot data collection. Yet, despite its growing adoption, the effectiveness of different co-training data sources and training strategies remains poorly understood.

We present a large-scale empirical study of co-training for LBMs using a vision-language-action (VLA) architecture. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples, and evaluates 89 policies across 58,000 simulation rollouts and 2,835 real-world trials.

Autonomous evaluation rollouts from three finetuned co-trained LBMs performing long-horizon and dexterous tasks: (left) pack items into a string bag, (middle) pour ingredients into the soup, and (right) store clean dishes.
(Videos are playing at 1x speed.)

Overview

Our robot policy uses a pretrained vision-language model (VLM) backbone along with an Action Flow Transformer and is trained on target robot data together with multiple co-training modalities, including standard vision-language data, dense language annotations for robot data, cross-embodiment robot data, human videos, and discrete robot action tokens. Policies are extensively evaluated in simulation on seen and unseen tasks under both nominal and distribution shift conditions, as well as in real-world experiments for language following and unseen long-horizon dexterous manipulation.

Overview

Data Modalities and Training Strategies

We study five co-training data modalities including: standard vision-language data for commonsense, spatial reasoning, and object grounding; dense language annotations for robot trajectories, generated via heuristic scripting and VLM-based captioning to provide explicit semantic supervision; cross-embodiment robot data spanning diverse robot morphologies and environments; large-scale egocentric human videos, leveraged either through latent action extraction or VLM-generated language annotations; and discrete robot action tokens, which compress continuous actions into discrete representations to probe abstraction and generalization.

We evaluate these modalities under three co-training strategies: single-phase co-training, which jointly learns from target robot data and co-training data; two-phase 1st-phase-only co-training, which pretrains on co-training data before specializing on target robot actions; and two-phase full co-training, which pretrains on co-training data and then jointly trains on target robot continuous action data and co-training data.

Impact of Different Co-training Data Modalities and Strategies

(1) Co-training with diverse vision-language data and cross-embodiment robot data substantially enhances the model's generalization to distribution shifts, unseen tasks, and language-following capabilities. Notably, owing to their information richness, co-training with standard vision-language data and language annotations for human videos benefits both 1st-phase-only and 2nd-phase co-training, whereas language annotations for robot trajectories and cross-embodiment data are primarily effective during the 1st-phase in two-phase co-training.

(2) Across all the effective co-training data modalities, standard vision-language data, VLM-based language annotations for robot data, and language annotations for human videos are the most beneficial. All three modalities consist of diverse vision-language data, suggesting that strengthening vision-language understanding of the VLM backbone translates into better robot policies.

(3) Discrete action tokens (including latent actions extracted from videos, FAST tokens, and action tokens learned from VQ-VAE) co-training yields no statistically significant performance improvements in our experiments. Specifically, co-training with FAST tokens decreases generalization, while latent actions from videos only provide benefits in the low target robot data regime, with benefits diminishing as the proportion of robot data increases.

(4) Across all co-training modalities examined, we observe no statistically significant impact on in-distribution performance.

In the following plots, we show the best co-training strategies for effective data modalities. The violins visualize posterior uncertainty; dots and horizontal lines indicate empirical and posterior means, respectively.

Useful co-training data modalities

Combining Effective Co-training Data Modalities

Additively combining the effective co-training modalities yields cumulative performance improvements. Our Final Model leverages all co-training data that consistently improve performance—standard vision-language data, dense language annotations for robot and human data, and cross-embodiment robot data—using the best-performing co-training strategies identified in our study. Benefiting from our curated co-training data and carefully designed training strategies, our Final Model achieves strong performance on our experiments, demonstrating substantial improvements over the model trained solely on target robot data, namely the no-co-training baseline.

Combining effective co-training modalities

Language Following

Co-training substantially improves the model’s ability to interpret and execute natural language instructions. Compared to the no-co-training baseline, our Final Model more reliably grounds language in visual perception and action, successfully handling seen objects, paraphrased instructions, and unseen objects. These improvements are validated through extensive real-world evaluations, with qualitative behaviors illustrated in the videos below. The Final Model reliably follows instructions, while the baseline frequently fails due to brittle language-action alignment.

Experiment:
Layout:
Prompt:
Placeholder
Instruction: -

No-co-training Baseline

Final Model

Performance under Unseen Tasks

Co-training significantly improves generalization to tasks not seen during training, as evidenced by our simulation benchmark. The qualitative rollouts shown below illustrate the Final Model’s improved robustness and generalization.

Task:
Layout:
Placeholder
Instruction: -

No-co-training Baseline

Final Model

Co-training Enhances the Quality of Learned Representations

Co-training improves the quality of learned representations, enabling the model to rapidly adapt to challenging, unseen long-horizon dexterous tasks. When fine-tuned with a small number of task-specific demonstrations (n=200), the co-trained Final Model consistently outperforms the no-co-training baseline, exhibiting better precision, stability, and task completion across multi-step manipulation sequences.

Task:

Fine-tuned No-co-training Baseline

Fine-tuned Final Model

Open-ended Language Following

We qualitatively demonstrate that co-training enables more flexible, open-ended language following in interactive settings. In this scenario, a human provides step-by-step, on-the-fly instructions toward a high-level goal—such as making a sandwich or cleaning up a table—without a predefined task script. The co-trained model interprets these incremental instructions, grounds them in the evolving scene, and executes appropriate actions in sequence, illustrating its ability to support interactive human-robot collaboration. For comparison, we also show the corresponding no-co-training baseline, which struggles to reliably follow open-ended instructions.

Task:
Policy:

VLM Backbone Benchmarking

Beyond downstream robot performance, we analyze how co-training reshapes the vision–language model (VLM) backbone. We benchmark the VLMs extracted from our trained policies on a suite of standard vision–language benchmarks spanning semantic understanding, spatial reasoning, and long-horizon reasoning.

We find that training exclusively on robot data can erode the visiolinguistic capabilities of the VLM backbone, whereas effective co-training helps preserve this understanding, as reflected by improved performance on standard vision-language benchmarks.

VLM Backbone Benchmarking

BibTeX Citation

@article{cotraininglbm2025,
  title={A Systematic Study of Data Modalities and  Strategies for Co-training Large Behavior Models for Robot Manipulation}, 
  author={Fanqi Lin$^{1,2}$ and Kushal Arora$^1$ and Jean Mercat$^1$ and Haruki Nishimura$^1$ and Paarth Shah$^1$ and Chen Xu$^1$ and Mengchao Zhang$^1$ and Mark Zolotas$^1$ and Owen Pfannenstiehl$^1$ and Maya Angeles$^1$ and Andrew Beaulieu$^1$ and Jose Barreiros$^1$},
  affiliation={$^1$Toyota Research Institute, $^2$Tsinghua University},
  year={2026},
  eprint={2602.01067},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2602.01067}, 
}