Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline.
MPVI is a modular framework with five components: a Subtask Planner (LLM) that decomposes the task instruction into an ordered sequence of navigation and manipulation subtasks; an Orchestrator that routes each subtask to the appropriate module and tracks progress; a Navigation module that uses classical motion planning with open-vocabulary object detection and frontier exploration; a Manipulation module (VLA); and a Completion Checker (VLM) that uses proprioceptive triggers to decide when a subtask is complete. The integration requires no additional data or training.
The integration of model-based motion planning with a model-free Vision-Language-Action policy is non-trivial. Model-based motion planning requires object locations to be known, which we address with frontier exploration and object detection. Reliable switching requires accurate subtask completion checking, which we address be using VLMs and proprioceptive signals. The videos below highlight these components.
Task progress improvement over baseline end-to-end VLA. MPVI increases the mean Q-Score across all 50 tasks by 113% with progress gains on 31 tasks (blue), matching the baseline on 16 tasks, and regressions on 3 tasks (red). Hatched bars indicate tasks in which the baseline did not complete any subtasks successfully (Q-score = 0).
We highlight three representative failure modes of the end-to-end VLA baseline that MPVI is designed to address.