Skip to main content

Part V: Vision-Language-Action

Part Overview

VLA systems combine vision, language understanding, and robot actions to enable natural human-robot interaction.

Chapters

Ch 16: VLA concepts, LLM + robotics convergence
Ch 17: Whisper speech recognition, voice-to-action
Ch 18: LLM planning, task decomposition, ROS mapping
Ch 19: Multimodal interaction, gesture + voice + vision

Learning Outcomes

Understand VLA architecture patterns
Integrate Whisper for voice recognition
Use LLMs for task planning
Build multimodal interaction systems

Time: 28-36 hours

Next: Chapter 16: VLA Introduction

Part Overview
Chapters
Learning Outcomes