Skip to main content

Part V: Vision-Language-Action

Part Overview

VLA systems combine vision, language understanding, and robot actions to enable natural human-robot interaction.

Chapters

  • Ch 16: VLA concepts, LLM + robotics convergence
  • Ch 17: Whisper speech recognition, voice-to-action
  • Ch 18: LLM planning, task decomposition, ROS mapping
  • Ch 19: Multimodal interaction, gesture + voice + vision

Learning Outcomes

  1. Understand VLA architecture patterns
  2. Integrate Whisper for voice recognition
  3. Use LLMs for task planning
  4. Build multimodal interaction systems

Time: 28-36 hours

Next: Chapter 16: VLA Introduction