Voice is the interface, was the defining message of the voice community in 2025, and it continues to resonate today. Language is one of the first communication skills humans learn and remains the most natural way we interact throughout our lives. Yet, to truly understand voice-based conversations require far more than simply transcribing speech into text. At NXP, we believe all these invisible social skills naturally expected from any human, must also be mastered by robots that aim to communicate with humans, especially humanoid robots.
Humans process many layers of information at once. They read visual cues like emotions, gestures, and direction. They distinguish thinking pauses from end-of-speech. They manage multi-speaker conversations and turn-taking. They filter noise, echo, and reverberation. They also adapt responses based on social context, such as the situation or the age and appearance of others.
At NXP, we design solutions with the understanding that all these invisible social skills naturally expected from any human, must also be mastered by robots that aim to communicate with humans.
For decades, the AI community has developed foundational voice technologies such as keyword spotting, Speech To Text (STT), and Text To Speech (TTS). Large Language Models (LLMs) and Vision Language Models (VLMs) added powerful reasoning capabilities to intelligent systems. More recently, initiatives such as Audio Language Models and Speech to Speech models have attempted to bridge the gap between voice and reasoning; however, these approaches have so far failed to deliver local, reliable, and low-latency Conversational AI solutions for robotics at the edge.
Robotics is changing our world to be more intelligent, perceive and act at the edge. Join us at Robotics Summit & Expo, booth #536 where you will see our latest solutions that bring intelligent robotics to life!.
When a conversational system fails under real world conditions, the usual reaction is to increase the model’s size or compensate with more complex prompting. Yet this only worsens inference latency—degrading the user experience—while failing to address the main issue: the input audio signal quality.
Multimodal Intelligence That Knows When to Listen
Diagram of NXP's Attention Front End voice solution
NXP’s Attention Front End (AFE) addresses the core challenges of human–robot interaction by combining multimodal sensing with audio signal cleaning. Rather than processing all incoming audio, the system detects when a user intends to engage with the robot and enhances the incoming audio to support reliable, low-latency and on-device conversational experiences. As an added benefit, this means you no longer only rely on massive cloud models.
Our solution takes advantage of complementary modalities:
- Vision: analyzes the scene, detects and counts individuals, recognize registered users, estimate proximity, and determines potential speakers facing the robot
- Voice: detects speech activity, identifies a registered voice signatures, estimates the direction of arrival and steers audio capture toward the targeted speaker (in the background we also characterize the acoustic audio scene)
Speech To Text processing is triggered only when multiple conditions are satisfied: the user is both visually and acoustically identified, speech is detected and the interaction occurs within an appropriate range and orientation. This gating mechanism, combined with NXP’s in-house voice and audio algorithms, significantly improves Word Error Rate (WER) across a wide range of conditions: from quiet environments to noisy scenarios (e.g. low signal-to-noise ratio).
NXP also integrates Ultra-Wideband (UWB) technology to extend spatial awareness beyond voice and vision. With solutions such as Trimension SR250, robots can securely determine the real-time position of a user’s smartphone or other robots, enabling them to understand where their owner or peers are located and respond accordingly. UWB delivers highly accurate ranging—down to a few centimeters—while maintaining low power consumption and stable performance in complex environments. This additional layer of precise, reliable location context enhances navigation, interaction, and proximity-based behaviors across both indoor and outdoor scenarios.
NXP’s Attention Front End integrated into Boston Dynamics® Spot®
From Modular Design to Measurable Performance
Examples showing how Word Error Rate (WER) improves when using our Attention Front End with Speech To Text models (e.g., Whisper), compared to using the models without it.
In short, the Attention Front End helps robots listen more like humans do: focusing on the right speaker, ignoring distractions, and understanding even in noisy environments. By combining vision, voice, and proximity sensing, it delivers cleaner audio to speech recognition models, improving responsiveness and accuracy while enabling more natural conversational AI at the edge.
You can evaluate our Attention Front End on the i.MX 95 evaluation kit (EVK). For more information or to discuss the solution in detail, please contact altaf.hussain@nxp.com.