We propose an architecture with input-dependent dynamic depth for processing streaming audio, using a vision-inspired keyword spotting framework. This architecture extends a conformer encoder by adding trainable binary gates, which allow the network to dynamically skip certain modules based on the input audio. Our approach improves the accuracy of detecting and localizing continuous speech, specifically using the top-1000 most frequent words from the Librispeech dataset. Additionally, our method maintains a small memory footprint and reduces the average amount of processing, without compromising overall performance. These benefits are even more pronounced when using the Google speech commands dataset with background noise, where up to 97% of processing is skipped for non-speech inputs. This makes our method particularly appealing for an always-on keyword spotter.