System-directed speech detection (DDSD) is the binary classification activity of distinguishing between queries directed at a voice assistant versus facet dialog or background speech. State-of-the-art DDSD programs use verbal cues (for instance, acoustic, textual content and/or computerized speech recognition system (ASR) options) to categorise speech as device-directed or in any other case, and sometimes should deal with a number of of those modalities being unavailable when deployed in real-world settings. On this paper, we examine fusion schemes for DDSD programs that may be made extra strong to lacking modalities. Concurrently, we examine using non-verbal cues, particularly prosody options, along with verbal cues for DDSD. We current totally different approaches to mix scores and embeddings from prosody with the corresponding verbal cues, discovering that prosody improves DDSD efficiency by as much as 8.5% by way of false acceptance price (FA) at a given fastened working level through non-linear intermediate fusion, whereas our use of modality dropout methods improves the efficiency of those fashions by 7.4% by way of FA when evaluated with lacking modalities throughout inference time.