Though Massive Language Fashions (LLMs) have proven promise for human-like conversations, they’re primarily pre-trained on textual content knowledge. Incorporating audio or video improves efficiency, however gathering large-scale multimodal knowledge and pre-training multimodal LLMs is difficult. To this finish, we suggest a Fusion Low Rank Adaptation (FLoRA) approach that effectively adapts a pre-trained unimodal LLM to devour new, beforehand unseen modalities through low rank adaptation. For device-directed speech detection, utilizing FLoRA, the multimodal LLM achieves 22% relative discount in equal error charge (EER) over the text-only method and attains efficiency parity with its full fine-tuning (FFT) counterpart whereas needing to tune solely a fraction of its parameters. Moreover, with the newly launched adapter dropout, FLoRA is strong to lacking knowledge, bettering over FFT by 20% decrease EER and 56% decrease false settle for charge. The proposed method scales nicely for mannequin sizes from 16M to 3B parameters.