Whisper
Whisper is an open-source speech-to-text mannequin supplied by OpenAI. There are 5 mannequin sizes out there in each English-focused and multilingual varieties to select from, relying on the complexity of the appliance and desired accuracy-efficiency tradeoff. Whisper is an end-to-end speech-to-text framework that makes use of an encoder-decoder transformer structure working on enter audio break up into 30-second chunks and transformed right into a log-Mel spectrogram. The community is educated on a number of speech processing duties, together with multilingual speech recognition, speech translation, spoken language identification, and voice exercise detection.
For this undertaking, two walkie-talkie buttons can be found to the consumer: one which sends their common English-language inquiries to the bot via the lighter, quicker “base” mannequin, and a second which deploys the bigger “medium” multilingual mannequin that may distinguish between dozens of languages and precisely transcribe accurately pronounced statements. Within the context of language studying, this leads the consumer to focus very intently on their pronunciation, accelerating the training course of. A chart of the out there Whisper fashions is proven under:
Ollama
There exists quite a lot of extremely helpful open-source language mannequin interfaces, all catering to completely different use circumstances with various ranges of complexity for setup and use. Among the many most generally identified are the oobabooga text-gen webui, with arguably probably the most flexibility and under-the-hood management, llama.cpp, which initially centered on optimized deployment of quantized fashions on smaller CPU-only gadgets however has since expanded to serving different {hardware} varieties, and the streamlined interface chosen for this undertaking (constructed on prime of llama.cpp): Ollama.
Ollama focuses on simplicity and effectivity, working within the background and able to serving a number of fashions concurrently on small {hardware}, rapidly shifting fashions out and in of reminiscence as wanted to serve their requests. As a substitute of specializing in lower-level instruments like fine-tuning, Ollama excels at easy set up, environment friendly runtime, a fantastic unfold of ready-to-use fashions, and instruments for importing pretrained mannequin weights. The give attention to effectivity and ease makes Ollama the pure alternative for LLM interface in a undertaking like LingoNaut, because the consumer doesn’t want to recollect to shut their session to liberate assets, as Ollama will routinely handle this within the background when the app is just not in use. Additional, the prepared entry to performant, quantized fashions within the library is ideal for frictionless growth of LLM functions like LingoNaut.
Whereas Ollama is just not technically constructed for Home windows, it’s simple for Home windows customers to put in it on Home windows Subsystem for Linux (WSL), then talk with the server from their Home windows functions. With WSL put in, open a Linux terminal and enter the one-liner Ollama set up command. As soon as the set up finishes, merely run “ollama serve” within the Linux terminal, and you may then talk together with your Ollama server from any Python script in your Home windows machine.
Coqui.ai 🐸 TTS
TTS is a fully-loaded text-to-speech library out there for non-commercial use, with paid industrial licenses out there. The library has skilled notable reputation, with 3k forks and 26.6k stars on GitHub as of the time of this writing, and it’s clear why: the library works just like the Ollama of the text-to-speech area, offering a unified interface for accessing a various array of performant fashions which cowl quite a lot of use circumstances (for instance: offering a multi-speaker, multilingual mannequin for this undertaking), thrilling options akin to voice cloning, and controls over the velocity and emotional tone of transcriptions.
The TTS library offers an intensive collection of text-to-speech fashions, together with the illustrious Fairseq fashions from Fb analysis’s Massively Multilingual Speech (MMS) undertaking. For LingoNaut, the Coqui.ai group’s personal XTTS mannequin turned out to be the proper alternative, because it generates high-quality speech in a number of languages seamlessly. Though the mannequin does have a “language” enter parameter, I discovered that even leaving this set to “en” for English and easily passing textual content in different languages nonetheless ends in devoted multilingual technology with principally appropriate pronunciations.