Kyutai, a French non-profit AI research laboratory, has unveiled Moshi, a cutting-edge real-time native multimodal foundational AI model that rivals OpenAI’s GPT-4o and Google Astra. Developed by a team of just eight researchers in six months, Moshi boasts an impressive range of capabilities, including the ability to understand and express 70 different emotions and styles, speak with various accents, and handle two audio streams simultaneously. This open-source project is built on the Helium 7B model and integrates text and audio training, optimized for CUDA, Metal, and CPU backends with support for 4-bit and 8-bit quantization. Moshi can interact in real-time with an end-to-end latency of 200 milliseconds, run on consumer-grade hardware, and supports multiple backends. Additionally, it features watermarking to detect AI-generated audio, a feature currently in progress. With its innovative approach, Moshi has the potential to revolutionize human-machine communication, and its open-source nature challenges major AI companies like OpenAI, which have faced criticism for delaying releases due to safety concerns.

Meet Moshi – The Revolutionary Open-Source AI Assistant That Can Listen and Talk
Moshi thinks while it talks.
1–2 minutes










