between one another. The problem is that each of these models tends to only do one or two tasks really well — translate and convert text to speech, speech to text or between either of the two sets — so you end up having to smash a bunch of models on top of each other to create the generalized performance seen in the likes ofThat's a computationally intensive process, so Meta developed a single model that can do it all.
In their blog post, Meta's research team notes that SeamlessM4T"significantly improve[s] performance for the low and mid-resource languages we support," while maintaining"strong performance on high-resource languages, such as English, Spanish, and German." Meta built SeamlessM4T from its existing PyTorch-based multitask UnitY model architecture, which already natively performs the various modal translations as well as automatic speech recognition.