Early this week, Google introduced MusicLM, a text-prompt music generator that can create 24 kHz musical audio from text descriptions. This is another leap for text-based AI generators. As many know, text-based image generators have dominated the news recently, with even Facebook announcing a text-based video generator. But MusicLM represents a new frontier in generative AI technology.
So how does this new AI model work? Well, there are two parts. First, it takes pieces of sound, or audio tokens, and maps them into words that represent the meaning of sounds in captions for training. These are called semantic tokens. Next, the program takes in user captions and/or input audio to generate what are called acoustic tokens. These are pieces of sound that are made up of the resulting output.
According to their abstract, the team claims that this new model outperforms previous systems in both “audio quality and adherence to the text description.” On the demonstration page provided by Google, there are numerous examples of MusicLM in action. Everything from creating audio from “rich captions” that go on to give music description to even less descriptive examples. In the “Audio Generation From Rich Captions” section of their example page, there are prompts that range from slow-tempo bass and drums to an arcade game soundtrack. Each music clip lasts for 30 seconds.
There is also what is called “Long Generation” for longer clips. In their example, they use three text prompts: “melodic techno,” “swing,” and relaxing jazz,” each of which created five-minute clips from those simple prompts. After listening to each, the generated audio is quite close to the prompt. Then there is the story mode. MusicLM takes a sequence of text prompts and then turns them into a series of musical tunes based on the prompts. Their example page describes it as, “The audio is generated by providing a sequence of text prompts. These influence how the model continues the semantic tokens derived from the previous caption.”
There are three examples in this series as well and you can listen to the morph and change of the music as it describes each prompt given. The example page also has a series of other interesting examples of how MusicLM has generated music. As many might wonder, what about possible misuse of the content and copyright issues? As of right now, this is one of the primary reasons the tech giant is holding on to the code. The company made clear that they need to work on risks associated with misuse, copywriting, and other issues with training data, saying in part, “We have no plans to release models at this point.”
Generative AI is fast becoming a major play and possible disruptor in multiple industries, with many projects in 2023 being looked at closely for how they could affect future industries and society as a whole. As for the future of MusicLM, future improvements are already in the pipeline with the possible inclusion of lyrical writing. “Future work may focus on lyrics generation, along with improvement of text conditioning and vocal quality. Another aspect is the modeling of high-level song structure like introduction, verse, and chorus. Modeling the music at a higher sample rate is an additional goal.“