Magenta RealTime

Today, we’re happy to share a research preview of Magenta RealTime (Magenta RT), an open-weights live music model that allows you to interactively create, control and perform music in the moment.

GitHub Code Colab Demo Model Card 📝 [Paper coming soon]

Magenta RT is the latest in a series of models and applications developed as part of the Magenta Project. It is the open-weights cousin of Lyria RealTime, the real-time generative music model powering Music FX DJ and the real-time music API in Google AI Studio, developed by Google DeepMind. Real-time music generation models open up unique opportunities for live music exploration and performance, and we’re excited to see what new tools, experiences, and art you create with them.

As an open-weights model, Magenta RT is targeted towards eventually running locally on consumer hardware (currently runs on free-tier Colab TPUs). It is an 800 million parameter autoregressive transformer model trained on ~190k hours of stock music from multiple sources, mostly instrumental. The model code is available on Github and the weights are available on Google Cloud Storage and Hugging Face under permissive licenses with some additional bespoke terms. To see how to run inference with the model and try it yourself, check out our Colab Demo. Options for local inference and personal fine-tuning will be following soon.

How it Works

Live generative music is particularly difficult because it requires both real-time generation (i.e. real-time factor > 1, generating X seconds of audio in less than X seconds), causal streaming (i.e. online generation), and low-latency controllability.

Magenta RT overcomes these challenges by adapting the MusicLM architecture to perform block autoregression. The model generates a continuous stream of music in sequential chunks, each conditioned on the previous audio output (10s of coarse audio tokens) and a style embedding to produce the next audio chunk (2s of fine audio tokens). By manipulating the style embedding (weighted average of text or audio prompt embeddings), players can shape and morph the music in real-time, mixing together different styles, instruments, and musical attributes.

The latency of controls is set by the chunk size, which has a maximum output size of two seconds but can be reduced to increase reactivity. On a Colab free-tier TPU (v2-8 TPU), these two seconds of audio are generated in 1.25 seconds, giving a real-time factor of 1.6.

Compared to the original MusicLM, we’ve upgraded our representations to SpectroStream for high-fidelity (48kHz stereo) audio, which is a successor to SoundStream (Zeghidour+ 21). We also trained a new joint music+text embedding model called MusicCoCa that is influenced by both MuLan (Huang+ 22) and the CoCa models (Yu+ 22). Additional details are provided in the model card and deeper technical descriptions will be available in an upcoming paper.

Latent Space Exploration… In Real Time

Magenta’s earlier work in latent music models for MIDI clips (MusicVAE, GrooVAE) and instrumental timbre (NSynth), offered a wide range of possible interfaces.

With Magenta RT, it is now possible to traverse the space of multi-instrumental audio: explore the never-before-heard music between genres, unusual instrument combinations, or your own audio samples.

The ability to adjust prompt mixtures in real-time allows you to efficiently explore the sonic landscape and find novel textures and loops to use as part of a larger piece of music.

Real-time interactivity also provides the possibility of this latent exploration being its own type of musical performance, the interpolation through space combined with anchoring of the audio context producing a structure similar to a DJ set or improvisation session. Beyond performance, it can also be used to provide interactive soundscapes for physical spaces like artist installations or virtual spaces like video games.

This opens up a world of possibilities to build new tools and interfaces, and below you can see three example applications built on the Lyria RealTime API in AI Studio. Over time, Magenta RT will open up similar opportunities for on-device applications.

PromptDJ PromptDJ MIDI PromptDJ Pad

Why Magenta RealTime?

Enhancing human creativity (not replacing it) has always been at the core of Magenta’s mission. AI, however, can be a double-edged sword for creative agency. It offers new opportunities for accessibility and expression, but it can also create a deluge of more passive creation and consumption compared to traditional methods. With this in mind, we have always strived to build tools that help close the skill gap to make creation more accessible, while also valuing existing musical practices and encouraging people to dig deeper in their own creative journeys. In this regard, real-time interactive music models offer several important advantages that have motivated our research over the years (Piano Genie, DDSP, NSynth, AI Duet, and more).

Live interaction demands more from the player but can offer more in return. The continuous perception-action loop between the human and the model provides access to a creative flow state, centering the experience on the joy of the process over the final product. The higher bandwidth channel of communication and control often results in outputs that are more unique and personal, as every action the player takes (or doesn’t) has an effect.

Finally, live models naturally avoid creating a deluge of passive content, because they intrinsically balance listening with generation in a 1:1 ratio. They create a unique moment in time, shared by the player, the model, and listeners.

While Lyria RealTime provides access to state-of-the-art live music generation to developers and users around the globe, the Magenta Project remains committed to providing more direct access to code and models to enable researchers, artists, and creative coders to further build upon and adapt to achieve their creative goals.

Known Limitations

Coverage of broad musical styles. Magenta RT’s training data primarily consists of Western instrumental music. As a consequence, Magenta RT has incomplete coverage of both vocal performance and the broader landscape of rich musical traditions worldwide. For real-time generation with broader style coverage, we refer users to our Lyria RealTime API.

Vocals. While the model is capable of generating non-lexical vocalizations and humming, it is not conditioned on lyrics and is unlikely to generate actual words. However, there remains some risk of generating explicit or culturally-insensitive lyrical content.

Latency. Because the Magenta RT LLM operates on two second chunks, user inputs for the style prompt may take two or more seconds to influence the musical output.

Limited context. Because the Magenta RT encoder has a maximum audio context window of ten seconds, the model is unable to directly reference music that has been output earlier than that. While the context is sufficient to enable the model to create melodies, rhythms, and chord progressions, the model is not capable of automatically creating longer-term song structures.

Future Work

Magenta RT and Lyria RT are pushing the boundaries of live generative music, and we are happy that Magenta RT marks a return of open releases from Magenta.

In the weeks following this research preview, look for upcoming features such as the ability to fine-tune your own models and on-device inference for more accessible consumer hardware.

We are currently working on the next generation of real-time models with higher quality, lower latency, and more interactivity, to create truly playable instruments and live accompaniment.

How to cite

A technical report is forthcoming. For now, please cite this blog post if you use or extend this work:

@article{magenta_rt,
    title={Magenta RealTime},
    url={g.co/magenta/rt},
    publisher={Google DeepMind},
    author={Lyria Team},
    year={2025}
}