Training-time Protection

Music Unlearnable for MusicLDM

Adversarial perturbations that prevent diffusion models from learning your music style

Protect your music from being learned and imitated by diffusion models through training-time adversarial perturbations that systematically misalign CLAP embeddings, breaking the condition-sample association in latent diffusion models.

🎯

What is Unlearnable?

A visual explanation of how adversarial perturbations make music "unlearnable" by diffusion models

❌ Without Protection (Learnable)

Model Can Learn

🎵

Your Music

→

🤖

Model Training

Learns (x, style)

→

🎨

Result

Can Imitate

⚠️

High Imitation Risk

✅ With Protection (Unlearnable)

Model Cannot Learn

🎵

Your Music

+ δ

🛡️

Protected

x' = x + δ

→

🤖

Model Training

Learns (x', wrong style)

→

🚫

Result

Cannot Imitate

✅

Low Imitation Risk

How It Works (Animated)

Click "Play Animation" to see how unlearnable protection works

📋

Overview

Problem: Music diffusion models (e.g., MusicLDM) can learn and imitate musical styles from training data, raising concerns about copyright protection and creative ownership.

Goal: Develop a training-time defense mechanism that prevents diffusion models from effectively learning specific musical styles without degrading perceptual quality.

Core Idea: Inject imperceptible adversarial perturbations δ into music samples before training, causing CLAP (Contrastive Language-Audio Pretraining) embeddings to shift systematically. This breaks the alignment between conditions (embeddings) and samples, preventing the LDM from learning the true style association.

🔑

Why It Works

MusicLDM relies on CLAP embeddings as conditional inputs to guide the diffusion process. During training, the model learns associations between embeddings e and samples x.

By perturbing samples to x' = x + δ, we cause CLAP to produce shifted embeddings e' = CLAP(x'). The model then learns the wrong association: (x', e') instead of (x, e).

This systematic misalignment means that even if an attacker tries to generate music using the original embedding e, the model cannot reproduce the true style because it was trained on misaligned pairs.

Key Insight: CLAP acts as a "bridge" between text/audio conditions and the diffusion model. By perturbing this bridge, we break the learning pathway without affecting human perception.

Training without Protection

x → CLAP → e → LDM learns (x, e)

Training with Protection

                            x
                            + δ
                            x'
                            →
                            CLAP
                            →
                            e'
                            →
                            LDM learns (x', e') ✗
                        

🔗

System Relationship: x / CLAP / LDM

Interactive visualization showing how adversarial perturbations affect the CLAP embedding space and disrupt the learning process in MusicLDM.

Data & Perturbation

Clean Spectrogram (x)

Protected Spectrogram (x' = x + δ)

Perturbation Strength

30%

View Mode

Protected

Perceptual Budget

Low

CLAP Audio Encoder

🎵

CLAP (frozen)

Contrastive Language-Audio Pretraining

Embedding Shift & LDM Learning

Embedding Space (2D Projection)

Δe = ||e' - e|| = 0.000

LDM Training Process

Condition:

e (clean)

🎼

LDM (Latent Diffusion)

Learns:

(x, e) association

Model sees (x', e') → learns wrong association → less memorization of true style

Imitation Risk

Low Risk

High Risk

Medium

⚙️

Method (Algorithm)

Key Formulas

Perturbed Sample: x' = x + δ

CLAP Embeddings: e = CLAP(x), e' = CLAP(x')

Embedding Shift: Δe = ||e' - e||₂

Optimization Objective

Objective: Maximize Δe = ||CLAP(x + δ) - CLAP(x)||₂

Subject to:
  - Perceptual constraint: ||δ||_p ≤ ε (energy bound)
  - Perceptual quality: D(x, x') ≤ τ (perceptual threshold)

Optimization Strategy:
  1. Initialize δ ~ N(0, σ²)
  2. For each iteration:
     a. Compute gradient: ∇_δ ||CLAP(x + δ) - CLAP(x)||₂
     b. Update: δ ← δ + α · sign(∇_δ)
     c. Project: δ ← clip(δ, -ε, ε)
  3. Return optimal δ*

📊

Evaluation Plan

To validate that protected samples prevent effective learning, we plan to evaluate:

🎯

Style Similarity

Measure how well models trained on protected data can imitate the original style using CLAP similarity scores and perceptual metrics.

Target: < 0.3 similarity

🎵

CLAP Embedding Distance

Quantify the embedding shift Δe and verify that it exceeds a threshold while maintaining perceptual quality.

Target: Δe > 0.5

👂

Perceptual Quality

Ensure perturbations remain imperceptible through listening tests (MOS) and objective metrics (PESQ, STOI).

Target: MOS > 4.0/5.0

🛡️

Robustness

Test against various preprocessing (resampling, compression) and verify protection persists across different diffusion model architectures.

Target: Protection maintained

⚠️

Limitations & Threat Model

CLAP Dependency

This method relies on CLAP as the conditioning mechanism. Models using different conditioning approaches (e.g., direct text-to-audio) may not be affected.

Preprocessing Robustness

Strong preprocessing (e.g., aggressive resampling, filtering) might remove perturbations. Protection effectiveness depends on maintaining perturbation integrity.

Training-time Only

Protection is only effective if applied before training. Once a model is trained on unprotected data, this method cannot retroactively protect it.

Perceptual Budget Trade-off

There exists a trade-off between protection strength and perceptual quality. Very strong perturbations may become perceptible, while weak ones may not provide sufficient protection.

📚

Resources

💻

GitHub

Code repository (Coming Soon)

View Code

📄

Paper

Full paper (Coming Soon)

Read Paper

🎬

Demo

Interactive demo (This page)

Try Demo