🎙 A great voice dataset

A voice model is 90% about the dataset. Garbage in, garbage out — the model can never sound better than the audio you train it on. This guide shows how to prepare it right, step by step, with our tools.

In short (TL;DR)

5–10 minutes of clean vocals (to fit the 50 MB upload limit). Quality and variety beat raw quantity.
Only clean vocals — no music, reverb, noise, effects, second voices.
Mono, 48 kHz — our tools output exactly this.
Remove pauses → join into one file → train → apply.

📏 How much you need (and the limit)

5–8 min	our sweet spot — fits the limit and gives a quality model.
~9–10 min	upper edge for WAV — see the limit below.
<3 min	too little — the voice will be flat.

⚠️ Upload limit — 50 MB. A 48 kHz mono WAV is ~5–6 MB per minute, so 50 MB holds about 8–9 minutes. A 14-minute dataset comes out to ~80 MB — it simply won’t upload. So aim for 5–10 minutes (ideally 5–8) of dense clean voice without pauses (Vocal Stitch strips them). 5–8 varied minutes is plenty for a good model.

✅ What the audio must be like

👍 Do

Vocals only (a cappella), no instrumental
Dry sound — no reverb or echo
One voice, no backing vocals
Clean, no noise or hiss
No clipping or distortion
Lossless (WAV / FLAC) for the source
Mono, 48 kHz

👎 Kills quality

Music / instrumental in the background
Reverb and echo — RVC’s #1 enemy
Noise, hiss, hum
A second voice, choir, backing
Heavy effects (autotune, distortion)
Long pauses and silence
MP3/OGG as the dataset source

🎭 Variety beats quantity

Varied 5–8 minutes beat the same minutes of monotone, single-note reading. Make sure the dataset covers:

different emotions and intonation (calm, bright, soft, intense);
different tempo and volume;
the full pitch range (low and high), especially for singing;
all the language’s sounds — e.g. rolled and soft “r”, sibilants, vowels. If a sound is missing in the data, the model will mangle it.

🗣 So the model doesn’t slur — the “th”, “r” and sibilant sounds

RVC’s weak spot is sibilants, “th” and the “r”. If they’re scarce or slurred in the data, the model mangles them. So deliberately add clear, unhurried “th”, “r”, “s / sh / ch” sounds — the easiest way is to record a few tongue-twisters, clearly and slowly:

R / L: “Red lorry, yellow lorry” · “Really leery, rarely Larry”
S / SH: “She sells seashells by the seashore”
TH: “The thirty-three thieves thought they thrilled the throne”
CH / W: “Which witch wished which wicked wish” · “A cheap ship trip”

Read them clearly and slowly — that gives the model clean samples of the hard sounds so it stops “swallowing” them. (Cloning a non-English voice? Use tongue-twisters in that language.)

🛠 The pipeline, step by step

Gather source audio of the voice

Gather recordings of the target voice so that after cleanup you have 5–10 minutes of clean vocals: songs, podcasts, voice notes. The cleaner the source, the less cleanup later.

Separate vocals from music (if from songs)

If the voice is already clean (a cappella or a mic recording) — skip this. If it’s a song — split out the vocals with any vocal remover and keep only the vocal stem, ideally with reverb removed.

🔗 vocalremover.org (free, in-browser)

Cut the pauses and join into one file

Drop the vocal files into our Vocal Stitch — it cuts the silence between phrases and joins everything into one continuous 48 kHz mono WAV. That’s your ready dataset.

🎚 Open Vocal Stitch →

Check the material (optional)

You can run it through the Track Analyzer — check duration and loudness, make sure there’s no clipping or dropouts.

🎛 Open Track Analyzer →

Train the voice

Upload your finished dataset to Train a voice — that’s it. Training is fully automatic: every parameter (sample rate, method, training length) is tuned for you, nothing to choose. In a few minutes you get a model (available 6 hours — download it or use it right away in Change Timbre).
Quality depends only on the dataset — that’s why steps 1–4 are everything.

🎙 Open Train a voice →

Apply the voice

In Change Timbre: upload the vocal you want to re-voice, pick your model and set the Pitch — “No change” if both voices are the same gender, or “Male → female” / “Female → male” when converting across genders. Nothing else to tweak — the rest is automatic.
If the voice mangles sounds, that’s not a setting — it’s the dataset: go back to clean vocals and the Ж/З/Р sounds (see the block above).

💡 From the community’s experience: the source vocal (the one you re-voice) works best when it’s close in pitch range to the target voice — or match it with the Pitch shift. And with clear diction: if the source swallows words, the model copies that. A well-articulated vocal in the right range = a noticeably better result.

🎚 Open Change Timbre →

🚫 Common mistakes

Reverb in the dataset — the top cause of a “dirty” model. Remove it during isolation.
Music bleeding into the vocal — the model learns instruments as part of the voice.
Too little data (<10 min) or monotony — the voice comes out “flat”.
File over 50 MB — won’t upload. A 48 kHz WAV is ~5–6 MB/min, so keep it to 5–10 minutes.

⚖️ Use only your own voice or one you have the rights/permission for. Cloning someone’s voice without consent is prohibited.

🎚 Start: Vocal Stitch →

Like the tools? Unlock the full catalog

Ready style prompts for 1027 artists · 🧪 Lab (12 tools) · 50 𝄞 monthly. One-time payment, no subscription.

Unlock all · $26.87 I have a code