A voice model is 90% about the dataset. Garbage in, garbage out — the model can never sound better than the audio you train it on. This guide shows how to prepare it right, step by step, with our tools.
| 5–8 min | our sweet spot — fits the limit and gives a quality model. |
| ~9–10 min | upper edge for WAV — see the limit below. |
| <3 min | too little — the voice will be flat. |
⚠️ Upload limit — 50 MB. A 48 kHz mono WAV is ~5–6 MB per minute, so 50 MB holds about 8–9 minutes. A 14-minute dataset comes out to ~80 MB — it simply won’t upload. So aim for 5–10 minutes (ideally 5–8) of dense clean voice without pauses (Vocal Stitch strips them). 5–8 varied minutes is plenty for a good model.
Varied 5–8 minutes beat the same minutes of monotone, single-note reading. Make sure the dataset covers:
RVC’s weak spot is sibilants and the rolled “r”. If they’re scarce or slurred in the data, the model mangles them. So deliberately add clear, unhurried Ж, З, Р, Ш, Щ, Ч sounds — the easiest way is to record a few Russian tongue-twisters, clearly and slowly:
Read them clearly and slowly — that gives the model clean samples of the hard sounds so it stops “swallowing” them.
Gather recordings of the target voice so that after cleanup you have 5–10 minutes of clean vocals: songs, podcasts, voice notes. The cleaner the source, the less cleanup later.
If the voice is already clean (a cappella or a mic recording) — skip this. If it’s a song — split out the vocals with any vocal remover and keep only the vocal stem, ideally with reverb removed.
🔗 vocalremover.org (free, in-browser)Drop the vocal files into our Vocal Stitch — it cuts the silence between phrases and joins everything into one continuous 48 kHz mono WAV. That’s your ready dataset.
🎚 Open Vocal Stitch →You can run it through the Track Analyzer — check duration and loudness, make sure there’s no clipping or dropouts.
🎛 Open Track Analyzer →Upload your finished dataset to Train a voice — that’s it. Training is fully automatic: every parameter (sample rate, method, training length) is tuned for you, nothing to choose. In a few minutes you get a model (available 6 hours — download it or use it right away in Change Timbre).
Quality depends only on the dataset — that’s why steps 1–4 are everything.
In Change Timbre: upload the vocal you want to re-voice, pick your model and set the Pitch — “No change” if both voices are the same gender, or “Male → female” / “Female → male” when converting across genders. Nothing else to tweak — the rest is automatic.
If the voice mangles sounds, that’s not a setting — it’s the dataset: go back to clean vocals and the Ж/З/Р sounds (see the block above).
⚖️ Use only your own voice or one you have the rights/permission for. Cloning someone’s voice without consent is prohibited.