Finding a way to make Old English words sound human. Every approach tried, every dead end documented.
The OE Lab page (Wulf and Eadwacer) needs word-by-word pronunciation audio. Currently falls back to browser speechSynthesis (robotic, inaccurate). Old English has sounds that don't exist in modern English:
Every commercial TTS API has a fixed phoneme inventory per language. If a sound isn't in their English set, it either gets approximated wrong or the request fails entirely.
Same three words, six approaches each. espeak-ng gets the phonemes right but sounds robotic. Neural2 sounds natural but approximates the sounds. Seed-VC, F5-TTS+Roper, and EZ-VC try to combine correct phonemes with natural voice — each using a different speech encoder.
Best-sounding voice (Charon, Kore), but ignores SSML entirely. The <phoneme> tags are silently dropped. No way to control pronunciation at all. Used for the podcast (where orthographic text is close enough), but can't do individual OE words.
Accepts IPA phoneme tags, but only for phonemes in its language's inventory. English Neural2 is missing /ɣ/, /y/, and long vowels. Covers ~60% of OE sounds. Quality "noticeably worse" than Chirp 3. German Neural2 helps with /y/ but still approximate.
Automated optimization: feed Chirp 3 HD creative English/German respellings, score TTS output against reference recordings from oldenglish.info using formant analysis, iterate. After 7 iterations of the scoring methodology: 0 problem sounds, 6 excellent, 7 good, 7 mediocre out of 20 vowels. Hit the ceiling of what English/German respelling can approximate.
Same phoneme inventory limitation as Google. Worse: Azure rejects the entire SSML document if it encounters an unrecognized phoneme. Google at least tries to approximate. Azure gives silence.
Pronunciation dictionaries only work with English phonemes on the oldest models (eleven_flash_v2, eleven_monolingual_v1). All newer models silently skip phoneme tags. Non-English phonemes not supported.
LLM-based model (8.8B params). Supports inline IPA with /IPA/ syntax in text. Different architecture than Google/Azure (not phoneme-table-based). Tested 5 OE words — audio generated but pronunciation quality was poor. Reads IPA more like text than actual phonemes.
Open source, ranked #2 on TTS Arena. Supports [text](/IPA/) override syntax via Misaki G2P — but this syntax doesn't work on Replicate's hosted version. Model reads the brackets and slashes as literal text. Also tested plain respellings ("ray-nikh", "BOH-ghoom") — results sounded unnatural for isolated words.
espeak-ng (formant synthesizer) can pronounce every OE sound correctly via direct phoneme input. It sounds robotic, but the phonemes are right. Separate the two problems: let espeak handle what to say, let a neural model handle how it sounds.
Generated OE words using espeak-ng with direct phoneme input ([[ ]] notation). Also tested Dutch voice (-v nl) where Dutch 'g' naturally produces /ɣ/. Robotic but phonetically accurate.
Ran espeak phoneme audio through RVC (Retrieval-based Voice Conversion) on Replicate. RVC re-renders audio in a target voice while (theoretically) preserving the pronunciation content. Used the Obama voice model. Key question: did /ɣ/ survive or get normalized to /g/?
RVC uses HuBERT for content encoding (English-centric). Higher risk of normalizing unusual phonemes to nearest English sound.
WavLM-based voice conversion. Fed espeak output + a Charon (Google Chirp 3 HD) reference clip. Less popular than RVC but different architecture.
New espeak-ng base samples generated for 6 key OE words, each testing specific phonemes. These were fed through three different voice conversion models, plus F5-TTS direct generation and Neural2 respellings as baselines. Compare across all approaches for the same word.
Fresh espeak-ng phoneme samples for the comparison. Robotic but phonetically correct. These are the input to all three VC models below.
Zero-shot voice conversion using Whisper as content encoder (680K hours multilingual data). Ran locally via pip install seed-vc. Fed espeak OE audio as source, Google Neural2-J English clip as target voice. Whisper's multilingual training should preserve non-English phonemes better than HuBERT (used by RVC).
Seed-VC loaded Wav2Vec2-XLS-R-300M (multilingual) internally. ~3s per word on M1 MacBook Air (CPU/MPS). Listen for whether /ɣ/ survived or became /g/.
Microsoft's speecht5_vc model from HuggingFace. Speech-to-speech conversion using speaker x-vector embeddings to define target voice. Lightweight model, runs fast on CPU. Different architecture from RVC and Seed-VC.
Used random speaker embeddings (no specific target voice). Quality may improve with proper x-vectors from a voice we like.
Zero-shot text-to-speech with voice cloning. Not voice conversion — this generates speech directly from text, cloning a reference voice's timbre. Fed OE orthographic text (rēnig, bōgum, etc.) with Neural2-J as reference voice. The question: how does F5-TTS pronounce OE text it's never seen?
F5-TTS generates from text, not audio. It doesn't know OE pronunciation rules — it's guessing based on English patterns. Compare these against the espeak+VC versions to hear the difference between "correct but converted" vs "guessed but natural."
Google Neural2-J with creative English respellings for each OE word. This is the "best effort without any pipeline" approach. Compare against the VC results to judge whether the pipeline is worth the complexity.
Tested whether we can improve the espeak source material (voice variants), skip espeak entirely (StyleTTS 2, Piper), or improve the VC pipeline (Annie variant → Seed-VC). Also tested MBROLA diphone synthesis and F5-TTS with guided respellings. 8 test words covering every hard OE phoneme.
Tested 6 voice variants × 8 words: en (default), en+Annie (female), en+Alicia, en+klatt (Klatt synthesizer), en+Adam, en+Andrea. All with -s 130 -p 35 for slower pacing and lower pitch. Phonemes are identical — only the voice character differs.
48 total samples (6 variants × 8 words) uploaded. Only showing Annie + comparison variants for rēnig above. All available on R2 at Audio Testing/oe-tts-round5/espeak-variants/.
Same Seed-VC pipeline as Round 4, but using en+Annie variant as source instead of plain en. Better source audio quality should produce better voice conversion output. Testing all 8 diagnostic words.
Annie variant espeak as source, Neural2-J as target voice. Compare against Round 4 Seed-VC (plain espeak source) to hear whether the voice variant input improves output quality.
Round 4 fed raw OE orthography to F5-TTS (it guessed pronunciation). This time using phonetically-guided English respellings: "ray-nikh" for rēnig, "boh-ghoom" for bōgum, etc. Same Neural2-J reference voice.
Compare these against Round 4's raw-text F5-TTS samples to hear whether guided respellings improve OE pronunciation.
Installed gruut (IPA phonemizer used by StyleTTS 2). gruut treats OE text as English and produces completely wrong phonemes: rēnig → /ɹ ˈɪ n ɪ ɡ/ (English "rinig"). It strips /ɣ/, maps /y/ to /ɪ/, ignores macrons. Same fixed-English-phoneme-inventory problem. StyleTTS 2 itself failed to install — needs tokenizers <0.20 which has no pre-built wheel for Python 3.14.
Installed via pip install piper-tts. On macOS ARM, the bundled espeak-ng data path points to a Linux build directory that doesn't exist: /Users/runner/work/piper1-gpl/.../espeak_ng-install/. Would need a custom build or Linux environment. Not viable on this machine.
Successfully built MBROLA 3.4-dev from source on macOS ARM. Downloaded en1/nl2/us1-3 diphone databases. But espeak-ng's mbrowrap layer depends on /proc (Linux-only). Cannot integrate MBROLA voices through espeak-ng on macOS. Direct MBROLA invocation also failed — espeak's --pho phoneme export produces empty output for phoneme-input text. Would need Linux to test properly.
Every previous attempt started from machine-generated pronunciation. This time: Simon Roper (YouTube linguist who speaks fluent Old English). Downloaded his "Interview with an Anglo-Saxon" video, extracted 15s of OE speech as reference. Two pipelines: F5-TTS voice cloning from text, and Seed-VC converting espeak through his voice.
F5-TTS with Simon Roper's OE speech as reference voice. Generating both raw OE orthography and English respellings. The voice timbre should sound like a real person reading OE, not a robot or an American English speaker.
12 total samples: 8 OE words, 2 respellings, 2 full poem lines — all in Simon Roper's voice timbre. The full lines are the real test: does the voice sound like a person reading Old English poetry?
Same espeak-ng phoneme source as before, but voice conversion targets Simon Roper's voice instead of Neural2-J. The theory: espeak gets the phonemes right, Seed-VC re-renders in Roper's natural OE-speaking timbre.
3 words generated so far. ~5 min per word on M1 Air CPU. Full word set pending evaluation of F5-TTS + Roper results.
Installed via pip install tacotron. Uses ARPAbet (CMU pronunciation dictionary) — not IPA. Only supports the 39 English phonemes from CMUdict. Cannot represent /ɣ/, /y/, or other OE sounds. Same phoneme inventory limitation as the commercial APIs.
First two install attempts failed: pydantic-core won't build on Python 3.14/macOS ARM, and the HuggingFace model is gated. Resolved by using Python 3.11 venv + HF authentication. See Round 6 below for results.
EZ-VC uses Xeus, a self-supervised speech encoder trained on over 4,000 languages and 1 million hours of audio. Where Seed-VC uses Whisper (multilingual but primarily trained on ~100 languages) and RVC uses HuBERT (English-centric), Xeus has the broadest phonetic coverage of any encoder. If any voice conversion model can preserve /ɣ/ and /y/ through conversion, it should be this one.
Pipeline: espeak-ng phoneme audio → Xeus encoder (discrete units) → F5-TTS decoder → BigVGAN vocoder. Reference voice: Simon Roper OE clip (15s). ~45s per word on M1 Air CPU.
All 6 diagnostic OE words converted through the Xeus 4,000-language encoder with Simon Roper's voice as reference. The key question: does the broadest multilingual encoder preserve OE phonemes better than Whisper (Seed-VC) or HuBERT (RVC)?
Setup: Python 3.11 venv, pip install -e . + espnet SSL fork. Required HuggingFace auth (gated model at SPRINGLab/EZ-VC). Patched BigVGAN for newer huggingface_hub API. Xeus model is 600MB, EZ-VC checkpoint is 1.2GB.
Re-generated F5-TTS + Roper samples with improvements: upsampled reference to 24kHz (matching F5-TTS output rate), proper silence trimming, 192kbps MP3 encoding (v1 was 74kbps). Same approach as Round 5b — F5-TTS generates speech directly from OE text, cloning Roper's voice — but with better audio quality.
v2 improvements: 24kHz reference audio (was 16kHz), 192kbps MP3 (was 74kbps), silence-trimmed output. Same F5-TTS v1 Base model, same Roper reference clip, 32 NFE steps, cfg_strength=2.0.
No commercial TTS API can produce Old English. Not Google, not Azure, not ElevenLabs. They all have a fixed English phoneme inventory and silently drop or mangle sounds they don't recognize. This isn't a configuration problem — it's an architectural limitation of every major TTS provider.
The breakthrough was separating the two problems: phoneme accuracy and voice quality. espeak-ng (a formant synthesizer from 2007) can pronounce every OE sound correctly via direct phoneme input. It sounds like a robot from 1995, but the sounds are right. Neural voice conversion models can then re-render that robotic audio in a natural human voice.
The best results came from cloning the voice of Simon Roper, a YouTube linguist who speaks fluent reconstructed Old English. F5-TTS (zero-shot voice cloning) generates speech that sounds like a real person reading OE poetry — because the voice reference is a real person reading OE poetry.
Current status: 90+ audio samples generated across 15 approaches over 6 rounds. Three viable pipelines identified: espeak-ng + Seed-VC, F5-TTS + scholar voice clone, and EZ-VC (Xeus 4,000-language encoder). Next step is human evaluation of the samples, then generating the full 52-word set for the OE Lab page.
Hours of OE pronunciation audio exists from scholars: Benjamin Bagby (Beowulf, Internet Archive), Michael Drout ("Anglo-Saxon Aloud" — entire Anglo-Saxon Poetic Records). Drout's recordings include Wulf and Eadwacer specifically. More reference voices would improve the cloning pipeline and provide variety.
Now running on macOS ARM via Python 3.11 venv + HuggingFace auth. 6 diagnostic words converted with Roper voice reference. See Round 6 results above.
Someone already did this for Latin (Ken-Z/latin_SpeechT5 on HuggingFace, 67 hours of training data). The same approach with Drout's OE recordings (~10-20 hours) could produce a dedicated OE voice model. More effort up front, but eliminates the two-step pipeline entirely.
Early 2025: Google Neural2 + SSML IPA tested, partial success (~60% of sounds)
Mar 2025: Chirp 3 HD tested, discovered it ignores SSML entirely
Mar-Apr 2025: 6-round autotune respelling loop, hit ceiling
Apr 13, 2026: Inworld TTS-1.5-Max tested on Replicate, poor quality
Apr 13, 2026: Kokoro-82M tested, IPA syntax doesn't work on Replicate
Apr 13, 2026: espeak-ng + voice conversion pipeline conceived and tested
Apr 13, 2026: RVC and FreeVC voice conversion results generated — awaiting evaluation
Apr 14, 2026: Round 4 — Seed-VC (Whisper-based), SpeechT5-VC, F5-TTS voice cloning all tested locally
Apr 14, 2026: tacotron confirmed dead end (ARPAbet only, no IPA). EZ-VC install failed (Python 3.14 incompatible)
Apr 14, 2026: Neural2 respelling baseline generated for A/B comparison
Apr 14, 2026: 6 new audio samples per approach uploaded to R2 — 30+ samples total for Round 4
Apr 14, 2026: Round 5 — espeak voice variants (48 samples), F5-TTS respellings (8 samples), Seed-VC v2 with Annie variant
Apr 14, 2026: StyleTTS 2 + gruut confirmed dead (gruut maps OE to English phonemes). Piper broken on macOS ARM
Apr 14, 2026: MBROLA built from source but macOS lacks /proc for espeak integration
Apr 14, 2026: Round 5b — Simon Roper voice cloning. F5-TTS + Roper (12 samples including full poem lines), Seed-VC + Roper (3 samples)
Apr 14, 2026: Round 6 — EZ-VC (Xeus 4,000-language encoder) finally running on macOS. Python 3.11 + HF auth solved the install issues. 6 words converted with Roper voice reference
Apr 14, 2026: 90+ total audio samples across all rounds. Three viable pipelines identified