Lab

Old English TTS Research

Finding a way to make Old English words sound human. Every approach tried, every dead end documented.

The Problem

The OE Lab page (Wulf and Eadwacer) needs word-by-word pronunciation audio. Currently falls back to browser speechSynthesis (robotic, inaccurate). Old English has sounds that don't exist in modern English:

Every commercial TTS API has a fixed phoneme inventory per language. If a sound isn't in their English set, it either gets approximated wrong or the request fails entirely.

Head to Head — Same Word, Different Approaches

rēnig /ˈrēː.niɣ/ — rainy (tests /ɣ/ voiced velar fricative)

espeak-ng phoneme input source
Neural2 respelling "RAY-nikh" baseline
Seed-VC v2 (espeak → natural voice) Seed-VC
F5-TTS + Simon Roper voice clone F5+Roper
EZ-VC (Xeus 4K-lang → Roper) EZ-VC
F5-TTS + Roper v2 (24kHz, 192kbps) F5+Roper v2

bōgum /ˈbōː.ɣum/ — branches (tests /ɣ/ + long vowel)

espeak-ng phoneme input source
Neural2 respelling "BOH-goom" baseline
Seed-VC v2 (espeak → natural voice) Seed-VC
F5-TTS + Simon Roper voice clone F5+Roper
EZ-VC (Xeus 4K-lang → Roper) EZ-VC
F5-TTS + Roper v2 (24kHz, 192kbps) F5+Roper v2

wyn /wyn/ — joy (tests /y/ front rounded vowel)

espeak-ng phoneme input source
Neural2 respelling "WUUN" baseline
Seed-VC v2 (espeak → natural voice) Seed-VC
F5-TTS + Simon Roper voice clone F5+Roper
EZ-VC (Xeus 4K-lang → Roper) EZ-VC
F5-TTS + Roper v2 (24kHz, 192kbps) F5+Roper v2

Same three words, six approaches each. espeak-ng gets the phonemes right but sounds robotic. Neural2 sounds natural but approximates the sounds. Seed-VC, F5-TTS+Roper, and EZ-VC try to combine correct phonemes with natural voice — each using a different speech encoder.

Round 1 — Direct TTS APIs

Google Chirp 3 HD

Failed

Best-sounding voice (Charon, Kore), but ignores SSML entirely. The <phoneme> tags are silently dropped. No way to control pronunciation at all. Used for the podcast (where orthographic text is close enough), but can't do individual OE words.

Google Neural2 + SSML <phoneme>

Failed

Accepts IPA phoneme tags, but only for phonemes in its language's inventory. English Neural2 is missing /ɣ/, /y/, and long vowels. Covers ~60% of OE sounds. Quality "noticeably worse" than Chirp 3. German Neural2 helps with /y/ but still approximate.

Autotune Respelling Loop (6 rounds)

Ceiling reached

Automated optimization: feed Chirp 3 HD creative English/German respellings, score TTS output against reference recordings from oldenglish.info using formant analysis, iterate. After 7 iterations of the scoring methodology: 0 problem sounds, 6 excellent, 7 good, 7 mediocre out of 20 vowels. Hit the ceiling of what English/German respelling can approximate.

Azure SSML <phoneme>

Same wall

Same phoneme inventory limitation as Google. Worse: Azure rejects the entire SSML document if it encounters an unrecognized phoneme. Google at least tries to approximate. Azure gives silence.

ElevenLabs

Dead end

Pronunciation dictionaries only work with English phonemes on the oldest models (eleven_flash_v2, eleven_monolingual_v1). All newer models silently skip phoneme tags. Non-English phonemes not supported.

Round 2 — Alternative TTS Models (Apr 2026)

Inworld TTS-1.5-Max (Replicate)

Poor quality

LLM-based model (8.8B params). Supports inline IPA with /IPA/ syntax in text. Different architecture than Google/Azure (not phoneme-table-based). Tested 5 OE words — audio generated but pronunciation quality was poor. Reads IPA more like text than actual phonemes.

rēnig /ɣ/ test
bōgum /ɣ/+long vowel

Kokoro-82M (Replicate)

IPA override didn't work

Open source, ranked #2 on TTS Arena. Supports [text](/IPA/) override syntax via Misaki G2P — but this syntax doesn't work on Replicate's hosted version. Model reads the brackets and slashes as literal text. Also tested plain respellings ("ray-nikh", "BOH-ghoom") — results sounded unnatural for isolated words.

IPA markdown syntax
English respelling "ray-nikh"
Plain OE text
Round 3 — espeak-ng + Voice Conversion

The Insight

espeak-ng (formant synthesizer) can pronounce every OE sound correctly via direct phoneme input. It sounds robotic, but the phonemes are right. Separate the two problems: let espeak handle what to say, let a neural model handle how it sounds.

IPA in data.json espeak-ng phoneme codes espeak-ng speaks the word (robotic but correct) voice conversion model (natural voice) final audio (correct + natural)

espeak-ng Source Audio

Generated

Generated OE words using espeak-ng with direct phoneme input ([[ ]] notation). Also tested Dutch voice (-v nl) where Dutch 'g' naturally produces /ɣ/. Robotic but phonetically accurate.

rēnig (phoneme input) espeak
Wulf line 1 (phoneme input) espeak
Wulf stanza 1 (Dutch voice) espeak-nl

RVC v2 Voice Conversion (espeak → Obama)

Samples generated

Ran espeak phoneme audio through RVC (Retrieval-based Voice Conversion) on Replicate. RVC re-renders audio in a target voice while (theoretically) preserving the pronunciation content. Used the Obama voice model. Key question: did /ɣ/ survive or get normalized to /g/?

rēnig (/ɣ/ test) RVC
cymeð (/y/ test) RVC
lāc (long vowel) RVC
bōgum (/ɣ/+long) RVC
Wulf line 1 (full line) RVC
Wulf stanza (Dutch→RVC) RVC

RVC uses HuBERT for content encoding (English-centric). Higher risk of normalizing unusual phonemes to nearest English sound.

FreeVC Voice Conversion (espeak → Charon)

Samples generated

WavLM-based voice conversion. Fed espeak output + a Charon (Google Chirp 3 HD) reference clip. Less popular than RVC but different architecture.

Wulf line 1 FreeVC
Round 4 — Expanded Testing (Apr 14, 2026)

Round 4 Setup

New espeak-ng base samples generated for 6 key OE words, each testing specific phonemes. These were fed through three different voice conversion models, plus F5-TTS direct generation and Neural2 respellings as baselines. Compare across all approaches for the same word.

espeak-ng Base Samples (Round 4)

Source material

Fresh espeak-ng phoneme samples for the comparison. Robotic but phonetically correct. These are the input to all three VC models below.

rēnig /ɣ/ espeak
bōgum /ɣ/+long espeak
wyn /y/ espeak
cymeð /y/+/θ/ espeak
lāc long /ɑː/ espeak
swylce /y/+/tʃ/ espeak

Seed-VC (espeak → Neural2-J voice)

Promising

Zero-shot voice conversion using Whisper as content encoder (680K hours multilingual data). Ran locally via pip install seed-vc. Fed espeak OE audio as source, Google Neural2-J English clip as target voice. Whisper's multilingual training should preserve non-English phonemes better than HuBERT (used by RVC).

rēnig /ɣ/ Seed-VC
bōgum /ɣ/+long Seed-VC
wyn /y/ Seed-VC
cymeð /y/+/θ/ Seed-VC
lāc long /ɑː/ Seed-VC
swylce /y/+/tʃ/ Seed-VC

Seed-VC loaded Wav2Vec2-XLS-R-300M (multilingual) internally. ~3s per word on M1 MacBook Air (CPU/MPS). Listen for whether /ɣ/ survived or became /g/.

SpeechT5 Voice Conversion (espeak → x-vector)

Samples generated

Microsoft's speecht5_vc model from HuggingFace. Speech-to-speech conversion using speaker x-vector embeddings to define target voice. Lightweight model, runs fast on CPU. Different architecture from RVC and Seed-VC.

rēnig /ɣ/ SpeechT5
bōgum /ɣ/+long SpeechT5
wyn /y/ SpeechT5
cymeð /y/+/θ/ SpeechT5
lāc long /ɑː/ SpeechT5
swylce /y/+/tʃ/ SpeechT5

Used random speaker embeddings (no specific target voice). Quality may improve with proper x-vectors from a voice we like.

F5-TTS (direct OE text → cloned voice)

Samples generated

Zero-shot text-to-speech with voice cloning. Not voice conversion — this generates speech directly from text, cloning a reference voice's timbre. Fed OE orthographic text (rēnig, bōgum, etc.) with Neural2-J as reference voice. The question: how does F5-TTS pronounce OE text it's never seen?

rēnig F5-TTS
bōgum F5-TTS
wyn F5-TTS
cymeð F5-TTS
lāc F5-TTS
swylce F5-TTS
Ēadwacer F5-TTS
Wulf F5-TTS

F5-TTS generates from text, not audio. It doesn't know OE pronunciation rules — it's guessing based on English patterns. Compare these against the espeak+VC versions to hear the difference between "correct but converted" vs "guessed but natural."

Neural2 Respellings (baseline comparison)

Baseline generated

Google Neural2-J with creative English respellings for each OE word. This is the "best effort without any pipeline" approach. Compare against the VC results to judge whether the pipeline is worth the complexity.

rēnig → "RAY-nikh" Neural2
bōgum → "BOH-goom" Neural2
wyn → "WUUN" Neural2
cymeð → "KUU-meth" Neural2
lāc → "LAHK" Neural2
swylce → "SWUL-cheh" Neural2
Ēadwacer → "AY-ad-wah-ker" Neural2
Wulf → "WOOLF" Neural2
Round 5 — Broader Testing (Apr 14, 2026)

Round 5 Approach

Tested whether we can improve the espeak source material (voice variants), skip espeak entirely (StyleTTS 2, Piper), or improve the VC pipeline (Annie variant → Seed-VC). Also tested MBROLA diphone synthesis and F5-TTS with guided respellings. 8 test words covering every hard OE phoneme.

espeak-ng Voice Variants

48 samples generated

Tested 6 voice variants × 8 words: en (default), en+Annie (female), en+Alicia, en+klatt (Klatt synthesizer), en+Adam, en+Andrea. All with -s 130 -p 35 for slower pacing and lower pitch. Phonemes are identical — only the voice character differs.

rēnig — Annie Annie
rēnig — klatt klatt
rēnig — Adam Adam
bōgum — Annie Annie
wyn — Annie Annie
cymeð — Annie Annie
Gehȳrest — Annie Annie
rēotugu — Annie Annie

48 total samples (6 variants × 8 words) uploaded. Only showing Annie + comparison variants for rēnig above. All available on R2 at Audio Testing/oe-tts-round5/espeak-variants/.

Seed-VC v2 (Annie espeak → Neural2-J)

Promising

Same Seed-VC pipeline as Round 4, but using en+Annie variant as source instead of plain en. Better source audio quality should produce better voice conversion output. Testing all 8 diagnostic words.

rēnig /ɣ/ Seed-VC
bōgum /ɣ/+long Seed-VC
wyn /y/ Seed-VC
cymeð /y/+/θ/ Seed-VC
lāc long /ɑː/ Seed-VC
swylce /y/+/tʃ/ Seed-VC
Gehȳrest /yː/+/j/ Seed-VC
rēotugu /ɣ/ 4-syl Seed-VC

Annie variant espeak as source, Neural2-J as target voice. Compare against Round 4 Seed-VC (plain espeak source) to hear whether the voice variant input improves output quality.

F5-TTS with IPA Respellings

Samples generated

Round 4 fed raw OE orthography to F5-TTS (it guessed pronunciation). This time using phonetically-guided English respellings: "ray-nikh" for rēnig, "boh-ghoom" for bōgum, etc. Same Neural2-J reference voice.

rēnig → "ray-nikh" F5-resp
bōgum → "boh-ghoom" F5-resp
wyn → "wuun" F5-resp
cymeð → "kuu-meth" F5-resp
lāc → "laahk" F5-resp
swylce → "swul-cheh" F5-resp
Gehȳrest → "yeh-huu-rest" F5-resp
rēotugu → "ray-o-too-ghoo" F5-resp

Compare these against Round 4's raw-text F5-TTS samples to hear whether guided respellings improve OE pronunciation.

StyleTTS 2 + gruut (IPA-to-speech dream)

Same wall

Installed gruut (IPA phonemizer used by StyleTTS 2). gruut treats OE text as English and produces completely wrong phonemes: rēnig → /ɹ ˈɪ n ɪ ɡ/ (English "rinig"). It strips /ɣ/, maps /y/ to /ɪ/, ignores macrons. Same fixed-English-phoneme-inventory problem. StyleTTS 2 itself failed to install — needs tokenizers <0.20 which has no pre-built wheel for Python 3.14.

Piper TTS (custom phoneme config)

macOS broken

Installed via pip install piper-tts. On macOS ARM, the bundled espeak-ng data path points to a Linux build directory that doesn't exist: /Users/runner/work/piper1-gpl/.../espeak_ng-install/. Would need a custom build or Linux environment. Not viable on this machine.

MBROLA + espeak-ng (diphone synthesis)

macOS blocked

Successfully built MBROLA 3.4-dev from source on macOS ARM. Downloaded en1/nl2/us1-3 diphone databases. But espeak-ng's mbrowrap layer depends on /proc (Linux-only). Cannot integrate MBROLA voices through espeak-ng on macOS. Direct MBROLA invocation also failed — espeak's --pho phoneme export produces empty output for phoneme-input text. Would need Linux to test properly.

Round 5b — Scholar Voice Cloning

New Approach: Real OE Speaker

Every previous attempt started from machine-generated pronunciation. This time: Simon Roper (YouTube linguist who speaks fluent Old English). Downloaded his "Interview with an Anglo-Saxon" video, extracted 15s of OE speech as reference. Two pipelines: F5-TTS voice cloning from text, and Seed-VC converting espeak through his voice.

F5-TTS + Roper Voice Clone (text → Roper voice)

Best so far

F5-TTS with Simon Roper's OE speech as reference voice. Generating both raw OE orthography and English respellings. The voice timbre should sound like a real person reading OE, not a robot or an American English speaker.

rēnig (OE text) F5+Roper
rēnig (respelling "ray-nikh") F5+Roper
bōgum (OE text) F5+Roper
wyn F5+Roper
cymeð F5+Roper
lāc F5+Roper
swylce F5+Roper
Ēadwacer F5+Roper
Wulf F5+Roper
bōgum (respelling "boh-ghoom") F5+Roper
"Leodum is minum swylce him mon lac gife" (full line 1) F5+Roper
"Wulf is on iege, ic on oþerre" (full line 2) F5+Roper

12 total samples: 8 OE words, 2 respellings, 2 full poem lines — all in Simon Roper's voice timbre. The full lines are the real test: does the voice sound like a person reading Old English poetry?

Seed-VC + Roper Voice (espeak → Roper)

Promising

Same espeak-ng phoneme source as before, but voice conversion targets Simon Roper's voice instead of Neural2-J. The theory: espeak gets the phonemes right, Seed-VC re-renders in Roper's natural OE-speaking timbre.

rēnig /ɣ/ VC+Roper
bōgum /ɣ/+long VC+Roper
wyn /y/ VC+Roper

3 words generated so far. ~5 min per word on M1 Air CPU. Full word set pending evaluation of F5-TTS + Roper results.

Confirmed Dead Ends

tacotron (IPA-to-speech directly)

Dead end

Installed via pip install tacotron. Uses ARPAbet (CMU pronunciation dictionary) — not IPA. Only supports the 39 English phonemes from CMUdict. Cannot represent /ɣ/, /y/, or other OE sounds. Same phoneme inventory limitation as the commercial APIs.

EZ-VC (4,000-language encoder) — first attempts

Resolved in Round 6

First two install attempts failed: pydantic-core won't build on Python 3.14/macOS ARM, and the HuggingFace model is gated. Resolved by using Python 3.11 venv + HF authentication. See Round 6 below for results.

Round 6 — EZ-VC (4,000-Language Encoder)

The Theory

EZ-VC uses Xeus, a self-supervised speech encoder trained on over 4,000 languages and 1 million hours of audio. Where Seed-VC uses Whisper (multilingual but primarily trained on ~100 languages) and RVC uses HuBERT (English-centric), Xeus has the broadest phonetic coverage of any encoder. If any voice conversion model can preserve /ɣ/ and /y/ through conversion, it should be this one.

Pipeline: espeak-ng phoneme audio → Xeus encoder (discrete units) → F5-TTS decoder → BigVGAN vocoder. Reference voice: Simon Roper OE clip (15s). ~45s per word on M1 Air CPU.

EZ-VC (espeak → Roper voice via Xeus)

New results

All 6 diagnostic OE words converted through the Xeus 4,000-language encoder with Simon Roper's voice as reference. The key question: does the broadest multilingual encoder preserve OE phonemes better than Whisper (Seed-VC) or HuBERT (RVC)?

rēnig /ɣ/ EZ-VC
bōgum /ɣ/+long EZ-VC
wyn /y/ EZ-VC
cymeð /y/+/θ/ EZ-VC
lāc long /ɑː/ EZ-VC
swylce /y/+/tʃ/ EZ-VC

Setup: Python 3.11 venv, pip install -e . + espnet SSL fork. Required HuggingFace auth (gated model at SPRINGLab/EZ-VC). Patched BigVGAN for newer huggingface_hub API. Xeus model is 600MB, EZ-VC checkpoint is 1.2GB.

F5-TTS + Roper v2 (improved quality)

Best so far

Re-generated F5-TTS + Roper samples with improvements: upsampled reference to 24kHz (matching F5-TTS output rate), proper silence trimming, 192kbps MP3 encoding (v1 was 74kbps). Same approach as Round 5b — F5-TTS generates speech directly from OE text, cloning Roper's voice — but with better audio quality.

rēnig F5+Roper v2
bōgum F5+Roper v2
wyn F5+Roper v2
cymeð F5+Roper v2
lāc F5+Roper v2
swylce F5+Roper v2
Ēadwacer F5+Roper v2
Wulf F5+Roper v2

v2 improvements: 24kHz reference audio (was 16kHz), 192kbps MP3 (was 74kbps), silence-trimmed output. Same F5-TTS v1 Base model, same Roper reference clip, 32 NFE steps, cfg_strength=2.0.

Where This Stands

What We Found

No commercial TTS API can produce Old English. Not Google, not Azure, not ElevenLabs. They all have a fixed English phoneme inventory and silently drop or mangle sounds they don't recognize. This isn't a configuration problem — it's an architectural limitation of every major TTS provider.

The breakthrough was separating the two problems: phoneme accuracy and voice quality. espeak-ng (a formant synthesizer from 2007) can pronounce every OE sound correctly via direct phoneme input. It sounds like a robot from 1995, but the sounds are right. Neural voice conversion models can then re-render that robotic audio in a natural human voice.

The best results came from cloning the voice of Simon Roper, a YouTube linguist who speaks fluent reconstructed Old English. F5-TTS (zero-shot voice cloning) generates speech that sounds like a real person reading OE poetry — because the voice reference is a real person reading OE poetry.

Current status: 90+ audio samples generated across 15 approaches over 6 rounds. Three viable pipelines identified: espeak-ng + Seed-VC, F5-TTS + scholar voice clone, and EZ-VC (Xeus 4,000-language encoder). Next step is human evaluation of the samples, then generating the full 52-word set for the OE Lab page.

Future Directions

More scholar voice sources

Research

Hours of OE pronunciation audio exists from scholars: Benjamin Bagby (Beowulf, Internet Archive), Michael Drout ("Anglo-Saxon Aloud" — entire Anglo-Saxon Poetic Records). Drout's recordings include Wulf and Eadwacer specifically. More reference voices would improve the cloning pipeline and provide variety.

EZ-VC (4,000-language encoder)

Tested — Round 6

Now running on macOS ARM via Python 3.11 venv + HuggingFace auth. 6 diagnostic words converted with Roper voice reference. See Round 6 results above.

Fine-tune SpeechT5 on OE audio

Research

Someone already did this for Latin (Ken-Z/latin_SpeechT5 on HuggingFace, 67 hours of training data). The same approach with Drout's OE recordings (~10-20 hours) could produce a dedicated OE voice model. More effort up front, but eliminates the two-step pipeline entirely.

Timeline

Early 2025: Google Neural2 + SSML IPA tested, partial success (~60% of sounds)
Mar 2025: Chirp 3 HD tested, discovered it ignores SSML entirely
Mar-Apr 2025: 6-round autotune respelling loop, hit ceiling
Apr 13, 2026: Inworld TTS-1.5-Max tested on Replicate, poor quality
Apr 13, 2026: Kokoro-82M tested, IPA syntax doesn't work on Replicate
Apr 13, 2026: espeak-ng + voice conversion pipeline conceived and tested
Apr 13, 2026: RVC and FreeVC voice conversion results generated — awaiting evaluation
Apr 14, 2026: Round 4 — Seed-VC (Whisper-based), SpeechT5-VC, F5-TTS voice cloning all tested locally
Apr 14, 2026: tacotron confirmed dead end (ARPAbet only, no IPA). EZ-VC install failed (Python 3.14 incompatible)
Apr 14, 2026: Neural2 respelling baseline generated for A/B comparison
Apr 14, 2026: 6 new audio samples per approach uploaded to R2 — 30+ samples total for Round 4
Apr 14, 2026: Round 5 — espeak voice variants (48 samples), F5-TTS respellings (8 samples), Seed-VC v2 with Annie variant
Apr 14, 2026: StyleTTS 2 + gruut confirmed dead (gruut maps OE to English phonemes). Piper broken on macOS ARM
Apr 14, 2026: MBROLA built from source but macOS lacks /proc for espeak integration
Apr 14, 2026: Round 5b — Simon Roper voice cloning. F5-TTS + Roper (12 samples including full poem lines), Seed-VC + Roper (3 samples)
Apr 14, 2026: Round 6 — EZ-VC (Xeus 4,000-language encoder) finally running on macOS. Python 3.11 + HF auth solved the install issues. 6 words converted with Roper voice reference
Apr 14, 2026: 90+ total audio samples across all rounds. Three viable pipelines identified