Improve offline TTS quality

Nice — neat little offline TTS app. I cleaned and upgraded it so you (1) pick higher-quality on-device voices first, (2) force the utterance lang to improve pronunciation, (3) split long lines into shorter utterances (better naturalness & avoids engine clipping), and (4) chain utterances with small pauses for natural pacing. I also added voice-quality heuristics and hooks (onstart/onend) and explained how to get the best result on Android.

Two quick tips before the code:

Install a modern high-quality on-device engine (e.g. Google Text-to-Speech or your device’s OEM TTS) and download its high-quality voice data in Android Settings → System → Language & input → Text-to-speech output. That usually has the biggest impact.
The Web Speech API does not expose every advanced feature of Android's native TTS (SSML, neural waveNet flags) — for the absolute best quality you’d use a native Android app calling the TTS API. The improvements below give a big boost while staying pure HTML/JS.

Paste this full HTML over your file (keeps your UI but upgrades behavior):

html
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Interactive Multi-Lang TTS with Controls (improved)</title>
  <style>
    body { font-family: sans-serif; padding: 1em; max-width: 900px; margin: auto; }
    ul { list-style: none; padding: 0; }
    li { padding: 0.5em; cursor: pointer; border-bottom: 1px solid #eee; }
    li:hover { background: #eef; }
    .active { background: #cce; }
    label, select, input { display: block; margin: 0.5em 0; }
    .controls { display:flex; gap:1rem; flex-wrap:wrap; align-items:center; }
  </style>
</head>
<body>
  <h1>Upload a .txt file & speak selected sentences (improved)</h1>
  <input type="file" id="fileInput" accept=".txt">

  <div class="controls">
    <div>
      <label for="voiceSelect">Voice:</label>
      <select id="voiceSelect"></select>
      <small id="voiceNotes"></small>
    </div>

    <div>
      <label for="rate">Rate: <span id="rateVal">1</span></label>
      <input type="range" id="rate" min="0.6" max="1.2" value="1" step="0.05">
    </div>

    <div>
      <label for="pitch">Pitch: <span id="pitchVal">1</span></label>
      <input type="range" id="pitch" min="0.7" max="1.3" value="1" step="0.05">
    </div>

    <div>
      <label for="pauseMs">Pause between chunks (ms): <span id="pauseVal">120</span></label>
      <input type="range" id="pauseMs" min="0" max="500" value="120" step="10">
    </div>
  </div>

  <ul id="sentenceList"></ul>

  <script>
    const fileInput = document.getElementById('fileInput');
    const sentenceList = document.getElementById('sentenceList');
    const voiceSelect = document.getElementById('voiceSelect');
    const voiceNotes = document.getElementById('voiceNotes');
    const rate = document.getElementById('rate'), rateVal = document.getElementById('rateVal');
    const pitch = document.getElementById('pitch'), pitchVal = document.getElementById('pitchVal');
    const pauseMs = document.getElementById('pauseMs'), pauseVal = document.getElementById('pauseVal');
    const synth = window.speechSynthesis;

    // Preferred languages (adjust order if you want)
    const prefLangs = ['ja-JP','zh-TW','zh-Hant','zh-CN','en-US','en-GB','en'];

    // Utility: safer get voices across engines
    function getVoicesSafe(){
      const v = synth.getVoices();
      return v || [];
    }

    function rankVoices(voices){
      // Score voices by:
      // 1) localService (on-device) preferred
      // 2) language match with preferences
      // 3) small boost if name contains 'Google'/'WaveNet'/'Samsung' (heuristic)
      const engineBoostNames = ['Google','WaveNet','Samsung','neural'];

      function langScore(v){
        for (let i=0;i<prefLangs.length;i++){
          if (v.lang && v.lang.startsWith(prefLangs[i])) return (prefLangs.length - i) * 10;
        }
        if (v.lang && v.lang.startsWith('en')) return 2;
        return 0;
      }

      return voices
        .map(v => {
          let score = 0;
          if (v.localService) score += 30;
          score += langScore(v);
          const name = (v.name || '').toLowerCase();
          engineBoostNames.forEach(n => { if (name.includes(n.toLowerCase())) score += 6; });
          if (v.default) score += 4;
          return {v, score};
        })
        .sort((a,b) => b.score - a.score)
        .map(x => x.v);
    }

    function populateVoices(){
      const voices = getVoicesSafe();
      if (!voices.length) return;

      const ordered = rankVoices(voices);
      voiceSelect.innerHTML = '';
      ordered.forEach(v => {
        const opt = document.createElement('option');
        opt.value = v.name + '||' + v.lang; // keep small unique key
        opt.textContent = `${v.name} (${v.lang})${v.localService ? ' • on-device' : ''}${v.default ? ' • default' : ''}`;
        voiceSelect.appendChild(opt);
      });

      voiceNotes.textContent = 'Prefer an on-device voice for best offline quality. Download voice data in Android settings if needed.';
    }

    populateVoices();
    if (synth.onvoiceschanged !== undefined) synth.onvoiceschanged = populateVoices;

    fileInput.addEventListener('change', e=>{
      const file = e.target.files[0];
      if (!file) return;
      const reader = new FileReader();
      reader.readAsText(file);
      reader.onload = () => {
        const lines = reader.result.split(/\r?\n/).map(l => l.trim()).filter(l => l);
        sentenceList.innerHTML = '';
        lines.forEach(line=>{
          const li = document.createElement('li');
          li.textContent = line;
          li.addEventListener('click', ()=> speakFiltered(line, li));
          sentenceList.appendChild(li);
        });
      };
      reader.onerror = () => alert('Error reading file.');
    });

    rate.addEventListener('input', ()=> rateVal.textContent = rate.value);
    pitch.addEventListener('input', ()=> pitchVal.textContent = pitch.value);
    pauseMs.addEventListener('input', ()=> pauseVal.textContent = pauseMs.value);

    // Preprocess text for better pronunciation: trim, remove parenthetical notes (existing),
    // normalize some punctuation (you can expand this).
    function preprocessText(text){
      let t = text.replace(/\s*\(.*?\)/g, '').trim();
      // convert multiple spaces
      t = t.replace(/\s+/g, ' ');
      // expand simple ellipsis to audible pause
      t = t.replace(/\.{3,}/g, '.');
      return t;
    }

    // Break into natural chunks (prefer splitting on sentence punctuation).
    // Also ensure chunks are not overly long (maxChunk chars).
    function splitToChunks(text, maxChunk = 160){
      // split on punctuation that ends sentences (keep the punctuation)
      const parts = text.match(/[^.!?。！？]+[.!?。！？]*/g) || [text];
      const chunks = [];
      for (let p of parts){
        p = p.trim();
        if (!p) continue;
        if (p.length <= maxChunk){
          chunks.push(p);
        } else {
          // fallback: split on commas or by length
          const sub = p.split(/[,，;；]/).map(s=>s.trim()).filter(Boolean);
          let cur = '';
          sub.forEach(s=>{
            if ((cur + ' ' + s).trim().length <= maxChunk){
              cur = (cur + ' ' + s).trim();
            } else {
              if (cur) chunks.push(cur);
              cur = s;
            }
          });
          if (cur) chunks.push(cur);
        }
      }
      return chunks;
    }

    // Select best voice based on ranking and lang heuristic
    function selectBestVoiceForText(desiredLang){
      const voices = getVoicesSafe();
      if (!voices.length) return null;
      const ordered = rankVoices(voices);
      // prefer exact language prefix match
      for (const v of ordered){
        if (desiredLang && v.lang && v.lang.startsWith(desiredLang)) return v;
      }
      // fallback: prefer same major language (e.g., 'en' from 'en-US')
      if (desiredLang){
        const major = desiredLang.split('-')[0];
        for (const v of ordered){
          if (v.lang && v.lang.split('-')[0] === major) return v;
        }
      }
      return ordered[0] || voices[0];
    }

    // Chain multiple utterances with small pause between them for naturalness
    function speakChunksSequentially(chunks, options){
      if (!chunks || !chunks.length) return;
      synth.cancel(); // stop any current speaking

      let index = 0;
      const speakNext = ()=>{
        if (index >= chunks.length) return;
        const u = new SpeechSynthesisUtterance(chunks[index]);
        if (options.voice) u.voice = options.voice;
        if (options.lang) u.lang = options.lang;
        u.rate = options.rate || 1;
        u.pitch = options.pitch || 1;
        u.volume = options.volume !== undefined ? options.volume : 1;

        // optional events
        u.onstart = ()=> {
          if (options.onstart) options.onstart(index, chunks[index]);
        };
        u.onend = ()=>{
          if (options.onend) options.onend(index, chunks[index]);
          index++;
          // small pause before next chunk to allow engine to breathe
          if (index < chunks.length){
            setTimeout(speakNext, options.pauseMs || 100);
          }
        };
        u.onerror = (e)=> {
          console.error('TTS error', e);
          if (options.onerror) options.onerror(e);
        };

        synth.speak(u);
      };

      speakNext();
    }

    // main speak function wired to list items
    function speakFiltered(text, li){
      sentenceList.querySelectorAll('li').forEach(el => el.classList.remove('active'));
      li.classList.add('active');

      const filtered = preprocessText(text);
      if (!filtered) return;

      // choose language heuristically: if voice select shows lang selected, use that; else detect
      let selected = voiceSelect.value;
      let langHint = null;
      if (selected){
        const parts = selected.split('||');
        if (parts.length === 2) langHint = parts[1];
      }

      // detect basic language by Unicode ranges (very simple)
      if (!langHint){
        if (/[一-龯]/.test(filtered)) langHint = 'zh-TW';
        else if (/[ぁ-ゔゞァ-・ヽヾ゛゜]/.test(filtered)) langHint = 'ja-JP';
        else langHint = 'en-US';
      }

      const voice = selectBestVoiceForText(langHint);
      // Split into shorter chunks for improved naturalness
      const chunks = splitToChunks(filtered, 140);

      // speak with small pause
      speakChunksSequentially(chunks, {
        voice,
        lang: langHint,
        rate: parseFloat(rate.value),
        pitch: parseFloat(pitch.value),
        pauseMs: parseInt(pauseMs.value,10),
        onstart: (i, txt) => { /* optional UI hooks */ },
        onend: (i, txt) => { /* optional UI hooks */ }
      });
    }

    // Expose a test voice button for debugging (optional)
    // You could add a "Test voice" button bound to this.
    function testVoices(){
      const voices = getVoicesSafe();
      console.log('Available voices:', voices);
    }

    // small UX: allow pressing space on a selected li to speak again
    document.addEventListener('keydown', e=>{
      if (e.key === ' '){
        const active = sentenceList.querySelector('.active');
        if (active) {
          e.preventDefault();
          speakFiltered(active.textContent, active);
        }
      }
    });
  </script>
</body>
</html>

What changed & why it helps

Prefer on-device voices (localService) — these are usually the high-quality offline voices you download in Android settings.
Explicit lang on utterances — telling the engine the language improves pronunciations (especially for bilingual lines).
Chunking short utterances — many engines produce more natural intonation on shorter segments and are less likely to be cut or garbled on long paragraphs.
Small pauses between chunks — realistic breathing/phrasing and avoids concatenation artifacts.
Voice ranking heuristics — automatically favors voices that look like modern/high-quality engines (heuristic; you can tune).
UI control for pause, rate, pitch — lets you fine-tune per language/voice.

Further improvements (if you want to go further)

If you need SSML (breaks, emphasis, prosody), the Web Speech API support is inconsistent. For full SSML and WaveNet/Neural voice flags, build a tiny native Android app using TextToSpeech API (it supports more SSML and engine flags).
Add language detection (e.g., fr, es) with a small library or server-side detection if you frequently mix languages.
Expand preprocess rules (numbers, dates, abbreviations) to expand into spoken words for more natural output.
If you can install a specific engine on the phone (Google TTS), try to prefer voice names that mention that engine — you’ll get noticeably better voices.

If you want, I can:

add SSML fallback for browsers that support it,
produce a native Android snippet (Kotlin) that shows how to call the TTS engine with SSML/neural options, or
tweak the chunking rules for specific languages (Japanese, Chinese need different segmentation).

Which of those would help you next?

聊天歷程紀錄

你說：

ChatGPT 說：

What changed & why it helps

Further improvements (if you want to go further)