I am a machine learning researcher specialising in speech and audio. I am currently a Member of Technical Staff on the Audio team at Zyphra, where I help build large-scale text-to-speech and audio foundation models — including the ZONOS family of TTS systems — working across model architecture, data pipelines and evaluation.
I take ideas from research all the way to production: training and shipping models at scale on large GPU clusters, and grounding them in how humans actually perceive sound. This builds on my PhD in Computer Science from the University of Sheffield (perceptually-motivated speech enhancement and quality assessment) and a first-class BSc from Cardiff University, with published work spanning speech enhancement, speech quality prediction and ASR.
I am especially interested in the place where large speech models meet human perception — building systems that don't just score well on metrics, but genuinely sound right to people. Always happy to talk speech, audio and TTS: george@zyphra.com.
Interests
- Text To Speech (TTS) & speech synthesis
- Audio foundation models
- Self-Supervised Speech Representations
- Speech Enhancement / Noise Reduction
- Speech Quality / Intelligibility machine perception and prediction
- Automatic Speech Recognition (ASR)
- Human perception of digital audio
- Neural systems for hearing aids & edge devices
- Deepfake detection / adversarial attacks
Technical Skills
- Large-scale model training on multi-GPU clusters
- Python — PyTorch, SpeechBrain, NVIDIA NeMo, SciPy
- HuggingFace ecosystem / OpenAI API
- Distributed training & data pipelines for audio
- Linux / Bash scripting
- Git / GitHub / GitLab
- C++, Java, SQL, MATLAB
Experience & Qualifications
-
Sep 2025 — Present
Member of Technical Staff @ Zyphra
San Francisco, CA, USA Research and development on ZONOS, a large-scale text-to-speech model - spanning model architecture, data pipelines and evaluation. -
Nov 2024 — Sep 2025
Speech Data Scientist @ ConnexAI
Manchester, UK Built and deployed production speech-processing systems, including data filtering for in-the-wild speech corpora and non-intrusive speech quality prediction. -
May 2024 — Aug 2024
Yamaha Research and Development
(Internship)
Hamamatsu, Japan Research internship applying machine learning to audio and music signal processing within an industrial R&D team. Project focused on modeling human perception of music spatiality. -
Oct 2020 — Jan 2025
PhD Computer Science + Graduate Teaching Assistant
University of Sheffield, UK
Thesis: Perceptually Motivated Speech Enhancement Researched neural speech enhancement guided by human perception — using speech quality metrics and self-supervised representations as loss functions. Authored 10+ first-author papers and taught undergraduate courses as a GTA. -
Aug 2017 — Aug 2020
BSc Computer Science (First Class Honours)
Cardiff University, UK
Thesis: Majel — Voice control for Command Line Interfaces Graduated with First Class Honours, with a final-year project building a speech-driven interface for the command line.
Papers & Publications
I am an author on 18 papers, of which 10 I am first author. These have amassed 161 citations with an h-index of 8. Full list on Google Scholar.
-
ZONOS2 Technical Report
2026 Zyphra Technical Report
We present ZONOS2 8B, our latest TTS model, which achieves state-of-the-art naturalness, prosody, and voice cloning fidelity. We improve upon Zonos-v0.1 across scale, data, and training recipe. We scale the model from 1.6B to 8B total parameters (900M active) with a novel mixture-of-experts (MoE) backbone, improving inference latency and throughput. We expand our training corpus from 200K to over 6M hours using a new data processing pipeline, and we simplify our post-training and conditioning recipes to improve naturalness and voice cloning fidelity. We evaluate ZONOS2 8B on quality, speaker similarity, WER, and ZTTS1-Eval, our novel TTS benchmark, where it performs competitively with state-of-the-art systems while maintaining good streaming latency. We release our model weights and example inference code under an Apache 2.0 license on GitHub and Hugging Face. -
WhiSQA: Non-Intrusive Speech Quality Prediction Using Whisper Encoder Features
2025 SPECOM 2025
There has been significant research effort developing neural-network-based predictors of speech quality (SQ) in recent years. While a primary objective has been to develop non-intrusive, i.e. reference-free, metrics to assess the performance of speech enhancement (SE) systems, recent work has also investigated the direct inference of neural SQ predictors within the loss function of downstream speech tasks. To aid in the training of SQ predictors, several large datasets of audio with corresponding human labels of quality have been created. Recent work in this area has shown that speech representations derived from large unsupervised or semi-supervised foundational speech models are useful input feature representations for neural SQ prediction. In this work, a novel and robust SQ predictor is proposed based on feature representations extracted from an ASR model, found to be a powerful input feature for the SQ prediction task. The proposed system achieves higher correlation with human MOS ratings than recent approaches on all NISQA test sets and shows significantly better domain adaption compared to the commonly used DNSMOS metric. -
Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification
2025 Interspeech 2025
arXiv Interspeech 2025 Dataset
Large-scale in-the-wild speech datasets have become more prevalent in recent years due to increased interest in models that can learn useful features from unlabelled data for tasks such as speech recognition or synthesis. These datasets often contain undesirable features, such as multiple speakers, non-target languages, and music, which may impact model learning. The Whilter model is proposed as a multitask solution to identify these undesirable samples. Whilter uses a Whisper encoder with an attention-based classifier to solve five diverse classification problems at once. In addition, an annotated dataset is published for a subset of two popular in-the-wild corpora. Whilter achieves F1 scores above 85% and equal error rates of 6.5% to 7.8% for three of five subtasks, outperforming a state-of-the-art BEATs classifier on speech-specific classes, with a notable decrease in processing time compared to a combination of single-task alternatives. -
Hallucination in Perceptual Metric-Driven Speech Enhancement Networks
2024 EUSIPCO 2024
arXiv EUSIPCO 2024 Listening Test Examples
Within the area of speech enhancement, there is an ongoing interest in the creation of neural systems which explicitly aim to improve the perceptual quality of the processed audio. In concert with this is the topic of non-intrusive (i.e. without clean reference) speech quality prediction, for which neural networks are trained to predict human-assigned quality labels directly from distorted audio. When combined, these areas allow for the creation of powerful new speech enhancement systems which can leverage large real-world datasets of distorted audio, by taking inference of a pre-trained speech quality predictor as the sole loss function of the speech enhancement system. This paper aims to identify a potential pitfall with this approach, namely hallucinations which are introduced by the enhancement system 'tricking' the speech quality predictor. -
Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement
2024 EUSIPCO 2024
Machine learning techniques are an active area of research for speech enhancement for hearing aids, with one particular focus on improving the intelligibility of a noisy speech signal. Recent work has shown that feature encodings from self-supervised speech representation models can effectively capture speech intelligibility. In this work, it is shown that the distance between self-supervised speech representations of clean and noisy speech correlates more strongly with human intelligibility ratings than other signal-based metrics. Experiments show that training a speech enhancement model using this distance as part of a loss function improves the performance over using an SNR-based loss function, demonstrated by an increase in HASPI, STOI, PESQ and SI-SNR scores. This method takes inference of a high parameter count model only at training time, meaning the speech enhancement model can remain smaller, as is required for hearing aids. -
Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition
2024 Interspeech 2024
One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method gives a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI). -
Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models
2024 ICASSP 2024 🏆 2nd Place — Clarity Prediction Challenge 2
Technical Report arXiv ICASSP 2024
Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users. Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7. -
Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement
2024 ICASSP 2024
Neural network based approaches to speech enhancement have shown to be particularly powerful, being able to leverage a data-driven approach to result in a significant performance gain versus other approaches. Such approaches are reliant on artificially created labelled training data such that the neural model can be trained using intrusive loss functions which compare the output of the model with clean reference speech. Performance of such systems when enhancing real-world audio often suffers relative to their performance on simulated test data. In this work, a non-intrusive multi-metric prediction approach is introduced, wherein a model trained on artificial labelled data using inference of an adversarially trained metric prediction neural network. The proposed approach shows improved performance versus state-of-the-art systems on the recent CHiME-7 challenge unsupervised domain adaptation speech enhancement (UDASE) task evaluation sets. -
CMGAN+/+: The University of Sheffield CHiME-7 UDASE Challenge Speech Enhancement System
2023 CHiME-7 UDASE Challenge 🏆 Challenge Entry
The CHiME-7 unsupervised domain adaptation speech enhancement (UDASE) challenge targets in-domain adaptation to unlabelled speech data. This paper describes the University of Sheffield team's system submitted for the challenge. A generative adversarial network (GAN) structure is employed as opposed to the unsupervised RemixIT method proposed in the baseline system. The system uses a conformer-based metric GAN (CMGAN) structure. The discriminator part of the GAN is trained to predict the output of a DNSMOS model. Data augmentation strategies are employed which enable training on historical training data as well as miscellaneous data from an additional generator. The proposed approach, referred to as CMGAN+/+, achieves significant improvement in DNSMOS evaluation metrics with the best proposed system achieving 3.40 OVR-MOS, an 18% improvement over the baselines. -
Non-Intrusive Intelligibility Predictor for Hearing-Impaired Individuals using Self-Supervised Speech Representations
2023 SPARKS Workshop 2023
Self-supervised speech representations (SSSRs) have been successfully applied to a number of speech-processing tasks, e.g. as feature extractor for speech quality (SQ) prediction, which is, in turn, relevant for assessment and training speech enhancement systems for users with normal or impaired hearing. However, exact knowledge of why and how quality-related information is encoded well in such representations remains poorly understood. In this work, techniques for non-intrusive prediction of SQ ratings are extended to the prediction of intelligibility for hearing-impaired users. It is found that self-supervised representations are useful as input features to non-intrusive prediction models, achieving competitive performance to more complex systems. A detailed analysis of the performance depending on Clarity Prediction Challenge 1 listeners and enhancement systems indicates that more data might be needed to allow generalisation to unknown systems and (hearing-impaired) individuals. -
The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions
2023 WASPAA 2023
Recent work in the field of speech enhancement (SE) has involved the use of self-supervised speech representations (SSSRs) as feature transformations in loss functions. However, in prior work, very little attention has been paid to the relationship between the language of the audio used to train the self-supervised representation and that used to train the SE system. Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly. This may lead to enhancement systems which are language specific and as such do not generalise well to unseen languages, unlike models trained using traditional spectrogram or time domain loss functions. In this work, SE models are trained and tested on a number of different languages, with self-supervised representations which themselves are trained using different language combinations and with differing network structures as loss function representations. These models are then tested across unseen languages and their performances are analysed. It is found that the training language of the self-supervised representation appears to have a minor effect on enhancement performance, the amount of training data of a particular language, however, greatly affects performance. -
Perceive and Predict: Self-Supervised Speech Representation Based Loss Functions for Speech Enhancement
2023 ICASSP 2023
arXiv ICASSP 2023 Audio Examples
Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self-supervised speech representation models, rather than the earlier feature encodings. The use of self-supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). -
PAMGAN+/-: Improving Phase-Aware Speech Enhancement Performance via Expanded Discriminator Training
2023 154th AES Convention 🏆 Student Technical Paper Award
Recent speech enhancement work, which makes use of neural networks trained with a loss derived in part using an adversarial metric prediction network, has shown to be very effective. However, by limiting the data used to train this metric prediction network to only the clean reference and the output of the speech enhancement network, only a limited range of the metric is learnt. Additionally, such speech enhancement systems are limited because they typically operate solely over magnitude spectrogram representations so they do not encode phase information. In this work, recent developments for phase-aware speech enhancement in such an adversarial framework are expanded in two ways to enable the metric prediction network to learn a full range of metric scores. Firstly, the metric predictor is also exposed to unenhanced 'noisy' data during training. Furthermore, an additional network is introduced and trained alongside which attempts to produce outputs with a fixed 'lower' target metric score, and expose the metric predictor to these 'de-enhanced' outputs. It is found that performance increases versus a baseline system utilising a magnitude spectrogram speech enhancement network. -
Non-Intrusive Speech Intelligibility Metric Prediction for Hearing-Impaired Individuals — Clarity Prediction Challenge 1
2022 Interspeech 2022
This paper proposes neural models to predict Speech Intelligibility (SI), both by prediction of established SI metrics and of human speech recognition (HSR) on the 1st Clarity Prediction Challenge. Both intrusive and non-intrusive predictors for intrusive SI metrics are trained, then fine tuned on the HSR ground truth. Results are reported on a number of SI metrics, and the model choice for the Clarity challenge submission is explained. Additionally, the relationship between the SI scores in the data and commonly used signal processing metrics which approximate SI are analysed, and some issues emerging from this relationship discussed. It is found that intrusive neural predictors of SI metrics when finetuned on the true HSR scores outperform the non neural challenge baseline. -
MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data
2022 EUSIPCO 2022
arXiv EUSIPCO 2022 Audio Examples
Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results. Incorporating psychoacoustically motivated speech perception metrics as part of model training via a predictor network has recently gained interest. However, the performance of such predictors is limited by the distribution of metric scores that appear in the training data. In this work, we propose MetricGAN+/- (an extension of MetricGAN+, one such metric-motivated system) which introduces an additional network — a "de-generator" which attempts to improve the robustness of the prediction network (and by extension of the generator) by ensuring observation of a wider range of metric scores in training. Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8% (3.05 vs 3.22 PESQ score), as well as better generalisation to unseen noise and speech.
Talks & Presentations
- AES [TC-MLAI] 2022 Talk — "Teaching AI to hear like we do: psychoacoustics in machine learning"