Hello. I am someone who, about 15 years ago, wanted to listen to web news and blogs via text-to-speech while doing other things, so I created a reading app. That app started me on a journey where I ended up living through apps.
Because of this, I’ve always made it a point to check the quality of the latest text-to-speech engines as they come out. For example, I thought Amazon Polly, which was released a few years ago, sounded quite natural, and I’ve noticed that the text-to-speech on the iPhone has been getting progressively better too.
Recently, a text-to-speech API of ChatGPT has emerged, and it sounds almost indistinguishable from a human in English, which I found astonishing. It’s so good that it could easily be used for making announcement recordings or similar tasks.
However, while I could listen to demos in English on the ChatGPT site, I couldn’t find any demos for Japanese text-to-speech. So, I ended up installing a Python environment on my Mac, something I’m not very familiar with, to actually call the API and make it read aloud in Japanese.
I’m updating an app I developed called, which reads aloud promotional text, and I’ve been comparing ChatGPT’s text-to-speech engine with that of the iPhone’s (the high-quality version you can download from the settings), in both English and Japanese.
To sum it up, ChatGPT’s English text-to-speech is incredible. Truly natural. It makes Amazon Polly, released a few years ago, seem less impressive by comparison. There’s intonation, so it doesn’t sound monotonous. However, the Japanese version still sounds a bit mechanical compared to its English counterpart, with odd intonations in places.
However, since ChatGPT’s voice API seems to be optimized for English, there might be significant improvements for Japanese and other languages in the future. The pace of AI evolution in recent times is astonishingly fast, so these improvements could come as soon as next month or take a few years.
I was researching this because I was considering integrating it into, which I’ve been diligently updating. If it’s significantly better than the iPhone’s native text-to-speech engine, it might be worth it.
I made a video to gather a wide range of opinions on this. I wanted to see if most people think it’s not much different from the iPhone’s engine or if many would be willing to pay extra for ChatGPT’s quality.
Text-to-speech engines might feel odd at first, but as you listen to web articles or books daily, you get used to it, much like getting used to a dialect, and the strangeness fades away. Personally, I’ve come to think that I wouldn’t pay extra to use ChatGPT for my listening needs.
However, it might be useful for creating voice files for videos or work-related announcements. For everyday use, I might stick with the iPhone’s engine and switch to ChatGPT’s high-quality voice for specific tasks. There are no licensing issues to worry about.
Additionally, while ChatGPT’s API allows for changes in speech speed and speaker, it currently doesn’t offer parameters for emotions. Modern speech engines can convey emotions, such as speaking in a sad or cheerful tone, which is impressive.
Using the API, I’m also curious about the potential latency in real-time server response, but ChatGPT’s API is designed to minimize delays and provide real-time responses. This will require further testing, but I’ll write more about it once I start integrating it.