I've ported a low-memory, completely self-contained speech synthesizer available to the ESP8266. No Internet connection or services are used, so you can use this is applications where a web service just isn't possible or desirable.
https://github.com/earlephilhower/ESP8266SAM
(note you will also need https://github.com/earlephilhower/ESP8266Audio to handle the I2S or delta-sigma sound output)
Software Automatic Mouth (SAM) was an amazing speech synthesizer available on 8-bit CPUs in the early 80s. There were versions for the Atari 400, Commodore 64, and others. A fan converted it from 6502 assembly to C code and put it online ( https://github.com/s-macke/SAM ). I took his code, reworked the output so it sent bytes directly to the audio device instead of buffering, and moved what tables I could into PROGMEM.
The quality is not stellar, but still amazing given the small memory footprint. All samples and waveform generation are only 4-bit(!!)) and even with CPU limits of the late 70s/early 80s is spoke in real time. While the ESP8266 has 100x the CPU horsepower of a 1-MHz 6502, it has less free memory than a C64 did, so this low memory usage is critical.
If you have need for a fixed vocabulary for your project, I'd still use the MP3 ESP8266Audio class for much higher quality, but if you don't know what you'll need your ESP8266 to say beforehand this is definitely usable.
Using the code is very simple:
#include <Arduino.h>
#include <ESP8266SAM.h>
#include <AudioOutputI2SDAC.h>
AudioOutputI2SDAC *out = NULL;
void setup()
{
out = new AudioOutputI2SDAC();
out->begin();
}
void loop()
{
ESP8266SAM *sam = new ESP8266SAM;
sam->Say(out, "Can you hear me now?");
delay(500);
sam->Say(out, "I can't hear you!");
delete sam;
}