Basically what you need is a low latency GPIO interrupt, yes? I don't think there is hardware support to map external inputs to the NMI (I will check this with someone who knows the hardware better but I'm pretty sure it's not possible.) It's going to get a lot better in ESP32 (configurable interrupt matrix, more interrupt priority levels) but there's not a lot of interrupt configurability in 8266 AFAIK.
However, you can drop a fair few cycles off the GPIO handler interrupt latency if you replace the default exception handlers - move VECBASE to point to your own set of handlers. Then your UserExceptionHandler checks immediately for your low latency event (GPIOs changed) at the top of UserExceptionVector and jumps straight to the USB handler. If the USB handler is in assembler then you can probably just save a subset of registers, to save even more cycles. Then the other exception handler cases fall through and jump back to the default SDK exception handlers. It'll be a bit fiddly, (in particular you can't easily fit a lot of literals in the exception vector space, before you need to jump out somewhere else. So probably need some creative tricks to jam them in the gaps.) but I imagine you're more than up to it.
Does that makes sense? Maybe you tried it already, and it doesn't help? Let me know if you'd like me to go into more detail. The best sample 8266 exception vectors I know of (non-Espressif-approved) are where we reimplemented them in the esp-open-rtos project: https://github.com/SuperHouse/esp-open- ... _vectors.S . The situation will (again) be better here for ESP32, we plan to make the exception vector & startup code available on github along with a lot of ther code.
I am still trying to mull through this and find the least invasive mechanism to insert my own interrupts. Sadly, I can't even figure out which exception handles the GPIO (it is an exception, right?) I will hopefully poke and prod around in my ESP's memory soon.