I was also thinking of something along the lines of "get 1 byte, send it, get another byte" and reconstituting the image at the receiving side. Just that my limited experience doesn't tell me if either of those two ways is manageable.
I'm in the process of trying to program the ESP8266, but haven't been able to upload to it because my usb-ttl cable hasn't arrived yet and I couldn't get the arduino as ISP to work.
Edit: a bit of extra info about the way the camera works, taking from http://embeddedprogrammer.blogspot.com/2012/07/hacking-ov7670-camera-module-sccb-cheat.html. The falling edge of a signal (VSYNC) indicates the start of a frame, and the frame is obtained while VSYNC is low. The rising edge of another signal (HREF) indicates the start of a line, and the line is obtained while HREF is high. The data registers (D0-7, a byte) are then sampled at the rising edge of the clock signal, which is either the same frequency as the driving frequency (e.g. 8MHz) or scaled by a factor. From this I'm guessing the ESP8266 would basically need to wait for the conditions (VSYNC low, HREF high), then start sampling the D0-D9 channels at a rate of 8MHz. If I understand correctly, D0-D9 must all be read before the next rising edge of the clock pulse, so that means the ESP must access its pins at a rate of 8MHz * 8 pins = 64MHz at least. That is for VGA mode, perhaps the lower resolution modes behave somewhat differently.