-->
Page 1 of 2

execute run-time generated asm code

PostPosted: Thu Oct 20, 2022 1:48 pm
by @xi@g@me
Hello all :-)

I need to generate and execute assembly code at runtime on my ESP8266 wemos D1 mini devboard, for real time reasons (I need it to be asm as it is runtime generated).

I wrote an example code to understand basics of the assembly instruction set (and used the ISA doc), then dumped the obj file using xtensa-lx106-elf-objdump.exe -S. Next step was to write the binary code inside a byte array (this is array is bound to contain runtime generated data in a future step). I then create a function pointer that points to this binary data and execute the function.

But when I do this, I get an exception each time the function is called.

Here is the exception contents (with decoder data):
Code: Select all20:24:55.258 > --------------- CUT HERE FOR EXCEPTION DECODER ---------------
20:24:55.323 >
20:24:55.323 > Exception (2):
20:24:55.323 > epc1=0x3ffe85d0 epc2=0x00000000 epc3=0x00000000 excvaddr=0x3ffe85d0 depc=0x00000000
20:24:55.423 >
20:24:55.423 > InstructionFetchError: Processor internal physical address or data error during instruction fetch
20:24:55.423 >
20:24:55.423 > >>>stack>>>
20:24:55.423 >
20:24:55.423 > ctx: cont
20:24:55.457 > sp: 3ffffdc0 end: 3fffffc0 offset: 0190
20:24:55.490 > 3fffff50:  00000fbc 29bf1c55 401002f1 00000000 
20:24:55.562 > 3fffff60:  0012323f 00000000 00001388 402014d8 
20:24:55.624 > 3fffff70:  00123239 3ffeead0 00000003 40201584 
20:24:55.694 > 3fffff80:  3fffdad0 00000000 3ffeead0 402015b9 
20:24:55.721 > 3fffff90:  3fffdad0 00000000 00000fbc 40207440 
20:24:55.807 > 3fffffa0:  feefeffe feefeffe 3ffe8618 4020dad6 
20:24:55.833 > 3fffffb0:  feefeffe feefeffe feefeffe 40100f4d 
20:24:55.891 > <<<stack<<<
20:24:55.891 >
20:24:55.891 > 0x401002f1 in millis at C:\Users\axiagame\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/core_esp8266_wiring.cpp:188
20:24:55.891 > 0x402014d8 in WS2812B::_sendData() at src/WS2812B.cpp:159 (discriminator 2)
20:24:55.891 > 0x40201584 in WS2812B::_doTestMode() at src/WS2812B.cpp:155
20:24:55.891 > 0x402015b9 in WS2812B::loop() at src/WS2812B.cpp:106
20:24:55.891 > 0x40207440 in loop at src/main.cpp:62
20:24:55.891 > 0x4020dad6 in loop_wrapper() at C:\Users\axiagame\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/core_esp8266_main.cpp:201
20:24:55.891 > 0x40100f4d in cont_wrapper at C:\Users\axiagame\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/cont.S:81
20:24:55.891 >
20:24:55.891 >
20:24:55.891 > --------------- CUT HERE FOR EXCEPTION DECODER ---------------


_sendData calls _sendSingleLED and not millis, so I guess the call stack is wrong. _sendSingleLED calls my binary code, in this array :

Code: Select all// contains
    // noInterrupts();

    // u32 ccount = 0;
    // u32 curccount = 0;
    // u32 lastWait = 0;
    // u32 pinID = 1 << m_dataPin;
    // u32 notPinID = ~pinID;
    // u32 * outReg = (u32*)0x60000300;
    // for (u32 b = 24; b > 0; --b)
    // {
    //     *outReg |= pinID;
    //     *outReg &= notPinID;
    // }

    // interrupts();
static u8 functionCodeAndData[] =
{
    0x00, 0x03, 0x00, 0x60, // 0x60000300, GPIO output register

    0x30, 0x6f, 0x00,   // rsil a3, 15
    0x22, 0x02, 0x12,   // l8ui a2, a2, 18 // offset of m_dataPin in class
    // A2 contains m_dataPin
    0x0c, 0x15,         // movi.n a5, 1
    0x00, 0x12, 0x40,   // ssl a2
    0x00, 0x55, 0xa1,   // sll a5, a5
    // A5 contains pinID
    0x7c, 0xf6,         // mov.i a6, -1
    0x50, 0x66, 0x30,   // xor a6, a6, a5
    // A6 contains notPinID
    0x1c, 0x84,         // movi.n a4, 24 // prepare loop count
    // A4 contains loop count
    0x9c, 0x24,         // beqz.n a4, 30
    // jump when A4 = 0
    0x31, 0xf9, 0xff,   // l32r a3, 4 // address of register in which to write pin data
    // A3 contains pin register address
    0x28, 0x03,         // l32i.n a2, a3, 0
    // A2 contains data in pin register
    0x20, 0x25, 0x20,   // or a2, a5, a2
    // set pinID bit
    0x29, 0x03,         // s32i.n a2, a3, 0
    // store new pin data in register
    0x60, 0x22, 0x10,   // and a2, a2, a6
    // clear pinID bit
    0x29, 0x03,         // s32i.n a2, a3, 0
    // store new pin data in register
    0x0b, 0x44,         // addi.n a4, a4, -1
    // decrement loop counter
    0x46, 0xfa, 0xff,   // j 19 // start of loop
    0x20, 0x60, 0x00,   // rsil a2, 0
    0x0d, 0xf0,         // ret.n

    //0x3D, 0xF0,         // nop.n
};


First I thought this was a problem of endianess, so I inverted the bytes of every instruction, and I also tried to keep only the ret.i instruction. I actually don't know if the instruction fetch unit takes the bytes one my one (and thus I shall not use little endian) or not.

Also, I wonder if the issue is due to the fact my code is in a static region of the code, and need to be placed elsewhere so it can be executed (the PC register points to an invalid location, as the exception message may suggest)

I'll try coping the binary code in a dynamically allocated region and execute it from here to see if that fixes my problem (runtime generated code will be in heap memory anyways), while waiting for some help.

Does someone have an idea of what is actually wrong? Do you know documentation or information I may use to make this work? Thanks in advance!

Re: execute run-time generated asm code

PostPosted: Mon Oct 24, 2022 3:47 pm
by @xi@g@me
OK, so I did some more research.

First, I did dump the obj file in both hexadecimal and asm code.
if I take a small funciton, i.e. _sendData, here is the asm code:
Code: Select allDisassembly of section .text._ZN7WS2812B9_sendDataEv:

00000000 <_ZN7WS2812B9_sendDataEv-0x4>:
   0:   00 00 00 00                

00000004 <_ZN7WS2812B9_sendDataEv>:
{
   4:   f0c112                  addi   a1, a1, -16
   7:   3109                   s32i.n   a0, a1, 12
   9:   21c9                   s32i.n   a12, a1, 8
   b:   11d9                   s32i.n   a13, a1, 4
   d:   02dd                   mov.n   a13, a2
    for (u8 ledID = 0; ledID < m_nbLEDs; ++ledID)
   f:   0c0c                   movi.n   a12, 0
  11:   130d22                  l8ui   a2, a13, 19
  14:   14bc27                  bgeu   a12, a2, 2c <_ZN7WS2812B9_sendDataEv+0x28>
        _sendSingleLED(m_dataArray[ledID]);
  17:   0d38                   l32i.n   a3, a13, 0
  19:   a03c30                  addx4   a3, a12, a3
  1c:   0d2d                   mov.n   a2, a13
  1e:   fff801                  l32r   a0, 0 <_ZN7WS2812B9_sendDataEv-0x4>
  21:   0000c0                  callx0   a0
    for (u8 ledID = 0; ledID < m_nbLEDs; ++ledID)
  24:   cc1b                   addi.n   a12, a12, 1
  26:   74c0c0                  extui   a12, a12, 0, 8
  29:   fff906                  j   11 <_ZN7WS2812B9_sendDataEv+0xd>
}
  2c:   3108                   l32i.n   a0, a1, 12
  2e:   21c8                   l32i.n   a12, a1, 8
  30:   11d8                   l32i.n   a13, a1, 4
  32:   10c112                  addi   a1, a1, 16
  35:   f00d                   ret.n


and here is the hexadecimal
Code: Select all 00000000 12c1f009 31c921d9 11dd020c
 0c220d13 27bc1438 0d303ca0 2d0d01f8
 ffc00000 1bccc0c0 7406f9ff 0831c821
 d81112c1 100df0       


We have some interesting information.
First, the function ends with a ret.n instruction (narrow return), as for every function. The asm code is f00d. If we look at the hexadecial code of this section, the 2 last bytes are 0df0. I can then conclude that the binary code is written in little endian. If we take a 24 bit instruction, like
Code: Select all10c112                  addi   a1, a1, 16

the binary equivalent is "12c110", which confirms the endianness. We could think that pack of 4 bytes are displayed backwards (in little endian, but encoded opposite, visual studio memory window does that :o ), but in that case we would see the 2 first bytes of the 24 bit instruction at the beginning of the previous 4-byte section.
So : binary code is in little endian (this explains why part of the instruction op code is generally in the last 8 bits...)

Re: execute run-time generated asm code

PostPosted: Mon Oct 24, 2022 4:23 pm
by @xi@g@me
And then, we have a 2nd very interesting piece of information to extract from this.
_sendData calls sendSingleLED. Let's have a look at the code that does the actual call:
Code: Select allfff801                  l32r   a0, 0 <_ZN7WS2812B9_sendDataEv-0x4>
0000c0                callx0   a0


First instruction write the address to call into a0, second instruction calls the function pointed by a0 (FYI, I read the ISA document, and it states that the CPU has "register files", which removes the need to push / pop registers to/from the stack before/after a call to another function, thus no such instruction here). Address of the function to call is... 4 bytes before the function entry point. Which is the 4 byte data right before the function in the same section:
Code: Select all00000000 <_ZN7WS2812B9_sendDataEv-0x4>:
   0:   00 00 00 00               

WTF, the function to call would be at address zero? That makes no sense at all! And that's because I forgot an essential step of the binary building process : the link program! The link program's goal is to take all object files like the one I disassembled, and put them all together in the binary file. Once that's done (and only at that time), the linker will fill all the 0 in the data with the actual position of the function in the final binary file. When building the object file, the compiler has indeed no way to know where the function will be exactly in the binary file.
I guess that this processor does not use virtual address space (using MMUs), as modern CPU do, so we don't need to take other considerations we would need if working on e.g. Windows PCs.

Now, let's have a look at the disassembled binary (firmware.elf). Here is the section that contains our _sendData function:
Code: Select all402014b8 <_ZN7WS2812B9_sendDataEv>:
{
402014b8:   f0c112                  addi   a1, a1, -16
402014bb:   036102                  s32i   a0, a1, 12
402014be:   0261c2                  s32i   a12, a1, 8
402014c1:   0161d2                  s32i   a13, a1, 4
402014c4:   02dd                   mov.n   a13, a2
    for (u8 ledID = 0; ledID < m_nbLEDs; ++ledID)
402014c6:   0c0c                   movi.n   a12, 0
402014c8:   130d22                  l8ui   a2, a13, 19
402014cb:   11bc27                  bgeu   a12, a2, 402014e0 <_ZN7WS2812B9_sendDataEv+0x28>
        _sendSingleLED(m_dataArray[ledID]);
402014ce:   0d38                   l32i.n   a3, a13, 0
402014d0:   a03c30                  addx4   a3, a12, a3
402014d3:   0d2d                   mov.n   a2, a13
402014d5:   fffcc5                  call0   402014a4 <_ZN7WS2812B14_sendSingleLEDERKNS_8LEDColorE>
    for (u8 ledID = 0; ledID < m_nbLEDs; ++ledID)
402014d8:   cc1b                   addi.n   a12, a12, 1
402014da:   74c0c0                  extui   a12, a12, 0, 8
402014dd:   fff9c6                  j   402014c8 <_ZN7WS2812B9_sendDataEv+0x10>
}
402014e0:   3108                   l32i.n   a0, a1, 12
402014e2:   21c8                   l32i.n   a12, a1, 8
402014e4:   11d8                   l32i.n   a13, a1, 4
402014e6:   10c112                  addi   a1, a1, 16
402014e9:   f00d                   ret.n


Look at the call instruction:
Code: Select all402014d5:   fffcc5                  call0   402014a4 <_ZN7WS2812B14_sendSingleLEDERKNS_8LEDColorE>


callx0 has been replaced with just call0, and the actual address of _sendSingleLED is visible in the instruction parameter. Actually, this is a 18-bit relative address to the current instruction, making the instruction fit into 24 bits (the instructions must be 4 bytes aligned (e.g. 402014b8 for _sendData), as the documentation says).

Anyways, that's how it actually works. So now, let's have a look at our custom made binary function. Is it called properly ?
Code: Select allvoid WS2812B::_sendSingleLED(LEDColor const & _color)
{
402014a4:   f0c112                  addi   a1, a1, -16
402014a7:   036102                  s32i   a0, a1, 12
    void(*asmFunc)() = reinterpret_cast<void(*)()>(functionCodeAndData + 4);
    asmFunc();
402014aa:   fffd01                  l32r   a0, 402014a0 <_ZN7WS2812BD1Ev+0x10>
402014ad:   0000c0                  callx0   a0
}
402014b0:   3108                   l32i.n   a0, a1, 12
402014b2:   10c112                  addi   a1, a1, 16
402014b5:   f00d                   ret.n
   ...

(see the 3 dots at the end? This is padding added to the binary so next function starts at a 4 byte aligned address.)

Whoops... this time there is no call0 instruction, but still the callx0...
Code: Select all402014aa:   fffd01                  l32r   a0, 402014a0 <_ZN7WS2812BD1Ev+0x10>
402014ad:   0000c0                  callx0   a0


function starts at 402014a4, and so 402014a0 is 4 bytes before, exactly as how it was in the object file. Like if the linker did not understand I would need to fetch the address of the binary data. This is not declared as a function, so the linker does not... well... link.

If we look at what's inside address 402014a0, we fall right after the end of another function (which one does not matter), that contains garbage (unsued data, I guess)
Code: Select all4020149f:   85cc00                  extui   a12, a0, 28, 9
402014a2:   fe                         .byte 0xfe
402014a3:   3f                         .byte 0x3f

the 4 bytes at address 402014a0 contains the following data : cc85fe3f. This is indeed not a valid address, and this is why our call fails dramatically. I think I'll have to find a way, either to make our binary data be in a text section and considered as code, or to do a proper callx call to a dedicated heap memory space. 2nd option is the best one as the code I want to be executed will eventually be run-time generated.

If you have any idea on how to do this, I'm interested :)

P.S. If you read the whole post, you're a hero!

Re: execute run-time generated asm code

PostPosted: Mon Oct 31, 2022 3:39 pm
by @xi@g@me
Hello again :-) I'm glad to announce that I finally made it working :)!

I read the documentation further, and found the proto-code for the instruction fetch mechanism of the processor. This code taught me several things :
- the instruction data is fetched 32 bits per 32 bits (up to 64 bits in the internal memory)
- the manual says that the fetch may fail if the is an "attribute" change in the physical address within the 64 bytes after the instruction.
- the "attributes" of that memory may have had something to do with IRAM & DRAM (the ES8266 has 64KB of instruction RAM, and 98KB of data RAM).

Thus I tried to find if I could allocate memory in the IRAM instead of in the DRAM. And I found!

According to this page, there are several configurations possible for the IRAM block. One of them is about sharing an allocate-able heap in the IRAM section: "16KB cache + 48KB IRAM and 2nd Heap (shared)". This decreases the cache size (which would eventually lead to slower code execution from the flash chip), but provides ability to allocate into the IRAM.
With platform.IO (Visual code), you can enable this feature using a build flag in the platform.ini file:
Code: Select allbuild_flags = -D PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48_SECHEAP_SHARED


As for allocation, the Arduino interface features a pair of classes to help with this :
HeapSelectIram and HeapSelectDram. Instanciate one of these classes, and you can allocate on the targeted heap (IRAM or DRAM, by default) during the scope of the declared class.

Here is my working code:
Code: Select allstatic u8 functionCodeAndData[] =
{
    0x00, 0x03, 0x00, 0x60, // 0x60000300, GPIO output register, that's data and not code.

    // code starts here
    // blah blah blah, binary code
    0x0d, 0xf0,         // ret.n
};

void WS2812B::_sendSingleLED(LEDColor const & _color)
{
    size_t allocSize = sizeof(functionCodeAndData) + 64;
    byte * funcInMem = nullptr;
    {
        HeapSelectIram selection;
        funcInMem = static_cast<byte *>(malloc(allocSize));
    }
    Serial.printf("allocated %u bytes at address %0x\r\n", allocSize, reinterpret_cast<unsigned int>(funcInMem));

    memcpy(funcInMem, functionCodeAndData, sizeof(functionCodeAndData));
    void(*func)() = reinterpret_cast<void(*)()>(funcInMem + 4);
    func();

    free(funcInMem);
}


The HeapSelectIram class is instantiated within brackets, to force its scope to the malloc call only. The + 4 is to access the entry point of my function (after the 4 bytes of data). Also, notice that I add 64 bytes to the allocated memory, in case (I did not test without).

And the Serial.printf confirms that the allocation is done inside instruction memory:
Code: Select all21:36:03.394 > allocated 70 bytes at address 4010709c
21:36:03.427 > allocated 70 bytes at address 4010709c
21:36:06.661 > allocated 70 bytes at address 4010709c


401XXXXX memory space is indeed inside the IRAM section (see here)
Alloc is done at the same location, which means I correctly free my memory. Also many prints = many calls = no exception in the CPU!!!

Took me a long time, but I got it! :mrgreen: