Executing code on external PSRAM

bb00 · Aug 10, 2024

Hello,

I am currently in the process of writing a little "OS" for the Teensy 4.1.
I am trying to load ELF files at runtime and execute the code contained in the .text section of the elf.
As far as i can tell (without a debugger) all code is properly loaded.

I removed the NOEXEC flag in the MPU configuration for the FlexSPI RAM.

I set the LSB of the entry address I get when loading the ELF file (basically the memory address i load the .text section into) to 1 so the CPU knows its Thumb code.

I compile the ELF as position independent code for thumb on cortex m7.

When i try to jump to the entry address the Teensy Memory faults.

Is there anything obvious I am missing out?

I will post some as soon I have access to my computer again.

Thank you all.

Best regards
Sebastian

MichaelMeissner · Aug 11, 2024

Note, while I work on PowerPC GCC, I have never looked at the ARM side of things. So this is just speculation.

But in general, for many position independent systems, you need to do a fixup pass to update the addresses. I.e. often times there is a chunk of memory that contains the actual pointers. The code uses a position independent method to load up the table to the pointers and put that in a register. Then when the code needs to refer to something, it loads up the pointer from the table, and uses that. There is special startup code to fixup these addresses before the program jumps into C/C++ code. Perhaps this startup code is not being done, or perhaps it is being done more than once.

Within OSes as opposed to stand-alone code, the program loader does this relocation. I.e. after it has copied the code to the appropriate location, it goes through the list of relocations that are in a table of the ELF file, and fixes up the relocations to adjust for the new address.

Note if the relocation is REL as opposed to RELA, this relocation pass can only be done once. I.e. in a REL relocation, it adds the current value in memory to the address. In a RELA relocation, the relocation has both a reference to the address that must be fixed up and the offset to add in the relocation record.

Perhaps the library code was not built with position independent code.

Perhaps you need to explicitly flush the i-cache.

Perhaps there is additional setup needed to make sure the i-cache is set up as opposed to the d-cache.

jmarsh · Aug 11, 2024

Can't make an informed guess without seeing any code or proper description of the fault.

defragster · Aug 12, 2024

MichaelMeissner said:
So this is just speculation.

@MichaelMeissner is a good voice to this and it seems on point.
Teensy startup moves code to run in RAM1 from FLASH: "// Initialize memory" - then there is some 'code init after that'? Not seeing an explicit call between early and middle '_hook's ?

Also check this in startup.c? - it is coded to mark the PSRAM region 'NOEXEC' in the cache

Code:

FLASHMEM void configure_cache(void)
{
// ...

    SCB_MPU_RBAR = 0x70000000 | REGION(i++); // FlexSPI2
    SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | NOEXEC | SIZE_16M;

bb00 · Aug 12, 2024

jmarsh said:
Can't make an informed guess without seeing any code or proper description of the fault.

Yes I will post code as soon as I have access to my computer again.

defragster said:
@MichaelMeissner is a good voice to this and it seems on point.
Teensy startup moves code to run in RAM1 from FLASH: "// Initialize memory" - then there is some 'code init after that'? Not seeing an explicit call between early and middle '_hook's ?

Also check this in startup.c? - it is coded to mark the PSRAM region 'NOEXEC' in the cache

Code:

FLASHMEM void configure_cache(void) { // ... SCB_MPU_RBAR = 0x70000000 | REGION(i++); // FlexSPI2 SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | NOEXEC | SIZE_16M;

I will look at it.
I already removed the NOEXEC flag but I still get the Instruction Access Violation fault.

MichaelMeissner · Aug 12, 2024

bb00 said:
Yes I will post code as soon as I have access to my computer again.

I will look at it.
I already removed the NOEXEC flag but I still get the Instruction Access Violation fault.

There may be other places in the code that is setting NOEXEC. If it is being set with '|=', you may need to explicitly clear the bit using '&= ~NOEXEC'.

defragster said:
@MichaelMeissner is a good voice to this and it seems on point.
Teensy startup moves code to run in RAM1 from FLASH: "// Initialize memory" - then there is some 'code init after that'? Not seeing an explicit call between early and middle '_hook's ?

In embedded code (i.e. normal Arduino), the linker puts a copy of what will go into the other memory regions in the main section, but it does address relocation based on where the code will be copied to in memory, so you don't have to adjust the memory locations.

In position independent code or shared libraries under a hosted environment, the linker creates a table of all of the addresses that need to be relocated, along with the relocation records that point to the symbol that needs to be resolved, and what type of relocation is being done. Before you can jump to the code, something has to go through the table of relocation entries and adjust the addresses to point to the adjusted address. In hosted environments, the initial loader program (or shared library loader when a new library is loaded) does this. Presumably in position independent code for embedded systems, this needs to be done.

In addition to fixing up addresses used by the code, you need to also adjust the addresses of items used in static or global declarations.

defragster · Aug 12, 2024

Could a minimum EXEC case be made? Just an entry point that does a return. Get the ASM bytes and write them to PSRAM and then call that entry address to see it work? Perhaps extending it to take a param and increment it and return the value?

When that works then PSRAM will be executing code properly ...

bb00 · Aug 12, 2024

defragster said:
Could a minimum EXEC case be made? Just an entry point that does a return. Get the ASM bytes and write them to PSRAM and then call that entry address to see it work? Perhaps extending it to take a param and increment it and return the value?

When that works then PSRAM will be executing code properly ...

Thats extactly what i am doing, and that doesnt work

I will post the codebase later today

defragster · Aug 12, 2024

bb00 said:
Thats extactly what i am doing, and that doesnt work
I will post the codebase later today

Wondering if RAM2 - also marked NO EXEC - might be worth a try as well? If that can be cleaned to work, it might show some difference?

bb00 · Aug 12, 2024

So I just published the code on GitHub: https://github.com/birdboat00/ozon
It uses PlatformIO.

The important code is starting at
https://github.com/birdboat00/ozon/blob/5d8cbaf6703952ae41c94bb5ccac7d663783348c/src/main.cpp#L67 (calling the loader and executing the entry)
and
https://github.com/birdboat00/ozon/...5ccac7d663783348c/src/services/ldr/elf.cpp#L8 (ELF loading logic)

I also modified startup.c of the teensy core
Line 319:

C:

    SCB_MPU_RBAR = 0x70000000 | REGION(i++); // FlexSPI2
    // SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | NOEXEC | SIZE_16M;
    SCB_MPU_RASR = MEM_CACHE_WBWA | READWRITE | SIZE_16M; // MAKE EXECUTABLE FOR ELF LOADING

The Code for the Binary I use to test and the linker script are test.c and linker.ld in the root directory. The commands I use to compile are in the test.c file.

If you need the code without all the other gruft I can arrange that, but it should work like this if you comment the unnecessary stuff out.

PaulStoffregen · Aug 12, 2024

I was about to ask you to write a simpler program...

But instead I created one just now, to give you and anyone else who wants to write this sort of dynamic code a known-good example to get started.

This example blinks the pin 13 LED. The code which writes to the GPIO toggle register runs from RAM. There are actually 3 copies to demonstrate running from the 3 different RAM regions, and 2 ordinary GPIO toggle writes for comparison. Uncomment the one you want inside loop().

It's only a single small file. Just copy into Arduino IDE, edit the core library startup.c file to remove NOEXEC, and upload.

Code:

uint16_t writeRegister_DTCM[2];
DMAMEM uint16_t writeRegister_RAM2[2];
EXTMEM uint16_t writeRegister_PSRAM[2];
const uint32_t bitmask = CORE_PIN13_BITMASK;
volatile uint32_t * const reg = &CORE_PIN13_PORTTOGGLE;

void setup() {
  pinMode(13, OUTPUT);
  Serial.begin(9600);
  if (CrashReport) {
    while (!Serial && millis() < 5000); // wait
    Serial.print(CrashReport);
  }
  Serial.println("Execute from RAM test");
  // must edit startup.c configure_cache() to remove NOEXEC
  initWriteRegisterMemory(writeRegister_DTCM);
  initWriteRegisterMemory(writeRegister_RAM2);
  initWriteRegisterMemory(writeRegister_PSRAM);
}

void loop() {
  // Uncomment 1 of these lines to blink the LED...
  //digitalToggleFast(13);
  //writeRegister(reg, bitmask);
  //callWriteRegisterMemory(writeRegister_DTCM, reg, bitmask);
  //callWriteRegisterMemory(writeRegister_RAM2, reg, bitmask);
  callWriteRegisterMemory(writeRegister_PSRAM, reg, bitmask);
  delay(750);
}

void writeRegister(volatile uint32_t *r, uint32_t n) {
  *r = n;
  // assembly:
  //   str r1, [r0, #0]
  //   bx  lr
}

void initWriteRegisterMemory(uint16_t *memory) {
  memory[0] = 0x6001; // str r1, [r0, #0]
  memory[1] = 0x4770; // bx  lr
  arm_dcache_flush(memory, 4);
}

void callWriteRegisterMemory(const uint16_t *memory, volatile uint32_t *r, uint32_t n) {
  ((void (*)(volatile uint32_t *r, uint32_t n))((uint32_t)memory | 1))(r, n);
}

If you leave NOEXEC in startup.c, you'll get this CrashReport info printed in the Arduino Serial Monitor.

Code:

CrashReport:
  A problem occurred at (system time) 14:9:37
  Code was executing from address 0x70000000
  CFSR: 1
    (IACCVIOL) Instruction Access Violation
  Temperature inside the chip was 46.39 °C
  Startup CPU clock speed is 600MHz

Of course, if you do properly remove NOEXEC, you'll see the LED does indeed blink.

jmarsh · Aug 12, 2024

PSRAM is cached by the CPU. The data cache is separate from the instruction cache.
I don't see any code to flush the data cache / invalidate the instruction cache after copying the executable code to PSRAM.

PaulStoffregen · Aug 12, 2024

Data cache is flushed by arm_dcache_flush(memory, 4) inside initWriteRegisterMemory().

For this simple example which only writes once, instruction cache doesn't need to be invalidated since the first usage to execute instructions will be a cache miss. But a more complete example which changes the already used code would indeed need to write to SCB_CACHE_ICIALLU. That register is documented starting on page 577 of the ARM Cortex-M7 Reference Manual (ARM doc #DDI0403Ee), which can be found on the Teensy 4.0 and Teensy 4.1 pages under "Technical Information".

If you run this example on a Teensy 4.1 with the PSRAM chip, or one without by instead using the DTCM or RAM2 memory (and of course remove NOEXEC from startup.c), you'll see the code executing from RAM does indeed cause the LED to blink. It really does work. I tried to keep it as short and simple as possible. Maybe better examples should be made?

jmarsh · Aug 12, 2024

PaulStoffregen said:
Data cache is flushed by arm_dcache_flush(memory, 4) inside initWriteRegisterMemory().

For this simple example which only writes once, instruction cache doesn't need to be invalidated since the first usage to execute instructions will be a cache miss. But a more complete example which changes the already used code would indeed need to write to SCB_CACHE_ICIALLU. That register is documented starting on page 577 of the ARM Cortex-M7 Reference Manual (ARM doc #DDI0403Ee), which can be found on the Teensy 4.0 and Teensy 4.1 pages under "Technical Information".

If you run this example on a Teensy 4.1 with the PSRAM chip, or one without by instead using the DTCM or RAM2 memory (and of course remove NOEXEC from startup.c), you'll see the code executing from RAM does indeed cause the LED to blink. It really does work. I tried to keep it as short and simple as possible. Maybe better examples should be made?

I was talking about OP's code.
It looks like it uses a simple heap allocator to reserve space for the executable code, that may cause problems with the cache management functions since they can only operate on complete cachelines (will need to allocate multiples of 32 bytes at 32-byte boundaries to ensure surrounding cachelines are not inadvertently touched).

bb00 · Aug 25, 2024

Flushing the cache absolutely did it

Thank you very much!

Executing code on external PSRAM

bb00

Member

MichaelMeissner

Senior Member+

jmarsh

Well-known member

defragster

Senior Member+

bb00

Member

MichaelMeissner

Senior Member+

defragster

Senior Member+

bb00

Member

defragster

Senior Member+

bb00

Member

PaulStoffregen

Well-known member

jmarsh

Well-known member

PaulStoffregen

Well-known member

jmarsh

Well-known member

bb00

Member