Teensy 4.0 - Bare minimum initialization and memory layout question

Status
Not open for further replies.

nanohenry

Member
Hello,

I'm working on a project around the Teensy 4.0. The platform seems to be really powerful, but I wish the different features were better documented. For example, I managed to figure out how to do direct GPIO access (which I needed) only after browsing the Forum for a few days. Nevertheless, it's really cool! :D

Anyway, regarding the software, I was wondering what the "bare minimum" setup is on boot? I.e. what registers need to be set, how to setup FPU, MPU and the DTCM/ITCM split? I'd like to understand as much as possible about what's going on "behind the scenes" - and also remove anything unneeded for my application.

I have also another question: on the Teensy 4.0 homepage, in the Memory Layout section, it says "Memory is used in the following ways", and then there's a list of the different sections. I assume this configuration is just the one that you'll get if you compile from e.g. Arduino IDE without modifying anything? That the memory can be used as one wishes but is that way by default? And I guess the layout is then established (data is copied from flash to TCM etc.) in the boot process?

I already found startup.c on GitHub, and that was great help understanding the boot process, but I'm still a bit lost as I'm not sure what all the register writes are for and what the written values mean.

Cheers and thanks in advance.
 
Again hard to say what all you need or don't need. Or what your plans are.

I personally just use the Arduino setup, which is fine for the things I wish to do.

Understanding the board. I assume you found the Datasheets for the board: https://www.pjrc.com/teensy/datasheets.html

And then there are manuals on the different types of ARM processors.

Startup.c is a good place to look plus the the linker scripts...

DTCM/ITCM Split... There are 512kb of memory in that section, which are split into 32kb sections, which can either be DTCM or ITCM... You can setup each of these sections of memory for one or the other (or for neither).
The other 512kb is in a different type of memory, which today you can get to by either defining your memory variable as DMAMEM or by using malloc. This memory acts differently and there is caching that an be configured for this section of memory... FLASH is flash...

So more or less all of the information comes from these sources, plus some hints on the IMXRT forums, sample apps...

So hope that helps and good luck
 
OK, thanks for the info. I'll look around.

One more thing: when a new program is flashed onto the board, it's only writing to the flash, right? So technically I could instruct the linker to not cause anything to be loaded into e.g. OCRAM and have the whole 512KB to myself?

Edit: I wanted to mention too that I'd like to avoid the system choosing the memory for me (e.g. through keywords like DMAMEM) and rather just have access to the whole memory for R/W/X (again, to learn more about how things are/can be done).
 
OCRAM is all open in the TD 1.48 - for pending faster USB in TD 1.49 it looks like 12 KB is allocated to USB buffers for DMA access. When that is complete USB will be optional and turning it off should put that back to unused - but no USB Serial.

Indeed after programming and when power off and into ResetHandler() in startup.c all code reside only in Flash. It is in that code that CODE not marked FLASH_MEM is moved to ITCM to run at CPU speed and then rounding up to next 32K boundary starts the RAM as DTCM that also runs at CPU speed. The 512KB RAM in upper OCRAM runs at 1/4th CPU speed (IIRC a post by Paul) and that is why the cache comes into play. { that seems a worthy note for the T4 Memory page on pjrc.com }

Bottom line is proper MCU startup in Startup.c from 'reset' is what Paul worked some many months on - nothing extraneous for full function - setting up the clocks and peripherals as needed for initial runtime use. The compile time location notes { DMAMEM, PROGMEM, FLASHMEM, ? } to the Linker just help the build get the code in the right place - where by default code goes into ITCM.
 
That the memory can be used as one wishes but is that way by default? And I guess the layout is then established (data is copied from flash to TCM etc.) in the boot process?

Yes, the documented memory layout is based on the defaults you get from startup.c and imxrt1062.ld.

But to some extent, it's also based on what the hardware provides. No matter what you do in software, the hardware has those 3 physical memories.

RAM1 can be divvied between ITCM, DTCM and normal AXI bus. If you don't do anything special (and you delete the auto-sizing stuff from imxrt1062.ld), the default partition always gives 128K ITCM, 128K DTCM and 256K OCRAM (slower AXI bus). If you find the old beta test thread, you'll see this is what we had for much of that testing... and especially the limited size for DTCM was pretty painful.


I was wondering what the "bare minimum" setup is on boot?

Well, that question involves a lot of choices about what you consider "bare minimum".

For example, if you want to run at any speed other than the default 396 MHz, then you need to configure the clocks, which probably involves configuring the PLL.

If you're going to program in C, do you consider copying the .data segment from flash to RAM necessary? Or does "bare minimum" mean your C code can't use static initialized variables?

Does "bare minimum" imply you won't use the USB ports? USB involves quite a lot of code, especially if you want to make it run anywhere near as fast as possible. For example, 1.49-beta1 gives about double the speed of 1.48. You could probably get by with much less code, but would you want to risk not supporting all the things the USB spec requires? Would simpler code that achieves only 5% of the speed be acceptable?

How about the serial ports? Maybe the extra code for the serial port interrupts and FIFOs is beyond what you'd consider "bare minimum"? Inside debugprintf.c you can find truly minimal polling-only code for Serial4 (which was used quite a lot in the very earliest days of development and beta testing). But is such primitive polling code really what you want to use on a 600 MHz processor which is capable of doing a tremendous amount of work during the time 1 byte can transfer at typical baud rates?

You could try whittling down startup.c and deleting other files from the core library, until you're satisfied you've reached "bare minimum". Whether that really has any value, beyond personal satisfaction, it quite questionable. This chip has 1 megabyte of RAM and the flash has 2 megabytes of space for code.


... but I'm still a bit lost as I'm not sure what all the register writes are for and what the written values mean.

I can try to answer a few specific questions if you're stuck on something, but you really can't expect a novel-length answer explaining every single register write in startup.c!

Unlike NXP's example code which has an abstraction layer that renames everything (but is still just a very thin layer), Teensy's core library uses the exact same register names as the chip's 3637 page reference manual. So if you want to look up what any particular hardware register does, you can just do a text search for it and usually get right to the detailed documentation.

However, the exception is ARM's registers in the CPU core. Most of those aren't documented in NXP's manual. ARM documents those, and for those registers you have 2 choices, free or easy. For the free documentation, search google for "DDI0403E". For the easy documentation, click the link for the "Definitive Guide..." book on the datasheets page. Even through a book hasn't been published specifically for Cortex M7 yet, almost all the ARM registers are the same as in Cortex M4. That book explains things very clearly, so it's well worth the money if you're really going to dive into the low level details.
 
So technically I could instruct the linker to not cause anything to be loaded into e.g. OCRAM and have the whole 512KB to myself?

Technically, the linker only creates symbols used by startup.c. Compiling & linking your program results in a HEX file with data that goes only into the flash memory. All configuration of how the RAM gets used is done by code in startup.c. The linker merely creates constants which get built into that startup.c code. It also creates code throughout the rest of your program with embedded memory addresses that depend on the memory setup being configured correctly before any of that code actually runs. If you create a bogus startup.c that doesn't precisely match the way you instructed the linker to build your code, your program will almost certainly hard fault when it tries to run.

But if you omit all setup, you will *not* get all 512K of RAM1 as a single region by default. If you don't write to the IOMUXC_GPR_GPR17 register, you'll get the hardware default of 128K ITCM, 128K DTCM, and 256K OCRAM, and 512K OCRAM2. If you want 512K OCRAM and 512K OCRAM2, you must write to IOMUXC_GPR_GPR17, because that is not the default.

Not using the extremely fast ITCM and DTCM buses would be a terrible waste! Likewise for the caches. This chip is loaded with incredibly powerful hardware, which requires code to configure it.
 
OK, that cleared many things for me already.

The thing I was mostly concerned about in the "bare minimum" setup was damaging or locking up the hardware by leaving something in an uninitialized state (not sure if that's even possible though).

I guess I'll go through startup.c with the documentation on hand.

And regarding your last point - I guess I mixed up OCRAM and OCRAM2. Does OCRAM refer to ITCM and DTCM combined (RAM1) and OCRAM2 to the other 512K of RAM (RAM2)? In any case, I meant that I would prefer having a fixed memory address (by directly accessing the memory addresses knowing there's nothing else there) for e.g. a large buffer in RAM2 and not having the need to allocate it through the C runtime library.

Thanks for you help!
 
There is a memory thread - search forum for imxrt-size - details there may be helpful, also that exe imxrt-size will spit out info on what is allocated where at compile time. I posted an output of that in the 1.49 beta 1 thread you can see the 12K used in OCRAM for USB buffers.

Also those 'names and sizes' used in startup.c to fill memory let you locate memory at runtime - for instance the unused ITCM space can be located and the residual part of the 32K can be located and used.

There should not be much of anything you could do that could hurt the T4 beyond maybe needing a 15 second restore. Using anything not initialized will result in perhaps a fault or hung program.
 
There is a memory thread - search forum for imxrt-size - details there may be helpful, also that exe imxrt-size will spit out info on what is allocated where at compile time. I posted an output of that in the 1.49 beta 1 thread you can see the 12K used in OCRAM for USB buffers.

Also those 'names and sizes' used in startup.c to fill memory let you locate memory at runtime - for instance the unused ITCM space can be located and the residual part of the 32K can be located and used.

There should not be much of anything you could do that could hurt the T4 beyond maybe needing a 15 second restore. Using anything not initialized will result in perhaps a fault or hung program.

I believe this is the thread:
 

Indeed it is that thread. That post preceded the notes on T4 memory pjrc.com/store/teensy40.html

And above was right > what speed is RAM2 that it needs to be cached?

...
The simple answer is 150 MHz, or 1/4 of whatever speed the M7 processor is running.

But the longer answer depends on details of how these buses and the bridges between them work. Sadly, NXP's documentation on those details is rather scant.
 
I meant that I would prefer having a fixed memory address (by directly accessing the memory addresses knowing there's nothing else there) for e.g. a large buffer in RAM2 and not having the need to allocate it through the C runtime library.

Why would you want this? What's the practical purpose, that's worth risking having too many cooks in the kitchen?

If you want a big block of 100K for a buffer, it's so very simple to just create an array as a global or static variable.

Code:
uint32_t myarray[25000];

If you want to explicitly control which memory region it's allocated within, just use a section attribute. That's what those keywords like DMAMEM do. If you want it aligned to 32 byte cache rows, just add the aligned attribute. Otherwise, the linker will automatically align it to whatever its data size is (so 4 byte aligned in this example).

There are a lot of things you can do that are considered poor programming practice, because they tend to lead to subtle and difficult to find bugs. Even if you do not initially suffer these problems when you write the code, very likely will come up in the future if you ever have to update or maintain it, not to mention reuse it in another project. Just because you can do a thing does not mean you should do it. Going around the linker to allocate buffers is rather unwise.

But if you're determined to do it anyway, at least be careful to edit imxrt1062.ld so the linker won't use part of the memory. Then in your code you can just cast an integer with the address to a pointer and use that memory however you like. But as good programming practice goes, this sort of thing is akin never using "for" and "while" loops, just writing spaghetti code filled with labels and "goto".

The linker is a very mature & useful tool. Really, you can & should trust it.
 
Last edited:
After some reconsideration, I decided that I will trust and use the linker, as there does not seem to be any point of not doing so.
 
One more question then: since there's a lot more RAM onboard than on many other MCUs, I'd like to experiment with code loaded at runtime (either from flash or an SD card). If I do access everything through variables and arrays and let the compiler/linker choose the addresses, how would I then "reserve" some memory (preferably in ITCM), load the code and then branch to it? Or is that even a good idea to begin with?
 
Status
Not open for further replies.
Back
Top