Teensy 4.1 Beta Test

@MartyMacGyver ... Paul noted in TD 1.52 release the PSRAM clocking was set to 88 Mhz to assure all would be well for new hardware going out with such chips installed - since it was done in startup.c when the chips were seen as present.

Early beta hacking set it at 133 Mhz and it seemed to be fine for those few boards.

There is SPIFFS library that will work to run on the Flash. It has the Setup and exports funcs() for SPIFFS code the Read, Write, Format code needed. Read works by direct memory access after enabled, write and format take 'access functions'

That code has been cleaned up a bit - but when option to use PSRAM also had speed set to 133 Mhz when it was tested there and may still have those clock settings in place.

github.com/PaulStoffregen/teensy41_extram
 
@degfragster, @mjs513 - I was able to test things out using the code on the `SPIFFS-FLASH-ONLY` branch of the teensy41_extram repo. I used `flashtest6.ino` example from the SPIFFS_t4 library as a basis (I'm not quite clear what extRAM_t4 is for but I haven't directly used it yet) and it appeared to work properly.

I then mixed in the contents of `teensy41_psram_memtest.ino` (renaming the setup() function to testpsram() and calling testpsram() at the end of setup()). It also worked, and reported 132.9 MHz in this context.

I'm curious how extRAM_t4 factors into this, or if it's relevant?
 
I should look at the current setup to say for sure - but as developed the PSRAM wasn't configured before setup until done as TD 1.52 released.

There is a SPIFF files system element and a separate driver so far for the QSPI setup and FLASH access - IIRC.

That may evolve beyond the current 'proof of concept' external libraries for FLASH and SPIFFS - or at least libraries included as needed in TeensyDuino install perhaps.
 
If you are going to just do direct writes to PSRAM with you own implementation you don't need it. There is one cautionary tail about using direct writes at this point, EXTMEM.

I you are using EXTMEM along with direct writes to PSRAM for other reasons you can wind up overwriting each unless you are very careful. What the lib does is reserve the lower 2Mb of the first PSRAM chip (this can be adjusted) for EXTMEM. Also it provides wrapper functions for direct writes along the same lines as the FRAM_MB85RC_I2C library less a bunch of functions:
Code:
	void	readArray (uint32_t ramAddr, uint32_t items, uint8_t data[]);
	void	writeArray (uint32_t ramAddr, uint32_t items, uint8_t value[]);
	
	void	readByte (uint32_t ramAddr, uint8_t *value);
	void	writeByte (uint32_t ramAddr, uint8_t value);
	void	copyByte (uint32_t origAddr, uint32_t destAddr);
	void	readWord(uint32_t ramAddr, uint16_t *value);
	void	writeWord(uint32_t ramAddr, uint16_t value);
	void	readLong(uint32_t ramAddr, uint32_t *value);
	void	writeLong(uint32_t ramAddr, uint32_t value);

	void	eraseDevice(void);
	
	uint32_t eramBaseAddr = 0x70400000;
	uint16_t bytesAvailableMB = 4;

The examples also show how you can do write and read structures. Hope this explanation helps.
 
Too late here to unzip and see current code for pointer. But - as noted above p#751 - there are just three functions to access [read, write, format] Flash - once it is enabled. These are instantiated and linked to SPIFFS to use as it does.

With a simpler library :: and FLASH setup handled with a .begin() then those three func()'s { and needed helpers ) would allow full user control, management and access to an installed flash.

The read is by direct address/access easy enough IIRC - but writes need wrapper code - and only formatted blocks can be written from formatted 1's to 0's when needed.

That was done minimally and immediately evolved to SPIFFS - confusingly enough for both FLASH and PSRAM That was not only cool - but seemed a good test for coverage and usage.
 
Yesterday evening had a little time, so I tried soldering on one of my castellated adapter boards for T4.1...

IMG_1121.jpg
IMG_1122.jpg

Nothing special - I was glad to see my measurements for where the two dips were on the bottom of the board was not too off.

Have not done much with it yet, but did verify using our HILOW test that I could see all of the IO pins from the bottom. Now will probably mount in some breadboard. Will be interesting to see how those pins are used during startup.
 
Again was curious about how the bottom pins are used during startup to detect if these chip locations are occupied. So again not very exciting sketch:
Code:
const short cCoxaMin1[] PROGMEM = {0, 1, 2, 3};
void setup() {
  pinMode(13, OUTPUT);
  pinMode(54, OUTPUT);
}

uint8_t pin_state = 1;
void loop() {
  pin_state ^=1;
  digitalWriteFast(13, pin_state);
  digitalWriteFast(54, pin_state);
  delay(100); 
}
Now capturing the startu pof the program, where Channels are 48-54, 13 and zoomed out looks like:
screenshot.jpg

Zoomed into the main area...
screenshot2.jpg

I am thinking about a hack to make it possible for those sketches that actually wish to use bottom pads to not have this run... More on it if it works... Then see how much indigestion it creates ;)

The hacks I did to startup.c was:
Change the init of size of memory to: uint8_t external_psram_size = 0xff;

Move the call: configure_external_ram
To be just after the call to startup_early_hook.

update the start of the configure_external_ram, to:
Code:
FLASHMEM void configure_external_ram()
{
	if (external_psram_size == 0) return;  // maybe allow someone to bypass...
	external_psram_size = 0; 
	// initialize pins

Then in my sketch I had:
Code:
extern "C" {
  extern uint8_t external_psram_size;
  void startup_early_hook(void) {
    external_psram_size = 0;  // force to 0 to not try to init
  }
}

And now the IO pins are no longer being touched until I may use them elsewhere...

But I am wondering again if this is too much of a hack? Or do we really want the early hook only after we check memory?

@Paul - what do you think?
 
Last edited:
@Paul and others, actually have a simpler hack, that goes along the lines of the bypass of calling serialEvents... Maybe with very little overhead.

Example just change: configure_external_ram, like:
Code:
PROGMEM uint8_t T41_SYSTEM_ENABLE_EXTERNAL_RAM __attribute__((weak)) = 1;

FLASHMEM void configure_external_ram()
{
	// initialize pins
	if (!T41_SYSTEM_ENABLE_EXTERNAL_RAM) return;

	IOMUXC_SW_PAD_CTL_PAD_GPIO_EMC_22 = 0x1B0F9; // 100K pullup, strong drive, max speed, hyst
	IOMUXC_SW_PAD_CTL_PAD_GPIO_EMC_23 = 0x110F9; // keeper, strong drive, max speed, hyst
	IOMUXC_SW_PAD_CTL_PAD_GPIO_EMC_24 = 0x1B0F9; // 100K pullup, strong drive, max speed, hyst

Then my guess is that compiler will throw away the if code...

And if a sketch wishes to not have this code run, they simply do:
Code:
PROGMEM uint8_t T41_SYSTEM_ENABLE_EXTERNAL_RAM = 0;

void setup() {
  pinMode(13, OUTPUT);
  pinMode(54, OUTPUT);
...
}

And it overwrites the weak variable and the code does not run... Maybe again it then tosses the whole function code away...
 
@KurtE
Very cool hacking :) Going to have to get a castellated board at some point - still working my other distraction :)
 
@Paul and others, actually have a simpler hack, that goes along the lines of the bypass of calling serialEvents... Maybe with very little overhead.

Example just change: configure_external_ram, like:
Code:
PROGMEM uint8_t T41_SYSTEM_ENABLE_EXTERNAL_RAM __attribute__((weak)) = 1;
...

...
And it overwrites the weak variable and the code does not run... Maybe again it then tosses the whole function code away...

That would be way easier to document and implement.

What would it look like if it could be inverted? A user wanting PSRAM would just declare: uint8_t external_psram_size;
 
@KurtE
Very cool hacking :) Going to have to get a castellated board at some point - still working my other distraction :)

Indeed very cool - great that the dimensions were right and the board was made(routed) to spec and it worked! With GND and 3V3 on that board and vBat or On/Off that extra half inch could to a lot - if not blocking SD socket when added.
 
Bloated RAM usage On Teensy 41 the RAM usage is nearly 1000% higher than on Teensy 35.
With a simple blinky program on Teensy 41:
Building .pio/build/teensy41/firmware.hex
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
RAM: [= ] 7.9% (used 41660 bytes from 524288 bytes)
Flash: [ ] 0.2% (used 16144 bytes from 8126464 bytes)

With Teensy 35
Building .pio/build/teensy35/firmware.hex
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
RAM: [ ] 1.9% (used 4892 bytes from 262136 bytes)
Flash: [ ] 2.8% (used 14732 bytes from 524288 bytes)

Same result on PlatformIO and TeensyDuno IDE on macOS Catalina
 
Bloated RAM usage On Teensy 41 the RAM usage is nearly 1000% higher than on Teensy 35.
This has been talked about in several threads, including this one which has a lot more information:
https://forum.pjrc.com/threads/60506-Does-Teensy4-have-less-program-memory-than-Teens3-6

As mentioned in these threads, on the T4.x with the current linker script all code which is not marked as FLASHMEM is copied down into the faster memory and is run from there instead of from the flash.

At some point there may be some easier way to leave all of the code in flash except for that marked as FASTRUN, but...
 
This has been talked about in several threads, including this one which has a lot more information:
https://forum.pjrc.com/threads/60506-Does-Teensy4-have-less-program-memory-than-Teens3-6

As mentioned in these threads, on the T4.x with the current linker script all code which is not marked as FLASHMEM is copied down into the faster memory and is run from there instead of from the flash.

At some point there may be some easier way to leave all of the code in flash except for that marked as FASTRUN, but...

Where can I find examples for use FLASHMEM, FASTRUN and others keywords?
 
The best I've found is this:

https://www.pjrc.com/store/teensy40.html

These symbol section decorators work much like PROGMEM does on old-school AVR Arduinos, so you may be able to get a description of how to actually use these in code by looking for tutorials on that. However, you have to update PROGMEM to the teensy-specific section you're interested in.

One bit of information that might help is this:

Every global/static variable, and every non-inline function, has "storage" (takes up bytes in memory,) and has a "name" (it's a "symbol".)
The job of a linker is to lay the storage for all symbols into appropriate memory, and bind all the names to the address of the symbol represented by the name.
In a plain-jane computer system (Linux, MacOS, Windows, etc,) there are usually three, maybe four, separate "sections" into which data/code can be laid out:

TEXT -- this is traditionally the name for "code that's executed"
DATA -- this is where your globals/statics with initialized values live -- anything that doesn't have the value 0/null when the program starts
BSS -- this is where globals/statics that have the value 0/null end up
(possibly a CONSTDATA section, where un-writable constants like string literals may live -- these can also live in TEXT, or maybe DATA, depending on specifics of the computer OS / runtime model.)

The loader (that actually loads and runs the program file that the linker generated) will copy the bytes from the TEXT section into the "where code starts" address in memory, and copy the bytes from the DATA section into the "where global variables start" address in memory, and then reserve enough space in memory for the BSS section.

BSS is an optimization, in that the loader doesn't need to copy any data into that section, it can just nuke it all to zero and call it good. Thus, BSS also doesn't take any storage in the program file itself.

Dynamic link libraries, and memory mapped files, of course throw some wrenches into these works, which you can go on a Wikipedia journey to discover if you want to have a good time, but that's not important for this discussion.

Then, the program starts running.

The model for the Teensy 4 isn't that different, except it has a few more sections. There's no "file on disk," but there is "flash memory." And the "flash memory" is directly addressable by the CPU. The "copy bytes into memory" has to be done to initialize RAM that can be written, and it has to be done to initialize RAM that it's fast to run code from, but it doesn't have to be done for constant data and code that can live in flash memory.

So, looking at the Teensy 4 reference page above:

The DATA segment gets copied into RAM1 (DTCM.) -- these are regular initialized globals

The BSS segment goes into RAM1 (DTCM.) -- these are regular zero globals

FASTRUN code gets copied into RAM1 (ITCM.) -- this is the default for functions/code
Because of memory controller reasons (look up Harvard Architecture,) the code and data can't share a 32 kB page, so the FASTRUN code size gets rounded up to 32 kB.

FLASHMEM code, and PROGMEM variables, stay in flash. (PROGMEM is the same as for Arduino, except you can read it directly without the special functions that the Arduino needs.)
The flash memory also needs to store copies of FASTRUN and DATA, but those copies are not referenced once the program has actually started running.
(A common optimization is to compress the text and data sections, so they take up less flash memory -- I don't know if Teensy has made arrangements to do this or not, as it requires some special modifications to the tool chain.)

RAM1 also contains the runtime stack, which starts at the top and grows downwards. Because you will have interrupts and function local variables, and maybe even use recursion (horror!) you need to make sure that you DON'T FILL UP RAM1! If it says you're 99% full, you have significant risk that your stack will overwrite the globals that go into BSS.

RAM2 contains variables marked DMAMEM. This is another kind of BSS data. Any variable marked DMAMEM will NOT be initialized. RAM2 also contains the heap that you get access to when you malloc()/new variables. Personally, I'm not a fan of using malloc/new in embedded applications, but you could do things like declare global variables that are pointers or references, initialized to the output of a call of malloc() or new. Running DMA from RAM2 probably has special implications related to the cache controller -- specifically, I would expect cache coherency to be stronger on the RAM2 bank, and that may or may not cause performance for tight accesses to that memory to be different. Some enterprising soul could benchmark this and report back!

Additionally, FLASHMEM code will not be writable, whereas FASTRUN code (unmarked functions) will be modifiable, perhaps by a stray pointer. Thus, you will have more robust code if you place it all into FLASHMEM. Note that self-modifying code will not "just work" in RAM1 -- you need to also arrange to flush or evict the instruction cache for that memory area, if you generate code at runtime. If you don't care about self-modifying code, no problem!

Finally, if you need DMAMEM for large mutable tables, but you want to initialize them, you can always declare the initialization data in PROGMEM and then memcpy() into the buffer in your setup() function.

So, with all that, you should have the information you need to figure out where to place each piece of code. FLASHMEM code can run with good performance, if it contains small loops, because it will be cached, but the initial access to that code (the first iteration of the loop) will run slower. Large, bulky, serial, code, will run less fast from FLASHMEM. If it's not performance critical code, that may not matter.

Here are some very simple examples:

Code:
//  RAM1, modifiable, initialized
char myWritableString[] = "This is a writable string (RAM1)";

//  RAM1, modifiable, initialized
int mySmallBuffer[10] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };

//  RAM1, modifiable, BSS
char someZeroBuffer[20];

//  flash, not modifiable, initialized
PROGMEM char myConstantString[] = "This is a constant string (flash)";

//  RAM2, modifiable, not initialized
DMAMEM int myBigBuffer[10240] = {}; // this will always be zero

//  RAM1, "modifiable," initialized
void fastrunFunction(int from) {
  for (int i = from; i != 100; ++i) {
    mySmallBuffer[i & 7] += 1;
    if (i & 1) {
      fastrunFunction(i+1);
    }
  }
}

//  flash, not modifiable, initialized
FLASHMEM void flashmemFunction(int from) {
  for (int i = from; i != 100; ++i) {
    myBigBuffer[i] += 1;
    if (i & 1) {
      flashmemFunction(i+1);
    }
  }
}

void setup() {
  pinMode(13, OUTPUT);
}

void loop() {
  digitalWrite(13, HIGH);
  digitalWrite(13, LOW);
  fastrunFunction(90);
  flashmemFunction(90);
}


@PaulStroffegen I'm not possessive about this explanation; if you want to use it in documentation or copy/modify sections of it, please go ahead!
 
Thanks @jwatte,

Awhile ago, I started a thread that talked about some of the different memory regions and some of the sections and the like:
https://forum.pjrc.com/threads/57326-T4-0-Memory-trying-to-make-sense-of-the-different-regions

Sometimes the easiest place to find examples is to look in the released code.

PROGMEM - Sort of like the old AVR days with the T4.x. (Sort of) In particular when you define a variable with PROGMEM, the variable is left up in the Flash memory and not copied down to the faster, but more limited in size memory DTCM (Data Tightly Coupled Memory). Note however unlike AVR processors, which have different addressing spaces, the ARM processor has one Address space, so you don't need all of those screwy macros and the like to then access the memory.

Again as mentioned by default all of your code is copied out of Flash into the same 512kb fast memory ITCM (Instruction Tightly coupled memory). That 512kb is divided into DTCM or ITCM by 32KB blocks. So if your code is under 32KB only one of the 16 32kb blocks will be used for instructions, the rest for data ....


So to mark a function to not bring down to the faster memory, you can mark it with FLASHMEM. Example in startup.c
Code:
FLASHMEM void configure_cache(void)
{
...

Now if you have large buffers and the like that are not initialized and you don't mind it being slower. You can access the other 512KB of memory on the board. It is slower but there is a hardware cache that speeds it up.

You can set these up, by using the DMAMEM keyword, like: DMAMEM uint8_t frame_buffer[320*240*2];
In this case that is size for ILI9341 display... Note: if you do any malloc operations, this also comes out of this 512KB.

New to the T4.1 - If you install external memory on your T4.1 - you can declare stuff to be created there,
by using the keyword: EXTMEM.
Example I have code for playing with ILI9486 with 4 bytes per pixel and 320x480...
So far I don't think it supports initialized variables here. But comments in startup code looks like it might soon.

Not sure if I covered everything, but hope that helps
 
Thanks @jwatte,@KurtE
  • Cache is disabled on the RAM1==ITCM/DTCM regions because it runs at CPU speed - no coherency problems there
  • Currently the ITCM is left writable - that area can be changed. And also the unused space above code under DTCM is findable/writable.
  • The RAM2 is cached - that does have the coherency issues with DMA usage where flush/delete as DMA bypasses the cache
  • RAM2 area cannot execute code
  • RAM2 is clocked slower at F_CPU_ACTUAL/4 - but has the cache coverage
  • This github.com/FrankBoesing/T4_PowerButton library has memory info functions
 
Very nice! I wish we could collect all of these into a single, descriptive page, and put it inside the Teensy documentation somehow.
In general, the Teensy documentation links on the main store/product pages feels ... fragmented, to me. It's hard to find what you really need.
 
Very nice! I wish we could collect all of these into a single, descriptive page, and put it inside the Teensy documentation somehow.
In general, the Teensy documentation links on the main store/product pages feels ... fragmented, to me. It's hard to find what you really need.

Me too! - FYI - There are some on the forum who have been trying to collect stuff into an unofficial WIKI: https://github.com/TeensyUser/doc/wiki

I have not looked in awhile to see if there is anything up there on the memory stuff yet.
 
Thanks @jwatte,

Awhile ago, I started a thread that talked about some of the different memory regions and some of the sections and the like:
https://forum.pjrc.com/threads/57326-T4-0-Memory-trying-to-make-sense-of-the-different-regions

Sometimes the easiest place to find examples is to look in the released code.

PROGMEM - Sort of like the old AVR days with the T4.x. (Sort of) In particular when you define a variable with PROGMEM, the variable is left up in the Flash memory and not copied down to the faster, but more limited in size memory DTCM (Data Tightly Coupled Memory). Note however unlike AVR processors, which have different addressing spaces, the ARM processor has one Address space, so you don't need all of those screwy macros and the like to then access the memory.

Again as mentioned by default all of your code is copied out of Flash into the same 512kb fast memory ITCM (Instruction Tightly coupled memory). That 512kb is divided into DTCM or ITCM by 32KB blocks. So if your code is under 32KB only one of the 16 32kb blocks will be used for instructions, the rest for data ....


So to mark a function to not bring down to the faster memory, you can mark it with FLASHMEM. Example in startup.c
Code:
FLASHMEM void configure_cache(void)
{
...

Now if you have large buffers and the like that are not initialized and you don't mind it being slower. You can access the other 512KB of memory on the board. It is slower but there is a hardware cache that speeds it up.

You can set these up, by using the DMAMEM keyword, like: DMAMEM uint8_t frame_buffer[320*240*2];
In this case that is size for ILI9341 display... Note: if you do any malloc operations, this also comes out of this 512KB.

New to the T4.1 - If you install external memory on your T4.1 - you can declare stuff to be created there,
by using the keyword: EXTMEM.
Example I have code for playing with ILI9486 with 4 bytes per pixel and 320x480...
So far I don't think it supports initialized variables here. But comments in startup code looks like it might soon.

Not sure if I covered everything, but hope that helps

Thanks this helps to understand this unique functionality.
 
What happens with const variables? Are they also in RAM or are they in the Flash memory as expected?
 
Here are additional linker-scripts and a modified boards.txt.

When they are installed (overwrite the existing boards.txt) , you can chose to use the old "Teensy 3" behaviour.
Programs will runs from flash then - which in some case might be a bit slower if the cache can't keep up.
In most cases you will not see any difference in Speed, but gain a lot of more RAM. The larger your program is and the more code is "HOT" it will be little bit more likely that the cache is not enough.

You can mark some functions as "FASTRUN" (as on Teensy 3) which will place them in the fast ITCM and need RAM, though.

You can select the linkage in the Arduino-menu.
Note that the memory-info in the compiler-window is confusing and can not show the mem-usage in a comprehensive way.

https://github.com/FrankBoesing/snippets/tree/master/linker
 
Back
Top