BUG/FYI: 16/32Bit access to adress 0x1FFF_FFFF cause hang (Teensy 3.x)

Frank B

Senior Member
There's a general problem:

An access to adress 0x1FFFFFFF with more than a byte (int16_t or bigger) causes a fault.
I found,that this is documented in the Reference Manual in chapter 3.5.3.4, too:

Burst-access cannot occur across the 0x2000_0000 boundary
that separates the two SRAM arrays. The two arrays should be
treated as separate memory ranges for burst accesses.

This can(!) cause problems, because normally you don't know where a variable is stored.

Solution: All variables (> 1 Byte, including arrays and structures, stack and heap) located in this region should be aligned to - at minimum- 2 (i think this is not possible)
- or - the linker script has to define two (RAM) memory regions
- or - ignore it and hope that your program works as expected if it uses more than 32kB RAM(Teensy 3.1) :)


This took me some time....
 
Last edited:
Well in theory (I don't know the ARM GCC backend, just speaking as GCC defaults), any static/global containing scalar types other than char, signed char, or unsigned char, should already have at least 2 byte alignment. On some systems arrays/structures containing only characters would also be given a larger alignment, but perhaps not on the Arm.

Dynamic allocations and stacks might cross that boundary, but again such things are typically aligned.

Now, if you are playing games with pointers, then you do have to be careful. Also, if you declare things specifically unaligned or packed, it might be problematical.
 
Would this happen in malloc()?
Some malloc()'s can managed multiple heaps (there's an odd name for these), so a proper linker script and a mmalloc would avoid this? On a small M4?
 
Would this happen in malloc()?
Some malloc()'s can managed multiple heaps (there's an odd name for these), so a proper linker script and a mmalloc would avoid this? On a small M4?
Malloc is defined as returning approprately aligned blocks, so any pointer returned will have the bottom 2-3 bits all 0 (it depends on what the arm alignments are, and I'm too lazy to look them up). So, unless you are doing something that intentionally creates unaligned pointers, you should be safe (insert hand waving here).
 
On malloc() I was wondering about the block of memory spanning the 32KB boundary - and if hardware things like DMA can't cope.
 
On malloc() I was wondering about the block of memory spanning the 32KB boundary - and if hardware things like DMA can't cope.
Yep, that can be a problem as well. Though as I read the original report, it sounded like the only thing that did not work was if you did a 2-byte load/store from address 0x1FFFFFFF or a 4-byte load/store from addresses 0x1FFFFFFE-0x1FFFFFFF. Generally the compiler tries to align things on natural boundaries. So short/unsigned short types should be aligned to have a 0 in the bottom bit (i.e. a single short won't have a byte on each side of the boundary). Similarly for int/unsigned int/float should have 00 in the bottom 2 bits. Now, if you hand a random pointer to a function, the compiler will assume it is properly aligned, and if it isn't it may be slower. Internally within GCC, there is a setting that says the compiler can issue unaligned loads/stores in some cases, because it will 'just work' but be slower, but the arm compiler does not set this (it says that the compiler when it is knowing things are unaligned must do move bytes and reassemble the thing afterwards).

However, in the grand scheme of things it may be better to not span the boundary. This would mean hacking of the linker script, as well as reworking malloc, so it never returns a pointer that crosses the boundary. BTW, as a compiler hacker, I hate such 'features' in the hardware, though I can understand why they are there.
 
The compiler / linker could cope with the restrictions at 32KB. IF someone made it so.
But the DMA hardware, hopefully, doesn't fault if it does a transfer at that boundary. That too can be dealt with I suppose. But imagine a large DMA buffer (KBytes) that spans the boundary. The compiler is unaware of such run time things. It would be up to malloc() or the linker if the region was named, to avoid.
I've seen these hardware architectures arise in other computers of the past. Always messy.
 
Sure, all of these can be dealt with. It is more a question, who is going to do the work to do all of this.

As I said, it may be a lot simpler to treat it as having two separate memory pools, so that no object is in one memory pool or the other. Then you only have to worry about the bug in smaller number of places (linker script and malloc). You potentially have more memory wastage in having two pools, but it simplifies a lot of other things.

Or if you don't need all 64K, put all of the stuff setup by the linker in the first 32K, and use the second 32K for stack/heap. But there are likely some people that need to have statically allocated 48K worth of data, or need a single large buffer.
 
But there are likely some people that need to have statically allocated 48K worth of data, or need a single large buffer.

Exact. And this is, why i vote for "Ignore this". But it should be documented and more easy to find.

Is it possible, that the linker gives a warning ?

The issue occured with a large array of chars, randomly not aligned - but i accessed it word-wise via a typecast. Nothing complicated.
I solved this with "align(2)".
 
The issue occured with a large array of chars, randomly not aligned - but i accessed it word-wise via a typecast. Nothing complicated.
I solved this with "align(2)".
To quote from Henry Spencer (USENET demigod) from many years ago, If you lie to the compiler, it will get its revenge.. Using type-punning/casting pointers is a form of lying to the compiler. In the standards, this is considered illegal behavior, if the pointer being casted is not suitably aligned. In general, newer versions of the compiler tends to find more places where users subvert the rules. On some RISC machines, it would not work at all, if the word access was not aligned, not just in this one corner case.

Having written from scratch a C compiler front end for a word oriented machine that had byte addressability thrown in at the last minute, using a different pointer representation (Data General MV/Eclipse computers), I do tend to be sensitive to lax code.
 
I wonder if a software workaround is possible?

Perhaps the memory fault handler could get some default code that could detect this, do the actual operation as 4 byte size reads, and return gracefully as if things "just work"....
 
I wonder if a software workaround is possible?

Perhaps the memory fault handler could get some default code that could detect this, do the actual operation as 4 byte size reads, and return gracefully as if things "just work"....

It depends on what the real hardware does. After being in the trenches for many years, I know that hardware can bite you in these corner case situations. Some hardware might just randomly give you bytes from other loads.
 
To quote from Henry Spencer (USENET demigod) from many years ago, If you lie to the compiler, it will get its revenge..

:) Yes, but it's a not an compiler-issue. And in this case, it was my fault. I'm completly aware of this.
But i was not aware that this can cause a problem :) I'm not that c or c++ expirenced.... last time i used c was two years ago.

I really don't think, that we urgently need a fix for this "problem" (that's why i wrote "FYI" in the subject), because the cases are rare - but a warning "memory boundary" would be fine, if possible.

Michael, i wrote a library for CRC. The Buffers for this can have any size. The CRC - Unit is faster with 32Bit-Values, so i fill it with 32Bit values, only the remaining values are 16 Bit or 8 Bit.
 
Last edited:
Note, even if the machine 'works' with unaligned data, it is generally faster to do a test and if the pointer is unaligned, handle the first 1-3 bytes, then handle all of the 32-bit aligned words using aligned pointers, and then handle the 1-3 trailing bytes. If buffers are expected to be really big, you might need to look into prefetch strategies to stream the data. For example, on Intel, the machine will break each part of the load/store into 2 separate microcode instructions.
 
Hi Michael,
i don't know the size of the buffers. This was a testcase, a kind of "benchmark". I measured the time for a large buffer. Normally, they are not very big, i think.
I'ts a library and Arduino is for beginners, so i can't say easyly "hey, user, align your buffer". I know, that it is faster when it is aligned. But it works without that (well, most of the time).
Your proposal is great, i'll update my lib.

But... the problem still exists :)

Also, I must apologize. Misaligned reading from this address gives wrong results - no crash. I just tested it again. I did not test misaligned writes.
I don't know why it crashed a few days ago. Maybe i have had an additional problem in my code.

btw, excuse my english...school is long ago :)
 
Last edited:
The Cortex-M4 processor has hardware support for mis-aligned access. When you read or write 32 bits that aren't aligned on a 32 bit boundary, the hardware automatically stalls the processor and performs a pair of reads on two 32 bit words, or a pair of read-modify-write operations if you're writing.

Properly aligned reading & writing is faster.

Perhaps another good solution would involve checking the 2 lowest bits of the address, before starting the main loop. If they're not zero, perhaps those first 1 to 3 bytes could be read and their CRC computed the slow single byte way. Then the main loop could read 32 bits at a time, always properly aligned.

Someday I'm going to try writing a fault handler that transparently fixes this problem. That might turn out to be impossible, or far too much code?
 
Hi Paul,
i don't know if this is possible.

But.. a general fault-handler in form of a blinking LED would be great !
If this is not ok for Teensy 3.0/3.1: Maybe with a dedicated pin of the next Teensy 3.x+- ? :)
 
^^
Bump :) with a rgb-led you'll need not so much space..
Again this "bug" was seen (see my mp3-thread).

kindly
Regards!
Frank
 
@Paul,

would you accept a pull-request if i wrote a "morse-code" signalling fault-handler with the on-board LED ??
 
Morse code or, more commonly used, two-digit blink codes, with a delay in between the two.
blink-blink delay blink-blink-blink-blink delay delay is twenty-four. Usually repeats forever
Few know Morse code.
 
would you accept a pull-request if i wrote a "morse-code" signalling fault-handler with the on-board LED ??

Maybe. Details matter.

It must preserve the USB and serial polling already in the fault handler, so you can't simply implement busy looping for the delays. Obviously, you also can't rely on the normal Arduino timing stuff, since the fault handler executes at higher priority than all other interrupts, including Systick.

It must not default to taking over pin 13, because we have no idea what various projects have connected to that pin. Anyone using the audio shield has that pin receiving I2S data, for example. Reconfiguring the pin to GPIO mode and turning it to output mode must be explicitly enabled by some function.

Extra code size in that handler, and 1 to 4 bytes of RAM always allocated, even when the feature is unused, is probably ok.
 
Something like this (?):
nmi.h :
Code:
#ifndef _nmicatcher_h_
#define _nmicatcher_h_

#include "core_pins.h"
#include "kinetis.h"
extern "C" {void catchNmi(int8_t pin);}

#endif
nmi.cpp:
Code:
#include "nmi.h"

static int8_t __nmiPin = -1;
static int8_t __nmiBlinkMask;

extern "C" {

static void nmiBlink(void);

void catchNmi(int8_t pin) {__nmiPin = pin;}

static void nmiBlink(void)
{	
	if (__nmiPin >= 0) 
	{
		pinMode(__nmiPin, OUTPUT);
		digitalWrite(__nmiPin, (systick_millis_count & __nmiBlinkMask));
	}
	systick_isr();
}

void hard_fault_isr(void) {__nmiBlinkMask=0x10; _VectorsRam[15]=nmiBlink;}
void memmanage_fault_isr(void) {__nmiBlinkMask=0x80; _VectorsRam[15]=nmiBlink;}
void bus_fault_isr(void) {__nmiBlinkMask=0x20; _VectorsRam[15]=nmiBlink;}
void usage_fault_isr(void) {__nmiBlinkMask=0x40; _VectorsRam[15]=nmiBlink;}

}
If you want to test:
Code:
#include "nmi.h"

void setup() {

  catchNmi(31); //<- use Pin31 for test, connect LED here (or use 13...)

  //Tests:
  //hard_fault_isr();
  //bus_fault_isr();
  //usage_fault_isr();
  //memmanage_fault_isr();

  // provoke bus_fault (test):
  (*(volatile uint32_t *)0x1FFFFFFF) = 0;

}


void loop() {}

I don't know how to provoke the other faults.. :)
So, simply do the include and initialize it with catchNmi(ledPin)

Hopefully this helps others to identify " inexplicable " errors. (EDIT: is " inexplicable " correct english ? I mean errors for which the user has no explanation)

Regards, Frank

Edit: Maybe you want to change the line systick_isr(); with a call of vector[15]. this needs 4 bytes more to save the "original" vector (a hook)- has the advantage that the user can change it.

Edit2:
It's a "hard_fault", not a bus fault.
p.s. Slower blink with shifting systick_millis_count to right.. (for example >>3)
 
Last edited:
Code:
#include "nmi.h"

void setup() {
   catchNmi(31); //<- use Pin31 for test
  while(!Serial);
  delay(1000);
}

#define M8A    (*(volatile uint8_t  *)0x1FFFFFFD)

#define M16    (*(volatile uint16_t *)0x1FFFFFFF)
#define M32_1  (*(volatile uint32_t *)0x1FFFFFFF)
#define M32_2  (*(volatile uint32_t *)0x1FFFFFFE)
#define M32_3  (*(volatile uint32_t *)0x1FFFFFFD)

volatile uint32_t x;
uint8_t *p;

void loop() {
  p = (uint8_t*)0x1FFFFFFD;
  *p++ = 0x01;
  *p++ = 0x01;
  *p++ = 0x01;
  *p++ = 0x01;

  x = M16;
  Serial.printf("Read16 ok (wrong value): %x\r\n",x); 

  delay(5000);   
  M16 = 0x2222; //LED Blink starts
  Serial.printf("Write16: %x\r\n", M16); 

  delay(5000);   
  Serial.printf("Read32 1:%x\r\n", M32_1); 

  delay(5000);   
  Serial.printf("Read32 2:%x\r\n", M32_2); 

  delay(5000);   
  Serial.printf("Read32 3:%x\r\n", M32_3); 

  delay(5000);   
  M32_1 = 0x12345678;
  Serial.printf("Write32 1:%x\r\n", M32_1); 

  delay(5000);   
  M32_2 = 0x5678ABCD;
  Serial.printf("Write32 2:%x\r\n", M32_2); 

  delay(5000);   
  M32_3 = 0x87654321;
  Serial.printf("Write32 3:%x\r\n", M32_3); 
  
}

Output:
Code:
Read16 ok (wrong value): 1
Write16: 22
Read32 1:800022
Read32 2:80002201
Read32 3:220101
Write32 1:800078
Write32 2:8000abcd
Write32 3:654321
Regarding "Audiobeeps":

Please try this.
I don't know why: Sometimes there seems to crash something (can't reproduce this at the moment!),but most of the time you get wrong values. Eventuelle some code in the sketches crashes with wrong values. i dont know, at the moment - and it's not important, i think. wrong values are worse than crashing, because it's not obvious when something goes wrong. As i wrote in post #16, we have wrong values. I can't remember exactly why i wrote "cause hang". It's too long ago. Maybe this was not correct, and the usercode crashed because of wrong values.
But i remember that there were crashes sometimes, during my tests with MP3.

The above code loops fine. But look at the outputs, please - all wrong.
Also, please note: The blinking starts with "write16". Read16: NO FAULT (or, eventually, is my test incorrect?)

This leads to the Question: Is the Audiomemory aligned to 4 or 2? It should be 4, to allow 32bit access.
Is DMA for audio 32 or 16 BIT ? Can DMA handle the boundary (even with 32 bit aligned data ?) Correctly ?

EDIT: OK, forget the crashes, it's an infinite loop, as you wrote.
 
Last edited:
What does LDM do , if doing some 32 bit-reads at once ? Then, alignment to 4 would not help if it can't handle the switch to the other ram.. we'd need 4*Registercount
Maybe someone wants to test this..
Also, if all other buffers for audio are aligned.. indirectly used ones too.. SD-Buffers for example, spiflash.. simple variables.. all.
then, memcpy must be configured for aligned -mode.. (i remember a switch "MISALIGNED" in libc.. it is likely to be switched "on", since the m4 allows this)
 
Last edited:
Back
Top