How to spend hours debugging a Teensy3.5

Status
Not open for further replies.

jwatte

Well-known member
I have a serial protocol that defines messages like this:

Code:
struct SomeMessage {
  uint32_t data1;
  uint16_t data2;
  uint16_t data3;
};

(I know about byte order, and struct padding, and all of it matches between host and Teensy, so don't worry about that :)

Then I have a class for receiving messages that looks something like:

Code:
class MessageParser {
public:
  void receive() {
    while (port.available() > 0) {
      int ch = port.read();
      buf[ptr] = ch;
      if (ptr == 0 && ch == HEADER_CHAR) {
        ptr = 1;
        crc = 0;
      } else if (ptr > 0) {
        crc.update(ch);
        ++ptr;
        if (ptr >= 2 && ptr == buf[1]+2) {
          decode_message(buf[1], &buf[2], crc);
          ptr = 0;
        }
      }
    }
  }
  void decode_message(size_t size, void const *data) {
    SomeMessage const &sm = *(SomeMessage const *)data;
    ... do stuff with sm ...
  }
  uint16_t crc_;
  uint16_t ptr;
  uint8_t buf[258];
};

Can you see the bug?




buf is aligned at a 4 byte boundary (likely,) but the structure in the buffer starts at offset 2 within the buffer. This means that the uint32_t data member in the message structure is being accessed unaligned (2 bytes off its natural 4-byte alignment in memory.)
When this happens on Teensy3.5 and Teensy3.6 the CPU just takes some unaligned access interrupt and locks up. This of course works great on x86 (where alignment is, at worst, just a performance problem) and on smaller, older microcontrollers that have narrow busses and emulate 32-bit words through multiple access. The Teensy3.5 just happens to be in the middle of this gulf, and doesn't like unaligned accesses.


This would have been easier to debug if there was some obvious signal I could catch for the various exceptions.
Is there? Is there a vector I could implement for the unaligned access exception? Are there other exceptions (like null pointer deference, and so forth,) that I could reasonably catch and maybe print something out a serial port, so I could know what's going on rather than poking at a nonresponsive chip, adding LED and pin wiggles in my code until I can zero in on the problem?
 
I never would have thought of alignment issues. So what was the solution? I’m new to 32 bit embedded.
 
to people who dont know this, is there something dummy proof for this issue? :)

I'm not sure there is a way to dummy proof the issue. But in general if you are doing casts like:

Code:
SomeMessage const &sm = *(SomeMessage const *)data;

you need to make sure data is properly aligned. On the Teensy, I believe there is only one address where unaligned addresses don't work, so it tends to be fairly rare in hitting it.
 
I believe I hit it within a single bank. I believe M4 needs software assist for unaligned 4 byte access. (Could be wrong of course)

The dummy proof way to get around this is to create a SomeMessage instance on the stack and memcpy() from the buffer into the instance.
 
I'm not sure there is a way to dummy proof the issue. But in general if you are doing casts like:

Code:
SomeMessage const &sm = *(SomeMessage const *)data;

you need to make sure data is properly aligned. On the Teensy, I believe there is only one address where unaligned addresses don't work, so it tends to be fairly rare in hitting it.

Not that easy or obvious in every case. I remember a bug in the teensy audio library (years ago). The audiomemory was an array of (256 + 2 = 258) byte size struct (258/4 = 64.5) ... the array itself was (of course) properly aligned, but not every 2nd entry.
It crashed, sometimes, when it hit* the problematic address. Bug was solvedby adding 2 bytes to the struct. So, it was no end-user fault, but a bug in the "official" library. Almost impossible to find for an unexperienced user. so, "align your data" is correct, but not as easy as it sounds, in some cases..

(*by using the DSP Extensions which can process 2 samples at once.. voilá 32 bit access on 16 bit 2-byte aligned data)
 
Last edited:
here is Paul's reply to one of my posts regarding unaligned memory access:

With Cortex-M0+ on Teensy LC, the hardware never supports unaligned access. Any unaligned read will cause a fault, which effectively crashes your program.

With Cortex-M4 on Teensy 3.2, 3.5 & 3.6, you can read do unaligned reads.

However, the one big gotcha if you do this from RAM. The RAM is actually in 2 banks. The lower bank ends at address 0x1FFFFFFF and the upper bank begins at 0x20000000. Unaligned access crossing that boundary will cause a fault, crashing your program. Since you're reading from Flash memory, this should not be an issue. Just know it does matter if your data is in RAM and crosses that boundary.

So @MichaelMeissner is correct - there is one address that will cause a fault if reading from RAM, but from Flash it is ok.
 
Not sure if this was done before? I created this to test the boundary condition behavior and then extended into PJRC CORE code to allow USER INO to do debug spew.

Paul: Does adding a (weak) and empty func() like :: user_fault_isr(){} make sense to allow user notification on a fault? Actual utility and value would be up to the user - and how bad the fault was to find what works for 'output'.

Using a T_3.6 >> I made a sketch and eventually got usable results?:: when read across the boundary - the part of the boundary is supplied NULL - or some garbage it seems. It is as expected before and after the boundary split. Writing across the boundary causes a fault.

Code is not pretty but it worked to cause a fault and let me see SerMon progress and result - though divide by zero kept running [not shown]? As far as I can see adding the (weak) call for user_fault_isr() allows the user to write custom code on a fault. In this case there was a [~7 second] PAUSE in USB output - but it recovers and works as shown - including the OUTPUT pin13 toggle and uS delay and surprisingly the Serial.print()'s … with this fault.

Not shown below but I added global int's yy and xx that I set before suspect lines and then I could print this in user_fault_isr()::
Code:
  yy=__LINE__;
  xx=ii;


To test I hacked into …\hardware\teensy\avr\cores\teensy3\mk20dx128.c:
Code:
extern void user_fault_isr(void);

void fault_isr(void)
{
    // #if 0 code removed - but tested in sketch 'output' as shown below.
	while (1) {
[COLOR="#FF0000"]		user_fault_isr();
[/COLOR]		// keep polling some communication while in fault
		// mode, so we don't completely die.
		if (SIM_SCGC4 & SIM_SCGC4_USBOTG) usb_isr();
		if (SIM_SCGC4 & SIM_SCGC4_UART0) uart0_status_isr();
		if (SIM_SCGC4 & SIM_SCGC4_UART1) uart1_status_isr();
		if (SIM_SCGC4 & SIM_SCGC4_UART2) uart2_status_isr();
	}
}

and to see the code work (weak) supplied or not I ended up pushing this into ..\\hardware\teensy\avr\cores\teensy3\yield.cpp:
Code:
[B]extern "C" void user_fault_isr(void)	__attribute__ ((weak));[/B]
extern "C" void user_fault_isr(void)
{
	GPIOC_PTOR=32;
	delayMicroseconds(100000);
  static int ooo=0;
  if ( ooo < 10 ) 
  {
    Serial.println( "CORE user_fault_isr(void)" );
    ooo++;
  }
}


The Serial.println( "Core..."); was so I could see the compiler switch between sketch "INO …" version below:
Code:
extern "C" void user_fault_isr(void);
extern "C" void user_fault_isr(void)
{
  GPIOC_PTOR = 32;
  static int ooo = 0;
  if ( ooo < 10 )
  {
    Serial.print( ooo );
    Serial.println( " :: INO user_fault_isr(void)" );
    ooo++;
  }
  delayMicroseconds(1000000);
  GPIOC_PTOR = 32;
  delayMicroseconds(100000);
}

void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(115200);
  while (!Serial && millis() < 5000 )
  { Serial.print( "Teensy NOT Online @ millis=" );
    Serial.println( millis() );
    delay(30);
  }
  Serial.print( "Teensy Online @ millis=" );
  Serial.println( millis() );
  user_fault_isr();
  delayMicroseconds(100000);
  user_fault_isr();
  delayMicroseconds(100000);
  delay(1000);

  uint32_t *foo2;
  uint16_t *foo1;
  uint8_t *foo;

  uint32_t ii, zz;
  for ( zz = 0; zz < 2; zz++ ) {
    foo2 = (uint32_t *)0x1FFFFFFA;
    foo1 = (uint16_t *)0x1FFFFFFA;
    foo = (uint8_t *)0x1FFFFFFA;
    for ( ii = 0; ii < 15; ii++ ) {
      Serial.print( (uint32_t)foo, HEX );
      Serial.print( " :: " );
      delay(100);
      foo[0] = ii;
      Serial.print( foo[0], HEX );
      delay(200);
      Serial.println( "-----@8" );
      foo++;
    }
    for ( ii = 0; ii < 10; ii++ ) {
      Serial.print( (uint32_t)foo1, HEX );
      Serial.print( " :: " );
      delay(100);
      Serial.print( foo1[0], HEX );
      if ( zz > 0) foo1[0] = foo1[0] + 1;
      delay(200);
      Serial.println( "-----@16" );
      foo1 = (uint16_t *)(1 + (char *)foo1);
    }
    for ( ii = 0; ii < 10; ii++ ) {
      Serial.print( (uint32_t)foo2, HEX );
      Serial.print( " :: " );
      delay(100);
      Serial.print( foo2[0], HEX );
      if ( zz > 0) foo2[0] = foo2[0] + 1;
      delay(200);
      Serial.println( "-----@32" );
      foo2 = (uint32_t *)(1 + (char *)foo2);
    }
  }
}

void loop() {
  GPIOC_PTOR = 32;
  delayMicroseconds(1000000);
  GPIOC_PTOR = 32;
  delayMicroseconds(1000000);
  Serial.println( millis() );
}

I added the other dump code from mk20dx128.c - and macro'd in standard Serial.print and that worked as below.
Here is abbreviated output - code above goes by BYTE setting values 0...0xE - then reads them @8 shows output from BYTE ptr - in pass 1 and 2. Then in pass 1 I read the values with @16 then @32 bit pointers. In red is where the reads are compromised. Then on pass 2 the 16 bit access causes a fault and calls the sketch copy of user_fault_isr()::
1FFFFFFA :: 0-----@8
1FFFFFFB :: 1-----@8
// …
20000007 :: D-----@8
20000008 :: E-----@8
1FFFFFFA :: 100-----@16
1FFFFFFB :: 201-----@16
1FFFFFFC :: 302-----@16
1FFFFFFD :: 403-----@16
1FFFFFFE :: 504-----@16
1FFFFFFF :: 5-----@16
20000000 :: 706-----@16
20000001 :: 807-----@16
20000002 :: 908-----@16
20000003 :: A09-----@16
1FFFFFFA :: 3020100-----@32
1FFFFFFB :: 4030201-----@32
1FFFFFFC :: 5040302-----@32
1FFFFFFD :: 50403-----@32
1FFFFFFE :: 504-----@32
1FFFFFFF :: 3000005-----@32

20000000 :: 9080706-----@32
20000001 :: A090807-----@32
20000002 :: B0A0908-----@32
20000003 :: C0B0A09-----@32
1FFFFFFA :: 0-----@8
1FFFFFFB :: 1-----@8
// …
20000007 :: D-----@8
20000008 :: E-----@8
1FFFFFFA :: 100-----@16
1FFFFFFB :: 201-----@16
1FFFFFFC :: 302-----@16
1FFFFFFD :: 403-----@16
1FFFFFFE :: 504-----@16
1FFFFFFF :: 52 :: INO user_fault_isr(void)

fault:
??: 725
??: 40072080
??: 6
psr:1FFF1138
adr:5
lr: C8
r12:FFFFFFF9
r3: 6
r2: 1BA9
r1: 20000004
r0: 1
r4: 20000009
lr: 40048034
 
Odd - does the T_LC have similar RAM Memory Split?

Same code runs and suggests it can read write the same addresses as BYTES - then gives a fault on the first ODD @16 WORD read:

20000008 :: E-----@8
1FFFFFFA :: 100-----@16
1FFFFFFB :: 2 :: INO user_fault_isr(void)
121 :: LINE
0 :: STATE

fault:
??: 9B7
 
Odd - does the T_LC have similar RAM Memory Split?

Yes. Teensy LC has RAM from 1FFFF800 to 20001BFF.

However, Cortex-M0+ never allows unaligned memory access. If you craft code which fetches 32 bits across any 4 byte boundary or 16 bits across any 2 byte boundary, it will always crash no matter where in the RAM you try.
 
Paul -- I'm wondering whether the startup & library code could be modified never to return an malloc/new pointer that crosses the boundary on the Teensy 3.x. It would mean that memory would be more fragmented, and a program that did one large allocation of most of the memory might not run any more.
 
didnt a user on the forum load a huge buffer? isnt the lcd screens have a 150k buffer? should they not be affected?
 
I would believe that would bust the ili9341_t3n code when you allow it to allocate the memory for a frame buffer. Can work around, but not clean.

Actually at times, I wish we had an easy option to turn on to get more debug output... That is maybe more interrupt vectors, that go to some print function, which hopefully prints out what error happened. Would be great if somehow it would print something like: Misaligned memory access (maybe address if it has it), called at ...

I did something like that a long time ago for some other processors (Basic Micro Atom Pro). My guess is something like that was done here, maybe would output to some hardware Serial port (Serial1? )

Alternative I think it was BDMicro with their Atmega 128 or was it chipkit boards, that had a built in LED and when it hit a fault like this, would blink the LED, with different patterns, to tell you what the fault was. ...
 
you cant mark the area as used so the heap would jump over it? (like a bad sector), dont SSDs do this? :)
 
@KurtE and @tonton81 from the discussions above, the problem only occurs when there is unaligned access across the boundary. According to information I can find online, malloc will always allocate memory suitably aligned for most standard data types. I'm assuming that this always aligns to at least a 4-byte boundary - maybe even 8 or more? Since we are reading/writing 16 bit values (2 bytes) to teh framebuffer, these will never cross the boundary.

So the framebuffer is allowed to cross the boundary, as long as unaligned access (especially write) doesn't occur - which it doesn't.

However, you may run in to issues if you are playing with pointers and writing multiple 16-bit values at once, like this:
Code:
// framebuffer is a uint16_t*
*((uint64_t*)framebuffer) = (uint64_t)four_pixels_at_once;
 
KurtE: the code I added shows adding to what PJRC already includes (mk20dx128 .c) - posts show catching the memory errors. There are various WEAK functions already for special cases - all default to "user_fault_isr()".

Using those various "unused_isr" stubs that resolve to "fault_isr()" does keep the system alive and allow it to provide notice as shown in post #10 above.

I just put TD_1.42_B4 on my system so that code is in another folder now. The way I did it worked around cpp name mangling - without that my 'cpp' sketch isn't breaking into those (weak) "c" code pieces just now.

That CORE "#if 0" code has output to Serial - that is what I macro'd around to push to USB since I didn't bother to include the ::#include "ser_print.h" // testing only
I didn't want Serial# or to figure out how it worked.

Paul: Thanks for the confirmation on that RAM region - was wondering how my test worked at all. But yes - given that RAM - it was expected the T_LC would fail the reads from your info.

I'm not C'ing something - or some way to extern "C" those _isr()'s is needed for use in a sketch? As was done for yield() with yield.cpp.

<edit>: AFAIK - I had to move my code external to mk20dx128.c ( into the yield.cpp file) because even when (weak) since they are in the same compile unit they are used before external elements in user code.
 
I did get back to that and saw it working right - I did not try the linked code - but indeed on T_LC it come through :: hard_fault_isr() :)
 
Status
Not open for further replies.
Back
Top