Teensy 4.1 : EXTMEM and USB UART - not responsive anymore

tjaekel

Well-known member
I run a thread with an UART shell (commands to control the system).
There is a GPIO interrupt, which comes with 1 KHz repetition rate.
The INT Handler for this GPIO uses EXTMEM, e.g. to store data in EXTMEM (which can be 1 KByte per interrupt).
So, the load due to this GPIO INT plus EXTMEM is pretty large (but not too large: all INTs are still processed properly, nothing lost on user INT handling).

What I see:
The USB UART is not responsive anymore.

I can still type characters on UART, manually, as a human normal typing speed, but I have lost characters.
The command line is not responsive anymore. It is tough to send a command string, often characters lost (not received: I have a local echo on MCU).
I have to type twice to get one character, a regular manual human typing of commands is not possible anymore.

It looks like, the frequent data access to EXTMEM (caused by the periodic GPIO interrupts) blocks the reception of USB UART.
Why?

A bit obvious: EXTMEM write is much slower as internal RAM. But why does it block the reception of USB UART?
I would hope, an access to EXTMEM does not block the reception of USB UART characters (but it seems to do).

BTW:
I saw in Teensy 4 LIB code this:
"HardwareSerial.cpp", when Serial.available() is called, even Serial.Read() - it disables the MCU IRQ almost for the entire time of the function calls
(using __disable_irq();)
Similar in USB "usb_serial.c", function usb_serial_read() uses NVIC_DISABLE_IRQ(IRQ_USB1);

So, it tells me this:
If you call Serial.available() "too often and fast" - it will disable (ALL !) interrupts: and this for more as 50% of time as MCU executes code - the INTs are disabled.

Assume this code:
Code:
char * UART_gets(char *b, int len) {
  int c;
  do {
      if (Serial.available()) {
         *b++ = c = Serial.read();
      }
  } while (c != '\r');
//...complete the string received

which means: you poll full speed if Serial.available() has something, and even you do a Serial.read() - MOST OF THE TIME the interrupts (USB1 or ALL) are disabled!.
So, there is just a tiny "timing window" where an INT can happen and would be processed (as fast as possible). But most of the time, the INT is "blocked" (and would come later, and randomly).

My impression:
The UART code, for USB UART, HardwareSerial, is not written in a way that other real-time stuff (INTs) can be processed.
I do not see a need to disable all INTs (IRQs), even not the USB INT - even I understand the concern with "race conditions". A proper FIFO implementation for received data
can work even without to block "all the time' other INTs. I do not see a need for disabling INT(s) if the implementation would be more tolerant for "race conditions".

So, using UART in combination with other real-time stuff, like other INTs, having the UART still responsive during a high INT load (frequency of other INTs, e.g. GPIO),
does not seem to be possible.
The Serial.available() and Serial.read() block and make the real-time response "bad and unpredictable".

BTW:
If you think to solve by using using a sleep(1), for 1 milli-second, when Serial.available() "fails", it does a create a new problem: now you lower the maximum possible
UART baudrate (e.g. just one character every 1 milli-second possible). Not possible anymore to send UART from a script (e.g. Python) in max. speed potentially possible.

Why the USB UART (and HardwareSerial) does not have an "Non-INT-Blocking" implementation, no guarantee that all characters are received, even with a higher INT load in background?

I am pretty unhappy with the UART implementation and performance (not designed for real-time systems).
UART becomes unresponsive with background INTs and access to EXTMEM.
 
Not sure of all the p#1 text but this pops out as unclear:

HardwareSerial and UART Serial#: 1,2,3,4,5,6,7,8 are those the device pins may support - and indeed they rely on interrupts for tending the FIFO's on the 1062 processor in the T_4.x'a that are only 2 bytes deep IIRC. Capable of perhaps 6 Mega bits per second.

Teensy USB is NATIVE real USB it is not UART processed in any way. To be responsive and capable of even half of the 480 Mega bit per second bandwidth on the USB it also relies on interrupts to be responsive and efficient.

Disabling interrupts for any significant period will interrupt data flow and integrity and also affect the millisecond SYS_TICK that feeds the teensy time exposed in millis() and micros().

Extensive I/O testing for read write of PSRAM based EXTMEM has been done and AFAIK it doesn't disable interrupts (or require interrupts to be disabled) in the process as it is memory mapped through the processor which has 'programmed' commands used for the 4 bit data transfer to the PSRAM chip(s) handled internally within the 1062 processor including passing that through a 32KB data cache as possible.

If a simple sketch could be posted demonstrating the issue observed, it would help addressing it.
 
Thank you for the response.
Let's see if I can strip down my project to the "root cause" (potentially not so easy, I use also FreeRTOS).

I know "USB UART" is VCP.

Thinking about my project and what might be possibly happen:
I have a main thread, which polls with Serial.avalaible() for a character received.
I have also a GPIO INT running, e.g. coming with 100 Hz (every 10 ms).

The Serial.available() (and as well Serial.read()) would disable INTs (at least USB1 INT) for most of the time (when called).
Outside this "timing window", where INT is not disabled, the GPIO INT kicks in.
Let's assume, due to accessing (writing) to EXTMEM, triggered by the INT (via a RTOS thread scheduled, with higher prio),
every 10 ms, and the copy to EXTMEM needs 9 ms,
for a period of 9 ms the main loop with checking for Serial.available() is not executed anymore (I am not there in my code).

Assuming, USB UART (VCP) can send every 1 ms a single character (actually a buffer, 64 bytes), but I am typing single characters on keyboard as a human,
possible, that the USR UART receiver does not get it.

Hard to simulate or to provide a simple example.

What I know for sure:
The GPIO INT triggers a thread, which runs outside of INT context (a normal thread released all the stuff).
So, my GPIO INT does not "disable" other INTs. But this thread has a higher prio as the normal UART command line thread:
so, I come back to code with Serial.available() after this GPIO INT thread has finished.

This can take up to 9 ms (less than 10 ms, because all the GPIO INTs are handled fine).
But during this period (of 9..10 ms) I "cannot" check for a new USB UART character received.

What I see clearly: when this "background INT" runs - the USB UART (VCP) is bad to respond, it loses typed character,
even typing slowly, e.g. every 1 second a character - no guarantee that it will be received.

My thinking is this:
USB UART - as VCP, receives packets, bursts (e.g. every 1 ms).
But the INT (for UART, for USB) is disabled for most of the time. So, a tiny window, where the INT can happen remains.
But this "open window" let trigger my GPIO interrupt, which keeps the system busy for up to 9 ms. But 9 ms, can mean, 9 USB VCP packets where
there already, but not realized or stored.
Even my GPIO INT does not block UART, USB... (it is a triggered thread in user space), possible that the USB UART receiver starts losing VCP packets.

It is still a bit "mysterious" because I assume, that the USB stack runs in background, is able to realize receiver interrupts, to store a VCP packet, even it has just one byte
payload per packet, maybe several packets received in between are confusing for the implementation...?

No idea: I see for sure, that the USB UART (VCP) receiver starts to lose single characters sent from PC terminal.
And it is clearly related to the stuff I do on EXTMEM.
Even it is an RTOS thread prio issue (GPIO INT thread is higher as main loop with UART) - I lose characters on USB UART, when typing as a human.

BTW:
When I do "copy and paste" of a command line in PC UART terminal program - it works fine. But this is obvious:
This string, e.g. shorter as 64 bytes - is now send as one single VCP packet, received at once.

Let's see if I can replicate this issue with a simple demo program.
Otherwise, I can give you link and instructions for entire project on GitHub - but you have to see how to generate a periodic GPIO INT
(e.g. with signal generator).
 
Here is a wholly inappropriate Arduino Sketch for Teensy 4.1. It works as expected - but does Serial printing from main code and an _ISR().

As far as understood it does what cannot be done under RTOS. The RTOS seems to be controlling interrupt by disabling them and implementing the USB and EXTMEM access in a non-functional way???
> USB Serial input is recognized and fully functional, USB output is undeterred, and here 8MB of PSRAM is tested exhaustively for over 36 seconds continuously - without the loop_ISR:
Code:
...
[B] test ran for 36.43 seconds
All memory tests passed :-)[/B]

CODE below takes the sketch "teensy41_psram_memtest.ino" {PSRAM test on all of EXTMEM} and puts on an interval timer a short repeat loop of the "usb_serial_print_speed.ino" sketch.

The "teensy41_psram_memtest.ino" runs multiple times through ALL of PSRAM writing and doing a read verify of various RANDOM or specific values.

During that the 'funfun.begin(loop_ISR, 1000);' causes the INTERRUPT loop_ISR() to print 10 lines 1,000 times per second. This 10,000 lps is less than the normal test rate exceeding 150-500+K lines per second.
> In that _ISR if there is Serial.available() { using a '\n' variation of the posted code } that data is read and then printed and a wait of 2 seconds on the 'lps' print testing is paused {except a more limited '.'} to allow verification of that output to be viewed.
> while paused with '.'s some PSRAM EXTMEM test partial completion feedback may show without scrolling as well.

When the PSRAM testing completes the _ISR() is stopped and the EXTMEM test of PSRAM results are shown - and they are passing here. And the expected time {with some partial second extension} is in line with the normal result without the _ISR() prints.
Code:
EXTMEM Memory Test, 8 Mbyte
 CCM_CBCMR=B5AE8304 (88.0 MHz)
........................................................................................................[U]Arduino code using PJRC TeensyDuino 1.59 beta 2 on IDE 1.8.19:[/U]
........................................................................................................................................................................................................count=10000000, lines/sec=0
count=10000001, lines/sec=1
 ...
count=10371688, lines/sec=10000
count=10371689, lines/sec=10000
[U]Arduino code using PJRC TeensyDuino 1.59 beta 2 on IDE 1.8.19:[/U]
........................... [B]test ran for 37.13 seconds
All memory tests passed :-)[/B]

Arduino code using PJRC TeensyDuino 1.59 beta 2 on IDE 1.8.19:
Code:
extern "C" uint8_t external_psram_size;

bool memory_ok = false;
uint32_t *memory_begin, *memory_end;

bool check_fixed_pattern(uint32_t pattern);
bool check_lfsr_pattern(uint32_t seed);

uint32_t count, prior_count;
uint32_t prior_msec;
uint32_t count_per_second;

// Uncomment this for boards where "SerialUSB" needed for native port
//#define Serial SerialUSB

elapsedMillis funWait;
void loop_ISR() {
  if ( funWait > 2000 ) {
    for ( int ii = 0; ii < 10; ii++ ) {
      Serial.print("count=");
      Serial.print(count);
      Serial.print(", lines/sec=");
      Serial.println(count_per_second);
      count = count + 1;
      uint32_t msec = millis();
      if (msec - prior_msec > 1000) {
        // when 1 second as elapsed, update the lines/sec count
        prior_msec = prior_msec + 1000;
        count_per_second = count - prior_count;
        prior_count = count;
      }
    }
  }
  else if (!(funWait % 10))
    Serial.print('.');
  int c = '\n', ii = 0;
  char b[256];
  do {
    if (Serial.available()) {
      b[ii++] = c = Serial.read();
    }
  } while (c != '\n');
  if (ii > 0) {
    b[ii] = 0;
    Serial.print(b);
    funWait = 0;
  }
}

IntervalTimer funfun;
void setup() {
  Serial.begin(1000000); // edit for highest baud your board can use
  while (!Serial) ;
  count = 10000000; // starting with 8 digits gives consistent chars/line
  prior_count = count;
  count_per_second = 0;
  prior_msec = millis();

  pinMode(13, OUTPUT);
  uint8_t size = external_psram_size;
  Serial.printf("EXTMEM Memory Test, %d Mbyte\n", size);
  if (size == 0) return;
  const float clocks[4] = {396.0f, 720.0f, 664.62f, 528.0f};
  const float frequency = clocks[(CCM_CBCMR >> 8) & 3] / (float)(((CCM_CBCMR >> 29) & 7) + 1);
  Serial.printf(" CCM_CBCMR=%08X (%.1f MHz)\n", CCM_CBCMR, frequency);
  memory_begin = (uint32_t *)(0x70000000);
  memory_end = (uint32_t *)(0x70000000 + size * 1048576);
  funfun.begin(loop_ISR, 1000);
  delay(10000);
  elapsedMillis msec = 0;
  if (!check_fixed_pattern(0x5A698421)) return;
  if (!check_lfsr_pattern(2976674124ul)) return;
  if (!check_lfsr_pattern(1438200953ul)) return;
  if (!check_lfsr_pattern(3413783263ul)) return;
  if (!check_lfsr_pattern(1900517911ul)) return;
  if (!check_lfsr_pattern(1227909400ul)) return;
  if (!check_lfsr_pattern(276562754ul)) return;
  if (!check_lfsr_pattern(146878114ul)) return;
  if (!check_lfsr_pattern(615545407ul)) return;
  if (!check_lfsr_pattern(110497896ul)) return;
  if (!check_lfsr_pattern(74539250ul)) return;
  if (!check_lfsr_pattern(4197336575ul)) return;
  if (!check_lfsr_pattern(2280382233ul)) return;
  if (!check_lfsr_pattern(542894183ul)) return;
  if (!check_lfsr_pattern(3978544245ul)) return;
  if (!check_lfsr_pattern(2315909796ul)) return;
  if (!check_lfsr_pattern(3736286001ul)) return;
  if (!check_lfsr_pattern(2876690683ul)) return;
  if (!check_lfsr_pattern(215559886ul)) return;
  if (!check_lfsr_pattern(539179291ul)) return;
  if (!check_lfsr_pattern(537678650ul)) return;
  if (!check_lfsr_pattern(4001405270ul)) return;
  if (!check_lfsr_pattern(2169216599ul)) return;
  if (!check_lfsr_pattern(4036891097ul)) return;
  if (!check_lfsr_pattern(1535452389ul)) return;
  if (!check_lfsr_pattern(2959727213ul)) return;
  if (!check_lfsr_pattern(4219363395ul)) return;
  if (!check_lfsr_pattern(1036929753ul)) return;
  if (!check_lfsr_pattern(2125248865ul)) return;
  if (!check_lfsr_pattern(3177905864ul)) return;
  if (!check_lfsr_pattern(2399307098ul)) return;
  if (!check_lfsr_pattern(3847634607ul)) return;
  if (!check_lfsr_pattern(27467969ul)) return;
  if (!check_lfsr_pattern(520563506ul)) return;
  if (!check_lfsr_pattern(381313790ul)) return;
  if (!check_lfsr_pattern(4174769276ul)) return;
  if (!check_lfsr_pattern(3932189449ul)) return;
  if (!check_lfsr_pattern(4079717394ul)) return;
  if (!check_lfsr_pattern(868357076ul)) return;
  if (!check_lfsr_pattern(2474062993ul)) return;
  if (!check_lfsr_pattern(1502682190ul)) return;
  if (!check_lfsr_pattern(2471230478ul)) return;
  if (!check_lfsr_pattern(85016565ul)) return;
  if (!check_lfsr_pattern(1427530695ul)) return;
  if (!check_lfsr_pattern(1100533073ul)) return;
  if (!check_fixed_pattern(0x55555555)) return;
  if (!check_fixed_pattern(0x33333333)) return;
  if (!check_fixed_pattern(0x0F0F0F0F)) return;
  if (!check_fixed_pattern(0x00FF00FF)) return;
  if (!check_fixed_pattern(0x0000FFFF)) return;
  if (!check_fixed_pattern(0xAAAAAAAA)) return;
  if (!check_fixed_pattern(0xCCCCCCCC)) return;
  if (!check_fixed_pattern(0xF0F0F0F0)) return;
  if (!check_fixed_pattern(0xFF00FF00)) return;
  if (!check_fixed_pattern(0xFFFF0000)) return;
  if (!check_fixed_pattern(0xFFFFFFFF)) return;
  if (!check_fixed_pattern(0x00000000)) return;
  Serial.printf(" test ran for %.2f seconds\n", (float)msec / 1000.0f);
  Serial.println("All memory tests passed :-)");
  memory_ok = true;
  funfun.end();
}

bool fail_message(volatile uint32_t *location, uint32_t actual, uint32_t expected)
{
  Serial.printf(" Error at %08X, read %08X but expected %08X\n",
                (uint32_t)location, actual, expected);
  return false;
}

// fill the entire RAM with a fixed pattern, then check it
bool check_fixed_pattern(uint32_t pattern)
{
  volatile uint32_t *p;
  Serial.printf("testing with fixed pattern %08X\n", pattern);
  for (p = memory_begin; p < memory_end; p++) {
    *p = pattern;
  }
  arm_dcache_flush_delete((void *)memory_begin,
                          (uint32_t)memory_end - (uint32_t)memory_begin);
  for (p = memory_begin; p < memory_end; p++) {
    uint32_t actual = *p;
    if (actual != pattern) return fail_message(p, actual, pattern);
  }
  return true;
}

// fill the entire RAM with a pseudo-random sequence, then check it
bool check_lfsr_pattern(uint32_t seed)
{
  volatile uint32_t *p;
  uint32_t reg;

  Serial.printf("testing with pseudo-random sequence, seed=%u\n", seed);
  reg = seed;
  for (p = memory_begin; p < memory_end; p++) {
    *p = reg;
    for (int i = 0; i < 3; i++) {
      // https://en.wikipedia.org/wiki/Xorshift
      reg ^= reg << 13;
      reg ^= reg >> 17;
      reg ^= reg << 5;
    }
  }
  arm_dcache_flush_delete((void *)memory_begin,
                          (uint32_t)memory_end - (uint32_t)memory_begin);
  reg = seed;
  for (p = memory_begin; p < memory_end; p++) {
    uint32_t actual = *p;
    if (actual != reg) return fail_message(p, actual, reg);
    //Serial.printf(" reg=%08X\n", reg);
    for (int i = 0; i < 3; i++) {
      reg ^= reg << 13;
      reg ^= reg >> 17;
      reg ^= reg << 5;
    }
  }
  return true;
}

void loop()
{
  digitalWrite(13, HIGH);
  delay(100);
  if (!memory_ok) digitalWrite(13, LOW); // rapid blink if any test fails
  delay(100);
}
 
Last edited:
... the INT (for UART, for USB) is disabled for most of the time. So, a tiny window, where the INT can happen remains.

Serial.available() does not disable interrupts, but it does call yield() if no bytes are available, so does that cause an RTOS task switch and is that causing your trouble?
 
In posted test code using the p#1 snippet for the do{}while code it was making mistakes that caused the code to hang as written.
> then end of line here on Windows with Arduino IDE at least is '\n' not '\r'
> That do{}while will hang until more Serial.available can be processed. Maybe RTOS task switches and USB gets data slowly? But on Arduino the '\n' on TD 1.58 could get lost and the loop never exited. Here is the edit that works:
Code:
  do {
    if (Serial.available()) {
      b[ii++] = c = Serial.read();
    }
    else
      break;
  } while (c != '\n');

That code was written knowing it was badly using Serial.print from main setup() and also the _isr() - but it worked to show the point that extensive EXTMEM/PSRAM data access under Arduino with PJRC code does not interfere with USB data transfer in or out.
See: github.com/Defragster/TeensySketches/blob/main/teensy41_psram_SerSpeed/teensy41_psram_SerSpeed.ino
The complete multiple tests of various data works as expected and the overall completion time is not affected. In fact a CORES PR to affect processor PreFetch made it run significantly faster while USB printed 10,000*32 chars/second and was ready to receive USB data form PC at the same time.
That code tested on three T_4.1's with PSRAM and resolved to work with edits from p#5 above. The only issue is that _isr() prints can get data mixed in with other Serial output from the _isr() usage as written.
 
I don't know why you're losing incoming characters. But I can try to clear up some possible misunderstandings about USB serial.

Assuming, USB UART (VCP) can send every 1 ms a single character (actually a buffer, 64 bytes), but I am typing single characters on keyboard as a human,
possible, that the USR UART receiver does not get it.

This is not how USB works. A complicated protocol is used to only transfer data when the receiver is able to accept.

When the USB host (your PC) wants to send, it transmits either a PING or OUT token. Teensy replies automatically (done at the hardware level not depending on an interrupt or any code) depending on what buffers are available to receive the data. Teensy replies to PING with an ACK token if the hardware has been given at least 1 buffer to receive data on that specific endpoint number. If no buffers are ready, the hardware replies with NAK. Likewise when the USB host transmits OUT to actually send data, Teensy replies with 1 of 3 possible responses. If no buffers are available to receive, Teensy replies NAK. The USB host must try again later if Teensy sent NAK. This NAK response happens automatically at the hardware level inside Teensy's USB hardware. Periodic retry also happens automatically at the hardware level in the USB host controller in your PC. You really can't get a situation where the PC transmits and the data is lost because Teensy wasn't ready, because the hardware on both sides implements this protocol where the data transfer simply doesn't occur until Teensy replies with a token to indicate its hardware has a buffer assigned to receive the incoming data.

The affirmative response is a bit complicated. USB 1.1 used only OUT, NAK, ACK. But with USB 2.0 a third response was added, plus an efficient PING message. If 2 or more buffers are available, Teensy replies to OUT with ACK and of course puts the data into the first available buffer. If Teensy has exactly 1 buffer available, it also stores the data and replies with the NYET token. The host treats NYET the same as ACK. But it remembers (for that particular endpoint) that the NYET reply. The next time the host wants to transmit, it sends the very efficient PING token to check whether any buffer space is available receive. If Teensy replies with ACK, then the USB host sends OUT with the data. This all happens automatically at the hardware level, so you can only observe it with a hardware-based protocol analyzer (I personally use Total Phase Beagle 480). If you use a software-based capture on your PC, like Wireshark, this low level protocol is invisible because it's done completely within the USB hardware.

The point of all this is the USB protocol fundamentally works with this ACK,NYET,NAK checking for available buffers. You absolutely can not have a situation where the USB host transmit data that gets lost because the USB device wasn't ready to receive. Of course plenty can go wrong on the software side, especially when you have preemptive thread scheduling.

From a hardware point of view, USB serial isn't at all like ordinary hardware serial. It's perhaps similar hardware serial with rigid CTS/RTS flow control. You just can't possibly have a situation where the receive buffer overflows because the receiver always de-asserts RTS and the transmitter just won't send another byte (if CTS really is implemented by hardware) until the receiver indicates it is ready. That's how USB works, using its rather complicated low-level protocol which is documented in chapter 8 of the USB specification.


Let's see if I can replicate this issue with a simple demo program.

I realize this low-level USB stuff ruling out one narrow part of the theory probably doesn't help. Realistically a simple demo program is probably the only way to really get to the bottom of this problem.
 
I agree:
Using Teensy FreeRTOS changes the "picture".

What I have realized meanwhile - with using FreeRTOS:
when Serial.avalable() was not successful - I did a vTaskDelay(1) (to give a bit more time and try again during next round).
Changing to taskYIELD(); has improved, but not completely solved:

Now, when background stuff is running, with a higher prio, and foreground does not do anything - it realizes now UART characters received much better.
But when I run now on foreground (with lower prio) in combination with background (higher prio) and foreground does also print() - almost impossible
to break my foreground loop with any UART character sent.

OK, pretty obvious that my system runs at the edge of the "speed", performance.

The "problem" comes just from the fact, that UART receiver does not really run in background, it seems to need still all these calls like Serial.available(), Serial.read().
I cannot change UART (USB) receiver into an approach that the receiver sends me a "notification", "signal" that something was received.
The implementation is designed for a "polling" approach, not for an INT or RTOS approach.

Never mind, I have to find work arounds my specific issue.
(in a full open source system, I would implement an USB UART INT handler: when something was received, it can notify my running code to "interrupt" and check for it)
 
Indeed noted p#4 with RTOS and unknown code in the picture is not clear.

The worked up sketch presented showed no issues with intense USB output, PSRAM R/W and always ready USB input perfectly received.

Also noted the USB processing from first posted snippet needed edits/completion to run in the sample presented. '\r' does not terminate a USB transmission - but a '\n' does and that do{}while can easily hang forever when there is no Serial.available() or if somehow a completed buffer doesn't provide the '\n'. This was resolved for the demonstration sketch provided and linked from github above.

Also confusing is having these two words together: "USB UART"
> as noted USB is real USB and does not pass through a UART to the Teensy device connector.
> and UARTS are Serial#'s devices on Teensy unrelated in any way to USB, but standard 3.3V Serial I/O devices
 
Back
Top