Non-Blocking WS2812 library still has significant delay

Promnius · Dec 29, 2019

I'm using a Teensy 3.2 with the pjrc library 'WS2812,' and was able to reproduce my problem with only the example codes. I'm running Teensyduino 1.4.8 on Arduino 1.8.9

I have added a few lines of code to time the 'leds.show()' function as well as report it to the console, and an additional delay (typically not necessary) to be absolutely sure the function is not getting called too fast (I know it will stall if the previous transfer is incomplete). I also changed the number of LEDs from 64 to 384, the number I am working with. Otherwise the example is unmodified, and my strip of lights still animates as expected (the extra delay slows the animation slightly).

I am seeing a return time that depends on the number of leds (!!), and that gets as high as 1200us with 384 leds (the number I am working with), but is still as high as 200us with 64 leds (the default).

I have found this blocking time changes by as much as a factor of 2 based on the compile options (384 leds ranges from 600us to 1200us), though I don't pretend to really know what those options are changing.

When I hacked up the library to add a bunch of intermediary timer variables, it looked like all the time was spent during the phase denoted

Code:

// copy drawing buffer to frame buffer

. The rest of the steps added up to less than 8uS.

I guess that means this is a soft blocking function, since I'd still be able to interrupt it during that time, which makes it more useful than other neopixel libraries, but it still caught me a little off guard that the leds.show() function takes nearly as long to return as other common neopixel libraries (it's a few times faster).

I guess what I'm trying to ask is: is this delay a hard limitation of the library? Would it be feasible to do the buffer copy in the background, a bit at a time?

Code:

/* WS2812Serial BasicTest Example

   Test LEDs by turning then 7 different colors.

   This example code is in the public domain. */

#include <WS2812Serial.h>

const int numled = 384;
const int pin = 1;

// Usable pins:
//   Teensy LC:   1, 4, 5, 24
//   Teensy 3.2:  1, 5, 8, 10, 31   (overclock to 120 MHz for pin 8)
//   Teensy 3.5:  1, 5, 8, 10, 26, 32, 33, 48
//   Teensy 3.6:  1, 5, 8, 10, 26, 32, 33

byte drawingMemory[numled*3];         //  3 bytes per LED
DMAMEM byte displayMemory[numled*12]; // 12 bytes per LED

unsigned long lngTime0 = 0;
unsigned long lngTime1 = 0;

WS2812Serial leds(numled, displayMemory, drawingMemory, pin, WS2812_GRB);

#define RED    0xFF0000
#define GREEN  0x00FF00
#define BLUE   0x0000FF
#define YELLOW 0xFFFF00
#define PINK   0xFF1088
#define ORANGE 0xE05800
#define WHITE  0xFFFFFF

// Less intense...
/*
#define RED    0x160000
#define GREEN  0x001600
#define BLUE   0x000016
#define YELLOW 0x101400
#define PINK   0x120009
#define ORANGE 0x100400
#define WHITE  0x101010
*/

void setup() {
  leds.begin();
  Serial.begin(1000000);
}

void loop() {
  // change all the LEDs in 1.5 seconds
  int microsec = 1500000 / leds.numPixels();

  colorWipe(RED, microsec);
  colorWipe(GREEN, microsec);
  colorWipe(BLUE, microsec);
  colorWipe(YELLOW, microsec);
  colorWipe(PINK, microsec);
  colorWipe(ORANGE, microsec);
  colorWipe(WHITE, microsec);
}

void colorWipe(int color, int wait) {
  for (int i=0; i < leds.numPixels(); i++) {
    leds.setPixel(i, color);
    lngTime0 = micros();
    leds.show();
    lngTime1 = micros();
    Serial.println(lngTime1-lngTime0);
    delay(20);
    delayMicroseconds(wait);
  }
}

PaulStoffregen · Dec 30, 2019

Yes, the copying and rearranging of data takes significant CPU time.

On Teensy 3.2, I'm seeing it take 719 us for 384 LEDs.

On Teensy 4.0 it's taking 65 us for the same 384 LEDs. That agrees well with other benchmarks which show ~11X faster CPU performance on 4.0 vs 3.2.

I must admit this is slower than I thought. This code has never had much optimization work. Maybe some of the techniques we use in the audio library could be applied to this code?

but it still caught me a little off guard that the leds.show() function takes nearly as long to return as other common neopixel libraries (it's a few times faster).

Hopefully you are using an oscilloscope or logic analyzer to make that measurement?

Libraries like Adafruit_NeoPixel block interrupts, which prevents millis() and micros() from keeping track of time while the LEDs are updating. If you use those functions to measure, you'll see wrong results which say Adafruit_NeoPixel used much less time than it actually did, because the system was prevented from incrementing its count of elapsed time.

Adafruit_NeoPixel takes 12820 microseconds to transmit to 384 LEDs. It has to, since each bit takes 1.25 us and each LED has 24 bits.

PaulStoffregen · Dec 30, 2019

As a quick sanity check, I ran this code just now on a Teensy 3.2.

Code:

#include <Adafruit_NeoPixel.h>

Adafruit_NeoPixel leds = Adafruit_NeoPixel(384, 1, NEO_GRB + NEO_KHZ800);

void setup() {
  leds.begin();
  Serial.begin(1000000);
}

void loop() {
  unsigned long lngTime0 = micros();
  leds.show();
  unsigned long lngTime1 = micros();
  Serial.println(lngTime1-lngTime0);
  delay(5);
}

In the serial monitor, it reports 1000 us taken between the two micros() calls.

But here's the waveform my oscilloscope sees on pin 1

The top half is 2 ms per division. It takes just under 6 divisions, which means Adafruit_NeoPixel is running slightly faster than the correct rate of 800 kbps. But you can easily see the time taken really is about 12 ms, not the 1 ms indicated by printing in the serial monitor.

In the bottom half where the zoomed in scale is 1 us. The bits really are approximately 1.25 us wide. These LEDs communicate at 800 kbps.

PaulStoffregen · Dec 30, 2019

Since I have the oscilloscope connected, I ran a couple more tests to better show the timing. I connected another channel to pin 12, which is driven high before leds.show() and low after. So the waveform on pin 12 shows how much time was spent in leds.show(). I used a fixed 15 ms delay between updates for these first tests.

Here's the Adafruit_NeoPixel test.

Code:

#include <Adafruit_NeoPixel.h>

Adafruit_NeoPixel leds = Adafruit_NeoPixel(384, 1, NEO_GRB + NEO_KHZ800);

void setup() {
  leds.begin();
  pinMode(12, OUTPUT);
}

void loop() {
  digitalWrite(12, HIGH);
  leds.show();
  digitalWrite(12, LOW);
  delay(15);
}

Here you can see the waveform on pin 12 shows the simple way Adafruit_NeoPixel works. The waveform stays high for about 12 ms while the library sends the pixel data. Then it's low for 15 ms for the delay(15) function. You can see the scope measures the frequency of pin 12 at 36.9 Hz, and the time high at 12.1 ms.

Here's the WS2812Serial test.

Code:

const int numled = 384;
const int pin = 1;

byte drawingMemory[numled*3];         //  3 bytes per LED
DMAMEM byte displayMemory[numled*12]; // 12 bytes per LED
WS2812Serial leds(numled, displayMemory, drawingMemory, pin, WS2812_GRB);

void setup() {
  leds.begin();
  pinMode(12, OUTPUT);
}

void loop() {
  digitalWrite(12, HIGH);
  leds.show();
  digitalWrite(12, LOW);
  delay(15);
}

Here you can see the time where pin 12 remains high is reduced to 718 us, and the rate of updates increases to 63.6 Hz, using the same delay(15). This is what's meant by non-blocking. As you can see in these waveforms, most of the 15 ms delay overlaps with the 12 ms time the library is transmitting the LED data.

Just remember, you can't rely on the Arduino timing functions like millis() and micros() when using a blocking library like Adafruit_NeoPixel. It interferes with time keeping and everything else using interrupts. But using an oscilloscope or logic analyzer, you can clearly see the real timing.

If you don't have a scope or logic analyzer but want to verify this, you can probably use a multimeter with frequency measurement to at least see the frequency on pin 12 when running both of these test programs. You can also use the DC voltage mode to see an average of pin 12's output, where a lower voltage reading means less time was spent in leds.show().

manitou · Dec 30, 2019

FWIW,
FastLED.show() is blocking, but micros()/millis() appear ok (12273 us)

Promnius · Dec 30, 2019

Thank you for all the wonderful replies!

It sounds like the easy button in my case is a Teensy 4.0, but this is all really really good information to have available too! It sounds like my other option is to move all my timing critical code to interrupts, since your non-blocking library can still be interrupted while it is making that buffer copy.

I feel like such an idiot, of course I was using timing functions for the blocking versions of the library . . . I should have known better. Just for due diligence I hooked the data pin up to a scope and sure enough I am now getting the same waveforms as you.

Thanks for helping me to understand what was really going on!

PaulStoffregen · Dec 30, 2019

FastLED's blocking function has some crafty code that tries to increment the millis() count by the amount it would have incremented had interrupts not been blocked.

Non-Blocking WS2812 library still has significant delay

Promnius

Member

PaulStoffregen

Well-known member

PaulStoffregen

Well-known member

PaulStoffregen

Well-known member

manitou

Senior Member+

Promnius

Member

PaulStoffregen

Well-known member