24bit packed data format

Status
Not open for further replies.

Projectitis

Well-known member
Hi all,

A 24bit unsigned integer format question that is not audio related!

I'd like to use a 24bit pixel format to conserve space in teensy flash/PROGMEM: RGBA6666
This is similar to RGBA565 but an extra bit for R and B channels, plus 6 bits for alpha.
Internally I'll be storing this data as 16b RGB565 plus a separate 8b Alpha (I may end up changing the bits to RGBA5658 to suit).

The main thing I need to be able to do is step a packed const byte array for reading.
Once read, I don't mind storing the pixel data itself as either uint32_t or ( uint16t, uint8_t ).

Rather than use 32b per pixel in flash and 'waste' a byte, I'd really like to use packed data (3-byte words) and step them with a pointer. How can I do this most efficiently/quickly?

e.g. - something like:

Code:
uint24_t *imageData = [pointer to image data in flash mem];
uint24_t pixel;

uint24_t *imageDataPtr = imageData;
while (some_exit_condition){
    pixel = *imageDataPtr;
    imageDataPtr++;
}

Do I need to do something like this?

Code:
uint8_t *imageData = [pointer to image data in flash mem];
uint16_t pixelRGB;
uint8_t pixelA;

uint8_t *imageDataRGBPtr = imageData;
uint8_t *imageDataAPtr = imageData + 2;
while (some_exit_condition){
    pixelRGB= *(uint16_t*)imageDataRGBPtr;
    pixelA = *imageDataAPtr;
    imageDataRGBPtr+=3;
    imageDataAPtr+=3;
}
 
There are two ways of looking at this. I'm going to talk about the hardware side. I'm pretty sure MichaelMeissner will soon comment about the compiler getting it's revenge...

The answer depends on which ARM chip you're using. With Cortex-M0+ on Teensy LC, the hardware never supports unaligned access. Any unaligned read will cause a fault, which effectively crashes your program.

With Cortex-M4 on Teensy 3.2, 3.5 & 3.6, you can read do unaligned reads. So if you cast a point to any arbitrary byte address to a 32 bit pointer and read it, there's a 25% chance the hardware will read all 4 bytes as 1 bus cycle. Otherwise the hardware will automatically do 2 bus cycles to the memory, which isn't wonderful but it's still much faster than writing code to do 2 reads and merge the results.

However, the one big gotcha if you do this from RAM. The RAM is actually in 2 banks. The lower bank ends at address 0x1FFFFFFF and the upper bank begins at 0x20000000. Unaligned access crossing that boundary will cause a fault, crashing your program. Since you're reading from Flash memory, this should not be an issue. Just know it does matter if your data is in RAM and crosses that boundary.
 
Fantastic, thanks Paul. Sounds great. The code I'm working on won't run on anything 'less' than T_3.2 anyway (and then probably only T_3.5 and T_3.6).
Is there a define such as SUPPORT_UNALIGNED_ACCESS that I can wrap code in, or should I check for PROCESSOR_TEENSY_XXX? Or some sort of _ARM_XXX define?
 
If you want to do this the most efficient way, and if you have control over the allocation of the image data, you might consider slightly unrolling your loop.

Cortex-M4 has a burst access optimization in hardware, where subsequent 32 bit access is single cycle, so you only suffer 2 cycles for the first 32 bit read (assuming aligned access).

If you can arrange for all your pixel data to be aligned to a 32 bit boundary, and it will be multiple of 4 pixels (12 bytes), you might write code like this:

Code:
while (some_exit_condition) {
        uint32_t word1 = *ptr++;
        uint32_t word2 = *ptr++;  // read 4 pixels... the 2nd & 3rd fetch are faster
        uint32_t word3 = *ptr++;

        pixelA = word1 & 0xFFFFFF;
        pixelB = (word1 >> 24) | ((word2 << 8) & 0xFFFF00);
        // etc....
}

You might even come up with some crafty way to arrange each group of 4 pixels to minimize the number of logic operations and shifts.

The key to optimizing this sort of code is carefully structuring things to stay within the ARM's 16 registers. At least 3 are used for stack, link register (return address) and program counter, and the buffer pointer and reading in 3 words burns 4, and the exit conduction probably uses at least 1 more. They go quickly. Usually you end up compiling and then reading the .lst file (which Arduino stores in the temp folder if you're using Teensy). As soon as the compiler is forced to "spill" your variables onto the stack, you'll see lots of LDR and STR instructions using R13 (SP) plus immediate offset, which destroy all your optimization work.

If you're running on Teensy 3.6, you might also use an align attribute to align your pixel array to a 16 byte boundary. That may or may not matter, but that's the cache line size within the local memory controller. On Teensy 3.2 or 3.5, normal 4 byte align is probably fine. The flash memory has a tiny cache too... so benchmarking and testing is worthwhile to get the best use of these caches.

But the biggest gains will come from processing 4 pixels at a time, if you can manage to fit it into the ARM's register set.
 
Wow, thanks for the info. I have full control over the allocation and structure of the data, and 4 pixel blocks should be no problem. I have a python script that reads the image data (PNG) and creates the header. Never had to pay that much attention to the register before, so this will be an experience ;)

For aligning, is it __attribute__ ((aligned (32))); that I'm after?
 
Just a note about __attribute__ aligned - the value is in bytes, not bits, so to align data to a 32bit word boundary, use __attribute__((aligned(4))):

Code:
static const uint8_t data[]  __attribute__ ((aligned (4))) = { 0xFF, ... , 0x32 };
 
Status
Not open for further replies.
Back
Top