T4 FlexIO - Looking back at my T4 beta testing library FlexIO_t4

Hm, that doesn't seem quite right... the DMA transfer size (SSIZE and DSIZE) shouldn't vary with the number of shifters, it should always be 32 bits so the DMA engine copies a word from memory into each FlexIO shifter buffer.

Try this code. It runs without errors and triggers the DMA callback, although I can't confirm whether the data is shifting out properly since I don't have an oscilloscope anymore. It should work with 4 shifters and I think it works with 8 shifters too.
 

Attachments

  • FlexIO_MultiBeat_DMA_demo_3.ino
    13.9 KB · Views: 130
Hm, that doesn't seem quite right... the DMA transfer size (SSIZE and DSIZE) shouldn't vary with the number of shifters, it should always be 32 bits so the DMA engine copies a word from memory into each FlexIO shifter buffer.

Try this code. It runs without errors and triggers the DMA callback, although I can't confirm whether the data is shifting out properly since I don't have an oscilloscope anymore. It should work with 4 shifters and I think it works with 8 shifters too.

That actually works!! Although it's outputting 0x05, 0x00 16 times each and not 0x05 32 times

I commented out the dma.enable in the main loop to allow it to trigger just once, but here it is:
Screen Shot 2021-09-14 at 9.36.13.png

Thank you for getting me this far! I'll take a deeper look into what you put together and try implement it into my POC driver.
 
Sorry, I have been distracted..

My assumption as well that Shifters 4-7 do not have DMA Source associated with them.

Also note: That there are only two DMA Sources on FlexIO1 and 2 on Flex IO2 and none on FlexIO3

And that each of these sources handle two shifters, so you can use it on one or the other...

I just got a reply back from NXP - they claim that FlexIO on the RT106x and 105x has only 4 shifters, not 8.
So I guess their application note software pack for the RT1050 is misleading as the default configuration there users 8 shifter - I sent this feedback to them.
 
That actually works!! Although it's outputting 0x05, 0x00 16 times each and not 0x05 32 times

I'm not sure what might be causing that. When I run the code on a T4.0 the shift buffers appear to be loaded correctly with 0x05050505, which should be shifting out 0x05 one byte at a time... Does the same thing happen if you comment out the DMA setup and just copy 0x05050505 into the shift buffers to trigger FlexIO manually?
 
I just got a reply back from NXP - they claim that FlexIO on the RT106x and 105x has only 4 shifters, not 8.
So I guess their application note software pack for the RT1050 is misleading as the default configuration there users 8 shifter - I sent this feedback to them.

It's weird. When you read the FLEXIO2_PARAM register after enabling FlexIO2, the value stored there indicates that there are 8 timers and 8 shifters on the chip rather than 4. Maybe the extra shifters are some kind of vestige of an earlier version of the chip?
 
I'm not sure what might be causing that. When I run the code on a T4.0 the shift buffers appear to be loaded correctly with 0x05050505, which should be shifting out 0x05 one byte at a time... Does the same thing happen if you comment out the DMA setup and just copy 0x05050505 into the shift buffers to trigger FlexIO manually?

I tried this by removing all the DMA code, keeping the FlexIO setup as is (commenting out the DMA enable on shifter status flag)
Code:
void FLEXIO_8080_MulBeatWR_nPrm(uint32_t const cmdIdx, uint32_t const * buf, uint32_t const len){ 
    uint8_t BeatsPerMinLoop = SHIFTNUM * sizeof(uint32_t) / sizeof(uint8_t);      // Number of shifters * number of 8 bit values per shifter

    FLEXIO_8080_ConfigMulBeatWR();
    Serial.println("Starting transfer");
    for(int j=0; j<BeatsPerMinLoop; j++){
      p->SHIFTBUF[0]= (uint32_t)buf++;
      }
    
    for (int i=0; i< (SHIFTNUM); i++){
      Serial.printf("SHIFTBUF[%d]:%x \n", i,p->SHIFTBUF[i]);
      } 
    while(0 == (p->SHIFTSTAT & (1U << (SHIFTNUM-1))))
    {
    }
    
    /* Wait the last multi-beat transfer to be completed. Clear the timer flag
    before the completing of the last beat. The last beat may has been completed
    at this point, then code would be dead in the while() below. So mask the
    while() statement and use the software delay .*/
    p->TIMSTAT |= (1U << 0U);


    Serial.println("Transfer complete");

    
}




void setup() {
  Serial.begin(115200);
  Serial.print(CrashReport);
  Serial.println("Start setup");

  /* initialize databuf (cannot be initialized in declaration because it is declared DMAMEM) */
  uint32_t databuf_tmp[DATABUFBYTES] = {0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05} ;
  FlexIO_Init();
  FLEXIO_8080_MulBeatWR_nPrm(0x02, databuf_tmp, DATABUFBYTES);
}

void loop() {
  
}

And In the serial monitor after SHIFTBUF[0] is loaded with data, it does not overflow into the adjacent shifters
Code:
FlexIO configured & enabled
Starting transfer
SHIFTBUF[0]:20077f9c 
SHIFTBUF[1]:0 
SHIFTBUF[2]:0 
SHIFTBUF[3]:0

Do I need to manually load data into each shifter on a multibeat write, or is the overflow the correct method?
 
I went ahead and loaded the buffers manually with 0x05050505 like so:
Code:
for (int i=0; i< (SHIFTNUM); i++){
      p->SHIFTBUF[i] = 0x05050505;
      Serial.printf("SHIFTBUF[%d]:%x \n", i,p->SHIFTBUF[i]);
      } 
    while(0 == (p->SHIFTSTAT & (1U << (SHIFTNUM-1))))
    {
    }
    
    /* Wait the last multi-beat transfer to be completed. Clear the timer flag
    before the completing of the last beat. The last beat may has been completed
    at this point, then code would be dead in the while() below. So mask the
    while() statement and use the software delay .*/
    p->TIMSTAT |= (1U << 0U);

Looks like it pushed it out 16 times as expected:
Screen Shot 2021-09-15 at 20.23.42.png
 
I went ahead and loaded the buffers manually with 0x05050505 like so:

This is right, and the DMA setup should be doing essentially the same thing - copying 32 bits from buf[0] to SHIFTBUF[0], then buf[1] to SHIFTBUF[1] etc. The source and destination addresses are incremented by 4 bytes after each copy (because of the SOFF and DOFF settings) and FlexIO isn't triggered to output data until data is written to SHIFTBUF[SHIFTNUM-1]. Then the destination address is reset to SHIFTBUF[0] because of the ATTR_DMOD setting. So I'm confused why the results would be any different when you do this without DMA...
 
This is right, and the DMA setup should be doing essentially the same thing - copying 32 bits from buf[0] to SHIFTBUF[0], then buf[1] to SHIFTBUF[1] etc. The source and destination addresses are incremented by 4 bytes after each copy (because of the SOFF and DOFF settings) and FlexIO isn't triggered to output data until data is written to SHIFTBUF[SHIFTNUM-1]. Then the destination address is reset to SHIFTBUF[0] because of the ATTR_DMOD setting. So I'm confused why the results would be any different when you do this without DMA...

A quick serial print revielded the following:
Code:
uint16_t databuf_tmp[DATABUFBYTES] = {0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05} ;
  memcpy(databuf, databuf_tmp, DATABUFBYTES); 
  arm_dcache_flush((void*)databuf, sizeof(databuf)); // always flush cache after writing to DMAMEM variable that will be accessed by DMA  
  Serial.printf("databuf[0]:%x \n", databuf[0]);

The value:
Code:
databuf[0]:50005

So the code is doing exactly what you had set it up to do.
The data transfer from the temp buffer to the DMAMEM buffer is where it went wrong - I had set databuf_tmp to a 16 bit integer while I was testing and it must have been overlooked.
Will run another test to verify later on today
 
I went ahead and added the DMA code into my simple display driver and am seeing some odd behavior:
1. It's not completing all the major loop iterations - only half of them.
2. It's not pushing out the correct values to the bus

Here is a screenshot of the display running in standard polling mode using a single shifter:
IMG_4308.jpeg

Here it's running 1 second later with DMA:
IMG_4309.jpeg

I left all the DMA settings as is, but changes the source address to an uint16_t pointer that stores the address to the C image array with a size of 480x320 and 16 bit color per pixle

Here is what is printed out in the calculations before the DMA channel is configured:
Code:
destinationModulo: 4, MulBeatCountRemain: 0, MulBeatDataRemain: 1610772084, TotalSize: 153600, minorLoopBytes: 16, majorLoopCount: 9600

So to me it seems that it only completes roughly 4800 major loops, and it's not passing the full value per pixel.
 
Please post your complete code that you're experiencing this issue with - probably some subtlety about the DMA with 16 bit instead of 8 bit source...
 
@eric attached

I just modified the last demo code you provided to suit the changes exactly as I have it running in my full library example. I didn't attach the library because it an absolute mess at the moment and needs a lot of cleanup.

View attachment 25871

Looks like you're transferring only one half of your test image here, since it's an array of 32 16-bit values (64 bytes long):
Code:
 FLEXIO_8080_MulBeatWR_nPrm(0x02, tft_test_img, 32);
As currently defined the third argument is the size in bytes, not the number of elements... Try using sizeof(tft_test_img) instead of a hard coded value.

For your 480x320x16bit image, you should also have a Totalsize of 307200, twice as large...

Btw, you are still using an 8 bit shift width in FlexIO, not 16, so the lower and upper bytes of your 16 bit values are output alternately... Maybe that makes sense for the ILI9488, I'm not familiar.
 
Looks like you're transferring only one half of your test image here, since it's an array of 32 16-bit values (64 bytes long):
Code:
 FLEXIO_8080_MulBeatWR_nPrm(0x02, tft_test_img, 32);
As currently defined the third argument is the size in bytes, not the number of elements... Try using sizeof(tft_test_img) instead of a hard coded value.

For your 480x320x16bit image, you should also have a Totalsize of 307200, twice as large...

Btw, you are still using an 8 bit shift width in FlexIO, not 16, so the lower and upper bytes of your 16 bit values are output alternately... Maybe that makes sense for the ILI9488, I'm not familiar.

Good catch on the Total size - I misses as I forgot that I pass half that values on the polling method as each iteration I push the 8 upper bits and then the 8 lower bits like so:
Code:
for(uint32_t i=0; i<length; i++)
        {
          buf = *pcolors++;
            while(0 == (p->SHIFTSTAT & (1U << 0)))
            {
            }
            p->SHIFTBUF[0] = buf >> 8;


            while(0 == (p->SHIFTSTAT & (1U << 0)))
            {
            }
            p->SHIFTBUF[0] = buf & 0xFF;
            
        }


I hooked up the logic analyser to see what the first two bytes are when trying the DMA method

1. The first two bytes in the image are 0x40 and 0x0b as a 16 bit value 0x400b
2. The polling method outputs as expected, 0x40 and then 0x0b
3. The DMA method outputs 0x01 and then 0x08

even when I used the DMAMEM buffer at first, I was getting the same behavior on the screen and the same results on the logic analyser
 
Sorry I have not been following the thread as much lately...

But remember with DMA you have to deal with the screwiness of DMAMEM

Things like:

at least in some cases, it wants the memory to be aligned to 32 byte boundaries.

If you are doing DMA out of DMAMEM, you need to do things like FLUSH the cache (arm_dcache_flush). As DMA works with underlying memory and may not have the updated contents if your code did something to values in those memory locations...

If you are doing DMA to DMAMEM like locations, you need to tell the cache to delete its contents at those locations otherwise your code may be reading stale data out of the cache instead of the new stuff in the actual memory from DMA... I typically use arm_dcache_flush_delete for this unless you are certain that the area that you are reading in, is properly 32 byte aligned and multiple of 32 bytes in length, then you can use the probably faster: arm_dcache_delete But again the delete without flush can be dangerous.
 
No worries Kurt, I'm slowly getting the hang of it.

Reminder to self - start using more #define's. I had missed shifter 0's output pin and had it set to pin 0 instead of pin 4, so I was only seeing the last three bits being pushed out. I fixed that and now I get the data pushed out, but the bytes are swapped around for some reason.

I'm feeding the DMA channel source with a pointer to the image array. after doing a final analysis on the logic analyser, instead of sending out 0x40 and then 0x0B, it sends 0x0B followed by 0x40.
As a quick fix, I set the destination address to SHIFTBUFBYS[0] that swaps the bytes around, but this is not a great solution as it swaps all 4 bytes around, while I need it to swap the first two, and then the other two.

So slowly getting closer, but stumbling upon a new issue each time :eek:

I must say, the screen update is VERY fast even at "low" 6Mhz bus speed, and it will take 24Mhz with no issue
 
I fixed that and now I get the data pushed out, but the bytes are swapped around for some reason.

I'm feeding the DMA channel source with a pointer to the image array. after doing a final analysis on the logic analyser, instead of sending out 0x40 and then 0x0B, it sends 0x0B followed by 0x40.
As a quick fix, I set the destination address to SHIFTBUFBYS[0] that swaps the bytes around, but this is not a great solution as it swaps all 4 bytes around, while I need it to swap the first two, and then the other two.

Let me make sure I understand the latest issue. You have an array of 16 bit pixels, which is stored in memory as little-endian bytes {LSB[0], MSB[0], LSB[1], MSB[1], LSB[2], MSB[2], LSB[3], MSB[3], ...}. When these bytes are copied to the FlexIO SHIFTBUF, they are shifted out in little-endian order as LSB[0], MSB[0], LSB[1], MSB[1], LSB[2], MSB[2], LSB[3], MSB[3] etc. But your screen needs to receive them in big-endian order MSB[0], LSB[0], MSB[1], LSB[1], MSB[2], LSB[2], MSB[3], LSB[3] etc.

Using SHIFTBUFBYS swaps the bytes but also swaps pairs of pixels giving you MSB[1], LSB[1], MSB[0], LSB[0], MSB[3], LSB[3], MSB[2], LSB[2] etc. I suppose that SHIFTBUFHWS doesn't work either because it swaps half words without swapping bytes within half words giving you LSB[1], MSB[1], LSB[0], MSB[0], LSB[3], MSB[3], LSB[2], MSB[2] etc. Is that right?
 
Let me make sure I understand the latest issue. You have an array of 16 bit pixels, which is stored in memory as little-endian bytes {LSB[0], MSB[0], LSB[1], MSB[1], LSB[2], MSB[2], LSB[3], MSB[3], ...}. When these bytes are copied to the FlexIO SHIFTBUF, they are shifted out in little-endian order as LSB[0], MSB[0], LSB[1], MSB[1], LSB[2], MSB[2], LSB[3], MSB[3] etc. But your screen needs to receive them in big-endian order MSB[0], LSB[0], MSB[1], LSB[1], MSB[2], LSB[2], MSB[3], LSB[3] etc.

Using SHIFTBUFBYS swaps the bytes but also swaps pairs of pixels giving you MSB[1], LSB[1], MSB[0], LSB[0], MSB[3], LSB[3], MSB[2], LSB[2] etc. I suppose that SHIFTBUFHWS doesn't work either because it swaps half words without swapping bytes within half words giving you LSB[1], MSB[1], LSB[0], MSB[0], LSB[3], MSB[3], LSB[2], MSB[2] etc. Is that right?

In a nutshell, yes - that is what is happening. Is this the expected behavior?

This is the source:
Code:
const uint16_t tft_test_img[32] FLASHMEM={
0x0102, 0x0304, 0x0506, 0x0708, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B,  
0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B, 0x400B};

And this is the output:
dma_16bit_source_issue.png


The code is exactly the same as the last example you provided, but set up for the Micromod - so 8 pins, and the source is not the DMA buffer, just a pointer to the array above.
 
Is this the expected behavior?
Yes, I think so. The SHIFTBUFBYS option allows hardware conversion of endianness for 32 bit words, but there isn't a built in option for 16-bit endianness conversion...

I came up with an idea (more of a hack) to swap the upper and lower 16 bits of each word on-the-fly during the DMA transfer. This works by reading the source 16 bits at a time in reverse order, then writing to the destination 32 bits at a time in reverse order again. The result is that the words stay in the original order but each one is half-word swapped. There's also a crucial change to the FlexIO TRGSEL configuration on line 133, selecting shifter[0] as the FlexIO trigger instead of shifter[n-1], which is necessary since the DMA transfer goes in the reverse order now writing to shifter[0] last.

In combination with SHIFTBUFBYS to flip the byte order, I think this accomplishes what you were looking for. Using a source buffer of {0x0102, 0x0304, 0x0506, 0x0708, ...} we should be getting bytes out in the order {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, ...}. Let me know if it works on your end.
 

Attachments

  • FlexIO_MultiBeat_DMA_demo_byteswap.ino
    15.5 KB · Views: 132
Almost, but still not quite there yet:
Screen Shot 2021-09-19 at 2.01.06.jpg

It sends out {0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x08,...}
 
Odd enough, ran another sketch and it worked fine - ran your latest version again - and it worked like a charm.
You're a live saver, Eric!
Screen Shot 2021-09-19 at 2.42.24.jpg
I'll implement it into the display logic tomorrow or Monday and report back.

I REALLY appreciate all the assistance on this topic, and I will share the final code once I'm satisfied with how it runs.
 
@Eric
I just implemented your last FlexIO/DMA setup into my test sketch with the display - it's smooths, quick and CRISP!

Thank you VERY much for contributing the time, effort and patience and helping me got to the desired result.
 
@KurtE I've started to play around with the IRQ for FlexIO3 on a T4.1 but it's just constantly triggering the interrupt as soon as I call NVIC_ENABLE_IRQ(IRQ_FLEXIO3) in my setup function.
I've tried both timer0 (TIMIEN |= (1U << 0)) and shifter0 (SHIFTSIEN |= (1U << 0)) as my interrupt triggers but the behavior is the same.

Do you have any experience with this?

Attached is a test sketch for a T4.1
View attachment T41_FlexIO3_IRQ.ino
 
@KurtE I know you probably have a lot of other more important things to handle right now, but if you could guide me on how to register a callback, I can take it from there by my self.
 
Back
Top