Ideas on a T4 parallel library using FlexIO

easone

Well-known member
Some recent discussion about driving parallel interfaces with FlexIO has got me thinking about developing some kind of general library. Is this something that would be useful to the community?

Ideas for features:
  • Parallel output with a flexible number and order of pins
    • 4, 8, 16 bit interfaces possible (up to 22)
    • Max sustained speed in the 20-50 MHz range
    • Hardware clock pin output (optional)
  • Non-blocking: uses FlexIO and DMA or interrupts
  • User passes struct of parameters/options to library such as:
    • List of pins to correspond to output bits
    • Data and clock pin polarity
    • How the source data is organized (nibbles, 8bit, 16bit, 32bit)
    • Options to swap output order of bytes and nibbles
    • Options to discard MSBs or LSBs for odd interface widths
  • Total transfer size may be limited to multiples of 32 bytes
Internally, I am thinking there would be an interrupt that copies data to a buffer (reordering bits as necessary) and then the DMA-driven FlexIO process would run from that buffer. This is the same way the SmartMatrix driver works. Normally you would call the transfer and it would have to fill the buffer before outputting any data, but optionally you could fill the buffer in advance so that when you call the transfer it starts immediately.

If your data pins are contiguous and in order (with respect to FlexIO pin ordering, not Teensy pin numbers), a more efficient process could be used that skips the buffering step and uses the source data without reordering bits, which could be done with nearly zero CPU usage or lag. But this wouldn't be a big deal unless you need maximum speed.

There would be some restriction on the available pins. They all have to be from a single FlexIO peripheral, and the clock pin (if desired) has to have a lower or higher FlexIO pin number than all the data pins. Some documentation to show possible pin configurations would be needed. The maximum number of pins would be as follows:

  • T4.0
    • FlexIO1: 5 total (including backside pad 33), all contiguous
    • FlexIO2: 9 total (including backside pad 32), 4 contiguous
    • FlexIO3: 14 total (no DMA, more CPU usage; including backside pads 26-27), 6 contiguous
  • T4.1
    • FlexIO1: 9 total (including backside pads 49, 50, 52, 54), 5 contiguous
    • FlexIO2: 13 total, 4 contiguous
    • FlexIO3: 22 total (no DMA, more CPU usage), 20 contiguous
  • MicroMod
    • FlexIO1: 5 total, all contiguous
    • FlexIO2: 15 total, 13 contiguous
    • FlexIO3: 14 total (no DMA, more CPU usage), 6 contiguous
 
easone:
I think this would be a great addition to the Teensy's libraries.
I have been reading this forum (Every Posting) for around 10 yrs and have found many different applications that could benefit the Teensy community.
Probably the most important for me would be for high speed A/D's, and D/A's. The SPI A/D's seem to be particularly problematic in this regard.
Changing to a parallel interface would certainly speed up the transfer a lot, especially if the API was easy to use.
I have noticed a lot of people over the years with little or no Hardware Engineering skills struggling with the hardware-software interface for these types of devices.

My two cents worth.
Regards,
Ed
 
A 16bit parallel interfaces to drive LCD in 8080 mode would be useful.

I've got a project which I want to use a 5" LCD with a SSD1963 in 16bit 8080 mode but there is no driver for the Teensy 4.1

Happy to help to test the library.
 
@eric I think this is a great idea! I’d be more than happy to contribute what I can and run tests my 16ch logic analyzer.
Ibdo wonder how dynamic it could be made to work without having to hardcode setup and functions for each data type and output type.

@skpang with 22 contiguous pins on FlexIO3 on the T4.1 this can be done easily. I can write up a few function for you and you can wrap the LCD init & frame write logic around it. The downsite is it wont be able to support DMA so a full screen write will block the main loop for 5-10ms while all the data is pushed out.
Alternatively, a polling method with an interrupt can be used to load 4 shifter buffers with 8 bytes of pixle data at a time, to free up the proccor a little bit in between - but that will require more logic to handle.
 
@Rezo, if you could share your example code that would be great.

Does any of the other FlexIO has DMA support that is 16bit wide ?
 
@Rezo, if you could share your example code that would be great.

Does any of the other FlexIO has DMA support that is 16bit wide ?

Attached is some basic FlexIO setup to output a command and data via FlexIO3 on a Teensy 4.1
I have not tested it, so you might need to do some tinkering and possibley some bit-shifting to push out 8 bit commands and data for LCD setup on the LSB's of the bus.
View attachment T41_FlexIO3_16bit_SingleBeatWR.ino

If you look up at the last section in Eric's post, you'll see that only the MicroMod has the most contiguous FlexIO pins that support DMA - 13 pins. He has helped me get an 8 bit wide FlexIO instance running with DMA on an ILI948X with success - max frame rate is about 75Hz

But as you can see, there is no way to support a 16bit wide bus unless you do some fancy code writing and link two FlexIO instances together along with two DMA channels to create "one" long bus
 
Right - at least initially, the library will only support 16bit interfaces on T4.1 FlexIO3, but I think an interrupt based method can leave the processor free the majority of the time. Despite being noncontiguous on T4.0 and 4.1, the pins on FlexIO2 can support a DMA based 8 bit interface with the right buffering process. Technically FlexIO1 on T4.1 can too (but this requires soldering some wires to the back memory pads which is not convenient).

there is no way to support a 16bit wide bus unless you do some fancy code writing and link two FlexIO instances together along with two DMA channels to create "one" long bus
That would be the dream. I'm not ruling this out! But would have to figure out a way to connect the trigger signals between FlexIO1 and FlexIO2. Maybe XBAR can do this.
 
Jean Marc’s VGA_t4 links two FlexIO instances via linked DMA channels to push out data to VGA displays
Might we worth looking into how he does it and give it a try on a parallel bus
 
...interesting idea, however I've not found a use case where DMA can help.
As of right now, I have a Teensy 3.6 operating in 8-bit parallel mode with an almost complete library. It's good enough to run the typical "graphics_test.ino" but can't quite run Demo Sauce yet... seems to have problems with reading yet.
That said, here's my results with the current library I'm working on:

Code:
ILI9341 Test!
Display Power Mode: 0x9C
MADCTL Mode: 0x48
Pixel Format: 0x5
Image Format: 0x0
Self Diagnostic: 0xC0
Benchmark                Time (microseconds)
Screen fill              42777
Text                     3542
Proportional Text        3250
Lines                    17082
Horiz/Vert Lines         3369
Rectangles (outline)     2164
Rectangles (filled)      87803
Circles (filled)         14769
Circles (outline)        13272
Triangles (outline)      4272
Triangles (filled)       29252
Rounded rects (outline)  4414
Rounded rects (filled)   95155
Done!
 
...interesting idea, however I've not found a use case where DMA can help.
As of right now, I have a Teensy 3.6 operating in 8-bit parallel mode with an almost complete library. It's good enough to run the typical "graphics_test.ino" but can't quite run Demo Sauce yet... seems to have problems with reading yet.
That said, here's my results with the current library I'm working on:

Code:
ILI9341 Test!
Display Power Mode: 0x9C
MADCTL Mode: 0x48
Pixel Format: 0x5
Image Format: 0x0
Self Diagnostic: 0xC0
Benchmark                Time (microseconds)
Screen fill              42777
Text                     3542
Proportional Text        3250
Lines                    17082
Horiz/Vert Lines         3369
Rectangles (outline)     2164
Rectangles (filled)      87803
Circles (filled)         14769
Circles (outline)        13272
Triangles (outline)      4272
Triangles (filled)       29252
Rounded rects (outline)  4414
Rounded rects (filled)   95155
Done!



What bus speed is it running on?
How are you writing the data? directly to the port register?
 
What bus speed is it running on?
How are you writing the data? directly to the port register?

Direct to port writing, and delayNanoseconds for pacing the signals. Full screen blit takes ~7.7ms.
I'll be sharing the library soon, and I just got DemoSauce going.
Uploading a video to ewwtwob of DemoSauce and the graphics test. I wIll post links here when they're ready videos are ready.
It's based on bits and pieces found on GitHub plus the ILI9341_t3 source.
I'm also planning a full-on 16-bit for Teensy3.6 as well, which should be easy since I've managed to master 8-bit. :)
 
I have a library written by another forum member that supports 8/16 bit parallel bus for the ILI9488 on a Teensy 4.1
A full screen (320x480 @ 16 bit color) update took roughly 5ms. He used the same method as you - writing to the port register and using NOP's to delay the WR/RD pin pulse.
It's fast, but the code is blocking the main loop - for me, that's a bit of an issue as I have several other things happening at the same time:
1. LVGL drawing and rendering the UI
2. FlexCAN writing and reading to the CAN BUS as fast as possible
3. Logging data to an SD card

Therefor, using DMA to write big chunks of data to the screen will avoid any latency on the other peripherals/apps that are running as well.

I am busy writing a final driver for the ILI948x (1/6/8) based on FlexIO & DMA and would be happy to try run the DemoSauce test as soon as it's done and compare results (if relevant at all)
 
My library is for teensy 3.[01256]. There's the difference. The library also isn't as efficient as it could be.
With an under 10ms update rate, there's no reason why you couldn't break the update into smaller chunks, perhaps a couple of lines at a time, via an ISR. Totally doable with my class timer library for Teensy 3's, which can dish out 1usec IRQs, and still no real need to use DMA.

Besides, if you are using DMA, are you sending the entire video frame? or chunks? I'm assuming DMA from PSRAM?
CAN shouldn't be affected, as that runs on an ISR.
DMA will be fighting the processor if it has cache misses too and wants to access the same bank. Also, with DMA, you might have a slight timing issue, but then I'm noit sure how the writes are timed using FlexIO on the teensy 4.x. I have both the 4.0 and 4.1, but have not fully explored them yet. If not too convoluted, I'll set up a teensy 4.1 for testing, I do have 8MB PSRAM already installed too.
 
What I'd love to see is a 4:3 TFT that's 640x480, seems that they're no where to be found, because everyone thinks 16:9 is all cool and stuff, bleh.
Why? well, I'm not rendering fancy graphics, I just want a terminal I can stuff in my pocket :)
 
Yes, but no price listed, and anything else I've managed finding cost over 100USD... :-(
 
Yeah, but I just got 2 320x240 ILI9341, which has a flash chip, and SPI resistive touch controller on-board for under $12 delivered, from Amazon.
The date codes suggest that they're new-old-stock, as they're from 2015.
Flash chip I'm going to pop onto one of my Teensy 4.1 boards. Might be amusing. :)
These displays also don't have the unwanted level shifters like on the Adafruit in the videos, and have all pins broken out, so that you could do 18bit interface if you wanted to.

If the above display was in the $30 range, I'd be interested, and even at that price, the seller would make a good profit.
I can get HD capacitive touch TFT for about $50.
In fact I have a bunch around, but unfortunetly the pins needed to drive it from the Teensy 4.1 aren't exposed/broken out/available.
Of course those pins not being available eliminates the possibility of using a camera directly too, but this is what we have with the design.
I'm guessing part of the reason for the higher costs today is the pandemic, which has caused glass shortages and silicon shortages, on top of the capacitor famine that's been going on for the last 2 years.
It's pretty much how things are world wide right now, and I can wait.
Thanks for the links though, I'll keep watching.
 
My library is for teensy 3.[01256]. There's the difference. The library also isn't as efficient as it could be.
With an under 10ms update rate, there's no reason why you couldn't break the update into smaller chunks, perhaps a couple of lines at a time, via an ISR. Totally doable with my class timer library for Teensy 3's, which can dish out 1usec IRQs, and still no real need to use DMA.

Besides, if you are using DMA, are you sending the entire video frame? or chunks? I'm assuming DMA from PSRAM?
CAN shouldn't be affected, as that runs on an ISR.
DMA will be fighting the processor if it has cache misses too and wants to access the same bank. Also, with DMA, you might have a slight timing issue, but then I'm noit sure how the writes are timed using FlexIO on the teensy 4.x. I have both the 4.0 and 4.1, but have not fully explored them yet. If not too convoluted, I'll set up a teensy 4.1 for testing, I do have 8MB PSRAM already installed too.

I'm using a Teensy MicroMod as it has enough contiguous pins on FlexIO 2 to drive the display @ 8 bit wide bus.
FlexIO is a series of shifters and timers - configured correctly, you only need to load the shifter buffer(s) with the data, and FlexIO will push the data out to the shifter (pins) and the timer will generate the WR pulse - much less complex to setup and manage than writing directly to the port register and use delays.
I can fit a whole screen buffer (480*320*2 = 307.2kb) into DMAMEM, so with DMA, I can write the whole frame in one shot.

PSRAM is too slow for such a fast writes I believe, plus the MM does not support the PSRAM chips.
I tired PSRAM with Kurt's ILI9341_t3n library and it was significantly slower than reading from RAM1 or RAM2.

I'll let Eric and other SME's comment on the advantages/disadvantages of FlexIO/DMA over other implementation methods for 8080/6800 parallel communications.
 
Yeah, there lies the difference in end-goals. You are rendering entire frames, I'm not always doing that, thus I only update areas that are changed.
You could get a huge speed boost if you could mark dirty lines or areas instead, and only blit those. The ILI chips do support blitting in a limited window, so that could be an option to explore.
No need to update something that's already in sync... Perhaps queue up update streams and DMA those instead, that's how I'd do it.
 
Yeah, there lies the difference in end-goals. You are rendering entire frames, I'm not always doing that, thus I only update areas that are changed.
You could get a huge speed boost if you could mark dirty lines or areas instead, and only blit those. The ILI chips do support blitting in a limited window, so that could be an option to explore.
No need to update something that's already in sync... Perhaps queue up update streams and DMA those instead, that's how I'd do it.

I'm not rendering entire frames, but I can write entire frames if needed.
In my case LVGL is responsible for rendering only the areas that require updates, so it doesn't redraw the entire screen each time unless a whole screen update is needed.
My display drivers receives the frame data (whole or partial) from LVGL, I set the window address in GRAM based on the area that is being updated and write the data to the relevant pixels. If the data is less than 32 bytes long, it will write it in blocking method, if it larger, it will write with DMA.
 
Nice! So, where are these wonderful libraries located? I have the components here to try things.
 
Still working on it :)

But let's focus the thread's topic - a general library for parallel data transmission on the T4/4.1/MM.
Driving displays is just one of the use cases.
 
Back
Top