Forum Rule: Always post complete source code & details to reproduce any issue!
Page 1 of 3 1 2 3 LastLast
Results 1 to 25 of 54

Thread: Ideas on a T4 parallel library using FlexIO

  1. #1

    Ideas on a T4 parallel library using FlexIO

    Some recent discussion about driving parallel interfaces with FlexIO has got me thinking about developing some kind of general library. Is this something that would be useful to the community?

    Ideas for features:
    • Parallel output with a flexible number and order of pins
      • 4, 8, 16 bit interfaces possible (up to 22)
      • Max sustained speed in the 20-50 MHz range
      • Hardware clock pin output (optional)
    • Non-blocking: uses FlexIO and DMA or interrupts
    • User passes struct of parameters/options to library such as:
      • List of pins to correspond to output bits
      • Data and clock pin polarity
      • How the source data is organized (nibbles, 8bit, 16bit, 32bit)
      • Options to swap output order of bytes and nibbles
      • Options to discard MSBs or LSBs for odd interface widths
    • Total transfer size may be limited to multiples of 32 bytes

    Internally, I am thinking there would be an interrupt that copies data to a buffer (reordering bits as necessary) and then the DMA-driven FlexIO process would run from that buffer. This is the same way the SmartMatrix driver works. Normally you would call the transfer and it would have to fill the buffer before outputting any data, but optionally you could fill the buffer in advance so that when you call the transfer it starts immediately.

    If your data pins are contiguous and in order (with respect to FlexIO pin ordering, not Teensy pin numbers), a more efficient process could be used that skips the buffering step and uses the source data without reordering bits, which could be done with nearly zero CPU usage or lag. But this wouldn't be a big deal unless you need maximum speed.

    There would be some restriction on the available pins. They all have to be from a single FlexIO peripheral, and the clock pin (if desired) has to have a lower or higher FlexIO pin number than all the data pins. Some documentation to show possible pin configurations would be needed. The maximum number of pins would be as follows:

    • T4.0
      • FlexIO1: 5 total (including backside pad 33), all contiguous
      • FlexIO2: 9 total (including backside pad 32), 4 contiguous
      • FlexIO3: 14 total (no DMA, more CPU usage; including backside pads 26-27), 6 contiguous
    • T4.1
      • FlexIO1: 9 total (including backside pads 49, 50, 52, 54), 5 contiguous
      • FlexIO2: 13 total, 4 contiguous
      • FlexIO3: 22 total (no DMA, more CPU usage), 20 contiguous
    • MicroMod
      • FlexIO1: 5 total, all contiguous
      • FlexIO2: 15 total, 13 contiguous
      • FlexIO3: 14 total (no DMA, more CPU usage), 6 contiguous

  2. #2
    Member
    Join Date
    Jul 2014
    Location
    Currently Odessa, Ukraine
    Posts
    35
    easone:
    I think this would be a great addition to the Teensy's libraries.
    I have been reading this forum (Every Posting) for around 10 yrs and have found many different applications that could benefit the Teensy community.
    Probably the most important for me would be for high speed A/D's, and D/A's. The SPI A/D's seem to be particularly problematic in this regard.
    Changing to a parallel interface would certainly speed up the transfer a lot, especially if the API was easy to use.
    I have noticed a lot of people over the years with little or no Hardware Engineering skills struggling with the hardware-software interface for these types of devices.

    My two cents worth.
    Regards,
    Ed

  3. #3
    Senior Member
    Join Date
    Jan 2015
    Location
    UK
    Posts
    174
    A 16bit parallel interfaces to drive LCD in 8080 mode would be useful.

    I've got a project which I want to use a 5" LCD with a SSD1963 in 16bit 8080 mode but there is no driver for the Teensy 4.1

    Happy to help to test the library.

  4. #4
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    @eric I think this is a great idea! Id be more than happy to contribute what I can and run tests my 16ch logic analyzer.
    Ibdo wonder how dynamic it could be made to work without having to hardcode setup and functions for each data type and output type.

    @skpang with 22 contiguous pins on FlexIO3 on the T4.1 this can be done easily. I can write up a few function for you and you can wrap the LCD init & frame write logic around it. The downsite is it wont be able to support DMA so a full screen write will block the main loop for 5-10ms while all the data is pushed out.
    Alternatively, a polling method with an interrupt can be used to load 4 shifter buffers with 8 bytes of pixle data at a time, to free up the proccor a little bit in between - but that will require more logic to handle.

  5. #5
    Senior Member
    Join Date
    Jan 2015
    Location
    UK
    Posts
    174
    @Rezo, if you could share your example code that would be great.

    Does any of the other FlexIO has DMA support that is 16bit wide ?

  6. #6
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    Quote Originally Posted by skpang View Post
    @Rezo, if you could share your example code that would be great.

    Does any of the other FlexIO has DMA support that is 16bit wide ?
    Attached is some basic FlexIO setup to output a command and data via FlexIO3 on a Teensy 4.1
    I have not tested it, so you might need to do some tinkering and possibley some bit-shifting to push out 8 bit commands and data for LCD setup on the LSB's of the bus.
    T41_FlexIO3_16bit_SingleBeatWR.ino

    If you look up at the last section in Eric's post, you'll see that only the MicroMod has the most contiguous FlexIO pins that support DMA - 13 pins. He has helped me get an 8 bit wide FlexIO instance running with DMA on an ILI948X with success - max frame rate is about 75Hz


    But as you can see, there is no way to support a 16bit wide bus unless you do some fancy code writing and link two FlexIO instances together along with two DMA channels to create "one" long bus

  7. #7
    Right - at least initially, the library will only support 16bit interfaces on T4.1 FlexIO3, but I think an interrupt based method can leave the processor free the majority of the time. Despite being noncontiguous on T4.0 and 4.1, the pins on FlexIO2 can support a DMA based 8 bit interface with the right buffering process. Technically FlexIO1 on T4.1 can too (but this requires soldering some wires to the back memory pads which is not convenient).

    Quote Originally Posted by Rezo View Post
    there is no way to support a 16bit wide bus unless you do some fancy code writing and link two FlexIO instances together along with two DMA channels to create "one" long bus
    That would be the dream. I'm not ruling this out! But would have to figure out a way to connect the trigger signals between FlexIO1 and FlexIO2. Maybe XBAR can do this.

  8. #8
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    Jean Marc’s VGA_t4 links two FlexIO instances via linked DMA channels to push out data to VGA displays
    Might we worth looking into how he does it and give it a try on a parallel bus

  9. #9
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    ...interesting idea, however I've not found a use case where DMA can help.
    As of right now, I have a Teensy 3.6 operating in 8-bit parallel mode with an almost complete library. It's good enough to run the typical "graphics_test.ino" but can't quite run Demo Sauce yet... seems to have problems with reading yet.
    That said, here's my results with the current library I'm working on:

    Code:
    ILI9341 Test!
    Display Power Mode: 0x9C
    MADCTL Mode: 0x48
    Pixel Format: 0x5
    Image Format: 0x0
    Self Diagnostic: 0xC0
    Benchmark                Time (microseconds)
    Screen fill              42777
    Text                     3542
    Proportional Text        3250
    Lines                    17082
    Horiz/Vert Lines         3369
    Rectangles (outline)     2164
    Rectangles (filled)      87803
    Circles (filled)         14769
    Circles (outline)        13272
    Triangles (outline)      4272
    Triangles (filled)       29252
    Rounded rects (outline)  4414
    Rounded rects (filled)   95155
    Done!

  10. #10
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    Quote Originally Posted by xxxajk View Post
    ...interesting idea, however I've not found a use case where DMA can help.
    As of right now, I have a Teensy 3.6 operating in 8-bit parallel mode with an almost complete library. It's good enough to run the typical "graphics_test.ino" but can't quite run Demo Sauce yet... seems to have problems with reading yet.
    That said, here's my results with the current library I'm working on:

    Code:
    ILI9341 Test!
    Display Power Mode: 0x9C
    MADCTL Mode: 0x48
    Pixel Format: 0x5
    Image Format: 0x0
    Self Diagnostic: 0xC0
    Benchmark                Time (microseconds)
    Screen fill              42777
    Text                     3542
    Proportional Text        3250
    Lines                    17082
    Horiz/Vert Lines         3369
    Rectangles (outline)     2164
    Rectangles (filled)      87803
    Circles (filled)         14769
    Circles (outline)        13272
    Triangles (outline)      4272
    Triangles (filled)       29252
    Rounded rects (outline)  4414
    Rounded rects (filled)   95155
    Done!


    What bus speed is it running on?
    How are you writing the data? directly to the port register?

  11. #11
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    Quote Originally Posted by Rezo View Post
    What bus speed is it running on?
    How are you writing the data? directly to the port register?
    Direct to port writing, and delayNanoseconds for pacing the signals. Full screen blit takes ~7.7ms.
    I'll be sharing the library soon, and I just got DemoSauce going.
    Uploading a video to ewwtwob of DemoSauce and the graphics test. I wIll post links here when they're ready videos are ready.
    It's based on bits and pieces found on GitHub plus the ILI9341_t3 source.
    I'm also planning a full-on 16-bit for Teensy3.6 as well, which should be easy since I've managed to master 8-bit. :-)

  12. #12
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    I have a library written by another forum member that supports 8/16 bit parallel bus for the ILI9488 on a Teensy 4.1
    A full screen (320x480 @ 16 bit color) update took roughly 5ms. He used the same method as you - writing to the port register and using NOP's to delay the WR/RD pin pulse.
    It's fast, but the code is blocking the main loop - for me, that's a bit of an issue as I have several other things happening at the same time:
    1. LVGL drawing and rendering the UI
    2. FlexCAN writing and reading to the CAN BUS as fast as possible
    3. Logging data to an SD card

    Therefor, using DMA to write big chunks of data to the screen will avoid any latency on the other peripherals/apps that are running as well.

    I am busy writing a final driver for the ILI948x (1/6/8) based on FlexIO & DMA and would be happy to try run the DemoSauce test as soon as it's done and compare results (if relevant at all)

  13. #13
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    My library is for teensy 3.[01256]. There's the difference. The library also isn't as efficient as it could be.
    With an under 10ms update rate, there's no reason why you couldn't break the update into smaller chunks, perhaps a couple of lines at a time, via an ISR. Totally doable with my class timer library for Teensy 3's, which can dish out 1usec IRQs, and still no real need to use DMA.

    Besides, if you are using DMA, are you sending the entire video frame? or chunks? I'm assuming DMA from PSRAM?
    CAN shouldn't be affected, as that runs on an ISR.
    DMA will be fighting the processor if it has cache misses too and wants to access the same bank. Also, with DMA, you might have a slight timing issue, but then I'm noit sure how the writes are timed using FlexIO on the teensy 4.x. I have both the 4.0 and 4.1, but have not fully explored them yet. If not too convoluted, I'll set up a teensy 4.1 for testing, I do have 8MB PSRAM already installed too.

  14. #14

  15. #15
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    What I'd love to see is a 4:3 TFT that's 640x480, seems that they're no where to be found, because everyone thinks 16:9 is all cool and stuff, bleh.
    Why? well, I'm not rendering fancy graphics, I just want a terminal I can stuff in my pocket :-)

  16. #16
    Senior Member BriComp's Avatar
    Join Date
    Apr 2014
    Location
    Cheltenham, UK
    Posts
    425
    Quote Originally Posted by xxxajk View Post
    What I'd love to see is a 4:3 TFT that's 640x480, seems that they're no where to be found, because everyone thinks 16:9 is all cool and stuff, bleh.
    Why? well, I'm not rendering fancy graphics, I just want a terminal I can stuff in my pocket :-)
    Here is one and another one. Looks like 4:3.
    Last edited by BriComp; 09-23-2021 at 11:01 AM.

  17. #17
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    Yes, but no price listed, and anything else I've managed finding cost over 100USD... :-(

  18. #18
    Senior Member BriComp's Avatar
    Join Date
    Apr 2014
    Location
    Cheltenham, UK
    Posts
    425
    I edited the post, second one has price - approx $69.

  19. #19
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    Yeah, but I just got 2 320x240 ILI9341, which has a flash chip, and SPI resistive touch controller on-board for under $12 delivered, from Amazon.
    The date codes suggest that they're new-old-stock, as they're from 2015.
    Flash chip I'm going to pop onto one of my Teensy 4.1 boards. Might be amusing. :-)
    These displays also don't have the unwanted level shifters like on the Adafruit in the videos, and have all pins broken out, so that you could do 18bit interface if you wanted to.

    If the above display was in the $30 range, I'd be interested, and even at that price, the seller would make a good profit.
    I can get HD capacitive touch TFT for about $50.
    In fact I have a bunch around, but unfortunetly the pins needed to drive it from the Teensy 4.1 aren't exposed/broken out/available.
    Of course those pins not being available eliminates the possibility of using a camera directly too, but this is what we have with the design.
    I'm guessing part of the reason for the higher costs today is the pandemic, which has caused glass shortages and silicon shortages, on top of the capacitor famine that's been going on for the last 2 years.
    It's pretty much how things are world wide right now, and I can wait.
    Thanks for the links though, I'll keep watching.

  20. #20
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    Quote Originally Posted by xxxajk View Post
    My library is for teensy 3.[01256]. There's the difference. The library also isn't as efficient as it could be.
    With an under 10ms update rate, there's no reason why you couldn't break the update into smaller chunks, perhaps a couple of lines at a time, via an ISR. Totally doable with my class timer library for Teensy 3's, which can dish out 1usec IRQs, and still no real need to use DMA.

    Besides, if you are using DMA, are you sending the entire video frame? or chunks? I'm assuming DMA from PSRAM?
    CAN shouldn't be affected, as that runs on an ISR.
    DMA will be fighting the processor if it has cache misses too and wants to access the same bank. Also, with DMA, you might have a slight timing issue, but then I'm noit sure how the writes are timed using FlexIO on the teensy 4.x. I have both the 4.0 and 4.1, but have not fully explored them yet. If not too convoluted, I'll set up a teensy 4.1 for testing, I do have 8MB PSRAM already installed too.
    I'm using a Teensy MicroMod as it has enough contiguous pins on FlexIO 2 to drive the display @ 8 bit wide bus.
    FlexIO is a series of shifters and timers - configured correctly, you only need to load the shifter buffer(s) with the data, and FlexIO will push the data out to the shifter (pins) and the timer will generate the WR pulse - much less complex to setup and manage than writing directly to the port register and use delays.
    I can fit a whole screen buffer (480*320*2 = 307.2kb) into DMAMEM, so with DMA, I can write the whole frame in one shot.

    PSRAM is too slow for such a fast writes I believe, plus the MM does not support the PSRAM chips.
    I tired PSRAM with Kurt's ILI9341_t3n library and it was significantly slower than reading from RAM1 or RAM2.

    I'll let Eric and other SME's comment on the advantages/disadvantages of FlexIO/DMA over other implementation methods for 8080/6800 parallel communications.

  21. #21
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    Yeah, there lies the difference in end-goals. You are rendering entire frames, I'm not always doing that, thus I only update areas that are changed.
    You could get a huge speed boost if you could mark dirty lines or areas instead, and only blit those. The ILI chips do support blitting in a limited window, so that could be an option to explore.
    No need to update something that's already in sync... Perhaps queue up update streams and DMA those instead, that's how I'd do it.

  22. #22
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    Quote Originally Posted by xxxajk View Post
    Yeah, there lies the difference in end-goals. You are rendering entire frames, I'm not always doing that, thus I only update areas that are changed.
    You could get a huge speed boost if you could mark dirty lines or areas instead, and only blit those. The ILI chips do support blitting in a limited window, so that could be an option to explore.
    No need to update something that's already in sync... Perhaps queue up update streams and DMA those instead, that's how I'd do it.
    I'm not rendering entire frames, but I can write entire frames if needed.
    In my case LVGL is responsible for rendering only the areas that require updates, so it doesn't redraw the entire screen each time unless a whole screen update is needed.
    My display drivers receives the frame data (whole or partial) from LVGL, I set the window address in GRAM based on the area that is being updated and write the data to the relevant pixels. If the data is less than 32 bytes long, it will write it in blocking method, if it larger, it will write with DMA.

  23. #23
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    Nice! So, where are these wonderful libraries located? I have the components here to try things.

  24. #24
    Senior Member
    Join Date
    Oct 2019
    Posts
    239
    Still working on it

    But let's focus the thread's topic - a general library for parallel data transmission on the T4/4.1/MM.
    Driving displays is just one of the use cases.

  25. #25
    Senior Member xxxajk's Avatar
    Join Date
    Nov 2013
    Location
    Buffalo, NY USA
    Posts
    590
    Yes, input from cameras is another use case.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •