Ideas on a T4 parallel library using FlexIO

Sounds like 8080 or 6800 style interfaces are common for LCDs like ILI948x. At least to start, I think my parallel library would be a low level library and would not emulate the whole interface, just the clock and data pins. Chip select, write/read enable, etc. would be outside the scope, and could just be controlled by the user before/after writes. Or there could be a wrapper library, I suppose...

Can anyone suggest part numbers to look at for the other applications mentioned - parallel input from cameras, and parallel interfaces with A/Ds and D/As? Are there any use cases that would need a 4-bit interface?

Some thoughts: one of FlexIO's big advantages (over direct GPIO access) is being able to generate clock pulses entirely in hardware, so there is no need to emulate a clock by writing low/high to any pins. And since the FlexIO shifters are buffered up to 32 bytes wide, data transfer by DMA or interrupt is more efficient and can happen simultaneously with the clock and shift process. With the right setup, a highly consistent data rate can be achieved since the clock is driven by a hardware timer. It should also be mentioned that GPIO access on T4.0/4.1/MM is 32 bit only (and fairly slow via DMA), so GPIO access is a bit less convenient than it was for T3.x.
 
Last edited:
I'd like to see a FlexIO library that allowed fast input of parallel data with an external clock source.
 
I'd like to see a FlexIO library that allowed fast input of parallel data with an external clock source.

That can be done. How many inputs do you have in mind and what bus speed?
FlexIO can go quite high, I think it maxes out at 120Mhz
 
My application (fast ADC) needs 12 bits of parallel input at about 50 Mhz. I suppose it doesn't matter if the clock signal is generated by the teensy or externally. But it needs to be low jitter.

Thanks for working on this.
 
That might be beyond what FlexIO can do. Maybe?

Keep in mind the 120 MHz speed is the internal clock, which is must use to sample any incoming signal. But it can sample on both rising and falling edges, so maybe there hope to sample a 50 MHz clock? Check out "Pin Synchronization" starting on page 2894.

But also look on page 2901 for SPI Slave mode, where is says:

Due to synchronization delays, the output valid time for the serial output data is 2.5 FlexIO clock cycles, so the maximum baud rate is divide by 6 of the FlexIO clock frequency.

I would guess trying to received clocked parallel data would work similarly to clocked serial data.

You might also not be able to get 12 continuous pins.

Or maybe there's a way to work around these limitations?
 
I've found that the FlexIO base clock can be overclocked to 480 MHz, so it may be possible. For such a high speed application I would want to avoid any interrupt driven bit reformatting. FlexIO2 on MicroMod should work (it has 13 contiguous pins and works with DMA). FlexIO3 on T4.1 might be an option too, but would have to rely on a fast interrupt to store the received data.
 
480Mhz works only at 600Mhz CPU speed.
At any lower CPU speed, FlexIO starts to misbehave ;) Once FlexIO clock speed is set to 240Mhz, the min baud rate divider that can be used is (2-2/2), no? So that would be 240Mhz
Problem is the FlexIO chapter doesn’t state the max clock speed or bus speed.
 
Even better than referencing a post/thread (where you potentially have to dig all of the information+corrections out of several posts), @luni has placed a compendium of the info on using FlexIO into the WIKI <here>.

Mark J Culross
KD5RXT
 
Another Use Case - Classic CPU Integration

Hi, all. I have another use case for you:

https://forum.pjrc.com/threads/68254-65c02-amp-65c816-Using-Teensy-4-1-as-a-Co-processor-on-8-bit-bus-FlexIO-DMA

Emulators aside, it would be excellent to be able to connect classic hardware to the Teensy 4/4.1 and combine the functioning of the classic hardware, sitting on the same bus, with all the advanced capabilities of the Teensy, acting as a kind of co-processor. The 65c816 even has a co-processor instruction (COP - $02) that has its own vector and can pass a "signature byte" to a co-processor. It would be a cool combination.

https://www.wdc65xx.com/wdc/documentation/w65c816s.pdf

J

PS I can't seem to find the 68K parallel bus example code ref'd in NXP's datasheet.
 
Last edited:
My application (fast ADC) needs 12 bits of parallel input at about 50 Mhz. I suppose it doesn't matter if the clock signal is generated by the teensy or externally. But it needs to be low jitter.

Thanks for working on this.

The input clock is less of a problem then actually getting the data out. You can overclock FlexIO (though when it's too high compared to other frequencies, it starts to misbehave, like Razo mentions - it sometimes doesn't generate a DMA or similar - or more likely, the signal that generates that request is too short for DMA controller to pick up) but you need to get the data somewhere. If it fits the 1MB of the local memory, you're probably good, but if there's more of it, and you need to copy it out to EXTMEM, that can become the bottleneck (me trying to deal with it here: https://forum.pjrc.com/threads/68238-Teensy-4-1-DMA-priorities-and-preemption).

I have a 12 bit camera sensor: the 18-20MHz pixel clock is more less most I can get from it, while copying the data out to EXTMEM (and even that with some extra HSYNC cycles to give it a bit more time to do the copy between the lines). 12 parallel lines are fine, you can do 2x4 parallel shifters + 4x single bit - you need to do some bit fiddling after receiving the data (the example here: https://forum.pjrc.com/threads/66201-Teensy-4-1-How-to-start-using-FlexIO actually does 12 bit input)
 
@miciwan perhaps a double-buffered DMA or interleave could work for you?

Idea 1: Single stage FIFO
Loop DMA to a "fast dma buffer" and then another DMA relays that to the EXTMEM, which can use bursts, and that way you could possibly not lose any data. The idea is that while EXTMEM is working, the buffer can keep recording.

Idea 2: Double stage FIFO
Loop DMA to fast ram, and LOOP dma copy to slow ram as fast as possible. Meanwhile another DMA to follows as fast as possible from the slower RAM bank. This is a little more complicated that Idea 1, but would allow larger transfers.

Idea 3: Interleave memory. (Note: this is what I would try first!)
DMA every other sample to local RAM. Either or both RAM banks can be used as the speed would be 50%, meanwhile, DMA every other sample to EXTMEM, After transfer is complete, colesce to EXTMEM.

Idea 4: rate limit DMA.
Basically match the DMA rate to EXTMEM. That's how it is done on the ESP32.
 
@xxxajk so ideas 1-3 are pretty much the only way to make it work, even at these lower speeds (up to 20).

There are two issues:
1) the main DMA is a just a single controller. it has multiple channels, but they do not execute at the same time, there's an arbitration and on top of that it looks like it does get "stuck" when servicing a transfer. Even when you set up DMA chains, at some point you get into situations, where the copy-out (to EXTMEM, DMAMEM, wherever) starts to interfere with the DMA that pulls the data out from FlexIO registers. And even if this one has the highest priority (it needs to have it, the external clock doesn't wait ;-) if the DMA controller gets to servicing one of the other transfers, and there's some hitch there (say it needs to wait on the bus arbitration for memory) it blocks the main read out and you start losing data.
2) if you try touch the data in EXTMEM in any way - to untangle the bits, so demoisaicing of the bayer pattern, whatever - you introduce more traffic on the bus and now it's not only the readout DMA that hitches, but the copy out DMA starts to take way more time - and you're not able to copy out the data from the aux buffers to EXTMEM.

The thing that seems to work best is just having the DMA work from FlexIO to buffers in local mem (tightly coupled) and copying them out by hand, with regular memcpy, in the interrupt handler on DMA completion (plus clocking EXTMEM at full speed, rather than 88MHz that it uses by default); sucks, as it takes extra cycles, so any processing in the "main thread" is slower by necessity, but the readout doesn't lose anything.

One thing that might potentially help (though I haven't tried that yet) is messing with the read/write priority of the core/DMA ports in the SIM_M7 bus fabric. By default CPU gets the higher priority, but it can be flipped around. Technically, this way, DMA should be faster, even when CPU is doing EXTMEM processing (still doesn't really solve problem 1)
 
@miciwan Think on how the classic Hardware Serial (and actually, USB does it to) buffers, in order to handle bursts.

Since there's only one DMA controller, you just use the CPU as the "second DMA controller". Yes, it ties up the CPU, but the idea is to have a buffer to store while other things are happening.
Idea 3 only needs one controller.
Don't forget that any other element within the SoC that uses DMA could cause a loss (e.g. USB).
This is not because it is using the DMA controller, but because it has control of the memory at that point, and wants to access the same memory region.

I've played with memory arbitration on the Teensy 3.x, and if you aren't careful USB ends up having problems. :-/ In those cases, I use the "No USB" selection, and just use hardware serial (pins 0 and 1) to an adapter for debug i/o, or CAN BUS to one of my own contraptions that acts like Serial, but on CAN. Of course with No USB, you have to press the button to program, or simulate it.
 
@xxxajk and @miciwan : so, what does this look like in code? How would one actually implement these ideas? Using a simple common example, say a 65c02, what would the code look like to capture 16-bits of address (always read), a clock signal from an external oscillator, and a /RW to tristate 8-bits of data? This is really quite similar to the 68K Bus example described in the documentation, yet that example is nowhere to be found (I've done an exhaustive search of the NXP site).
 
@jonathan322 I might have one of NXP's application note software packs that has the 6800 setup for in an LCD driver demo.
PM me your email and I'll share it with you if I find it.
 
@jonathan322 Don't forget that 65xx/85xx CPU does RMW for writes, i.e. on the bus does Read/Modify/Write, this is in sync with theta 2. And you need to account for that,
*Cough!* Yes 65xx was the first CPUs that I did hardware with, and I've done ISA 8088 <-> 6502 ISA bus cards.
devfin.jpg
 
Note there's no ROM on it... The 6502 is halted, code uploaded, and unleashed. There's a mailbox bit for ISA to basically say "freeze me, I got something for you". The PC then stops the 6502, and has access to the 64k of RAM. The PC does whatever, and then resumes the 6502. The other way is true too. It causes an IRQ on the 6502, which causes it to use the same bit to say "ok to freeze me"... etc... can be programmed to do basically whatever I want.
I used this card at one point to simulate hard disks and printer for a C128 in CP/M mode :)
 
Note there's no ROM on it... The 6502 is halted, code uploaded, and unleashed. There's a mailbox bit for ISA to basically say "freeze me, I got something for you". The PC then stops the 6502, and has access to the 64k of RAM. The PC does whatever, and then resumes the 6502. The other way is true too. It causes an IRQ on the 6502, which causes it to use the same bit to say "ok to freeze me"... etc... can be programmed to do basically whatever I want.
I used this card at one point to simulate hard disks and printer for a C128 in CP/M mode :)

@xxxajk , Funny, I was thinking of adding CP/M compatibility to one of my designs by adding a Z80 connected via a VIA 65c22 - two, actually, port-to-port.

Are you a member here? http://forum.6502.org/
 
Last edited:
Could you use it to sample multiple SPI data lines? I know many simultaneous sampling adcs have one output line per input, clocked out simultaneously with the clock line.
It's meant for fpgas but it would cut the time used to sample multiple channels.
 
Back
Top