TFT Display: SPI or 8 bit parallel interface?

Status
Not open for further replies.
Here is the link to what is in here somewhere graph of current to milliamps running a benchmark :: manitou48/teensy4/blob/master/coremarka.png

The T_4.0 has the memory as indicated - the only QSPI like ports are the SDIO access to the SD card. The T_4.1 - coming later in 2020 - is planned to have a pair of QSPI connected chip pads on the underside ...
 
Thanks!! Can't wait for the T4.1!

Hopefully not a long wait then :) The T4 is really cool - but adding the SD card and USB Host alone takes a PCB making it taller or as big as a T_3.6 and 2.4". The T_4.1 will be that with even more I/O pins and Ethernet and other stuff Paul posted ...
 
For T_4.1 reference hopefully not making anything up … from "Pins-to-bring-out-on-a-hypothetical-larger-Teensy4":
None of this will happen on Teensy 4.1. It's already far too late to consider these major changes.

There also just isn't PCB real estate for most of this. The extra space we're getting is allocated to a USB host connector and power switch like we have on Teensy 3.6, and the micro SD socket, and the ethernet PHY. The design has no 1.8V regulator at all.

Teensy 4.1 will bring out 4 signals from FlexSPI2 to a pair of SOIC pads, and 4 signals from FlexSPI1 go to the flash chip (those signals really aren't accessible), but neither of the two FlexSPI controllers will have all 8 of its signals routed. All the BGA escape routing has to happen on only 2 sides of the chip. The 42 I/O signals, 6 flash, 6 SD card, 7 QSPI for 2 chips and 12 ethernet PHY, 4 USB, and several misc signals have completely used up all the escape routing space. The PCB is already 6 layers with 5/5 mil spacing and 8 mil drill, which pushes right up to the limits before driving the PCB cost up much higher.

Even if we could get more signals out and have PCB real estate for pads or a connector, adding an octal interface flash hardly seems to make much practical sense. Faster flash is really only valuable if it's the main program storage memory, supported by the bootloader and default linker script. Sure, in theory someone could edit the build to compile software for the different memory range, and then go through a lot of trouble to get their data written into that memory. But in practice, pretty much nobody will go to that sort of effort. Teensy is all about informal making and rapid prototyping. An external flash chip that's hard to actually use would add very little practical value. It really only makes sense as the primary memory.

Octal flash might make more sense for the 1170 chip later this year. Even with only 4 bit interface, having read-while-write capability would be nice. We currently emulate EEPROM by stalling everything while the flash chip is busy writing.
 
Yep sometimes some of us don't spell very well. :0
I don't know what my affliction is called, but me fail English often in the "tone" or subtext, and spelling mistakes makes it hard for me to read/understand.

When I encounter mistakes like then/than, to/too, which/witch, their/they're/there, etc., I need to stop and reconsider the entire sentence. It is very jarring. They really do make things more difficult for me.
(I also use very strict ad blocking, as I find ads extremely distracting/jarring also.)

The fact that this library when doing DMA updates of the screen uses smaller local buffers for screen updates and converts the buffer into the actual color data being sent out, was because the frame buffer is not storing the data in the format that the display needs (3 bytes per pixel), so I need to do something, and I so far don't know of any direct way for using DMA, that for example you give a source buffer as the SOURCE and you give it the SPI output register as a destination, and maybe have some intermediary who maybe uses the source as a palette index and send the translated data to the destination... So I did it myself manually. Also as a side effect it took care of the issue of consistency of the data (actual memory versus cache)...
Yes, precisely. I too have thought about those issues.

The best I could come up with is basically a serial-to-parallel expander with SRAM-based LUT; say, a GS84032CGT-150I 256k×32 used in a 65536×17/19/25 configuration, where two 8-bit serial to parallel converters provide the address (one fast, one slower), and the output is 16-bit/18-bit color plus data/command line. You also need a tristate latch for the output lines for setting the palette entries, so the number of components goes quickly through the roof; not really worth it.

As of now, I'm looking at Buydisplay 2.8" IPS panels (because of the good documentation), and doing a carrier board with perhaps a PIC32MZ1024EFG064-I/PT as a blitter/GRAM unit, slaved to a Teensy 4.0 via UART and/or SPI. Instead of 2D geometric primitives and fonts, I'd like to do tile graphics and sprites instead, with proper opacity and blending support; maybe a custom background color per row or column. I could also maybe try to port some of my early nineties pixel tricks for fun.

I'd love to use Teensy instead (I don't really like PICs), but BGA is outside my skill range; even TQFP will need some practicing first.. and such a blitter will need a lot of RAM.
 
If doing full updates/big updates, calling the function will take less time for SPI as no waiting/digitalWrites stuff is needed, and just fast(RAM2 speeds) buffer writes, right?
So if the display smoothness and tearing is not the major concern (the update of the lcd) but time, then SPI with DMA is better?
 
Sorry, not really sure what you are asking...

Fastrun code - Currently on T4, unless you specifically tell it otherwise all of the code is copied down to ITCM (fast run)... There are ways to tell it that some function(s) are not fastrun, but unless you are running into issues like your program is too large, probably not an issue.

Programs that run on the Teensy, are typically programmed by running your sketch using the Arduino IDE, and all of your code including the portions of the core that is needed plus whatever libraries you use is compiled by a GCC compiler and then the binary is downloaded to your Teensy (into the flashram), which then runs it each time the Teensy is powered up and/or reset. There are no provisions for running code stored on a SDCard, or somehow modified locally on your Teensy. Note: you don't necessarily have to use the Arduino IDE, there are ways to build using Make files, likewise some of us have things setup such that we can build directly within sublime text, other use Eclipse, or some form of Visual...

And there are some alternatives that I have not tried like a Python setup: https://forum.pjrc.com/threads/59040-CircuitPython-on-Teensy-4!


And yes there is no quad SPI available on T4 as I don't see any of them that have the logical CS2 and CS3 defined...
 
The main difference between SPI and parallel transfer is the maximum display frame rate one can achieve. SPI is certainly simpler: fewer wires, easier to get going, libraries already exist. Simpler is often better, because of fewer things that can go wrong.

If you have a 320×240 or 320×480 display with a full 16-bit framebuffer (153,600 or 307,200 bytes), in DMAMEM on Teensy 4.0, you can use DMA to send it via SPI to the display with very little work and very little to no interruption of "main program" tasks; very much a "fire and forget" thing.

With SPI, you can also use one or two 8-bit framebuffers (76,800 or 153,600 bytes), and have a simple and fast interrupt routine translate the 8-bit framebuffer data into 16-bit color, to a DMA buffer. This interrupt takes a bit of processor time, but the transfer itself uses DMA. For example, the ILI4988_t3 library supports this right now.
The reason you might wish to use two framebuffers, is that this allows "drawing" on one, while the other is being transferred to the display.
There is enough DMAMEM on Teensy 4.0 for two 320×240 16-bit framebuffers, but not for two 320×480 16-bit framebuffers. (I do not know if it is possible to overcome that by placing the framebuffers in different 512k banks, and somehow have the DMA work right, but I don't think it is possible. FrankB, KurtE, or Paul (Stoffregen) might know for sure.)

With parallel transfers, every pixel sent to the display needs work from the processor to either A) set the output pins and strobe the WR pin, or B) convert the data from framebuffer format to DMA'ble buffer format (and only the DMA buffer needs to be in DMAMEM).
Option A option is straightforward, but uses the "main program" to do all the work. If the microcontroller just displays images and does not have other "work" to do, this is a perfectly acceptable approach. (Do remember that the display only needs to be updated when the image changes, and you can just update smaller rectangular areas. You do not need to keep "refreshing" the display.)

Teensy 4.0 only supports option B (DMA) for 8-bit and 9-bit parallel transfers, using the specific pins I've already mentioned, as there is no GPIO bank with 16 exposed pins. Processor work is similar to using palettes with SPI, but the data is transferred about eight times faster. Setting this up is a bit complicated, though, and I do not think any existing library supports this.

The reason I am so interested in 9-bit parallel transfers and DMA is that I like tinkering. It is not better, it just isn't -- as far as I know -- already supported by any library. All it does is allow 18-bit color for palette'd framebuffers and display updates faster than the display refresh rates. It is interesting technical territory for me.

Should you use SPI or parallel? As I said, SPI works right now, and if you "overclock" the SPI transfers (as in test which SPI clock rates work for your display, until you have it fast enough), and use short or shielded wires, the updates will be fast enough for just about any use case. If you have a use case, say a calculator or whatnot, and you want to do that instead of playing with the low-level display update stuff like I do, use SPI.
 
OK. Get it. Short, shielded wires. Just like PCI Express.
SPI, sounds really great. Also, the keypad button matrix also requires a fair amount of GPIO, so less pin is better. (Should I use a I2C I/O expander, like the one with 40 I/O pins, to use it to scan through rows of keypad, scaning will take some time, I2C is pretty slow, so if I am correct, I/O expanders are not good in this situation)
 
For a keypad, I recommend the diode approach.

If you have N×M keys, you need N+M I/O pins: N outputs and M inputs. You can use an expander, shift register, counter etc. for the N outputs, if you need to reduce the pin count.

Each button is wired in series with a diode, connecting one output to one input. Only one output is high at any time (and this is why you can use a decoder/multiplexer, using just K output pins, for 2N key rows; for example, a dirt cheap 74HC238 (74VHC238FT for example) can be used to control 8 output lines, using just 3 output pins on a teensy). Inputs tell which buttons on that row are pressed and which are not. The reason for the diode per button is that it allows each button to be detected individually.

A common choice is to use N=8. If you also were to use a 74HC238, you only need 3 output pins and M input pins for 8M buttons, all individually and separately detectable.
For N=16, you can use 74HC4154 family; then you need 4 output pins and M input pins for 16M individual buttons.
These only need something like 20 nanoseconds (a dozen cycles at 600 MHz or so) for the output to switch (whenever the row changes).
You can use any (reasonably fast) decoder where only one output is high at the same time, really; lots of choices.

For software debouncing, I like an approach where button change is noted immediately, but further changes are ignored for 20ms-50ms. If you check each button at most 1000 times a second (at least a millisecond between checks), a single byte per button suffices for debounce and state. This way button press detection is immediate, not delayed, and you can detect both transitions (pressed, released) as well as states (down, up).

I have designed two carrier boards for Teensy LC, for 4×8=32 buttons and 9 analog potentiometers. This one using through-hole components, and this one using SMD (SOT23-3 common cathode, like BAS70-05 Schottky) diode pairs; both intended for use with wires to the buttons and potentiometers, for example via 1×2 (buttons) and 1×3 (pots) pin headers.
Each row output has a space for a 10k resistor to protect the Teensy LC, in case the software malfunctions; they just limit the current to about 0.3 mA. I recommend putting such resistors on the outputs. (If using a decoder, both on the Teensy outputs and the decoder outputs.)

Don't forget the pads on the bottom of the Teensy 4.0; you have another ten pins, 24-33, you can use. You can solder female or male pin headers here. (You could use a 10-wire flat cable, and solder each wire directly, but pulling on the cable may rip off the pads, so I don't recommend that.) With a 74HC4154 you can support 16×6=96 buttons with these alone; a full keyboard.

If you like to use standard 12mm×12mm or smaller tactile buttons in tight grouping, consider rotating every second one 90 degrees, in a checkerboard pattern. Each button has four leads; a pair on one side, and another pair on the opposite side. Rotating them in a checkerboard pattern gives more room in routing the lines between the button legs: the legs of one button are never next to the legs of another this way. The other two legs are not connected, just use disconnected pads. (The pad is actually a plated through hole, so soldering all four legs does help keep the button on the board; it's just that two of the pads per button are not connected to anything else.) I also recommend using diagonal pins on each button, so you don't need to remember which way is always connected, and which way is only connected when the button is pressed. The buttons sit on the board, so it is easiest to use SMD (Schottky) diodes on the other side of the board; and use that side for the columns (input traces), and the button side of the board for the rows (output traces).
 
Could you explain a bit more how to use two frame buffer thingies? So my display is definitely 320*480, but what I can give up is bpp, if there is an 8 bit color option, that will work right?
What are the benefits of using two frame buffers? Will the refresh be instant and unnoticeable (but will have one frame latency?)
If I don't need two frame buffers, may I use 16 bit color, and store it that way and not use the palette method? Or maybe 8 bit color? It looks garbage tho. 8 bit no palette color with two frames? 16 bit no palette color with single frame?
 
Consider the case where sending/writing the frame buffer to the display takes t1 milliseconds, and it takes the microcontroller t2 milliseconds to redraw everything in the framebuffer.

If you use a single framebuffer, then the refresh rate is 1000/(t1+t2); i.e. during each screen update cycle, you first draw everything, then send it to the display.

If you use two framebuffers, with (mostly) DMA doing the sending/writing to the display, then the refresh rate is 1000/max(t1,t2). This is because the screen update cycles overlap (and t1 and t2 can be done at the same time), and the update cycle length is determined by whichever takes longer.

If it takes td for the display to refresh (say, 14.29ms for 70Hz refresh rate), then display updates can only be tear free if t1<td.
When using SPI, t1 is determined by the SPI clock (and the amount of bits transferred per display frame). Although the ILI4988 specifies a minimum serial clock cycle duration as 50ns (i.e., a maximum SPI clock of 20 MHz), many displays work with much higher SPI clock rates.

So, whether you use one or two framebuffers, will not affect whether you get tear-free display; it only means the microcontroller has more time to redraw everything to the framebuffer.



It is possible to extend the _t3 libraries to support two framebuffers on Teensy 4.0. In addition to _pfbtft, we need _pfbtft2; both are allocated and initialized the same way (and neither needs to be in DMAMEM since translation is necessary, regardless of whether the framebuffer is direct color or palette'd). In updateScreen(), either the two pointers are swapped, or contents of _pfbtft is copied to _pfbtft2; and the rest of the updateScreen() function uses _pfbtft2, as does the fillDMApixelbuffer() function.

Essentially, DMA operations then sends data using _pfbtft2, and drawing operations use _pfbtft, and the updateScreen() function copies or swaps the fully drawn _pfbtft to _pfbtft2.
 
A couple of different thoughts here.. Sorry if some of these are maybe are again probably a little off the topic.

One of the great things about open source software is we are all free to try out different things and see if you can get something to work out well for your application and then if appropriate see if the owner of the library wishes to incorporate your changes. A lot of the things in these libraries are just that.

The DMA with the ILI9488_t3 code was sort of a special case, to get around the issue of the Frame buffer memory was not in the same format as is needed to send to the display. For other displays on the T4, I may or may not do the copy of memory. In the non-continuous updates mode, I may simply tell the DMA memory to flush out and the do simple DMA operation, which maybe only interrupts when the full frame is done.

One versus two frame buffers benefits versus overhead are probably very specific to the application at hand. Example Frank's C64 type code simply has one frame buffer, which is a continuous update, and has the code that fills in the next frame, simply understand where in memory is currently being output by DMA, and simply fill in the data for the new frame, making sure that fill code does not overtake the output location...

Unless your double buffer code is such that each frame fully change and sets all of the pixels in the new frame, you often have the issue/overhead that you may need to copy all of the data from one buffer to the other before you start your next round of updates... Again more overhad...

We did add a hack in the ILI9488_t3 (I think it was that one) that allowed you to set the frame buffer to a different point in memory and the DMA could would not look at the new location until the current frame finishes. This allowed for a quick and dirty setup for double buffering.

All of the above usefulness and Frames per second are all interesting things, but in many cases, you may be better off looking at what your application is actually doing and may find things that improve the speed a lot more than simply how many frames per second... That is suppose using 8 bit parallel speeds your output of a frame by lets say 3-5 times faster than SPI, that is a great speed up and maybe somebody should try it. But suppose your application was to output a tachometer the screen and suppose you found that by doing a couple of simple changes which simply logically the two bounding rectangles of where the needle was and where it is now, and only output that region, and on average you maybe only need to update 5% or 10% of the display, which of these two changes will have a bigger impact on your application? Obviously doing both will be great, but then the questions come up, just how fast do you need it and how much time do you wish to devote to making it faster?

Again sorry for rambling here.
 
FWIW, the DMA code in your OS lib has been very helpful reference for me to implement the DMA transfers for my 3D graphics lib :) It freed up the MCU to do rasterization while transferring the data, so runs faster now.
 
BTW, KurtE is absolutely right. It always makes sense to look at the whole, and consider which details matter.

The cases where a double frame buffer would be useful are relatively rare. Something like voxel graphics or a Doom/Unreal-type ray casting engine, which will recalculate every pixel for every frame from scratch. (Although in many cases you can avoid that just like KurtE mentioned, by having the display refresh in the direction you recalculate the pixels, so you only need to ensure you don't overrun the DMA.)
Maybe some pixel tricks, where the framebuffer memory are filled in semi-random/nonconsecutive fashion -- I have a pixmap shadow effect like this.
But in general, a double frame buffer just isn't needed. Old 2D game consoles definitely didn't have those.

I do not have any specific use case for myself, really; I am just interested in tinkering with the technical details. In particular, using a secondary microcontroller to convert and colormap UART/SPI data to a parallel-interface display, or to use as a blitter/graphics slave processor. I designed a schematic for a PIC32MX170F256D-I/PT board for this, but since it only has two high-speed UARTs (A0,A1,A4,C9) in addition the the 16+8 parallel pins, it doesn't seem to be well suited for this purpose, although the parts cost would have been less than 10 euros per board at Mouser. A PIC32MZ1024EFG064-I/PT would work, but the cost would be close to a known-working "second" Teensy 4.0 (dedicated for display management, even if via digitalWriteFast() et cetera).

Because of this, my answers are not oriented toward what is sensible, robust, and works in practice, but more towards what is technically feasible. I really like helping others build stuff I myself couldn't or wouldn't think of, so my suggestions may be a bit on the optimistic side: "yeah, it is possible this way (with a couple of months of work)", instead of the more sensible "don't bother; doing it using this library will get you there in just a few days".
 
Yeah, I totally agree, and the two frame thing is kinda unnecessary, tear free is not an important thing as well. As I am not totally understanding DMA, I'd better stick with the stock ILI9488 library. Calling for example fill_rect should occupy nearly any CPU time right? And it should even take less than using parallel without DMA.
This is my first real project, and I'd better not make it too advanced. DMA method already tick the most important check marks, so I may stick with it.
The stock library is doing DMA already? If not, how? I really don't know what does a DMA code look like, and I don't know the details of how it works...
 
Do you happen to know how to put Teensy 4.0 into very very deep sleep and can be woken up with a button?
How is the power consumption?
Or maybe I am limited to use the power on/off switch pin provided on the board?

The power on/off is the switch for the onboard 3.3v regulator for Vin? Or just cuts the power from the 3.3v power pin/3.3v regulator output rail from the procesor? Or both? If I want to utilize it, will I have to power Teensy from the 3v3 pin or Vin pin?
 
May I use 2 displays with DMA? Is it possible? Is the memory enough?
The display both are ILI9488 with SPI, with CS pin, and both 320*480. How to do it?
 
Can you run 2 displays with DMA? I have not done so with these displays, but have with others, and you may be able to do so and a few things may need to be fixed...

1) Enough memory - 480*320*2=307200 * 2 = 614400 so not easy to fit two full buffers into DMAMEM which is 512K. Not sure if things can be setup to be able to put one in lower memory or not (probably difficult, but...). BUT: we started off with a palette version which only used 1 byte per pixel as the T4B1 has half the memory... So you can fit two palette versions in memory if you can live with 256 colors.

or when the T4.1 comes out (https://forum.pjrc.com/threads/58028-Pins-to-bring-out-on-a-hypothetical-larger-Teensy4) I believe there will be the ability to add additional memory to these boards... Not sure how fast that memory will be as I believe it use FlexSPI2 connections... Will see when it comes out.

1a) Something I am curious to play with: If the main things you wish to output are hard coded logical bitmaps you might be able to output them directly from PROGMEM. That is you might be able to do something like: tft.setFrameBuffer(my_progmem_image); tft.updateScreen();
I may need to make sure that when you do a setFrame, that it either does not clear the memory or make that optional...

2) DMA to SPI - You can only do DMA output to the SPI port from one display (at a time). I don't know how well I have it setup to allow you to round robin this... BUT I have for other displays (and maybe it is in place here), have the ability to do two or three updateScreenAsync updates at the same time IF both displays are connected to different SPI busses.
That is if one display is connected up to SPI and the other to SPI1...

Again I am pretty sure I have not tried this yet with these displays, although I think I may have two... One from ebay that looks like a larger ILI9341 and I think I have one floating around from Buydisplay...

3) Again I am guessing, but two displays implies twice the power needed. So may need to see if the USB or the like can handle the current. For sure I would not run both (probably neither) from 3.3v
 
RRR... If I use parallel, I may use the CS pin to make them share the same data pins right, and DMA feels like a big hassle for dual displays.
And do I HAVE to change pixels to something in a rectangle space I defined with the set address commands? For example, if I want to draw a circle, may I set the address to the smallest square that covers the circle and when sending the pixels that is not part of the circle, send something like transparent so it doesn't cover the previous data? I wish you know what I mean. Will I have to for example fill a circle by chopping it into rectangles that don't touch the outsides of the circle, or any non rectangular shape
 
And the bloody coronavirus cause Taobao to ban shipping my cheap displays, so I maybe can wait until Teensy 4.1 comes out.
I can make my violin while I wait. Maybe make it more Teensy-y
 
    *((volatile uint16_t *)(&GPIO6_DR) + 1) = ((d & 0xF0) << 2) + (d & 0x0F);
where the cast obtains a 16-bit pointer to GPIO6_DR, we add +1 since we want the high 16 bits of this 32 bit register (and Teensy 4.0 is little-endian, least significant byte first); the volatile tells the compiler it cannot cache the access to the register, and the initial asterisk means we dereference the pointer, i.e. access that location. The value we assign is halved, with lower part shifted by 16 bits, and upper part by 18 bits. As we skip the first 16 bits (so as to not affect any GPIO6 bits 0-15, if they happen to be outputs), it means we only need to shift the upper four bits of the byte two places up. We do not need to cast d to any other type, as C integer promotion rules means that when we do the binary AND operation (&), the arguments are promoted to int anyway. Since we only keep the necessary bits, everything else will be zero.
I have an 8 bit parallel TFT display connected to a Teensy 4.1 via the pins mapped to GPIO6 bits 16..23 (pins 19,18,14,15,40,41,17,16), and have it working all fine and dandy using this statement to send data to the TFT:

Code:
GPIO6_DR = (uint32_t)(d << 16)

where d is of type uint8_t. I now wish to only set just those 8 bits on GPIO6, so as to not affect other output pins on GPIO6, without resorting to DigitalWriteFast for the individual pins. Is there a way to do this? Note that I tried a derivative of the above:
Code:
*((volatile uint16_t *)(&GPIO6_DR) + 1) = (d & 0xFF);

to narrow it to just the high 16 bits, but this didn't seem to drive the display. I'm playing above my paygrade here, so struggling a bit, so wanted to reach out to the community to see if anyone can help with setting just bits 16..23 of GPIO6. Thanks in advance!
 
Status
Not open for further replies.
Back
Top