TFT Display: SPI or 8 bit parallel interface?

Status
Not open for further replies.

jerrymonkey

Active member
So the ILI9488 and most other TFT libraries writen or ported to Teensy is using SPI interface, but I found a ILI9488 320*480 TFT display with a nice capacitive touch. http://www.lcdwiki.com/3.5inch_LCD_MODULE-Capacitive_Touch

The problem is it only have 16 bit parallel interface broken out. I am planning to use the 8 bit mode. I "could" do port manipulation, setting all 8 pins' state in around 4 clock cycles.

Here's my problem:
Is SPI fast enough/faster than parallel interface?
Can SPI do full frame update quickly?

I am not sure if I have the skills to port from existing Arduino parallel library / mod Teensy SPI library.

Any ideas?
 
The maximum SPI clock the ILI9488 datasheet allows, is about 15 MHz (to be exact, 66ns per SPI clock cycle). With a 320×480 display at 16 bits per pixel, you can reach about 6 frames per second.
Many displays work with 20 MHz SPI, so you can reach between 8-9 FPS, but not much more for full display updates. If only a smaller region needs updating, that can occur faster.

In the 8/16 bit parallel mode, the write cycle (for each 8 or 16 bits in parallel) needs to be only 30ns (15ns + 15ns) or longer, i.e. 33 MHz maximum WR clock should work fine. (Read cycles are much slower: 160ns or 6.25 MHz for register data, 450ns or 2.2 MHz for framebuffer data.) Thus, you can theoretically reach 100 FPS in the 8 bit parallel mode, 200 FPS in the 16 bit parallel mode. This is obviously higher than the internal refresh rate of the ILI9488. What it means in practice, is that if you initiate a transfer just after the module has refreshed the display, the framebuffer can be refreshed faster than the internal display update, leading to tear-free updates.

The sequence to send parallel data is simple: If this is data, pull RS low; high for register writes. Pull CS and WR low. Set the parallel data pins, noting than on the Teensy 4.0, you need to split each byte, for example mask and shift the bits 4..7 two bits up, so you can use AD_B1_00..03 and AD_B1_06..09 (pins 19,18,14,15,17,16,22,23). Make sure at least 15ns passes since WR was pulled low, then pull WR high. If you wait at least 15ns before pulling WRX low again, you can send multiple bytes in a quick succession. Otherwise, (wait that 15 ns and) pull CS high.

On Teensy 4.0 running at 600 MHz, each clock cycle is 1.667 ns long (9 clock cycles per 15 ns), so it's more of a matter of doing it not too fast, even taking into account the operations needed to mask and shift the bits. In fact, you can use pretty much any I/O pins you want; there's ample time, even if you don't hit the 15ns minimums, and need say 20ns.

The library issue is harder, and probably the make-or-break thing in your case. Perhaps you should first look at existing libraries to see what they could support? And whether they are organized in such a way that you could replace their low-level send/receive portions with your own parallel send/receive functions? I write my own, so I don't know what is out there.
 
Some displays can use higher speeds - as usual.
All ili9341 I own so far accept way more than 30MHz Spi. I've used 60MHz. Then, the lower resolution helps to reach higher FPS with fullscreen-updates.
 
I originally thought SPI was a pretty FAST thing, now I know that parallel still does better, especially dealing with Teensies.

註解 2020-02-05 205312.jpg
If I understand it correctly, you will have to first set ChipSelect to low, then set Data/Comand to either high or low, meanwhile pull WRite to low, and also set Data Bus lines. Then wait at least a few clock cycles, and then pull WRite to high maybe?

註解 2020-02-05 210023.jpg
So if I want to set column address to 63-300,
I will need to set SC to 63 or b0000000000111111
and EC to 576 or b0000000100101100
Like so:
未命名.png

Also the controller seems to be able to do 16.7M 24bit but GRAM only can store 18 bit color, so except the DPI thing, the 24 bit color is not accessable?
Because it can only take 6 bit per color per pixel, the least 2 bits are ignored.

Which command controls whether I am using 18 bit (takes 3 data per pixel) or 16 bit 565 RGB (takes 2 only) or if it is controlled by hardware jumper/resister thing?
I feel like 4 times the colors don't make that much difference on the Chinese cheap TFT display

I believe it is a hardware thing because I didn't find the command for it, and also from the website:
註解 2020-02-05 212930.jpg
Feels like the module already locked that down.


And another question, how does 16 bit data bus mode work? Setting it up is a jumper thing I knew it, and
commands are 8 bits long: use lower 8 bits
two 8 bit long parameters/data turn into one 16 bit long one like setting the column address thing becomes COMMAND>SC>EC rather than COMMAND>SC[15:8]>SC[7:0]>EC[15:8]>EC[7:0]
and if there is odd 8 bit parameters/data then the last 16 bit long's top 8 bit is ignored that sort of thing?



OK another question how does the power button completely shuts 3.3v work? Will it still drain my battery if it is connected to te 5v input?
 
For me some of these type questions, have no real one size fits all answer.

Is SPI fast enough? It is if it is fast enough for whatever it is you are needing to do...

Is SPI faster than 8 bit parallel outputs? Short answer NO, longer answer is it may depend on what your whole thing is doing...

That is if your parallel output code has to tinker with managing the output of all of the bits to the display, it may not be doing anything else, like reading in or computing the next frame of data...

Where if you are doing DMA SPI output, you may be able to setup your code have the display startup and take care of continuously updating, and then have other code that then does what it needs to do for the next frame. Being careful to time when it can update the different sections of the display....

So again the answers may be, it depends... :D
 
Some displays can use higher speeds - as usual.
All ili9341 I own so far accept way more than 30MHz Spi. I've used 60MHz. Then, the lower resolution helps to reach higher FPS with fullscreen-updates.
True; I only have one ILI9341 and no ILI9488, and was only quoting what the datasheet says.

(I found the version 0.90 and version 1.00 using a simple DuckDuckGo search; looking at the site itself first to see if there is reason to assume it is valid. Mouser, Digikey, Element14, and LSCS are also good sources for datasheets for chips they sell, but they don't seem to be selling ILI4988 chips.)

That said, I have been thinking about using one of my Teensy 4.0's as a programmable blitter, or even doing a replacement board for one of these displays with a SAM/PIC32MZ on it, somewhat similar to Digole displays, with Teensy 4.0 being the master processor; and discussed this elsewhere. It might be a nice basis for old-style pixel-art games?

If I understand it correctly, you will have to first set ChipSelect to low, then [...]
Download the datasheet I linked to above (version 1.00), and the sequences are listed in chapter 4, from page 39 onwards; and timing requirements in chapter 17.4, from page 329 onwards.

The 8-bit parallel interface is described in chapter 4.7.3, from page 123 onwards, and you can select whether you use 16 bits per pixel, or 24 bits per pixel.

Internally, the display supports 18 bits per pixel, or 6 bits per color component. The mapping from 16 bits to 18 bits is R5:G6:R5, i.e. you lose one red and one blue bit. This is typical, and in my opinion, the best way to do 16 bit color. The mapping from 24 bits to 18 bits uses the most significant 6 bits of each color component.

You can use command 0x3A to choose between which (16 or 24 bits per pixel) is used.

And another question, how does 16 bit data bus mode work?
That is described in the ILI4988 datasheet chapter 4.7.5, from page 126 onwards. The command 0x3A can be used to choose between 16bpp and 24(18)bpp; in 24 bit mode, three 16-bit transfers are used to transfer the data for two pixels.

In both 8-bit and 16-bit parallel modes, everything else except pixel data (i.e., register reads and writes) use only the low 8 data lines, as shown in chapter 4, page 39.
 
So yeah thanks for your help I now know 3A command can set the color mode thing.
Since WR need to be pulled for a few nanoseconds, which for teensy, is actually a few clock cycles that need to spare. So I should use the 16 bit parallel which runs roughly two times faster. Also, using 16 bit color rather that 18 bit will save one third of time, and is easier to store, process and transfer.

In the past I rarely read datasheets for communication protocols and that sort because many libraries are already writen for them for Arduino, which is much more common.

Also if I failed to modify the library, I could desolder the display ribbon from the module, and allows me to tinker with the IM0 to IM2 pins, and get SPI working.
I should be able to replicate the pins' high and low sequence and make it work, just not well optimised.


And why I want optimal speed? Since it has a nice capacitive touch, it would be nice to have smooth and responsive GUI feedback when scrolling though menus or pressing a button.


Another project guidance request, how fast will octuple precision floating point number operations take with software, without a proper FPU that can do such high precision? FPUs don't seem to run at crazy speeds, certainly slower than 600mHz, so what if I use the second core of the Teensy as the fpu to do the calculations? The worse case, octuple precision is 4 times more bytes than double, so with "human's" normal calculation method (kinda n^2 big O notation complexity), multiplication and division will be 16 times slower?
 
Sounds like this project will keep you busy. Will be interesting to see how it progresses.

Other options for doing IO to multiple pins may be to try to use FlexIO... (Chapter 49 of the IMXRT RM)... I don't think the T4 has 8 contiguous Flex IO pins, but I think the T4.1 will.
However I believe that will be on the FlexIO 3 which does not have DMA support. I probably should add table I put up in ReadME, that I put with my FlexIO experiment library (https://github.com/KurtE/FlexIO_t4)
Which shows the pins of T4 and maybe 4.1... So far I have only experimented with logical Uarts and SPI, but they do show examples of emulating things like an 8080 BUS.
 
On the Teensy 4.0, although there aren't 16 pins available on any I/O port (closest is AD_B1, which has 0-3, 6-11, 14-15), bit-banging them will do just fine due to the 600 MHz system clock, because you can do it faster than the display module can accept... You just cannot use DMA to the display module, that's all. I'd basically use an open-coded loop, something like
Code:
    digitalWriteFast(TFT_CS, 0);  // Assert Chip Select for the TFT

    // Command part is written first, omitted here for brevity

    // Send one 16-bit pixel data:

    digitalWriteFast(TFT_DC, 1);  // This is pixel data, not command/register (those only use D0..D7)

    digitalWriteFast(TFT_WR, 0);  // Writing to TFT
    // Start setting up the data pins,
    digitalWriteFast(TFT_D0, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D1, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D2, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D3, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D4, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D5, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D6, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D7, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D8, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D9, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D10, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D11, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D12, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D13, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D14, pixel & 1); pixel >>= 1;
    digitalWriteFast(TFT_D15, pixel & 1);
    // and at this point, make sure at least 15ns passed
    // since the previous comment (including data pin setting)
    digitalWriteFast(TFT_WR, 1); // Tell TFT to latch the parallel data.

    // At least 15ns should pass before next digitalWriteFast(TFT_WR, 0)
The sequence for 8-bit command/register write is very similar. I don't think you'll need pixel reads, but they're very similar to register reads. You do need to change the direction for all the TFT_D0..D15 pins (TFT_D0..D7 for register access, although I would make the others inputs too), and strobe the RD pin, and you'll want to make the TFT_D0..D15 pins inputs before pulling RD down. The data is available in the pins 40ns after pulling the RD pin low for register reads, and 340ns after for pixel data reads. So, it's pretty straightforward, really.

Note that you'll want to make TFT_ constants macros (#define TFT_D0 ...), so that Teensyduino can optimize DigitalWriteFast() properly. If you use const int TFT_D0 = ...; or similar, DigitalWriteFast() will be slower.

(Also note, I'm not sure if that is the best way to open-code it; it's just something I might start with on the Teensy 4.0.)

On Teensy 3.5 and 3.6, PTB16..23 are continuous (pins 0,1,29,30,43,46,44,45); as are PTC0..11 (15,22,23,9,10,13,11,12,35,36,37,38); PTD0..9 (2,14,7,8,6,20,21,5,47,48) and PTD11..15 (55,53,52,51,54) only misses bit 10. On Teensy 3.2, you can use PTD0..7 (2,14,7,8,6,20,21,5) and PTC0..11 (15,22,23,9,10,13,11,12,28,27,29,30). So, on these using 8-bit or 16-bit assignments to the GPIO to set/read the 16 data pins in two 8-bit assignments is quite doable; their microcontrollers have GPIO registers that allows direct assignment to/from port bits 0-7, 8-15, 16-23, 24-31; or 0-15 and 16-31; or 0-31. So, a helper function that uses fixed pins can do a 16-bit parallel write in a handful of cycles, definitely fast enough, on these too.

You can find the pin-to-GPIO mapping in the Teensy schematics page.

As to the actual code, I would write low-level helper functions to send a command (with N payload data bytes), send a command and read N payload data bytes, send one 16-bit pixel, send N 16-bit pixels, and so on, and start with getting the display information out first, using excessively long pulses by delaying everything with delayMicroseconds(1) -- essentially dropping the WR clock to somewhere around 0.5 MHz --, so I could capture them (at least the CS,DC,WR,RD, and some of the data pins) using a cheap logic analyzer (a cheap Saleae Logic clone off eBay that I use with Sigrok, costs about 6€, and all software/firmware it runs on is provided by Sigrok) -- it can sample at up to 24 MHz, but having an order of magnitude lower rate helps a lot --, and verify the data looks sane. Unless I get reasonable responses out of the display immediately so no logic analyzer stuff is needed; that sometimes happens.

Note: KurtE and others here are much more experienced than I with this stuff, so if you see anything that conflicts, trust them first. I've been thinking about something like this for a while, but I'm just an old bumblepuck with not much real-world experience, other than some odd experiments. I'm much more experienced with C and POSIX and the software side.
 
Wow.. That is a lot of interesting and useful information! So what will I lose if I don't have DMA, because I don't quite understand how that thing works. I am planning on creating a simple function to replace the original send 8 bit command and 8/16 bit data that sort of function. For the IO, I think I can use 2 GPIO ports, or maybe three, or maybe 8 bit mode is fine. Also, T4 can be overclocked to 1Ghz =P
I am also curious about the current draw of each clock speed, if anyone happen to know. I could measure myself after I actually get my T4.0, or maybe T4.1
T4.1 is FASTER!! MORE PINS!!!
Really want it
 
Ooof!!! FlexIO is a tad difficult than just port manip. Also how to do port manipulation correctly, all I get from the core library in core pins is the registry to turn what pin high and other registry to turn it low and toggle

And I need know how to use github, seems like something noice
 
So what will I lose if I don't have DMA, because I don't quite understand how that thing works.
To simplify a bit, the DMA engine is a separate unit in the microcontroller, that transfers 8-bit bytes or 16- or 32-bit words, while the processor itself is working on something else.

ILI9341, ILI9488, ST7789, and other similar displays that support parallel interface, all use a D/C (Data/Command, AKA RS for Register Select) pin, to distinguish commands from parameters/pixel data. Because this is necessary for longer sequences, and setting up the DMA transfer takes enough time so that it is not worth it for short sequences, the optimal data width is 7, 15, or 31 bits.

Long story short, DMA via parallel interfaces to these displays isn't worth it, unless you are willing to use double the RAM for the data sent, and have the 9/17/33 bits available in the same GPIO port (or FlexIO; I haven't used FlexIO yet myself).

If you DO NOT use DMA, the processor must send the data in a loop of some sort. You can still use DMA to receive data from UART or SPI into a buffer. This is mostly a design issue: with parallel interfaces to these displays, you'll want your "main loop" to be around sending data to the TFT display, and fit everything else in/around that.

Fortunately, when you do not use DMA, the processor can do funky stuff to the data when it sends it.

In particular, you can use just 320×480=153,600 bytes for your framebuffer, and 2×256=512 bytes for a 256-entry palette of 16-bit colors, so that your framebuffer can display any 256 unique colors picked from the 65,536 possible ones. When sending pixel data, Teensy will look at the byte in the framebuffer, and instead of sending it directly, use the palette to look up the corresponding color to send to the display. (You can even split the display into multiple regions that have their own palettes.)

Then, if you always refresh the entire display, you can do "animation" by modifying the palette only. This is how color cycling is done for fractals, for example: only the palette data changes, the framebuffer itself is unchanged. (Of course, because the palette is on the Teensy and not on the display, you do need to update the entire display for this to work. But Teensy 4.0 has ample power; you can even calculate these fractals directly on it.)

In summary, you don't lose much, if you cannot use DMA with parallel interface to these displays. I just wanted you to know the tradeoff.

Also how to do port manipulation correctly
First, you look at the relevant manual.
For example, on Teensy 4, the GPIO stuff is described on page 1009 forwards. (There is also a note somewhere there that says that each 8-bit byte or 16-bit half of the 32-bit register can be accessed separately. So, an 8-bit access can set/clear/modify bits 0-7, 8-15, 16-23, or 24-31; a 16-bit access can set/clear/modify bits 0-15 or 16-31; and a 32-bit access can set/clear/modify all 32 bits at once.)

Let's ignore direction (input or output) and interrupt capabilities, and concentrate on setting several output pins at the same time. (You use the Teensyduino interfaces to set pin directions et cetera; we're just basically looking at how to do digitalWriteFast() to several pins in parallel.)

On page 1020, we see four interesting registers mentioned: GPIO data register (DR), GPIO data register SET (DR_SET), GPIO data register CLEAR (DR_CLEAR), and GPIO data register TOGGLE (DR_TOGGLE). If we read the following pages, we find that
  • DR contains the port output pin states. Pins that are reserved or not outputs, should be zero. You can read and write this register.
  • DR_SET is a write-only register. When you write to this register, the bits set in the value cause the corresponding bits in DR to be set.
    For example, writing 20+25=33 to this register sets bits 0 and 5 of DR .
  • DR_CLEAR is a write-only register. When you write to this register, the bits set in the value cause the corresponding bits in DR to be cleared.
    For example, writing 20+23=5 to this register clears bits 0 and 3 of DR .
  • DR_TOGGLE is a write-only register. When you write to this register, the bits set in the value cause the corresponding bits in DR to change state.
    For example, writing 21+22=6 to this register toggles bits 1 and 2 of DR .
We also know that GPIOn base address is 0x401B8000+(n-1)*0x4000, where n is between 1 and 4 inclusive;
GPIO5 base address is 0x400C0000, and
GPIOn base address is 0x42000000+(n-6)*0x4000, where n is between 6 and 9, inclusive;
and that DR is at the base address, DR_SET is at base address plus 0x84, DR_CLEAR at base address plus 0x88, and DR_TOGGLE at base address plus 0x8C.

We can also find these constants in the Teensyduino Core, in hardware/teensy/avr/cores/teensy4/imxrt.h, as GPIOn_DR, GPIOn_DR_SET, GPIOn_DR_CLEAR, and GPIOn_DR_TOGGLE.

We can either use the schematics to find out the names of the pins we wish to access and the manual to find their corresponding GPIO bank numbers and bits, or we can do it the easy way, and look at hardware/teensy/avr/cores/teensy4/core_pins.h, especially the CORE_PINn_BIT, CORE_PINn_PORTREG, CORE_PINn_PORTSET, and CORE_PINn_PORTCLEAR macro definitions. These are all for Teensy 4, so if using another Teensy model, use the correct core.

Let's say we want to toggle pins 20, 21, and 22, at the same time if possible. We find that the bits are 26, 27, and 24, respectively, and that for Fast I/O, we can use GPIO6 for all of them.
So, to toggle them, we can write
    GPIO6_DR_TOGGLE = (1<<26) | (1<<27) | (1<<24);
Or, better yet,
    // Note: pins 20, 21, and 22 are assumed to be in GPIO bank 6!
    GPIO6_DR_TOGGLE = CORE_PIN20_BITMASK | CORE_PIN21_BITMASK | CORE_PIN22_BITMASK;
(In fact, we can also do a byte access to the most significant byte of GPIO6 DR_TOGGLE register, writing 224-24+226-24+227-24=13 to it, but better leave that sort of optimization for the compiler to worry about.)

The GPIO module in the Teensy will toggle each pin in parallel. That is, if they were high-low-low, they simultaneously become low-high-high.

In the same GPIO bank, a single write to the DR_SET register can set any set of pins, and a single write to the DR_CLEAR register can clear any set of pins, using the same logic as above.

If you look at the implementation of hardware/teensy/avr/cores/teensy4/core_pins.h:digitalWriteFast(), you'll see that as long as you use literal constants or compiler macros (i.e., use #define MY_FOO_PIN 20 instead of const int my_foo_pin = 20;), it optimizes to the thing we did by hand above: it is pretty darned fast. So, unless you are doing something special, you are better off using digitalWriteFast() with macros (not consts!).

One such special thing could be using pins 14-19, 22, and 23 for the eight parallel data pins, as they are all in GPIO bank 6:
    Pin Bit TFT
    19 16 D0
    18 17 D1
    14 18 D2
    15 19 D3
    17 22 D4
    16 23 D5
    22 24 D6
    23 25 D7
Do note that there are two additional pins affected,
    20 26
    21 27
that should be either inputs, used for digital audio or UART 5; not GPIOs. If they are output GPIO pins, we'll set them also.
You see, to set these pins (in the above order, 19,18,14,15,17,16,22,23 from D0 to D7) corresponding to byte d, we can do
    *((volatile uint16_t *)(&GPIO6_DR) + 1) = ((d & 0xF0) << 2) + (d & 0x0F);
where the cast obtains a 16-bit pointer to GPIO6_DR, we add +1 since we want the high 16 bits of this 32 bit register (and Teensy 4.0 is little-endian, least significant byte first); the volatile tells the compiler it cannot cache the access to the register, and the initial asterisk means we dereference the pointer, i.e. access that location. The value we assign is halved, with lower part shifted by 16 bits, and upper part by 18 bits. As we skip the first 16 bits (so as to not affect any GPIO6 bits 0-15, if they happen to be outputs), it means we only need to shift the upper four bits of the byte two places up. We do not need to cast d to any other type, as C integer promotion rules means that when we do the binary AND operation (&), the arguments are promoted to int anyway. Since we only keep the necessary bits, everything else will be zero.

In other words, if pins 20 or 21 are GPIO outputs also, they will be set to low/0 by the above code.

But, like I wrote earlier, the speed at which Teensy 4.0 can set those eight pins using digitalWriteFast() is probably as fast as the displays can handle anyway, so doing it the complicated -- and very hard to maintain!! -- way like this, is just uncalled for.

At minimum, I'd put a clear comment explaining what it does, and why; and write but comment out the equivalent digitalWriteFast() commands. I know me being me, I'd bumble something at some point anyway, and I'd want to verify that it isn't that which is causing problems, so switching to the known good way of setting the pins is the first step in debugging.
 
That is a great DMA explanation! So I guess in the ILI9488_t3 library, the SPI transfer part doesn't use SPI.transfer that sort of stuff but instead just set some registry or variable to the data, and then that's it, no SPI function calls, so it doesn't occupy the CPU time.

Now I really struggle to decide which interface to use, both have really attracting characteristics. I may go with parallel? Able to send 16 bits at a time is gggggggreat.
But when I am scrolling, I need to do heavy calculations while updating the whole screen. That may waste some unnecessary time. But also I will get smooth scroll with no skip frames. And the power of Teensy can do heavy calculations real fast.

I think I will go with parallel, because like for example changing the background color, SPI will wipe from top to bottom, quite significantly I guess? But with parallel, less significant? When doing bigger updates, or scrolling vertically or horizontally, tearing effect will happen, then parallel will outperform SPI, even with DMA
 
ILI9488_t3 supports the palette method I described above, although it calls it pallet :rolleyes:

It does support DMA too, by precalculating the data sent via SPI into a buffer, then using a DMA to actually feed the bytes.
Each time a buffer of pixels has been sent completely, the DMA engine raises an interrupt (so process_dma_interrupt() gets run), and that prepares the next buffer and sets up the DMA transfer, until all pixels for the current frame have been sent.

Actually, now that I thought about it, we can use DMA on Teensy 4.0 for parallel 8-bit or 9-bit transfers. Our DMA buffer will just use 32 bits per pixel, that's all. Also, we'll want to use 32-bit palettes (1024 bytes per palette), with a wonky bit order: 000010RRRR00RGGG000010GGGB00BBBB2 for 8-bit parallel, and 00001RRRRR00RGGG00001GGBB00BBBB2 for 9-bit parallel transfers.

You see, our DMA setup handler only needs to send the initial command, either 0x2C "Memory Write" (for the very first pixel in the update region), or 0x3C "Memory Write Continue" (for further pixels even if there is another command in between). That is just a single byte, during which the C/S (or RS, Command/Data AKA Register Select) pin is low/0; for parameters and pixels it is high/1, and we can let the DMA handle those.

If we use pins 19,18,14,15,17,16,22,23 (and 20 for 9-bit parallel), the DMA buffer refill from a 256-color framebuffer using a palette is just
Code:
    static uint8_t   tft_framebuffer[480][320];  /* [y][x], we update in 320-pixel "columns" */
    static uint32_t  tft_palette[256];
    static volatile uint32_t  tft_dma_buffer[320];  /* DMA uses uint16_t tft_dma_buffer[2*320] */
    static volatile uint32_t  tft_y;  /* Next y to be filled */

    uint8_t *pixel = tft_framebuffer[tft_y];  /* Same as = &(tft_framebuffer[tft_y][0]); */
    uint8_t *const endpixel = pixel + 320;
    uint32_t *out = tft_dma_buffer;
    while (pixel < endpixel)
        *(out++) = tft_palette[*(pixel++)];
Because the palette entries are already mangled to correspond to the 16-bit chunks of GPIO bank 6, the buffer fill loop is just a straightforward lookup: each byte in the framebuffer corresponds to 32 bits of DMA data; of which 2×8 or 2×9 bits is actually color information. A bit rough RAM use, but worth it, in my opinion.

(Using 8-bit parallel transfers, our DMA buffer needs twice the bytes ILI9488_t3's DMA buffer does, for the same number of pixels.)

Furthermore, we have that tenth bit in GPIO bank 6 available, pin 21, that we could wire to the TFT C/S (also known as RS) pin, which selects between commands and data.
That way, if we wanted to include the Memory Write Continue command in the DMA buffer, we'd just prepend a 16-bit word with value 0x003C to the DMA buffer. We would need to set bits 27 and 11 in each palette entry, though. (This is because GPIO bank 6 bit 27, corresponding to pin 21, must be set, as palette data is data, not commands; and because we actually send 16-bit words, we need both bits 27 and 27-16=11 set. This is what those 1's are in the 32-bit pattern earlier. Essentially, each palette entry is two 16-bit words, that are written to GPIO bank 6 high 16 bits as-is.)

If we need direct color lookup, we can use a lookup table per component that does that bit splitting for us. Say, 256 entries per color component, for 256×4×3=3072 bytes total. Then, if r, g, and b are the color components between 0 and 255, inclusive, the actual wonky bit order, true 32-bit color value is, tft_lookup_red[r]+tft_lookup_green[g]+tft_lookup_blue[b].

Funnily enough, the same code works for both 8- and 9-bit parallel transfers; 8-bit transfers allow 65,536 colors, and 9-bit 262,144 colors.

Darn nifty, if I say so myself.

To properly time the DMA transfers, we do need a WR strobe pin also, and wire the DMA transfer to a falling edge of that pin. The TFT will latch the pins on the rising edge, so Teensy 4.0 needs to set the output pins on or just after the falling edge. I'm not exactly certain how to generate the 10-30 MHz clock signal for the WR strobe pin, perhaps use a pair of pins, one input (for triggering the DMA on falling edge), and one output (say, a SPI SCK pin), connected together, and to the TFT WR strobe pin. But that's details!
 
Now I kinda want a dual 3.5 inch 480*320 lcd setup, so that I don't need to 3d print and polish that many buttons (making those custom rubbery buttons with plastic caps prototypes is real pain in the ass to diy)
Haven't settle the design yet, one of the lcd will be capacitive touch, or both, but *if* I successfully nail the buttons, it will be better than a touch screen. I am making a graphing calculator, and having a screen as the function buttons allows it to show not only function buttons but also QWERTY keyboard if needed or show tables, Y= screen, etc. The number keys *must* be real buttons.
I am worried if I will make something like the horrible Casio Classpad
 
Another non-related question, how to give power to teensy with a button and shut off with a GPIO, I know a latch or something or auto power off circuit is ok, but um...
I will use a 3.3v 500mA power supply connected to a lipo battery, probably with or maybe without a over discharge circuit. The power supply MIC5219-3.3V have a ENable pin, so if I utilize it, I can get less than 5 micro amps of current draw, which is certainly an overkill, if I use the power on/off thing that cut off 3.3v on the teensy, I can get less than 200 micro amps. I wonder if I can hook a GPIO pin to the power on/off pin, so I can pull it to high or low for a few seconds to turn if off, and also a physical button is connected to the power on/off so I can power it on with one click and hold button to force shutdown.
Another choice is to somehow use the deepsleep stuff on the teensy, not sure about the performance.

I don't think I need a overdischarge circuit, I can test the battery voltage when the teensy boots, and if it is lower than 3.7 volts then just show a warning on screen and shut itself down.

The linear power supply is [+]Efficient and [+]Small but [-]Can't output 5v [-]Low max current output [-]Have 500mV dropout voltage @500mA so the battery is unusable when it is 3.7 volts, which is real bad as lithium batteries' protector boards over discharge kicks in @2.4v. I also need 5v for the USB host port for keyboards and mouse. Because the host port is not needed at all times, so I wanted to use existing power bank circuit that auto turns on when peripherals are connected. (Existing circuits are great, maybe efficient and safer), so no boosting required if not needed.
Using the power bank circuit that my li polymer battery comes with (power bank comes with) can [+]output more current, so can also power keyboard mice and hubs. [+]safe, with protection [+]only powers on when there is current draw (Teensy itself may not be able to keep it on, maybe can with the LCD backlight) or/and USB device is connected. Some people add resisters to add current draw, and that is bullshit. Another problem is how to control whether I want the converter to start or not, and have no complete control sucks.

The Numworks calculator I am referencing uses a STM32 f7 and it will work at 2.8v, but not my teensy working at 600mhz. And it doesn't need 5v because it doesn't have a usb host port. Power supplying a battery powered project the a pain in the ass.
 
The cheapy Chinese USB boost circuits' efficiency is worrying, costing about 40 cents, but garantee to turn on and stay on but no way to shut it down.
The damn power bank circuit that comes with the original power bank needs a double button press to enter force output mode (stays on for 2 hours).
I may have to abandon the power bank circuit (but it has QC3.0 quick charge capabilities...)
The numworks have 1450mAh battery, and I am planning to use a 5000mAh@3.7v one.
The numworks one claims to have a use time of 22 hours (actually 11 hours of full brightness not sleeping mode, and 11 hours of dimmed screen)
The power off doesn't cut the power (using the ENable pin of the 2.8v power supply) of the STM32 but just put it into sleep, and somehow can achieve several years of standby. The ENable pin is connected to the reset button, so it is like the 200 micro amps plan above + Teensy's sleep current.

I think the reason that I really want to cut the power completely is because I underestimated the sleep of the teensy, as I don't own a teensy, yet. Another reason is I will need to have 5v output, so shutting off the boost circuit is necessary, or maybe not. I don't know how much current leak a boost circuit has when no load. maybe a mosfet? when button is pressed, turn on the mosfet, and a GPIO pin will pull it after it boot immediately, then it can turn itself off by shutting the mosfet. Then I will need to hold the button until teensy boot.

Sorry for my horrible English, I am Taiwanese, and sorry for messy posts with no subject and ideas all over the place, really bad at this. Oops.
 
DualScreen Design.jpg
ArrowKeys Design.jpg
JoystickAndArrowKeys Design.jpg
And these are what I came up with, but they happen to be much wider than what I wish so, coming at roughly 9-10 cm.
 
Last edited:
...
I am also curious about the current draw of each clock speed, if anyone happen to know. I could measure myself after I actually get my T4.0, or maybe T4.1
T4.1 is FASTER!! MORE PINS!!!
Really want it

There is a current versus CPU speed graph posted on forum 'somewhere'.
T_4.1 Faster - core speed should be the same as it is using the same 1062 MCU core in a larger package
T_4.1 more pins - YES
 
ILI9488_t3 supports the palette method I described above, although it calls it pallet
Yep sometimes some of us don't spell very well. :0

Note: the main reason we added the palette mode to the ILI9488_t3 was when we started playing on it, it with with the T4 Beta 1 board which had half as much memory and as such did not have enough memory to allow a full 320*480*2 buffer, so added the ability to do it with 256 colors.... Later when T4 B2 came out, we added the ability to go to the full size buffer, but left in the ability to go with smaller one as well as you can then use it on T3.5/6...

The fact that this library when doing DMA updates of the screen uses smaller local buffers for screen updates and converts the buffer into the actual color data being sent out, was because the frame buffer is not storing the data in the format that the display needs (3 bytes per pixel), so I need to do something, and I so far don't know of any direct way for using DMA, that for example you give a source buffer as the SOURCE and you give it the SPI output register as a destination, and maybe have some intermediary who maybe uses the source as a palette index and send the translated data to the destination... So I did it myself manually. Also as a side effect it took care of the issue of consistency of the data (actual memory versus cache)...

As for needing absolute speed, for things like scrolling... Sometimes you may need or want to look at different alternatives... You may find that just doing it straight forward is fast enough... Example start a DMA output over SPI is fast enough as you can compute the next screen and be ready to output it when the previous output is done. And/Or you can do like Frank B does with the ili9341 with his game modules, where he starts up continuous DMA outputs (some of our other libraries support this as well), and then he times loading the data into the memory, behind the update area, so the screen keeps updating as fast as possible... Again with the ILI9488_t3 code there is more overhead as the code does have to copy/translate the data on each pass...

So again you may want to look at other alternatives, maybe both hardware and software.

Example: In the thread: https://forum.pjrc.com/threads/59456-TFT-3-5-quot-display-(320x480)-ILI9488-or-HXD8357D a few of us have or will be experimenting with the HXD... display which has the same resolution and supports SPI with 16 bit pixels. And I believe we can/will support it from either the ILI9341_t3n and/or ILI9488_t3 library... So we can do updates of the screen without doing pixel translations and it cuts the number of bytes transferred to update the screen by a third...

Other things to consider/try - Many of these displays have some hardware support built in for scrolling. You can often define a region of the display that can be scrolled horizontally and/or vertically. You can often do this by sending a set of commands that defines the region in the displays memory where the scroll will happen, and then you can tell it to scroll by setting a new starting index... Each of these different displays may or may not support something like this, and they probably have different ways to do it...

But in theory it an work like once you have this defined, you should be able to send a simple command, that says logically scroll the area by N pixels and then do an update output to fill in the new area that was uncovered. It has been a long time since I played around with doing some stuff like this, but potentially maybe some of the displays may have the ability to send data to it, in a area of memory that is not currently visible and then scroll? Again one would need to look over. For example there are sections in the HX8537-D manual talking about scrolling...
 
No graph for T4 found, only one in 2016
But someone real nice posted his results
16mhz/36ma
24mhz/36ma
48mhz/41ma
72mhz/44ma
96mhz/53ma
120mhz/53ma
200mhz/54ma
300mhz/63ma
400mhz/65ma
500mhz/68ma
600mhz/83ma
And pjrc site says 100ma at 600mhz, so, actually pretty impressive
 
Yeah, keep throwing things to the buff and let it send is great. Just worry about tearing effects for spi, and wipe animation when updating big chunks. Using lcd driver commands don't feel quite right... I will definitely try after I receive my teensy and lcd and other stuff. Because of space problems I will desolder the lcd and touch ribbons from the board, that also means I can set it to SPI if I want.

Correct me if I'm wrong, I *think* DMA is called that because it directly accesses the buffer memory and throw things into it, unlike normal functions that access with the stack thing
I don't know I think I am just completely off, better start learning computer structure stuff and assembly
 
Good luck, sounds like a fun project...

As for using LCD driver commands... Like scrolling, personally if I were developing something that had a large portion of my screen scrolling, that would be one of the first things I would looks at, as most all of the display drivers have those capabilities and my guess is they added it to solve these very types of issues... But that is me...

DMA: https://en.wikipedia.org/wiki/Direct_memory_access

Is about the ability of the processor to move data without taking up the main processor as mentioned earlier in this thread.

Stack thingy is something completely different As the image from the T4 product page shows.

teensy4_memory.png

And talk about in the thread: https://forum.pjrc.com/threads/57326-T4-0-Memory-trying-to-make-sense-of-the-different-regions

Your normal code can access memory in any of the normal areas, such as the DTCM (Where global variables are defined, that are initialized or not initialized). Likewise from the RAM2 section where you might have variables that were defined with the DMAMEM attribute or allocated with something like malloc or new and from Flash if you have constant data defined with PROGMEM attribute.

When you are talking about stack, that is the portion of the memory image shown in RAM1 where it says Local Variables.
These are those things that are defined within a program function:
That is if you have something like:
Code:
int my global_array[100];
void my_function() {
    int my_stack_array[100];
...
}
the my_stack_array is on the stack, which implies that when this function exits, that array now longer exists, or more particular, if you call another function, those local variables (and other things like parameters, return addresses...) will write over that same portion of memory that my_stack_array was using. Whereas my_global_array will continue to exist and have a specific address in this case still in DTCM in this case in the zeroed variables portion (as no initial values were given)...
 
Yeah, I saw it earlier this day, the fastrun code is a bit confusing, and Paul says the function to use it is on his Todo?

Also is there a good way to run program on teensy written on teensy or from an sd card. Can I throw my code at somewhere in the memory and make teensy run it(maybe not) or I need to write my own runtime (probably misunderstood what runtime is) and run my program, written in existing interpret*** languages or my own, or somehow compile it and run with custom instruction set or existing ones. Both of these are, I would say, not native?
Are for example executables in Windows running natively on the CPU, or what connects them together? Yeah I am a newbie.
 
Status
Not open for further replies.
Back
Top