[posted] VGA out for Teensy 4.0/4.1

Hi :)

I was involved in the 1st Maximite and Colour Maximite , and was so to say my job to made and fix keyboard for some country.
After that my work evolved and i was no more able to stay active in the project.
With a German guy's i made the 1st STM32 Maximite , but then he was not able to stay as on my side i have not allot of time the project was retaken from Peter Matter , Geoff Graham who invented the Basic interpreter continu now with Peter to make evolve the project on the new Colour MAximite 2.
I just have made the FR KB Layout and fixed a bug in the German KB , here stop my involvement in this project , with more time it will be easier to take active part but ... hmmm ... it's not the case :rolleyes:
Soon i go to Hinkley Point close from Bristol to install 7 Robots i programmed here in France , it's for the new EPR Power plant and i will be a little over booked :)
But so long i'm at home (this week and next but not sure) i try to play with the T4.1 and as i said , i know this MCU for little more than one week now.
For the maximite 2 use the STM32H7xx MCU and for memory it use 8Mb SDRAM , it's this card : https://www.waveshare.com/product/mcu-tools/core-boards-compact-boards/stm32-core/coreh743i.htm
The VGA is done directly from the integrated 2D engine (LTDC + DMA2D) , it's why i know VGA so good , i have do VGA 1st time in 2013 with this project : https://www.youtube.com/watch?v=axJRn_WZFY0
I am a fan of the C64 so I liked the idea of running basic on a modern CPU
Well , for me it was the ... Amiga ... ;)

I finished the small PCB this morning.
I added a joystick connector for a second joystick and I had to remap the analog joypad + button to other pins for convenience.

I have been playing with the video PLL a bit this PM.
Best I can get is 544x480.
Higher frequency also results in rubbish on screen.
I don't fully understand the setup of the video PLL. I need to read the chapter better. From what I understand:

Reading the spec DIV_SELECT is between 27 and 54.
POST_DIV_SELECT is 1,1/2 or 1/4.

I tried your version but you still don't get 640x480. If you print an digits pattern you will see that you also don't see 80 characters (8x80=640 pixels)
Hi ,
I don't have try char actually , just have try to have better image , it's like the FlexIO1 and FlexIO2 are shifted on 2 pixels , but i don't understand how all it's done ...
If you have some docs that explain how it's working it will be good for me to learn it :)
And when your PCB and final pinout is set , don't forget to made a schematic for explain how everything is connected ,i will do the same.

The 2 DMAs cannot be started exactly at the same time as the DMA has a kind of queue of control.
This is why the RRRG and GGBB per pixel are not in phase.

You have to compensate by shifting the pixels.
I could compensate by modifying the source address (+2) for one of the copy but that only works for the lowest resolution somehow.
Another solution is to change the frame buffer itself (interleave the nibbles of the pixels)

R0R0R0G0 R1R1R1G1 R2R2R2G2
G2G2B2B2 G3G3B3B3 G4G4B4B4

I verified the V-pulse on my setup and I have 16ms BTW.
I have to go now.
I will continue and push the code in the evening.
Some times ago , Bitluni made allot of test about VGA on ESP32 MCU.

I compiled all timing and put them in a PDF that you can take here :

It will probably help you for all other resolution you want use on VGA :)

And i finally found some information on T4.1 that can a little help me to understand this MCU , it was here :

For the FlexIO they don't explain why you can't use any pins you want , it will be easy to made a 8bit parallel port that we can use directly without splitting him in 2 distinct part that give pixel shift :rolleyes:

As you say you checked the vsync timing and it was good , i do it too and strangely , now it's good too .. 16.74mS ... :) ... don't ask why ... he he

Thank again for you support.
I understand the video PLL now and could use it at the place of the SW PLL.
It does not really improve but I learned at least something new!

I found why I have so much miseries with the timing.
The problem is really the 2 DMA copy at the same time.
If I only use one and show 4 bits (RRRG or GGBB does not matter), the timing are all logical with proper values for front and back porch, image is also perfectly centered at 640x480

As soon as the 2 DMA are started (even 32bit aligned copy), timing all becoming bad.
Not sure why.

The reason why you don't have 8 bits available on one FlexIO is again bounding and function offered at pin level.


In fact there is:

There is one possibility to use 8 pins at flexio1 on T4.1.
2, 3, 4, 5, 33, 49, 50, 52, 54 => pins
4, 5, 6, 8, 7, 13, 14, 12, 15 => flexio

But I have then to use 16bits to store 8bits color in memory which I find waste. An the pins are not that easy to access.
I will investigate other possibilities with this track...I think it is the only way to get 640x480 on the T4.1 (not the T4.0). Unless 4bits color is enough...;-(
Pins 48 to 54 are in use for the PSRAM , so , it's not possible to use them :)
I have take a look at the LCDIF pins with the nxp pin tool , all what we need is on FLEXIO2 but some pins are not accessible , the T41 just don't have the right pins connected to the world.
Is just bad that the FLEXIO3 don't have DMA ....
Now for accessing 8bit from 16bit is not a waste if you manage to access 2 different 8bit in this 16bit. You shift on 8bit not on 16bit for accessing the pixel color :)
And if you want your 12bit colors , then you have to use 16bit too.
I just got a stupid idea ... but not sure that's so stupid :) and go in a total reverse way than VGA ...
As all this emulator emulate ooOOooOOooLD machines , and all this old (computer) have use PAL or NTSC signals to generate composite video , is it not a way to create such signal on MQS pin to create this signal ?
I know it was something like that on ESP32 and it will resolve allot of problem if it's possible to do it on T41 , only 1pin needed for video , and the frequency is low 14Mhz or something like that.
There is no DAC on the Teensy 4.
I am not sure MQS can be used for video signal.

BTW Bitluni used a tricky feature of the DAC on the ESP32.
It was possible to use the I2S to output on the DAC output of the ESP32.
I also think its VGA feature was using some tricky feature of the I2S block (to use it as some shift register?)
I watched the video a some time ago I cannot remember.

May be using the DMA is scatter/gather mode is our solution?
Switch destination every 4 bytes?
I run out of idea to be honest...
You could always use 8 bits of FlexIO3... It would be more processor intensive because you would need to use an interrupt to load the shift buffer instead of DMA. One thing you can do to decrease the number of interrupts is to use all four 32-bit shift buffers, which could hold 16 pixels worth of data...

I wonder if it's possible to synchronize FlexIO1 and FlexIO2 by using a single external trigger signal? Not sure how to achieve that.
Well , i'm not good enough with this MCU to tell you if it's ok or not for triggering 2 FLEXIO with only one signal :)
But for the FLEXIO3 it's sure that it will slow down everything without DMA.
And like Jean Marc , with this pin arrangement from the T41 i don't have allot idea rest in stock ...
Is it possible to combine 4 shift registers to store 16 pixels in one go instead of 4?

How do I do this?

Now it copies 4 bytes per major loop into a single 32bits shift reg and I use SHIFTS_PER_TRANSFER = 4
FLEXIO2_TIMCMP0 = ((SHIFTS_PER_TRANSFER*2-1)<<8) | ((flexio_clock_div/2-1)<<0);

flexio2DMA.TCD->NBYTES = 4;
flexio2DMA.TCD->SOFF = 4;
flexio2DMA.TCD->SLAST = 0;
flexio2DMA.TCD->BITER = maxpixperline / 4;
flexio2DMA.TCD->CITER = maxpixperline / 4;
flexio2DMA.TCD->DOFF = 0;
flexio2DMA.TCD->DLASTSGA = 0;

What should become above params?
Hi :)
I just tested your new VGA_T4 lib , now the 640x480 is perfect , it's just little blue but no more pixel shift ;)
Your code is so far away from what i understand on T41 !! , where do you get all information's ?
The reference manual is OK for general information but then you have to found examples to really understand what to do :)
As i see , i think the emulators will be perfect if your VGA code become stable with no shift in screen ....
I can't wait for your next news ... he he ...
Is it possible to combine 4 shift registers to store 16 pixels in one go instead of 4?

How do I do this?
I haven't been able to test the following for VGA, but it is nearly the same as the configuration I am using for the SmartMatrix driver. Please forgive any errors!

First, you need to configure all 4 shift buffers. Set SHIFTCFG[INSRC] = 1 for each shifter to use the next shifter's output as input. Only shifter 0 can drive parallel pins, so the pins need to be disabled for shifters 1-3.
  uint32_t timerSelect, timerPolarity, pinConfig, pinSelect, pinPolarity, shifterMode, parallelWidth, inputSource, stopBit, startBit;
  uint32_t triggerSelect, triggerPolarity, triggerSource, timerMode, timerOutput, timerDecrement, timerReset, timerDisable, timerEnable;

  // Shifter 0 registers  
  parallelWidth = FLEXIO_SHIFTCFG_PWIDTH(7);  // 8-bit parallel shift width
  pinSelect = FLEXIO_SHIFTCTL_PINSEL(0);      // Select pins FXIO_D0 through FXIO_D7
  inputSource = FLEXIO_SHIFTCFG_INSRC*(1);    // Input source from next shifter
  stopBit = FLEXIO_SHIFTCFG_SSTOP(0);         // Stop bit disabled
  startBit = FLEXIO_SHIFTCFG_SSTART(0);       // Start bit disabled, transmitter loads data on enable 
  timerSelect = FLEXIO_SHIFTCTL_TIMSEL(0);    // Use timer 0
  timerPolarity = FLEXIO_SHIFTCTL_TIMPOL*(1); // Shift on negedge of clock 
  pinConfig = FLEXIO_SHIFTCTL_PINCFG(3);      // Shifter pin output
  pinPolarity = FLEXIO_SHIFTCTL_PINPOL*(0);   // Shifter pin active high polarity
  shifterMode = FLEXIO_SHIFTCTL_SMOD(2);      // Shifter transmit mode
  FLEXIO2_SHIFTCFG0 = parallelWidth | inputSource | stopBit | startBit;
  FLEXIO2_SHIFTCTL0 = timerSelect | timerPolarity | pinConfig | pinSelect | pinPolarity | shifterMode;

  // Shifter 1-3 registers are identical except with pin output disabled
  parallelWidth = FLEXIO_SHIFTCFG_PWIDTH(7);  // 8-bit parallel shift width
  inputSource = FLEXIO_SHIFTCFG_INSRC*(1);    // Input source from next shifter
  stopBit = FLEXIO_SHIFTCFG_SSTOP(0);         // Stop bit disabled
  startBit = FLEXIO_SHIFTCFG_SSTART(0);       // Start bit disabled, transmitter loads data on enable 
  timerSelect = FLEXIO_SHIFTCTL_TIMSEL(0);    // Use timer 0
  timerPolarity = FLEXIO_SHIFTCTL_TIMPOL*(1); // Shift on negedge of clock 
  pinConfig = FLEXIO_SHIFTCTL_PINCFG(0);      // Shifter pin output disabled
  shifterMode = FLEXIO_SHIFTCTL_SMOD(2);      // Shifter transmit mode
  FLEXIO2_SHIFTCFG1 = parallelWidth | inputSource | stopBit | startBit;
  FLEXIO2_SHIFTCTL1 = timerSelect | timerPolarity | pinConfig | shifterMode;
  FLEXIO2_SHIFTCFG2 = parallelWidth | inputSource | stopBit | startBit;
  FLEXIO2_SHIFTCTL2 = timerSelect | timerPolarity | pinConfig | shifterMode;
  FLEXIO2_SHIFTCFG3 = parallelWidth | inputSource | stopBit | startBit;
  FLEXIO2_SHIFTCTL3 = timerSelect | timerPolarity | pinConfig | shifterMode;

The FlexIO timer and the DMA request should be configured to trigger on Shifter 3 instead of Shifter 0 (because shift buffer 3 will be loaded last), and the SHIFTS_PER_TRANSFER is 16 instead of 4:
  // Timer 0 registers
  timerOutput = FLEXIO_TIMCFG_TIMOUT(1);      // Timer output is logic zero when enabled and is not affected by the Timer reset
  timerDecrement = FLEXIO_TIMCFG_TIMDEC(0);   // Timer decrements on FlexIO clock, shift clock equals timer output
  timerReset = FLEXIO_TIMCFG_TIMRST(0);       // Timer never reset
  timerDisable = FLEXIO_TIMCFG_TIMDIS(2);     // Timer disabled on Timer compare
  timerEnable = FLEXIO_TIMCFG_TIMENA(2);      // Timer enabled on Trigger assert
  stopBit = FLEXIO_TIMCFG_TSTOP(0);           // Stop bit disabled
  startBit = FLEXIO_TIMCFG_TSTART*(0);        // Start bit disabled
  triggerSelect = FLEXIO_TIMCTL_TRGSEL(1+4*(3)); // Trigger select Shifter 3 status flag
  triggerPolarity = FLEXIO_TIMCTL_TRGPOL*(1); // Trigger active low
  triggerSource = FLEXIO_TIMCTL_TRGSRC*(1);   // Internal trigger selected
  pinConfig = FLEXIO_TIMCTL_PINCFG(0);        // Timer pin output disabled
  timerMode = FLEXIO_TIMCTL_TIMOD(1);         // Dual 8-bit counters baud mode
  #define SHIFTS_PER_TRANSFER 16 // Shift out 8 bits 16 times with every transfer = four 32-bit words = contents of Shifters 0-3 
  FLEXIO2_TIMCFG0 = timerOutput | timerDecrement | timerReset | timerDisable | timerEnable | stopBit | startBit;
  FLEXIO2_TIMCTL0 = triggerSelect | triggerPolarity | triggerSource | pinConfig | timerMode;
  FLEXIO2_TIMCMP0 = ((SHIFTS_PER_TRANSFER*2-1)<<8) | ((flexio_clock_div/2-1)<<0);
  FLEXIO2_SHIFTSDEN |= (1<<3);

To configure the DMA transfer to fill up all 4 buffers in a single burst each time it is triggered, you can use a minor loop offset to reset the destination address after each burst. Conveniently, the FlexIO buffer registers are adjacent in memory space. The number of pixels in each line needs to be a multiple of 16.
  unsigned int minorLoopBytes, minorLoopIterations, majorLoopBytes, majorLoopIterations;
  int destinationAddressOffset, destinationAddressLastOffset, sourceAddressOffset, sourceAddressLastOffset, minorLoopOffset;
  volatile uint32_t *destinationAddress, *sourceAddress;

    DMA_CR |= DMA_CR_EMLM; // Enable minor loop mapping so that we can have a minor loop offset
    minorLoopIterations = 4; // transfer 4 words with each DMA trigger into 4 FlexIO buffers
    minorLoopBytes = minorLoopIterations * sizeof(uint32_t);
  #define BYTES_PER_PIXEL 1
    majorLoopBytes = maxpixperline * BYTES_PER_PIXEL; // This must be evenly divisible by 16
    majorLoopIterations = majorLoopBytes / minorLoopBytes;
    sourceAddress = (uint32_t*) & gfxbuffer[0];
    sourceAddressOffset = sizeof(uint32_t);
    sourceAddressLastOffset = - majorLoopBytes; // at completion of major loop, reset source address
    destinationAddress = &FLEXIO2_SHIFTBUF0;
    destinationAddressOffset = sizeof(uint32_t);
    minorLoopOffset = -minorLoopIterations * destinationAddressOffset; // reset destination address at end of each minor loop...
    destinationAddressLastOffset = minorLoopOffset; // ...and at end of major loop
    flexio2DMA.TCD->SADDR = sourceAddress;
    flexio2DMA.TCD->SOFF = sourceAddressOffset;
    flexio2DMA.TCD->SLAST = sourceAddressLastOffset;
    flexio2DMA.TCD->DADDR = destinationAddress;
    flexio2DMA.TCD->DOFF = destinationAddressOffset;
    flexio2DMA.TCD->DLASTSGA = destinationAddressLastOffset;
    flexio2DMA.TCD->BITER = majorLoopIterations;
    flexio2DMA.TCD->CITER = majorLoopIterations;
    flexio2DMA.disableOnCompletion(); // disable on completion or else it will be triggered by FlexIO continuously

By the way, I would configure both FlexIO1 and FlexIO2 to use FLEXIO_SHIFTCFG_PWIDTH(7) and FLEXIO_SHIFTCTL_PINSEL(0). This will select signals FXIO_D0 through FXIO_D7 to be output on both FlexIOs, but since the Teensy 4.1 only has pins 0 through 3 on FlexIO2 and pins 4 through 7 on FlexIO1, the extra signals will be discarded. And this should enable you to use FLEXIO2_SHIFTBUF0 and FLEXIO1_SHIFTBUF0 as your destination registers - no need to use the nibble byte swapped register FLEXIO1_SHIFTBUFNBS0 or to use an unaligned copy.

Hope that helps...
Thanks a lot!
I will try all that today if I have the time between the various family trips planned. It is the last opportunity as tomorrow I am back to work...

The code I had pushed 2 days ago was just the version using the video PLL and for the 640x480 mode, I was only copying the lowest nibble with a single DMA, that is why the image was green/blue (only GGBB was copied!)
(but UAE is looking so much nicer at 640x240, even in blue!)
I am interested to see how the 2 DMAs will interfere by using the 4 shifters combined. I had experimented yesterday but no luck...
I will verify will your input.
The code you propose results in a black screen.
I was looking at your DMA copy code, trying to use it with a single shift register (at least I know what I expect on the screen for that one!)

if I use my original code:
flexio2DMA.TCD->NBYTES = 4*minorLoopIterations;
flexio2DMA.TCD->SOFF = 4*minorLoopIterations;
flexio2DMA.TCD->SLAST = 0;
flexio2DMA.TCD->BITER = maxpixperline / (4*minorLoopIterations);
flexio2DMA.TCD->CITER = maxpixperline / (4*minorLoopIterations);
flexio2DMA.TCD->DOFF = 0;
flexio2DMA.TCD->DLASTSGA = 0;
=> that results a correct full screen image (single DMA copy)

If I do minorLoopIterations=4 (=> it will copy 16bytes to FLEXIO2_SHIFTBUF0 to 3 I guess), then I get the 4 times the image on the half left image (I understand 4 times the image as only the pixel copied to FLEXIO2_SHIFTBUF0 is used but why on 1/2 LEFT screen?)

If I use your DMA code with minorLoopIterations=1 I get a correct full screen image.
With minorLoopIterations=4 I get a black image.
I noticed that I needed the full DMA setup in the interrupt. Then your DMA config seems to work with minorLoopOperation=4(at least I get what I would expect with a single shiftbuf, with minorLoopOperation=2 I get an half screen image, with 4 a smaller image however not 1/4th) So better than with mine...

But still I have a black image as soon I use 4 shift buffers...
I will continue a bit in the evening, I have to go now.
Thanks for the support...
I don't know exactly what's wrong without seeing the full code, but if you have TCD->DOFF = 0 with minorLoopIterations = 4 then your DMA is copying into single destination SHIFTBUF0 four times in a row before the FlexIO has shifted out all the data, so some data is being lost.

With 4 shift buffers and TCD->DOFF = 4, the triggers for the FlexIO timer and the DMA request need to be set to use shifter 3 instead of shifter 0 (FLEXIO_TIMCTL_TRGSEL(1+4*(3)) and FLEXIO2_SHIFTSDEN |= (1<<3)) so that all the data is correctly copied to the buffers before the shifters are loaded and shifting starts, and all the buffers are empty before the next copy starts. Make sure to set SHIFTS_PER_TRANSFER to 16 or else you may not get correct behavior.
No luck with the code...;-(
I really believe that the DMA copy in the interrupt does not fill the 4 shift registers...
So below code is not behaving as expected.
triggerSelect = FLEXIO_TIMCTL_TRGSEL(1+4*(3)); // Trigger select Shifter 3 status flag
How to prove is not easy...
I tried keeping the same DMA copy (filling the 4 shift registers) and used SHIFTER1 iso SHIFTER0 in the single shift register variant.
triggerSelect = FLEXIO_TIMCTL_TRGSEL(1+4*(1));
I get strange behavior too...

I pushed the code as it is now. May be you can have a look.

The variant using 4 shift registers is under below compiler switch (commented out in commit!)

BTW using FLEXIO_SHIFTCFG_PWIDTH(7) and FLEXIO_SHIFTCTL_PINSEL(0) to use destinationAddress = &FLEXIO1_SHIFTBUF0 and avoid the nibble shift is also not working somehow but I tried (not in the commit).
Last edited:
Hi , i've found bug in VGA_T4 , i corrected it , that was the pixel who can go out the screen and it append some time a freeze from the T41. Now it's OK :)
I made 2 new easy function :
  void draw_h_line(int16_t x1, int16_t y1, uint16_t lenght, vga_pixel color);
  void draw_v_line(int16_t x1, int16_t y1, uint16_t lenght, vga_pixel color);
I modified :
void drawfilledcircle(int16_t x, int16_t y, int16_t radius, vga_pixel fillcolor, vga_pixel bordercolor);
void drawfilledellipse(int16_t cx, int16_t cy, uint16_t radius1, uint16_t radius2, vga_pixel fillcolor, vga_pixel bordercolor);

The function call are like before , it's just internal code that have change.

I have a bug in the Ellipse , it's round ... Filled Ellipse is ok , the strange is that i have exactly the same code that work on STM32 ... he he
I completed the vgatestalign demo , it's little more active now :eek:

Bye the way , my old Rigol DS5102M scope died last night :(
I will get a Siglent SDS1204X-E for Wednesday in the hope it will last longer , only 10 years for the Rigol ... , my previous 50mhz old Hameg that i give to a friend is close to 30 year old and is still alive !!

Mods are here :

With byte decomposed in 2 parts it will not be easy to resolve for the DMA.
And use a little SPI LCD for Amiga emulator it's not the ideal way :)
A SPI LCD with more resolution will become too slow at refresh rate .... i just hope that a solution exist on VGA ...
I will merge your changes this evening. May be we should introduce cropping at pixel level for every primitives...
Another idea would be to use the 2D HW block to provide Blit/ScaledBlit. Of course with the non standard RGB 8bit mode it will be tricky.
Not sure if CLUT mode was supported by the 2D block.
At least for scrolling and sprites it might be interesting to have a Blit support in HW.

May be the 2D block could also be used to expand each line from 8bits to 32bits before the DMA does the copy, and use 32bits shift registers...you never know ;-)