Call to arms | Teensy + SDRAM = true

They're using LCDIF (and PXP) to output to the display... which is pretty tricky considering the amount of pins shared between CSI and LCDIF... but the main note is they're explicitly not using the eDMA module which is where the slowdown seems to come from with SEMC/SDRAM.
 
They're using LCDIF (and PXP) to output to the display... which is pretty tricky considering the amount of pins shared between CSI and LCDIF
True, I have not really studied the LCDIF as I don't think we have had any boards that have enough pins exported to use. I might take a look
and see if we added the 3 or 5 pins mentioned in yesterdays post about CSI... If both would be covered or not....

Other side note: I am currently playing with CSI on the Teensy 4.1 with PSRAM. Which I believe is a lot slower than the SDRAM.

But with a hacked up example sketch in our new library, I have the VGA app, able to run on T4.1 with CSI and define the camera buffers in PSRAM.
And I am able to read in frames using the CSIs DMA...
1714053603654.png

Sorry the picture of the picture is sort of washed out here... And still lots of stuff to try to figure out how to do, or not do with CSI...
 
@KurtE the dev board has all the eCLDIF pins exposed. Starting from B0_00 to B1_13 I believe.
I posted here a few weeks back that I got it working on a 24 bit display with SDRAM.
 
I think the way they organised the pins was intended for an 8-bit camera connected to CSI and a 16-bit (parallel) display connected to LCDIF. Anything higher in either case and you run out of pins due to them being shared between the modules.
 
Last edited:
@KurtE the dev board has all the eCLDIF pins exposed. Starting from B0_00 to B1_13 I believe.
Thanks, @mjs513 (and others), I updated again the dogbone... excel document all of the LCD pins on the right hand side of the
board were filled in. I think it should look like:
1714058274282.png

Will send the document back to you and/or could try to do PR with it back to @defragster and/or could put it up in my documents...

But looking at this, I belive the LCD_Data ends at B1_11, so assuming we have the AD_B1_xx pins defined, you should be able to use both sub-systems
 
B0_00 is LCD_CLK, B0_01 is LCD_ENABLE, B0_02 is LCD_HSYNC.

I guess you could have either a 24-bit LCD, or a 10/16 bit camera without running into conflicts. But still, not with the current board due to the missing CSI data pins.
 
Need to update B0_00 through B0_03 accordingly
Sorry, I did not go all the way up on the SS to those pins... I have now...

But still, not with the current board due to the missing CSI data pins.
Exactly.

Pushed up current version to fork of defragsters github project and issued PR
 
Have the SDRAM board wired up to the ER-TFT101-1 display. DMA is working on it as well as the MicroMod, sort of! This is the test sketch I am using modified to include SDRAM usage:
Code:
#include "RA8876_t3.h"
#include "SDRAM_t4.h"
#include "Teensy41_Cardlike.h"
#include "flexio_teensy_mm.c"

/*
// MicroMod
uint8_t dc = 13;
uint8_t cs = 11;
uint8_t rst = 5;
*/

// SDRAM Board
uint8_t dc = 17;
uint8_t cs = 14;
uint8_t rst = 27;

/*
// T4.1
uint8_t dc = 13;
uint8_t cs = 11;
uint8_t rst = 12;
*/

uint32_t start = 0;
uint32_t end =  0;

uint8_t busSpeed = 20;

RA8876_t3 lcd = RA8876_t3(dc,cs,rst); //(dc, cs, rst)

SDRAM_t4 fb_sdram;

static uint16_t* frameBuffer;
static uint16_t* frameBuffer1;

void setup() {
  while (!Serial && millis() < 3000) {} //wait for Serial Monitor
  Serial.printf("%c SDRAM Dev Board and RA8876 parallel 8080 mode testing (8/16)\n\n",12);
//  Serial.print(CrashReport);
//  pinMode(WINT, INPUT); // For XnWAIT signal if connected and used.

  if(!fb_sdram.begin()) {
    Serial.printf("SDRAM Init Failed!!!\n");
    while(1);
  };

  frameBuffer = (uint16_t*)sdram_malloc(sizeof(flexio_teensy_mm));
  memcpy((uint16_t *)frameBuffer, flexio_teensy_mm, sizeof(flexio_teensy_mm));
 
  frameBuffer1 = (uint16_t*)sdram_malloc(sizeof(teensy41_Cardlike));
  memcpy((uint16_t *)frameBuffer1, teensy41_Cardlike, sizeof(teensy41_Cardlike));

  if(!lcd.begin(busSpeed)) Serial.printf("lcd.begin(busSpeed) FAILED!!!\n");
  delay(100);

  Serial.print("Bus speed: ");
  Serial.print(busSpeed,DEC);
  Serial.println(" MHZ");
  Serial.print("Bus Width: ");
  Serial.print(BUS_WIDTH,DEC);
  Serial.println("-bits");
}

void loop() {
  start = micros();

//  lcd.pushPixels16bitDMA(teensy41_Cardlike,1,1,575,424);    // FLASHMEM buffer
//  lcd.pushPixels16bitDMA(flexio_teensy_mm,530,260,480,320); // FLASHMEM buffer

  lcd.pushPixels16bitDMA(frameBuffer,530,260,480,320);        // SDRAM buffer
  lcd.pushPixels16bitDMA(frameBuffer1,1,1,575,424);           // SDRAM buffer

  end = micros() - start;

  Serial.printf("Wrote %d bytes in %dus\n\n",(575*424)+(480*320), end);
  waitforInput();
}

void waitforInput()
{
  Serial.println("Press anykey to continue");
  while (Serial.read() == -1) ;
  while (Serial.read() != -1) ;
}
And the result is:
SDRAM_FB.jpg

We used SDRAM as a frame buffer.
Now when we use FLASHMEM as a frame buffer the results are:
FLASHMEM_FB.jpg

We get inconsistent and wavy or sheared pictures. The picture is never the same when pressing a key in loop() to redraw the pictures. The pictures are always the same when using the SDRAM buffers. I have no idea why yet. My 74LVC245 buffer chips should be here tomorrow. I'll wire up a buffer circuit and hopefully that will solve some of the other issues. The ER-TFTM101-1 data pins are only good for 8ma current. The 74LVC245's are rated at 50ma.
At least we know DMA will work on both the MicroMod and the SDRAM board...
 
Looks like flash is just barely too slow to keep up... unless there's code somewhere running directly from flash causing contention?
 
Looks like flash is just barely too slow to keep up... unless there's code somewhere running directly from flash causing contention?
I had not thought of that. What's interesting is it works fine from flash memory on the MicroMod and T41. So maybe there is a conflict somewhere. I could speed up the bus rate and see if that changes any thing...
Edit: Went from 2MHz to 24MHZ best results 20MHz...
 
Last edited:
Was playing around jpegdec library and did a comparison of using PROGMEM vs SDRAM (running at 198Mhz)

Code:
FLash TEST
full sized decode in 28350 us
half sized decode in 21260 us
quarter sized decode in 7320 us
eighth sized decode in 6104 us
SDRAM TEST
56010
full sized decode in 28208 us
half sized decode in 21061 us
quarter sized decode in 7124 us
eighth sized decode in 5861 us
this is using a 640x480 test image that is part of the lib.

Here is the code:

C++:
//
// Perf Test
//
#include <JPEGDEC.h>
#include "../test_images/tulips.h" // 640x480 56k byte test image
JPEGDEC jpeg;
//uint8_t frameBufferSDRAM = (uint8_t *)((((uint32_t)(sdram_malloc(640 * 480 + 32)) + 32) & 0xffffffe0));
uint8_t* frameBufferSDRAM = (uint8_t*)sdram_malloc(640 * 480 * sizeof(uint8_t));

int JPEGDraw(JPEGDRAW *pDraw)
{
  // do nothing
  return 1; // continue decode
} /* JPEGDraw() */

void setup() {
  Serial.begin(115200);
  delay(100); // allow time for Serial to start
} /* setup() */

void loop() {
long lTime;
  Serial.println("FLash TEST");
  if (jpeg.openFLASH((uint8_t *)tulips, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,0)) { // full sized decode
      lTime = micros() - lTime;
      Serial.printf("full sized decode in %d us\n", (int)lTime);
    }
    jpeg.close();
  }
  if (jpeg.openFLASH((uint8_t *)tulips, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,JPEG_SCALE_HALF)) { // 1/2 sized decode
      lTime = micros() - lTime;
      Serial.printf("half sized decode in %d us\n", (int)lTime);
    }
    jpeg.close();
  }
  if (jpeg.openFLASH((uint8_t *)tulips, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,JPEG_SCALE_QUARTER)) { // 1/4 sized decode
      lTime = micros() - lTime;
      Serial.printf("quarter sized decode in %d us\n", (int)lTime);
    }
    jpeg.close();
  }
  if (jpeg.openFLASH((uint8_t *)tulips, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,JPEG_SCALE_EIGHTH)) { // 1/8 sized decode
      lTime = micros() - lTime;
      Serial.printf("eighth sized decode in %d us\n", (int)lTime);
    }
    jpeg.close();
  }

  Serial.println("SDRAM TEST");
  Serial.println(sizeof(tulips));

  memcpy(frameBufferSDRAM, tulips, sizeof(tulips));

  if (jpeg.openRAM((uint8_t *)frameBufferSDRAM, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,0)) { // full sized decode
      lTime = micros() - lTime;
      Serial.printf("full sized decode in %d us\n", (int)lTime);
    } else {
      Serial.println("Failed to open Framebuffer!!");
    }
    jpeg.close();
  }
  if (jpeg.openRAM((uint8_t *)frameBufferSDRAM, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,JPEG_SCALE_HALF)) { // 1/2 sized decode
      lTime = micros() - lTime;
      Serial.printf("half sized decode in %d us\n", (int)lTime);
    } else {
      Serial.println("Failed to open Framebuffer!!");
    }
    jpeg.close();
  }
  if (jpeg.openRAM((uint8_t *)frameBufferSDRAM, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,JPEG_SCALE_QUARTER)) { // 1/4 sized decode
      lTime = micros() - lTime;
      Serial.printf("quarter sized decode in %d us\n", (int)lTime);
        } else {
      Serial.println("Failed to open Framebuffer!!");
    }
    jpeg.close();
  }
  if (jpeg.openRAM((uint8_t *)frameBufferSDRAM, sizeof(tulips), JPEGDraw)) {
    lTime = micros();
    if (jpeg.decode(0,0,JPEG_SCALE_EIGHTH)) { // 1/8 sized decode
      lTime = micros() - lTime;
      Serial.printf("eighth sized decode in %d us\n", (int)lTime);
    } else {
      Serial.println("Failed to open Framebuffer!!");
    }
    jpeg.close();
  }

  delay(5000);
} /* loop() */
 
@mjs513 I think the PXP can help you with the color conversion, and it might be faster as well. But either way it will be done async and won’t require any CPU for it.
 
I received my 74LVC245 chips and had some fun wiring this rat's nest up:
FALSHMEM_FB_GOOD.jpg

This took care of the bad image when loading from FLASHMEM. Both FLASHMEM and SDRAM are working using 8-bit DMA. The image is correct and repeatable at 20MHz. I only buffered the 8 data signals and put 10K pullups on /RD, /WR, /CS and /RS. Now to try and find out why I
still have the erratic 8-bit reads. Probably software...
 
@KurtE I’ll add all the pins you want. As that will help the community. So go ahead and together with @mjs513 or anyone else, compile a table of the missing pins.

Ideally you’d make a complete table of ALL pins, maybe some of the ones exposed now should be removed. That would be the most structured and correct way of doing it.
I think which new pins could/should be added to support has been shown now in a few threads including:
I Hid most of the columns, which shows:
1713990709528.png

The ones in Yellow are the ones we do not have. 3 are critical for CSI (D3-D5)

The PIXLK/MCLK - There are currently other pins on the board that can work, that is pins 59-62 (B1_12 - B1_25) are alternatives to the pins we
use on the T4.1...

As some of you know I made a 4.5 (facelift) that adds USB-PD (USB Power Delivery up to 12V). And it also has SDCARD.
Those are the only changes made (moved the boot button position as well), hence the word facelift.

The USB Host port has sort of two modes:
* USB-PD, you supply your own external power via screw terminal connector.
* USB 5V, the from the input USB-C port is also present on the USB-C Host port.

Simply bridge two of the 2 pads to choose between the two.

Here are both images, if you want to make it look even nicer in the spreadsheet.

As soon as someone presents a complete spreadsheet for a gen5 board, I will make that happen!
I'm hoping that a few pins that are currently present can be removed, to give more space under that tight BGA.
Question, what do you mean by complete? That is I have not seen what I would consider a complete one for DB4 or DB4.5. That is one
that shows all of the signals for EMC and now SDIO. Is that what you are wanting?

Side notes on the layout of the boards: For me the layout of the boards is in no real specific order.
Maybe they are for some different usage cases. So for example, if I am wanting to test things out with a camera, tft display, and SD card
currently there are a lot of jumper wires, which in most cases I end up spending more of my time debugging the wring than the code....

So if I am serious about trying some stuff out. I would probably create a shield for it, like I did for the Micromod, or the updated board
for the T4.1. Which are pretty cheap to get fabricated. With PCBWay: these boards cost me about: $5 + $20 shipping and usually here
within a week +- a couple of days.

Can and probably may do one or more for DB5. @mjs513 has done one for DB4. If the board is lets say 100x60mm it is still $5for 5 or 10 boards
if the length > 100mm it goes up by $18 for 5 of them or $22 for 10 of them. Which is not the end of the world, But makes it less likely that I will order multiple versions... With the current board, could probably create one under 100, by excluding a couple of the pins at the end, as they
are currently just extra GND and +3.3v... But the only thing missing is +5v, which I may want to provide to the TFT displays...

Sorry, I know, just a few random thoughts.
 
Last edited:
are currently just extra GND and +3.3v... But the only thing missing is +5v, which I may want to provide to the TFT displays...
I agree - would be nice to a +5v on the side headers instead of all the 3.3v and gnd pins. Probably could reduce that number.
 
Question, what do you mean by complete? That is I have not seen what I would consider a complete one for DB4 or DB4.5. That is one
that shows all of the signals for EMC and now SDIO. Is that what you are wanting?
Not sure if this would help or, not, but I tried to create a IO pin MUX page for the Dogbone document with I think most of
the IO pins that are on the board. It shows the SDIO pins which were added on 4.5 board in different color... Those should work with my variant for the Micromod as same IO pins...

I also showed the 5 possible additions for CSI pins. Currently in PR

Which I now have permissions to merge, but wanted to know if this helps or not. Note: some of the color coding for different functionality is probably wrong. can go back to that later, although most of the problems are probably ine the GPIO_EMC_xx pins, which are pretty well dedicated anyway
 
create a IO pin MUX page for the Dogbone document
Merged - can be seen for comment as

@Dogbone06 -, etc:
Wondering if the edge single row might work as a double row for DB_V 5.0 to shorten the pin array for shield creation (p#993) and shorter signal runs? Especially adding a few more pins? Though I see about 16 GND pins in the rows now? Not sure if that helps anything? Maybe a 90 rotated added row across somewhere?
 
Quick update: Dumped the FlexIO /RD generation and generated it manually with software. Both 8-bit and 16-bit reads are working consistently now with one /RD pulse per read instead of two or sometimes three. Tested with T41 and ER-TFTM101-1. Might try generating the /WR signal as well to see if there is any improvement with writes to the display. Especially the SDRAM Dev board and MicroMod...
 
You will get a faster WR pulse if you use direct port manipulation and some NOPs to delay, but, it won't be easy to control the speeds.
That's where the FlexIO timers and shifters have the advantage.
 
Just realized pin AD_B1_03 isn't routed so the board doesn't have SPDIF_IN... could we add that to the list for gen5? With Kurt's additions that would complete all of the AD_B1_XX pins which make up consecutive halves of GPIO1 / FlexIO3.
 
Well it's pretty much finished. Modified the SDRAM Dev board /RD signal to use manual pulse generation. Bus speeds are about the same as the T4.1:D The FlexIO /RD signal generation seems to have been the main issue for stability. It kind of makes sense. I thought that the register reads were stable and memory reads were not stable but that was not the case.The read status function was also failing erratically causing problems with 2D busy function. I have both the T41 and the SDRAM boards working at 12-20MHz in both 8-bit and 16-bit mode.Just have to modify the driver for the MicroMod board and push up all of the changes to GitHub. There is only one thing left to do and that is to figure out why using async IRQ image display produces this:
T41_8bit_IRQ_bad.jpg


The image is at the right coordinates but the image itself is skewed horizontally to the right and appears to wrap around to the left but otherwise is working.
Here is a demonstration of BTE and ROP ops:
BTE_and_ROP_ops.jpg

Picture is a little blurry due to my steady hand;) This is the output on the serial monitor showing some of the transfer times:
Code:
LCD Memory Transfer test starting!Compiled May  2 2024 at 14:27:09
Bus transfer speed 12.0MHz

fillRect operation takes 0.608 milliseconds to fill the image area.
Put-picture from PROGMEM to display took 3.698ms to begin the operation.
  But the next LCD operation was delayed by 0.282ms because data transfer was still underway
16-bit copy from PROGMEM to display took 1063.589ms to begin the transfer (data is on its way while you read this.)
Chromakey copy from PROGMEM to display took 6.432ms to run to completion.
ROP 15  BTE copy took 62.000us, followed by 780.000us internal processing in the RAiO chip.
ROP 0  BTE copy took 62.000us, followed by 783.000us internal processing in the RAiO chip.
ROP 1  BTE copy took 62.000us, followed by 2069.000us internal processing in the RAiO chip.
ROP 2  BTE copy took 62.000us, followed by 2169.000us internal processing in the RAiO chip.
ROP 3  BTE copy took 62.000us, followed by 1294.000us internal processing in the RAiO chip.
ROP 4  BTE copy took 62.000us, followed by 2179.000us internal processing in the RAiO chip.
ROP 5  BTE copy took 62.000us, followed by 1126.000us internal processing in the RAiO chip.
ROP 6  BTE copy took 62.000us, followed by 2013.000us internal processing in the RAiO chip.
ROP 7  BTE copy took 62.000us, followed by 1762.000us internal processing in the RAiO chip.
ROP 8  BTE copy took 62.000us, followed by 2167.000us internal processing in the RAiO chip.
ROP 9  BTE copy took 62.000us, followed by 2183.000us internal processing in the RAiO chip.
ROP 10  BTE copy took 62.000us, followed by 1120.000us internal processing in the RAiO chip.
ROP 11  BTE copy took 62.000us, followed by 2178.000us internal processing in the RAiO chip.
ROP 12  BTE copy took 62.000us, followed by 1344.000us internal processing in the RAiO chip.
ROP 13  BTE copy took 62.000us, followed by 2163.000us internal processing in the RAiO chip.
ROP 14  BTE copy took 62.000us, followed by 2033.000us internal processing in the RAiO chip.


First Page Finished, PRESS ANY KEY...
This a demo program written by M Sandercock.

This was a long complicated learning experience. Should have trusted my instincts to begin with...
EDIT: Just noticed that image display with DMA has the same horizontal skew problem as async IRQ has:confused:
EDIT2: Found the problem with the skewed image. Earlier we modified "MulBeatWR_nPrm_DMA" with:
Code:
        p->SHIFTBUFHWS[0] = *(uint32_t *)value;
        uint32_t *value32 = (uint32_t *)value;
        value32++;
        p->SHIFTBUFHWS[1] = *(uint32_t *)value32;
        value32++;
        p->SHIFTBUFHWS[2] = *(uint32_t *)value32;
which was a fix for a screwed up image that was apparently due to the messed up FlexIO /RD signal. I commented the above out and it now displays the image correctly. Talk about chasing your tail...
 
Last edited:
Back
Top