Diy teensy sdram solder yourself

theboot900

Well-known member
I've designed a DIY Teensy 4.1 based board with Sdram with the idea of being able to solder it yourself. (It requires a stencil and solder paste as well). I don't have a hot plate, so i have been heating them with hot air from underneath. It uses the BGA 12x12 chip. Like dog bones board it uses the teensy micromod sized flash chip.

  • No special manufacturing. $4 for 5 boards from jlcpcb
  • Excel file with complete component listing, quantities and Links to all required components on mouser.
  • Full schematic, Kicad board and Gerber files
  • Printable PDF showing all components placings for easy reference when placing parts.
  • Small form factor. Same length as teensy 4.1. 20mm wider than teensy 4.1.
  • Smallest component is 0603
  • All but 2 components are on one side of the board. 2 capacitors will need to be hand soldered afterwards.
  • 60 IO pins exposed
  • GPIO7 - all 32 bits exposed
  • GPIO6 - upper 16 bits exposed
  • 32mb Sdram using dogbones sdram library.
  • Sd card
  • Full part listing with quantities and mouser links for parts
  • Can be powered by Usb 5v. Or remove the bridge from Usb to voltage regulator and it can be be powered exclusively through 3.3v pin (recommended to use a linear 3.3v input supply)

Possibly due to my routing (I had trace lengthed them) the Sdram runs fine overclocked at 198mhz, but at 220mhz overclock it gets some errors. I'm only a hobbyist and this is the first 4 layer board i've done. Maybe someone will find it useful.

All the files are at https://github.com/theboot999/Teensy-DIY

Top.jpg
Locations.jpg
Photo.jpg
 
Is there an external 5V pin, to power it from an external 5V supply or power connected devices that require 5V?
 
No you would have to route one yourself. I didn't have any use for one and initially was just going to be 3.3v only. I added 3.3v regulator to be able to power of 5v usb.
 
Happy to see that you were successful in making your own SDRAM board after me and @Rezo started the endevour to make it happen, and managed to get a full list of people (you know who you are) to join in and make it reality. This is what a community should be.

Stay creative people! 😁
 
Happy to see that you were successful in making your own SDRAM board after me and @Rezo started the endevour to make it happen, and managed to get a full list of people (you know who you are) to join in and make it reality. This is what a community should be.

Stay creative people! 😁

Thanks! Definitely wouldn't have been able to do it without you and rezo
 
Regarding "only 198MHz" in your text above. Did you use a 10pF for the SDRAM?
We found that 10pF seemed to work best, we reached 227MHz (if I am not mistaken) stable. And at 240MHz it started to not work to well.
Defragster did see other results with 12pF but we concluded that it must be more/less solder which creates capacitance and thus giving slightly different results. As these capacitors were soldered by hand.

So if you don't have a 10pF for the SDRAM cap, try that and see if you can reach the 227 mark.
 
Regarding "only 198MHz" in your text above. Did you use a 10pF for the SDRAM?
We found that 10pF seemed to work best, we reached 227MHz (if I am not mistaken) stable. And at 240MHz it started to not work to well.
Defragster did see other results with 12pF but we concluded that it must be more/less solder which creates capacitance and thus giving slightly different results. As these capacitors were soldered by hand.

So if you don't have a 10pF for the SDRAM cap, try that and see if you can reach the 227 mark.
It's there at C11. But it could be a case of the value having to match the design; we found 10pf to be best, but the layout on this board looks a little bit different and also seems to use different size components.
 
It's there at C11. But it could be a case of the value having to match the design; we found 10pf to be best, but the layout on this board looks a little bit different and also seems to use different size components.
Yeh, I guess this is the best that can be done, and 198 is still decent when considering that the SDRAM is rated for 166.
 
I've designed a DIY Teensy 4.1 based board with Sdram with the idea of being able to solder it yourself. (It requires a stencil and solder paste as well). I don't have a hot plate, so i have been heating them with hot air from underneath. It uses the BGA 12x12 chip. Like dog bones board it uses the teensy micromod sized flash chip.

  • No special manufacturing. $4 for 5 boards from jlcpcb
  • Excel file with complete component listing, quantities and Links to all required components on mouser.
  • Full schematic, Kicad board and Gerber files
  • Printable PDF showing all components placings for easy reference when placing parts.
  • Small form factor. Same length as teensy 4.1. 20mm wider than teensy 4.1.
  • Smallest component is 0603
  • All but 2 components are on one side of the board. 2 capacitors will need to be hand soldered afterwards.
  • 60 IO pins exposed
  • GPIO7 - all 32 bits exposed
  • GPIO6 - upper 16 bits exposed
  • 32mb Sdram using dogbones sdram library.
  • Sd card
  • Full part listing with quantities and mouser links for parts
  • Can be powered by Usb 5v. Or remove the bridge from Usb to voltage regulator and it can be be powered exclusively through 3.3v pin (recommended to use a linear 3.3v input supply)

Possibly due to my routing (I had trace lengthed them) the Sdram runs fine overclocked at 198mhz, but at 220mhz overclock it gets some errors. I'm only a hobbyist and this is the first 4 layer board i've done. Maybe someone will find it useful.

All the files are at https://github.com/theboot999/Teensy-DIY

View attachment 37454View attachment 37455View attachment 37456

Very very cool project!
What have you done with it so far? Anything interesting with the additional GPIOs and SDRAM?
 
Last edited:
It's there at C11. But it could be a case of the value having to match the design; we found 10pf to be best, but the layout on this board looks a little bit different and also seems to use different size components.

Yes I tried a 10pf and a 6pf.

But ive put it down to my lack of professional routing skills. And also in trying to keep power filtering capacitors all to one side of the board they would not be optimally placed.
 
Yes I tried a 10pf and a 6pf.
On the @Dogbone06 board I tried 6 (6.5?) up over 12 and the one board I had did better with 12 - but 10 was a good sweet spot on that layout for the other boards together all going a bit over 200 MHz - and some over the 220 as noted.

Indeed, CAP location seems important - PJRC wanted them on the close to MCU on bottom - and then others to balance the build. But if it is working, you did well.
 
On the @Dogbone06 board I tried 6 (6.5?) up over 12 and the one board I had did better with 12 - but 10 was a good sweet spot on that layout for the other boards together all going a bit over 200 MHz - and some over the 220 as noted.

Indeed, CAP location seems important - PJRC wanted them on the close to MCU on bottom - and then others to balance the build. But if it is working, you did well.

Thanks. I'll try a bunch of different values and see if I can get an improvement. It's always bits 8 to 15 that error out (when reading the 32 bit words)
 
From memory, we managed to get 200MHz with no cap at all just leaving the trace disconnected - have you tested what the maximum safe speed is for that case? It may be higher if the trace is longer.
 
I've tried many different capacitor values, and i cannot get it to run stable at 221mhz.

The best i got using a 20pf capacitor, was using Dogbones Sdram test slightly modified was an average failure rate of 3500 words every 478 million word reads

So now ive been doing some testing on an 800 x 480 Driverless display. (Well the panel driver is ST7277). This is driven by the lcdif pins and a separate constant current ic.

The output is by the LCD Engine. I'm running at about 50hz on a 20mhz Lcdif clock. I've been running triple buffering. The buffers are set up bank aligned. This made a big performance difference having them bank aligned. The SDRam is divided into 4 banks. Having the buffers in seperate bank improves the switching time. I run triple buffering so at the start of a frame:
  • Back Buffer - Gets Cleared by the Pixel Pipeline to 1 colour.
  • Mid Buffer - Gets drawn to by the cpu
  • Front Buffer - Gets output via the LCDIF Engine.

I'm a little bit disappointed by the Pixel Pipeline clear rate. But I would think it has the least priority when it comes to SDRam access - the Cpu and Lcd would take a higher priority. Granted all the timings are while the LCDIF is running, so there will be bus contention issues. Obviously when using the Pixel Pipeline to clear, we are still free to use the cpu. However if the backbuffer takes longer than i frame time (20.2 milliseconds) then we skip a frame and output the same previous frame.

Running 32 bit buffers. Buffers are uint32_t x Framewise (384000)
  • Clearing the backbuffer manually in a for loop via cpu. 4.9 milliseconds
  • Clearing the backbuffer using PXP. 14.0 milliseconds
  • Clearing the backbuffer using PXP while cpu is writing 384000 pixels to mid buffer. 19.1 milliseconds

So next i switched to using 24 Bit Packed format. This format is stored as uint8_t [frameSize * 3]. Its a little bit more maths to set a pixel, as to find the location its * 3, then the next 3 pixels are R, G, B. However its technically 33% less bandwidth for the LCDIF and the PXP to read and write to the SDRam.

Doing a standard write to a pixel is definitely slower. Doing a manual clear of the buffer in an unoptimized format (Setting r, +1 for g, +2 for b) took a large 7.6 milliseconds. However i made an optomised version for 24 bit. And this worked well for doing sequential pixels of the same colour. We create a temporary array of uint_32[6]. We then fill that with 8 sequential pixels (With the pixels overlapping the bytes as its 3 bytes per pixel) Then we memcopy this into our 24 bit buffer. We then do the left over pixels manually.

I haven't tested individual pixels (Alphabet letters) etc on 24 bit. I'm sure there is a better way performance wise than setting each pixel as 3 x 8 bit writes.

So on 24 bit
  • Clearing the backbuffer manually in a for loop Optomised via cpu. 2.4 milliseconds
  • Clearing the backbuffer using PXP 12.3 milliseconds
  • Clearing the backbuffer using PXP while cpu is writing 384000 pixels to mid buffer optomised. 15.4 milliseconds (Cpu write time 3.2 milliseconds)
  • Clearing the backbuffer using PXP while cpu is writing 768000 pixels to mid buffer optomised. 18.6 milliseconds (Cpu write time 6.4 milliseconds)

So clearly the optomised cpu writing to the mid buffer is hammering the sdram bus, and giving the pxp not much time to clear.

However, the PXP takes 3 to 5 times longer than doing a manual clear. This was a bit surprising, as i thought the SDRam would be more off a bottle neck, and these numbers would be closer. However,

Next i hope to do the same tests in 16 bit RGB 565 mode. I think we could get some good performance there.
 
cannot get it to run stable at 221mhz.
221 unstable ... Not surprising. Others may have though not here on the uniform set of @Dogbone06 boards IIRC. So pushing the 166 MHz part so far seems safe - but beyond the common working <210? it may depend on the PCB build and part? Testing here with 6'ish to 16'ish pf for me the sweet spot was higher than what worked on average for others but higher tested faster on the board here.

Good luck optimizing the buffer flow. Write speed is 3X higher than reads - but that may depend on contiguous writes or something? That might be why optimized CPU writes clears the buffer faster?
 
221 unstable ... Not surprising. Others may have though not here on the uniform set of @Dogbone06 boards IIRC. So pushing the 166 MHz part so far seems safe - but beyond the common working <210? it may depend on the PCB build and part? Testing here with 6'ish to 16'ish pf for me the sweet spot was higher than what worked on average for others but higher tested faster on the board here.

Good luck optimizing the buffer flow. Write speed is 3X higher than reads - but that may depend on contiguous writes or something? That might be why optimized CPU writes clears the buffer faster?

I ordered one of dogbones boards and got it stable at 221mhz with a 10pf capacitor. However I think the routing for the sdram for that board was a replica of the imxrt1060 dev board. There routing pattern would be professional.

I can get my board running at 198mhz with or without the capacitor capacitor.

Thanks. Now in my normal program flow it would be
Run lcdout (whole frame)
Clear buffer via pxp
Cpu run update, then draw to midbuffer.

So in an ideal setup once the cpu writes to the midbuffer, the pxp has already cleared the backbuffer.

Otherwise we have 3 things accessing sdram at once. Lcd dma, pxp dma and the cpu.

I suppose my test case was, how feasible is it to do 24 bit colour using sdram and lcdif at 50 or 60hz. And that seems doable on basic drawing.

However any kind of pixel blending and reading back pixel data from the sdram buffers while processing them would be unfeasible. (Eg basic 3d lighting maps)
 
The only relevant advice I can think of is make sure your writes from the CPU happen sequentially/with as few other memory accesses between them as possible. The reason being it's impossible to write an entire cacheline with a single instruction, so any single write will usually trigger a memory read to fetch the surrounding cacheline bytes even if you're planning on overwriting all of them. But the CPU has a clever feature that lets it adapt: if it notices the memory write pattern is writing large, sequential blocks of data it will start delaying fetching the rest of the cacheline and will discard the fetch completely if the entire cacheline is filled by the CPU.

(Other CPU architectures usually work around this problem by having a "allocate cacheline for writing" instruction that initializes a cacheline without filling it from memory, but ARM Cortex doesn't.)
 
routing for the sdram for that board was a replica of the imxrt1060 dev board
Yes, routing was cloned to assure that wouldn't be the issue. A 100 pf? cap was somehow placed so a proper replacement was looked for assuming it would be best and the whole concept was new and finding one reliable over 200 seemed like a win. Not sure how much testing happened with no cap as there was a setting saying to look for it IIRC so that settled that.

@jmarsh did some code divergence IIRC? - with good results - but never looked at that here.

Minimizing multi hits on the SDIO RAM access seems best. Especially if 3X slower reads would be involved and breaking contiguous writes that are designed to run so much faster in the optimal path built into the chip.
 
Also I'm not sure what types of graphics you're rendering, but if it's just basic UI stuff (placing font glyphs and basic rect fills) consider using LCDIF's paletted mode. That way each pixel can still be 24-bit color but they only take up a single byte of memory.
 
I've tried many different capacitor values, and i cannot get it to run stable at 221mhz.

The best i got using a 20pf capacitor, was using Dogbones Sdram test slightly modified was an average failure rate of 3500 words every 478 million word reads

So now ive been doing some testing on an 800 x 480 Driverless display. (Well the panel driver is ST7277). This is driven by the lcdif pins and a separate constant current ic.

The output is by the LCD Engine. I'm running at about 50hz on a 20mhz Lcdif clock. I've been running triple buffering. The buffers are set up bank aligned. This made a big performance difference having them bank aligned. The SDRam is divided into 4 banks. Having the buffers in seperate bank improves the switching time. I run triple buffering so at the start of a frame:
  • Back Buffer - Gets Cleared by the Pixel Pipeline to 1 colour.
  • Mid Buffer - Gets drawn to by the cpu
  • Front Buffer - Gets output via the LCDIF Engine.

I'm a little bit disappointed by the Pixel Pipeline clear rate. But I would think it has the least priority when it comes to SDRam access - the Cpu and Lcd would take a higher priority. Granted all the timings are while the LCDIF is running, so there will be bus contention issues. Obviously when using the Pixel Pipeline to clear, we are still free to use the cpu. However if the backbuffer takes longer than i frame time (20.2 milliseconds) then we skip a frame and output the same previous frame.

Running 32 bit buffers. Buffers are uint32_t x Framewise (384000)
  • Clearing the backbuffer manually in a for loop via cpu. 4.9 milliseconds
  • Clearing the backbuffer using PXP. 14.0 milliseconds
  • Clearing the backbuffer using PXP while cpu is writing 384000 pixels to mid buffer. 19.1 milliseconds

So next i switched to using 24 Bit Packed format. This format is stored as uint8_t [frameSize * 3]. Its a little bit more maths to set a pixel, as to find the location its * 3, then the next 3 pixels are R, G, B. However its technically 33% less bandwidth for the LCDIF and the PXP to read and write to the SDRam.

Doing a standard write to a pixel is definitely slower. Doing a manual clear of the buffer in an unoptimized format (Setting r, +1 for g, +2 for b) took a large 7.6 milliseconds. However i made an optomised version for 24 bit. And this worked well for doing sequential pixels of the same colour. We create a temporary array of uint_32[6]. We then fill that with 8 sequential pixels (With the pixels overlapping the bytes as its 3 bytes per pixel) Then we memcopy this into our 24 bit buffer. We then do the left over pixels manually.

I haven't tested individual pixels (Alphabet letters) etc on 24 bit. I'm sure there is a better way performance wise than setting each pixel as 3 x 8 bit writes.

So on 24 bit
  • Clearing the backbuffer manually in a for loop Optomised via cpu. 2.4 milliseconds
  • Clearing the backbuffer using PXP 12.3 milliseconds
  • Clearing the backbuffer using PXP while cpu is writing 384000 pixels to mid buffer optomised. 15.4 milliseconds (Cpu write time 3.2 milliseconds)
  • Clearing the backbuffer using PXP while cpu is writing 768000 pixels to mid buffer optomised. 18.6 milliseconds (Cpu write time 6.4 milliseconds)

So clearly the optomised cpu writing to the mid buffer is hammering the sdram bus, and giving the pxp not much time to clear.

However, the PXP takes 3 to 5 times longer than doing a manual clear. This was a bit surprising, as i thought the SDRam would be more off a bottle neck, and these numbers would be closer. However,

Next i hope to do the same tests in 16 bit RGB 565 mode. I think we could get some good performance there.
Could you share the codebase for this test?

I have 7” 800*480px display with the same driver and will be hooking it up to my Devboard V5 when the adapter PCBs arrive, and have been considering use of the PXP to write rectangle clips to the main eLCDIF buffer for LVGL partial mode implementation (to improve render/write times and as much non blocking transactions)
 
I recently made a super nice PCB with lots of outputs and sensors. It runs stable on 221MHz, but I use the same SDRAM layout for all boards. I keep it simple by copying the "bare minimum" straight over to any new project that is similar. So I use that 10pF cap that me and Defragster decided on when we tested back in the day.

I guess different designs makes for different results, it just proves that layout is super hard!
 
The only relevant advice I can think of is make sure your writes from the CPU happen sequentially/with as few other memory accesses between them as possible. The reason being it's impossible to write an entire cacheline with a single instruction, so any single write will usually trigger a memory read to fetch the surrounding cacheline bytes even if you're planning on overwriting all of them. But the CPU has a clever feature that lets it adapt: if it notices the memory write pattern is writing large, sequential blocks of data it will start delaying fetching the rest of the cacheline and will discard the fetch completely if the entire cacheline is filled by the CPU.

(Other CPU architectures usually work around this problem by having a "allocate cacheline for writing" instruction that initializes a cacheline without filling it from memory, but ARM Cortex doesn't.)
Thats interesting to know. Thanks

Yes, routing was cloned to assure that wouldn't be the issue. A 100 pf? cap was somehow placed so a proper replacement was looked for assuming it would be best and the whole concept was new and finding one reliable over 200 seemed like a win. Not sure how much testing happened with no cap as there was a setting saying to look for it IIRC so that settled that.

@jmarsh did some code divergence IIRC? - with good results - but never looked at that here.

Minimizing multi hits on the SDIO RAM access seems best. Especially if 3X slower reads would be involved and breaking contiguous writes that are designed to run so much faster in the optimal path built into the chip.

Yes. Unfortunately at least the LCDIF will be reading the Sdram while anything else is going on with it. I don't remember how much exactly, but setting the backbuffers in different banks (The SDram is divided into 4 even banks) did make a difference on performance. I believe theres less row switching required.

I tried adjusting the outstanding transactions and burst length with the LCDIF without any noticeable difference.

Also I'm not sure what types of graphics you're rendering, but if it's just basic UI stuff (placing font glyphs and basic rect fills) consider using LCDIF's paletted mode. That way each pixel can still be 24-bit color but they only take up a single byte of memory.

I will try that at some point to see what performance i can get. At this point im working on bringing across a basic 3d engine i was using on a teensy 4.1 and an sdd1963 display which was using a look up table and an 8 bit backbuffer.
 
Back
Top