PC Engine emulator, PSRAM experiment

Jean-Marc

Well-known member
I would like to share my first experience using PSRAM (SPI RAM) on the Teensy 4.0.

Few weeks ago I decided to port one more emulator core to my MCUME project.
https://github.com/Jean-MarcHarvengt/MCUME

The core is TGEmu, a PC engine emulator written by Charles MacDonalds.
It is a challenging emulator wrt RAM requirement.
In addition to the 128k ILI9341 frame buffer, the core has about 200k of local variables + 128k of background cache + 512k of sprites object cache. Finally ROMS loaded into RAM are between 256kb and 1MB!
There was clearly not enough memory in the T4 for it but I had ordered few IPS6404 PSRAM devices some time ago, so I had an opportunity to try them.

I connected the PSRAM to SPI2 (SPI mode, not QUAD SPI)
That means that I could forget the 'build in' SD card for my disk I/O.
The ILI 9341 is connected on SPI0.
I use one MQS channel for Audio (pin 10).
My plan was to use the PSRAM for storing the game rom image. Using the PSRAM mostly in reading mode.
The rest I could almost store in the RAM of the teensy.
The game image is loaded to the PSRAM at startup. Initially it was read from USB storage but I had some hang up when reading the image from USB to PSRAM (not clear why). Another problem is that the uFS + USB library are using almost 70k of RAM.
So I decided to go back to the good old SD library and connect the SD port of the ILI display on SPI0, together with the display (using another CS of course)
As the image is copied at startup from SD to PSRAM, SPI0 can be used exclusively in DMA mode for the display later on.

For the PSRAM driver, I used a cache of 16 pages of 16/32 bytes in RAM. depending of the game, the emulated CPU accesses and jumps at few locations.
Bigger page size results is freezing the game continuously.
CPU usage in TGEmu is also extensive. I had to compile the code for fastest and overclock the T4 to 800Mhz.

You can see the result in this video
https://youtu.be/Ot9RgDMqdF4

I said that I could almost fit all RAM buffers (except the game image) into the T4 memory.
In fact I had to cheat for the 512k RAM objects buffer which results in buggy display in some games. To go around I really would need the all heap (malloc area) to be available for the emulator.
 
Just posted this: T4-0-Memory-trying-to-make-sense-of-the-different-regions

If you do the math for sizeofsomememory - what does it show in your code? Wondering if there is a usable chuck of RAM going to waste in this case - of course it will vary as code changes.

On that thread are notes on leaving code in FLASH - has that been done to free up RAM? Also - should be on that thread is the imxrt-size code that will display some memory map details - including showing how full the last 32KB ITCM code block is.

Very cool you got PSRAM working. Would be cool to see that code broken out.
 
if I run the tool of Kurt I get:

FlexRAM section ITCM+DTCM = 512 KB
Config : aaaaaaff
ITCM : 110976 B (84.67% of 128 KB)
DTCM : 377536 B (96.01% of 384 KB)
Available for Stack: 15680
OCRAM: 512KB
DMAMEM: 4672 B ( 0.89% of 512 KB)
Available for Heap: 519616 B (99.11% of 512 KB)
Flash: 405536 B (19.96% of 1984 KB)


I will isolate the code of the PSRAM in a new project.
I also would like to investigate why I cannot use DMA Transfer with the PSRAM as I use for the display.

I also have doubt about the clock setting in the SPI driver.
The PSRAM supports 84MHz.
If I pass that parameter, I don't reach that speed over SPI.
 
if I run the tool of Kurt I get:

FlexRAM section ITCM+DTCM = 512 KB
Config : aaaaaaff
ITCM : 110976 B (84.67% of 128 KB)
DTCM : 377536 B (96.01% of 384 KB)
Available for Stack: 15680
OCRAM: 512KB
DMAMEM: 4672 B ( 0.89% of 512 KB)
Available for Heap: 519616 B (99.11% of 512 KB)
Flash: 405536 B (19.96% of 1984 KB)


I will isolate the code of the PSRAM in a new project.
I also would like to investigate why I cannot use DMA Transfer with the PSRAM as I use for the display.

I also have doubt about the clock setting in the SPI driver.
The PSRAM supports 84MHz.
If I pass that parameter, I don't reach that speed over SPI.

That looks like that currently leaves a 12KB region orphaned in ITCM RAM. I just posted code here that locates that free space and gets a pointer to it: T4-0-Memory-trying-to-make-sense-of-the-different-regions

TD 1.48 shipped with an SPI limit of 60 MHz (?) on the clock. @KurtE did a mod the other week that allowed for an 80 MHz clock that would give a bit of a boost. Seems there was a PULL request because it worked for tested displays - it may make it into TD 1.49 beta. KurtE posted his change on github.
 
Thanks for the tip.
My understanding from the post is that the ILI9341 was running at about 40MHz with the old limit.

With the change I can set it up to 60MHz for the display and same for the PSRAM.
Going above with the PSRAM (80MHz) results in error.
Same for the display BTW.

So there is a noticeable improvement but still far from the SPEC of the IPS6404 (104MHz for the SQ version)

Does the current SPI driver supports QUAD SPI transfer? How to setup the extra pins for QUAD SPI? Any example shared somewhere? Thanks.
 
As I also have the SD card on the same bus as the display, the load is probably slightly different.

Indeed something is different with the display or connections - was cool testing KurtE's update ran on the ili9341 as connected, and @KurtE and @mjs513 both tested with ili9488's IIRC.

But those wouldn't relate to the PSRAM connections depending on how they were made - except to know that good lines to device the 1062 with KurtE's change was able to run at 80 MHz.

Perhaps showing how it was connected would allow others to setup and test. I have some of those chips - but it would take KurtE's interest {or similar} to setup with logic analyzer or perhaps adjust the code timing or SPI ordering to get function at 80 MHz.

… prior post was cut short as the doorbell rang. Nice it went up to 60 MHz.

Other than the SDIO bus running 4 data lines - notes about QSPI support on other SPI's didn't come up as an option with the pins presented.
 
Hi,

I tried on another setup where only the ILI9341 (2.2") was connected on SPI0.
I can reach 100MHz clock in DMA and I also did not notice any issue without DMA.
This w-e I will create a PCB for the version where the SD card is connected on same SPI0 bus. I probably had some weird long wire connections.
I will confirm with pictures!

How do I post pictures on this forum BTW?
 
Hi,

I tried on another setup where only the ILI9341 (2.2") was connected on SPI0.
I can reach 100MHz clock in DMA and I also did not notice any issue without DMA.
This w-e I will create a PCB for the version where the SD card is connected on same SPI0 bus. I probably had some weird long wire connections.
I will confirm with pictures!

How do I post pictures on this forum BTW?

Cool display works faster - test so far shows it hit 80 and no jumps until 120 as the match works out when running a benchmark.

To post pics jpeg or png typically and must be under 1MB AFAIK or they are rejected. There is an 'insert image' icon on the normal toolbar - and in 'Go Advanced' where Any file attach you can pick a pic too.
 
Here are few pictures of the T4+PSRAM piggy back and the PCB breaking board I created for the emulation project
T4piggy.png
pcbT4.png
With both the ILI9341 + the SD card (from the ILI) on SPI0, I cannot go above 60MHz (ILI9341 alone I could go to 100MHz)
Not sure what the default SPI clock is in the SD library. It probably fails there when I go above 60MHz.
PSRAM alone on its SPI2 bus is still at 60MHz too. Above is giving errors.
 
More emulators using the PSRAM module on the Teensy 4.0.
Next to PC engine now Gameboy, Sega Master System, Megadrive and AtariST are using it.
I had to struggle a week with the SW before figuring out that the soldered PSRAM chip was defect and that was not always visible. Now all ok!
https://youtu.be/j2sKw7KYpEo
https://github.com/Jean-MarcHarvengt/MCUME/blob/master/README.md
The list becomes bigger every day: Atari 2600,Odyssey,colecovision,Atari5200,Vectrex,NES,PCEngine,Sega Master System,Sega Game Gear,Sega Megadrive and Gameboy.
And for the computers: ZX81and spectrum, Atari800,C64,AtariST, 8086 XT.
Only theTeensy 4.0 could do them all!
 
Any chance that you share your code to access the IPS6404? Just ordered a bunch of them for experimenting, so some working examples to start from might be useful.
 
Looks good, thanks a lot for sharing. I'll give it a try when the chips arrived.
 
Just a few thoughts about using PSRAMs after struggling with them in my projects. Maybe they will be helpful in some way.
Although from the outside these chips invite to use them as a regular serial SRAM there is one potential trap:
If we look at the datasheet:
https://github.com/Edragon/Datasheet/blob/master/IPUS/IPUS 64Mbit.pdf
Page 21
tCEM parameter: CE# low pulse width, value max 8us
PSRAMS being a DRAMS with all the refreshing circuitry and a serial IO built in need some time for internal operations, hence the one single I/O burst length is limited to 8us only.
If exceeded it will return garbage.
So, depending on the clock frequency of the SPI, to avoid errors the PAGE_SIZE value should be limited in order to not to exceed the allowed 8us max burst length.
IE, for 70MHz clock:
(8us * 70MHz)/ 8 = 70 bytes total
subtracting 5 bytes (1 command, 3 byte address, 1 for wait cycles) gives max 65 bytes long data burst in theory, so a PAGE_SIZE of 64 bytes should work.
For 60MHz the burst length goes down to 60 bytes or 55 bytes of data.
Looking at the PSRAM_T::begin code, a default linear burst access is used. The max clock frequency is limited to 84MHz in that mode.
Perhaps a way to speed up the transfer would be to set the chip to work in 32 byte burst wrap mode and access data in 32 byte chunks with a higher clock rate.
Of course assuming the buffers stored in the PSRAM are aligned with it's page size of 1k. For accesses crossing the page boundary the clock is limited down to 84MHz.
 
Thanks a lot for the spec hidden detail but by luck the current page size in the driver is 16 bytes.
So it reads max 16bytes in an SPI transaction (always from a 16 Bytes page boundary, at 0, 16, 32...)
So we are still below a 4us CS low pulse in total.
I tried increasing from 70MHz to 80Mhz and it starts failing directly, with our without burst mode command after the reset.
I get a bit confused to be honest...
 
I meant 16 bytes + the 5 bytes overhead (1 com + 3 add + 1 wait) but still below 4us total (2.4us at 70MHz if I am correct)
 
Jean-Marc,
thanks a lot for your code to make the chip work!
I tested it with Teensy 4.0 @600MHz & audioshield rev. D with PSRAM [ESP PSRAM64H] soldered to the audio shield
and Arduino 1.8.10 and Teensyduino 1.4.9 beta #3
and SPICLOCK = 80MHz

Code:
#include "psram_t.h"

//#define PSRAM_CS      10  // CS pin of SD card of audio shield 
#define PSRAM_CS       6  // CS pin of MEM of audio shield 
#define PSRAM_MOSI    11  // to IPS pin 5 SI/SIO0
#define PSRAM_MISO    12  // to IPS pin 2 SO/SIO1
#define PSRAM_SCLK    13  // to IPS pin 6 SCLK

// Teensy 3.6, 180MHz, 70MHz SPI
// 21:10h - ca. 40sec pro Runde

// Teensy 4.0, 600MHz, 70MHz SPI
// 21:13h - ca. 17sec pro Runde

PSRAM_T psram = PSRAM_T(PSRAM_CS, PSRAM_MOSI, PSRAM_SCLK, PSRAM_MISO);

static int cnt = 0;
static uint8_t randomdata[0x10000]; // 64k rando data


void setup() 
{ 
  Serial.begin(115200);
  while (!Serial && millis() < 2000); 
  // Init random data
  for (int i=0; i<0x10000 ; i++) 
  {
    randomdata[i] = random(255);
  }
  Serial.println("Init PSRAM");  
  psram.begin();
  Serial.println("Testing PSRAM...");  
  Serial.print("IPS_SIZE = "); Serial.println(IPS_SIZE);
}



void loop() 
{
  uint32_t counter = millis(); 
  Serial.print("loop ");
  Serial.println(cnt++);

  // Write and read random patern table over the all device in a loop
  for (int i=0; i<IPS_SIZE ; i++) 
  {
    psram.pswrite(i,randomdata[i&0xffff]);  
    //Serial.print("Writing at position: ");
    //Serial.println(i);    
   }
  Serial.print(millis() - counter); Serial.println(" msec for 8Mbyte WRITE");
  Serial.print(8 * 1024 * 1000 / (millis() - counter)); Serial.println(" kbyte/sec WRITE");
  delay(100);
  counter = millis();
  for (int i=0; i< IPS_SIZE; i++) 
  {
    uint8_t val = psram.psread(i);
    if (val != randomdata[i&0xffff] )
    {
      Serial.print("err at ");
      Serial.println(i);
    } 
    else
    {
      //Serial.print("correct at ");
      //Serial.println(i);
    }
  }
  Serial.print(millis() - counter); Serial.println(" msec for 8Mbyte READ");
  Serial.print(8 * 1024 * 1000 / (millis() - counter)); Serial.println(" kbyte/sec READ");
}

I am not sure whether my calculations are correct, but the read and write speed seem to me a bit low. Is this normal with this kind of chip?
SPICLOCK = 80MHz

Code:
Init PSRAM

Testing PSRAM...

IPS_SIZE = 8388608

loop 0

12192 msec for 8Mbyte WRITE

671 kbyte/sec WRITE

3003 msec for 8Mbyte READ

2727 kbyte/sec READ

loop 1

12752 msec for 8Mbyte WRITE

642 kbyte/sec WRITE

3003 msec for 8Mbyte READ

2727 kbyte/sec READ

loop 2

12752 msec for 8Mbyte WRITE

642 kbyte/sec WRITE

3003 msec for 8Mbyte READ

2727 kbyte/sec READ

loop 3

12752 msec for 8Mbyte WRITE

642 kbyte/sec WRITE

3003 msec for 8Mbyte READ

2727 kbyte/sec READ

loop 4

12751 msec for 8Mbyte WRITE

642 kbyte/sec WRITE

3003 msec for 8Mbyte READ

2727 kbyte/sec READ
 
it is indeed a not as fast as the specs indicates but it starts failing at higher clock frequency.
Probably best to use Quad SPI mode as alternative. I don't know how to use it on the teensy. May be the built-in SD driver is using Quad SPI and can be used as an example.
 
Maybe one should check with an oscilloscope or logic analyzer if there are any gaps between the transmissions. This would indicticate a suboptimal SPI code.
 
I still did no perform the experiment suggested. Shame on me.

Currently I use the PSRAM mainly as extra memory from where I mostly read.
I write one time to it and then I only read from it (for some of the emulators ported, I use it to store the ROM that can be few MB)
The PSRAM driver uses simple SPI transfer (I could not find a good example of Quad SPI transfer)
Last w-e I wanted to check if I could use and SD card (SDxc) as PSRAM (I mean for what I use it).
As I only read from it, I thought I could create the file once and then just read from it (seal and read), as I access the PSRAM.
unfortunately, even if the SD library uses SDIO (4 lines), I still cannot go as fast as the PSRAM. Mostly because you have to read 512bytes at a time from the SD card.
Even with few pages in RAM, it is far from the performance I could achieve on the PSRAM.
On the real PSRAM I really have to keep small transfers (<16 bytes) with few cache pages (8) to achieve good performances.
I tried SD and SDFat libraries. FAT32 and ExFat. No difference.
I also tried uSDFS (from SD). There I cannot write my initial file.
Anyhow, I don't think that the filesystem and the FS library on top is the bottleneck.
I could not find a FS library that would allow me to write blocks directly in a second scratch partition for e.g. (SD and SDFat don't seem to like multiple partitions)

Just to say, current PSRAM driver is slow but fast enough to emulate a Megadrive at almost full speed (read only)!
I already asked bu did someone managed to mmap a SPI device to a portion of memory, I mean in addition to the flash.
 
Looking to the T_4.1 thread Paul posted this:

Another feature I'm considering is using the bottom side for locations to add 1 or 2 QSPI memory chips, which could be either flash or psram.

That QSPI could lead to double or more of : 642 kbyte/sec WRITE and 2727 kbyte/sec READ !

T_4.1 still a work in progress some (hopefully few) months away - will know more as Paul progresses and posts ...
 
That would be great and a valid reason for me to move to a larger pinout. I experimented with self made flex PCB to access bottom tracks of the T4. I was surprised how easy it was but at the end I ended up with a longer T4 anyway!
 
Back
Top