I don't really see the need for a simulator. I tend to agree with MarkT is that the real problem is simulating the devices. Until you can simulate all of the devices hooked up, only a few things could be done on a simulator.
Then you get to the issue of who is going to build such a thing? In general for the scope people want, you generally are not going to get people to dedicate a large effort and make this simulator free or at least a cheap enough charge that the majority of folk will use it. You also have the issue that to simulate a program can often take hundreds if not thousands of instructions on the host doing the simulation. Sure in some cases, ancient 8-bit game code on MAME simulators might be faster than the original machine, but I suspect in general, it will just be faster to use a Teensy.
Now in doing compiler development since 1979, I have often used simulators, particularly when when I was supporting new targets. These were created so the compiler teams could implement support for a new CPU (either a completely new architecture, or a tweak on an existing architecture that has new instructions) before the silicon was available.
In terms of my use, these were often instruction set simulators that adds the new instructions, and you don't worry about the performance of the code. Generally, when the first real machines were made available for my use, I would move from using simulators to using the real hardware. Now, often that first generation of real hardware is slowed down, but until I'm running benchmarks, it typically doesn't matter. Generally by the time I need to run benchmarks, I prefer to do it on real hardware that is not the initial generation, because I am skeptical that any simulator will be completely cycle accurate.
But there are simulators that are much more detailed and try to be cycle accurate. They allow the cpu hardware designers to isolate bottlenecks during their design. I don't know what the performance numbers are for these simulators, but I recall in the past for a different employer, that it would often take a week of simulation to get to allow the user program to print its first message (and the machines doing the simulation were often the fastest/largest machines of the current generation). And such simulations often needs tons of disk to hold all of the information that can be reduced to look at the issue at hand.
In terms of the Teensy, the unfortunate thing is we don't have access to the JTAG hardware to allow a debugger to more easily debug the code. I believe there is a port of a gdb stub that allows Teensys to be debugged without using JTAG support. It may be a better use of time to enhance this support so it is more useful.
In a previous job, I ran into somebody that was the corporate fixer. One of her jobs was chip courier. She would fly first class (because they get double the carry on luggage) and carry two empty carry on suitcases. She would meet her contact at the airport and load up the suitcases with the very first chips coming out of the fab. She would fly back ASAP, and then deliver the chips to the various teams bringing up the chip. At this stage, there are likely things that don't work, and the bring up teams will need to deal with the issues.