but at some point, there is so little additional cost and size increase that ....
A finer point of IC manufacturing that's not often spoken is the incredible specialization of modern fabrication processes. You can see press releases that nVidia is using a 16 nm TSMC process for their new GPU chips and they'll use Samsung's new HBM DRAM made (maybe) with 29 nm silicon. Similar statements can be regularly found regarding NAND flash memory. If any technical info is given at all, it's the transistor channel length. Anyone reading only this info could easily believe they're all very similar.
What's rarely mentioned, especially in these modern times where companies consider even fairly obvious facts to be closely guarded proprietary trade secrets, is the incredible difference between these silicon processes. DRAM memory, for example, requires memory cells which are basically a mosfet transistor where the gate capacitance holds the data. Tricks are played with a DC offset voltage on the entire area of the chip holding these transistors, to reduce the effective voltage that's trying to pull these precious few electrons off the transistor gates. The insulating layer on the gate is made differently, to reduce the leakage, but that also changes the transistors switching threshold voltage. Since most of the chip is just these cells and row/column access lines, relatively few metal layers are needed for wires.
Likewise, those TSMC processes for logic circuits are heavily optimized for fast digital circuitry, but at the expense of other stuff. You can't implement flash memory at all, and high accuracy analog circuits (like the ADC and DAC on Teensy) are pretty much impossible. SRAM can be made using 4 or 6 transistor cells, but it's very expensive.
MCUs are typically made with 90 nm or larger transistors. Soon we'll start to see 65 nm being used, but there are huge challenges getting flash memory to work reliably. The 90 nm (and larger) processes allow for a good balance of features, so you can have reliable flash memory, RAM and pretty good logic speed, and fairly accurate analog circuits (at least well matched capacitors) all on the same piece of silicon. But at 90 nm, every extra feature takes quite a bit more silicon area than adding stuff on the most advanced logic-only chips.
Perhaps we'll eventually see 45 nm and even more advanced processes able to implement single-die MCUs. Maybe? Or perhaps designs similar to the ESP, where a separate flash chip is used for non-volatile memory, will become the norm?
One thing is pretty certain to remain the same. Silicon process for the fastest logic, best DRAM, and best Flash will continue to be very different. Adding extra logic-only features to the logic die may be cheap, if those features don't require adding memory or analog features. But the higher the silicon performance, the more specialized the process must be, which means adding extra features that silicon isn't meant to implement is very difficult and expensive.