For a keypad, I recommend the diode approach.
If you have
N×
M keys, you need
N+
M I/O pins:
N outputs and
M inputs. You can use an expander, shift register, counter etc. for the
N outputs, if you need to reduce the pin count.
Each button is wired in series with a diode, connecting one output to one input. Only one output is high at any time (and this is why you can use a decoder/multiplexer, using just
K output pins, for 2
N key rows; for example, a dirt cheap 74HC238 (74VHC238FT for example) can be used to control 8 output lines, using just 3 output pins on a teensy). Inputs tell which buttons on that row are pressed and which are not. The reason for the diode per button is that it allows each button to be detected individually.
A common choice is to use
N=8. If you also were to use a 74HC238, you only need 3 output pins and
M input pins for 8
M buttons, all individually and separately detectable.
For
N=16, you can use 74HC4154 family; then you need 4 output pins and
M input pins for 16
M individual buttons.
These only need something like 20 nanoseconds (a dozen cycles at 600 MHz or so) for the output to switch (whenever the row changes).
You can use any (reasonably fast) decoder where only one output is high at the same time, really; lots of choices.
For software debouncing, I like an approach where button change is noted immediately, but further changes are ignored for 20ms-50ms. If you check each button at most 1000 times a second (at least a millisecond between checks), a single byte per button suffices for debounce and state. This way button press detection is immediate, not delayed, and you can detect both transitions (pressed, released) as well as states (down, up).
I have designed two carrier boards for Teensy LC, for 4×8=32 buttons and 9 analog potentiometers.
This one using through-hole components, and
this one using SMD (SOT23-3 common cathode, like BAS70-05 Schottky) diode pairs; both intended for use with wires to the buttons and potentiometers, for example via 1×2 (buttons) and 1×3 (pots) pin headers.
Each row output has a space for a 10k resistor to protect the Teensy LC, in case the software malfunctions; they just limit the current to about 0.3 mA. I recommend putting such resistors on the outputs. (If using a decoder, both on the Teensy outputs and the decoder outputs.)
Don't forget the pads on the bottom of the Teensy 4.0; you have another ten pins, 24-33, you can use. You can solder female or male pin headers here. (You could use a 10-wire flat cable, and solder each wire directly, but pulling on the cable may rip off the pads, so I don't recommend that.) With a 74HC4154 you can support 16×6=96 buttons with these alone; a full keyboard.
If you like to use standard 12mm×12mm or smaller tactile buttons in tight grouping, consider rotating every second one 90 degrees, in a checkerboard pattern. Each button has four leads; a pair on one side, and another pair on the opposite side. Rotating them in a checkerboard pattern gives more room in routing the lines between the button legs: the legs of one button are never next to the legs of another this way. The other two legs are not connected, just use disconnected pads. (The pad is actually a plated through hole, so soldering all four legs does help keep the button on the board; it's just that two of the pads per button are not connected to anything else.) I also recommend using diagonal pins on each button, so you don't need to remember which way is always connected, and which way is only connected when the button is pressed. The buttons sit on the board, so it is easiest to use SMD (Schottky) diodes on the other side of the board; and use that side for the columns (input traces), and the button side of the board for the rows (output traces).