Fast 8 bit parallel I/O for T4.0

For a 8-bit wide bus using GPIO6 bits 16-19, 24-27, I'd use temp = val ^ state; GPIO6_DR_TOGGLE = ((temp & 15) << 16) | ((temp & 240) << 20); state = val & 1023; to update the bus state to val.

For a 10-bit wide bus using GPIO6 bits 16-19, 22-27, I'd use temp = val ^ state; GPIO6_DR_TOGGLE = ((temp & 15) << 16) | ((temp & 1008) << 18); state = val & 1023; to update the bus state to val.

In both cases, the state variable tracks the value seen on the bus, and allows us to use the GPIOx_DR_TOGGLE, so we do not modify the states of the other pins in the same GPIO port. All variables above are of uint32_t type.

If you need an N-bit wide bus within any single GPIO port, but a different (any) pin order, you can do it with a 1<<N = 2ⁿ -entry look-up table of uint32_t's in Flash or RAM. It will not be as fast as using consecutive sets of bits in order, though. If N is large, you can split it into sub-buses. If they are all in a single GPIO port, they will change states at the same time.

You can implement the above 10-bit bus using a look-up table in RAM with for example
Code:
constexpr int       bus_bits = 10;
constexpr uint8_t   bus_bit[bus_bits] = { 16, 17, 18, 19, 22, 23, 24, 25, 26, 27 };
constexpr uint32_t  bus_mask = (1 << bus_bits) - 1;
volatile uint32_t   bus_state;
uint32_t            bus_lookup[1 << bus_bits];

static inline void set_bus(uint32_t value) {
    GPIO6_DR_TOGGLE = bus_lookup[(value ^ bus_state) & bus_mask];
    bus_state = value & bus_mask;
}

void setup() {

    // Fill bus_lookup
    {
        uint32_t  i = 1 << bus_bits;
        while (i-->0) {
            uint32_t  value = 0;
            uint32_t  bits = i;
            for (int bit = 0; bit < bus_bits; bit++, bits >>= 1)
                if (bit & 1)
                    value |= 1 << bus_bit[bit];
            bus_lookup[i] = value;
        }

        // Todo: set pins to outputs

        // Initialize bus state to 0
        bus_state = 0;
        GPIO6_DR_CLEAR = bus_lookup[bus_mask];
    }

    // Other setup() stuff you need...

}
The bus_lookup table will have 1024 entries, and thus consume 4096 bytes of RAM. You can split the look-up table into two or three or more much smaller tables (for example, the 1024-entry table to two 32-entry ones), or even completely do without:
Code:
static inline void set_bus(uint32_t value) {
    uint32_t  changes = (value ^ bus_state) & bus_mask;
    uint32_t  toggle = 0;
    for (int bit = 0; bit < bus_bits; bit++, changes >>= 1)
        if (changes & 1)
            toggle |= bus_bit[bit];
    GPIO6_DR_TOGGLE = toggle;
    bus_state = value & bus_mask;
}
Just remember to initialize the bus state to 0, and set the relevant pins outputs and low.

The useful thing is that with this lookup-based approach, you can use any bits in the same GPIO bank and in any order you want, for a modest runtime cost (in memory and speed).

I'm personally working on a variant approach of this, using 18-bit RGB parallel output using three 6-bit look-up tables, but calculating the actual emitted color at runtime based on a couple of framebuffers on top of each other. The idea is to have a full-color background one can blit to, but also be able to blit antialiased glyphs/sprites/characters on top, independent of each other. The maximum clock rate on ILI9341 etc. display controllers with such an interface is typically about 15 MHz, so technically there are on the order of 40 clock cycles (at 600 MHz) per bus output word to get that done.. Reminds me of the early days of 6502 assembly programming. 😉
 
I don't think it will work the way the writer wanted it to. When I started working on my scheme, I thought I could avoid some right-shifting by just reading the upper 16 bits of DR but it didn't work -- the data I got was incorrect when compared to a 32-bit read. I concluded that the processor doesn't "like" accessing just one half of a register. It's all 32 bits on integral register address boundaries, or nothing.
You need to read the reference manual about such things as memory operations on memory mapped registers, different rules apply from cached SRAM or whatever as the address-decode logic is not necessarily common between different types of memory mapped resource... In general if an I/O register is documented as 32 bits, best to always address it 32-bit wide unless its documented to work in other widths.
 
Back
Top