Upgrade AVR cnc boards using Teensy 4.1

Status
Not open for further replies.

Anichang

Member
Hi,

TLTR: I've some doubts about number of pins available, pwm performance, and how to get a gcc-based toolchain.

I have 4 MKS Gen L boards (atmega2560) to drive the 19 steppers of my multitool 3d printer; 1 board = 1 YZ-arm, 4 arms in total, sharing a single X axis. The upgrade goal is being able to use all the arms as a whole single bot and, possibly, to raise the stepping speed to more than 1/16 microstepping.
The MKSGenL+Teensy cost is about the same of buying new specialized boards based on some STM32F4 mcu. But I prefer to have general purpose boards as much as possible. So far so good.

The current issue is that I can't "glue" the boards together and keep the speed decent. Basically the avr cpu+mem+comms triangle is very small; too small. A very simple serial protocol doesn't overload the cpu but requires too much bandwidth (ie: more than 500Kbaud/s available on the tty-USB port), increasing the serial protocol complexity overloads the cpu making the jitter going out of control; and memory isn't really enough to buffer long stepping periods.
So I'd like to upgrade my setup by plugging a Teensy 4.1 on top of each MKS Gen L, then offloading the stepper control from the AVRs to the Teensy and leave all the "ancillary" functions (ie: heaters, sensors, powering, ...) on the AVRs. The idea is to use the MKS Gen L as carrier boards with all the mosfets and connectors already in place and take advantage of atmega2560 pins; after a quick check the Teensy should fit nicely on top of the MKS Gen L using a simple pcb adapter.
After connection I'd have USB and individual stepper pins, connected from MKS Gen L to Teensy USB-host; then the USB-device port of Teensy going to the host computer. In this way the fast realtime control (1us periods) is on Teensy, the slow realtime control (1ms periods) is on AVRs. There's a logic level mismatch to be adjusted (5v <-> 3.3v) but everything could be trivial to connect. The only thing I'm having issues with is to figure out how many pins are truly available without overlapping with the onboard peripherials (USB, SD, ethernet, external psram).

The second doubt is about the real pwm performance. Considering I need to receive data from the AVR twice per millisecond, collate with local data, and eventually forward to host, will I be able to bitbang at 250KHz on 5+ pins with exact timing? On paper the cpu is blazing fast but ... go figure what happens in real world...

The last thing I'd appreciate some inputs about is how to get a gcc toolchain. I currently have all my toolchains tidy and comfy; I use Gentoo's crossdev and I keep the toolchains in shape mostly with crossdev only (x86, arm for RPi and Odroid, AVR; then I manually plugged the ESP32 toolchain in the proper spot). I'd like to have a similar setup for Teensy as well, but I couldn't figure out how to get that. The only howto I've found uses an stm32 docker image to build the NXP SDK... a bit convoluted... is there a way to just build a gcc-libc-binutils-make toolchain from sources?

Any advice appreciated.

Best Regards
 
To be honest, I don't quite get what you are trying to do. Based on grblHAL, the teensy 4.1 is capable of handling at least 5 stepper (step/dir/enable) outputs at once using hardware PWM. 600 kHz step rates are possible with lots of cycles left over though grblHAL is artificially constrained to 400 kHz. Not sure why you are bit banging them, you'll spend all your time doing that.

Frankly, from the sounds of it, I would toss the AVRs and just do everything with the teensys. There are plenty of cycles available.
 
PWM hardware can't run parallel Bresenhams for concerted stepper motor movement, so it has to be bit banged I believe.
(For instance for circular motion you can't set a fixed frequency...)

T4 is blindingly fast, don't worry about this, just be sure to use digitalWriteFast - a simple loop can get over 37MHz from a single pin
at 150MHz clock, so updating half a dozen pins every microsecond or so shouldn't be too taxing - though there may be interrupt overhead
of course.
 
To be honest, I don't quite get what you are trying to do. Based on grblHAL, the teensy 4.1 is capable of handling at least 5 stepper (step/dir/enable) outputs at once using hardware PWM. 600 kHz step rates are possible with lots of cycles left over though grblHAL is artificially constrained to 400 kHz. Not sure why you are bit banging them, you'll spend all your time doing that.

Frankly, from the sounds of it, I would toss the AVRs and just do everything with the teensys. There are plenty of cycles available.

Ok, cool, thanks! I didn't know grblHAL. I've been digging in grbl code before (and appreciated the coding style) but didn't know a modular fork existed.

About bitbanging: I don't have to. But depends on the hw I get; hw pwm pins/timers available. I don't known anything of Teensies, yet. On AVR I could do it without "spending time" (ie: toggle and wait), but spending cycles (ie: trigger IRQs, a ton of IRQs); no idea about best practices on Teensies.
Anyway, sounds like there's enough headroom to bitbang (if I have to). I don't need 600kHz; ~250kHz is a safe max for stepper drivers I have. They all have 1.2-1.9us minimum pulse width; so I can't do much more than ~250kHz.

About tossing the AVRs: I'd like. But then I've to make a custom pcb to add mosfets, passive parts, connectors and rewire the whole thing. I've spotted some awesome T4 carrier boards on Tindie, but (cost apart) I don't see the difference between using the AVRs as carrier boards instead of those teensy-specialized ones. I mean, a bit more software black magic is more comfortable than tinkering with hw.

Thanks for the tips!
 
PWM hardware can't run parallel Bresenhams for concerted stepper motor movement, so it has to be bit banged I believe.
(For instance for circular motion you can't set a fixed frequency...)

On the AVRs the stepper drivers are hardwired to pins not supported by the timers, so I had to bitbang the pwm signal. I was using 2 timers, one for prep and trigger, 1 for actual exact pulsing. First timer was directly manipulating second timer's registers. I got this strategy on Hackaday. The result, as you can imagine, was an IRQ storm. But I usually do everything interrupt based and leave in the main loop the 'best effort' stuff. To control jitter I just computed the amount of cycles used in each ISR, added the ISR call overhead, and the amount of ISRs call in time, then be sure to not overload the cpu. I could lower the pulse width of 62.5ns (1 cycle) and see the cpu explode. In this way there are no sleeps wasting cpu time.

A similar approach for memory management: use global vars only, in order to have a well known stack size at compile time, and I used a dynamic pool filling the whole heap; then at runtime dumping all the incoming data in the dynpool as a circular buffer and use multiple pointers to use it from any ISR was in need of data (pointers: buf_start, buf_decode, buf_use, buf_end). Zero memcpy, no cpu waste. Probably at the end of the development I could have removed the memory barriers as well, because there were no nested ISRs, so no chance for 1 ISR to access partial bytes.

But the result is an excruciating pain for developers. And not enough, even in assembly and reducing the function calls. Plus, the resulting code is a jungle hardly accessible to other developers. To not mention (my) inability to go C++ ...

Indeed, Bresenhams are computed on the Odroid-XU4 (8 "A" cores) before the build run starts, then steps are sent to the AVRs (to Teensy, after the upgrade) for realtime execution in the "near future". Have a look at Klipper. This is needed to offload as much as possible the realtime controller, be able to keep more than 1 MCU in sync, and move as much logic as possible to Python for easy development...
I didn't study the geometry part yet; I don't know of "better" than Brasenhams. But having 8 "A" cores at GHz speed allows the use of complex math (FFT?) and Python. On T4 the FPU is a beast but still not even close to 1 of those "A" cores.

The 16MB extram available on T4 allow to buffer 10+ seconds of build time, worst scenario: a "step" is 4 bytes for timestamp + 3 bytes for stepper id and vector, all steps sent without any kind of optimization, 250kHZ. My first CD-ROM burner had 5 seconds of buffer and it was enough to not waste too many pristine CDs, but mouse was stuttering when stopping the music player during a cd burning session...

T4 is blindingly fast, don't worry about this, just be sure to use digitalWriteFast - a simple loop can get over 37MHz from a single pin
at 150MHz clock, so updating half a dozen pins every microsecond or so shouldn't be too taxing - though there may be interrupt overhead
of course.

Cool, I'd say my doubt about pwm speed is definitively cleared.

I see an implicit advice from you to use Teensyduino/Wiring. And I can accept that, but just for curiosity sake, is there a way to get that precious code out of Arduino IDE and build it as a lib in a vanilla arm-none-eabi gcc 9.3.0/10.2.0 ?
 
You know about Teensyduino? Except for MCU peripherals, it should mostly be a recompile.
 
but just for curiosity sake, is there a way to get that precious code out of Arduino IDE and build it as a lib in a vanilla arm-none-eabi gcc 9.3.0/10.2.0 ?

On the Teensy 4.1 page:

https://www.pjrc.com/store/teensy41.html

Scroll down to "Software" and look for "Command Line with Makefile".

We use an older version of gcc. Sometimes newer versions print warnings or even errors for code the older versions accepted. Usually these things are "trivial" but if you want a smooth experience, plan on using the same (old) version of gcc which Teensyduino uses.
 
On the Teensy 4.1 page:

https://www.pjrc.com/store/teensy41.html

Scroll down to "Software" and look for "Command Line with Makefile".

My bad, I saw a long list of IDEs and didn't pay attention to the last point of the list. I'm impressed, it works for me: I adapted the Makefile to build a static libt4.a and then used my gcc v10.2.0 to compile and link my simple test app. Don't know whether complex apps will work as well; if not, I'll build everything with supplied gcc v5. I don't need std++17 or std++20.

What about reflection? I see -fno-rtti, is that for AVR only or ...

BTW: At first I tried to build the supplied template main.cpp, but WProgram.h thrown an error on memcpy extern. At first glance seems memcpy-armv7m.S doesn't get built ... I suppose it should, because it is the optimized version.

We use an older version of gcc. Sometimes newer versions print warnings or even errors for code the older versions accepted. Usually these things are "trivial" but if you want a smooth experience, plan on using the same (old) version of gcc which Teensyduino uses.

Yeah, up to 2018 it was hard to find a decent gcc version for AVRs; 5.4.x was the last one supported by linux distros, build from source otherwise. Today things are much better; AVR works with gcc v10.2.0, so probably you can bump your as well!!!
And looks like gcc v10 will be the last one with AVR support, so... it's a good time for the last version bump with unified gcc version for AVR and ARM. Next time you'll be probably EOL the AVR teensies!

Anyway, I thank you all for the help; all my doubts are cleared, I'm going to buy T4s right after this message and then start working on my code. Please, If you don't mind, leave the thread open.
 
Status
Not open for further replies.
Back
Top