Options for 'bare-metal' development

Status
Not open for further replies.

Strooom

Active member
Hi, I am doing a development for a CNC/Motion Controller.
As a first target HW I want to run it on a Teensy3.5

I would like to achieve the following :
  • It should be easy to compile the SW and download to the target HW - eg. as simple as in the Arduino IDE today..
  • I do not want to include the typical Arduino 'overhead', ie. no setup of peripherals behind the scenes. All peripherals config will be done by the application itself and I don't want anything running that we don't know about, as this could disturb the real-time behaviour that we need to control multiple stepper-motors at high speeds..

So what are my options ?

  1. Arduino IDE / Teensyduino : is there a way to omit the peripheral configuration (eg USB, Serial, timers, ...) and go for the minimum build with a crt0.s, sysinit.c ?
  2. Visual Studio (optionally with Visual Micro extension) - I probably need a customized makefile here - how could we distribute that in a failproof way to the community ?
  3. PlatformIO - can we do a minimum build here, and is there a simple upload for the resulting .hex
  4. other..

An example for this is GRBL, which achieves the above (for the Arduino Uno) as follows :
  • all .c files are in a 'grbl' *library* folder - (so you have to install a library, not just download a sketch-file)
  • this library has an 'examples' folder - as an 'example' there is an 'upload.ino' sketch.
  • the upload.ino sketch has no setup() or loop(), it only #includes the 'top-level' grbl.h file

Many thanks for your suggestions/opinions!
 
So what are my options ?

Buying only the hardware, reading 1000+ pages manual from Freescale the and develop SW from scratch WITHOUT using PJRC's core SW , as Paul's core SW does what you not wanted to do, namely setting up Teensy to be Arduino compatible. Obviously most users of this forum can help you only with Arduino style development.

Having said that,
I suggest to have a look into Paul's Teensyduino distribution and you will see that this is as bare-metal as you can get. There are a few things in it during startup that you may not wanted to do, but you can remove them.

Also if you do not like the Arduino IDE and the associated pre-compiler, use a standard text editor and the makefile provided by Paul in the cores SW.
If you need a IDE, take anyone you can configure for the ARM toolchain, or can handle makefile projects. It is your choice.
 
The Teensy ARM is different in setup requirements than AVR - Paul's notes suggest that bare without his (well tuned) setup is really a non-functional state.

Start with option 1 - learn the Teensy . . . Compile with Tools / USB Type : None :: removes USB/etc. There is a systick 1ms timer that fires for a var++ - and can be stopped if you want to lose that. There isn't much else overhead I've seen going on - after that - except the defining your own void yield(){} to stop any checks for serialEvent# processing - it will just sit there calling loop(). All the code is installed on your system to alter/adjust/remove.

Might be a fun exercise to get the cycle counter running and see how many cycles are spent outside loop code average and worst case - then remove those items one by one and see what is left.

Teensy on Arduino just offers consistent install/IDE/library options before you compile - once you hit compile only what is used gets included.
 
Thanks for the prompt answers!,

It's not that I don't like the Arduino IDE : it's a beginner-friendly easy access toolchain, so it's a preferred way of distributing applications to a wide audience. (I've even donated :p)
I just need to know what code it is adding (for the sake of making things easy) which may interfere with the application itself, eg. timers, UARTs, etc.

I also understand that the K6x needs more initialisation than the AtMega328 (int-vectors, clocks, etc) and it would be great if the TeensyDuino could do that for me, or at least be a source of inspiration.

Might be a fun exercise to get the cycle counter running and see how many cycles are spent outside loop code average and worst case - then remove those items one by one and see what is left.
I will do this kind of test anyhow, as I want to be sure how much time is spent in the ISR, as well as to ensure the hardware floating point is used io. libraries. I'll post the results when I have them.

Again, thanks for all the help!
 
One more quick question : will the TeensyDuino enable the floating point HW when Teensy 3.5 is selected ?
 
First to directly answer your "bare metal" question, several people have gone down this path and shared their work. Here's one of the best. Maybe others will recall links to more of them, or perhaps you'll find them with a little searching...

http://kevincuzner.com/2014/12/12/teensy-3-1-bare-metal-writing-a-usb-driver/

If you don't already have it, you really must get Joseph's Yiu's book.

https://www.amazon.com/dp/0124080820

Freescale's reference manual documents all the peripherals, but it has only a few pages that summarize how they configured the ARM core. There are tons of critically important features about the ARM code which you really must understand if you're going to achieve best performance. This book is really the only approachable source for that info. ARM's reference manuals document but don't explain how things are meant to work. Joseph Yiu does. No matter what approach you take, the info in this book is essential for getting the most out of the ARM processor.

Now I can tell from your question you're determined to achieve best possible performance. That's good. You also seem convinced discarding all the Teensy core library work would be a best way to accomplish your goals. That's not so good. With a lot of work, you'll manage to trim some flash memory and RAM usage (which amounts to just a few percent of the total memory), but you'll gain nothing performance-wise. If you decide to support USB or serial communication, you'll be hard pressed to achieve the levels of performance already present in Teensyduino's core library, which makes effective use of advanced features like USB DMA and serial FIFOs.

Of course, if you build this as an Arduino library & sketch, as GRBL does, you get full access to the hardware registers. Teensyduino gives you a good base to build upon, and it'll be much easier to share your work with others. I highly recommend using it to avoid all the mundane stuff, so you can focus on making your code run fast. It really won't be holding you back. If you're used to 8 bit AVR or PIC, probably the best thing you can do is read Joseph's Yiu's book and spend some time experimenting with the powerful hardware features.

For example, you're probably conditioned to avoid floating point math, if you're used to lesser processors. On Teensy 3.5, the FPU is basically the same speed as 32 bit integer math, as long as you're careful to avoid avoid 64 bit double (the FPU only does 32 bit floats). As soon as you go to special effort like extra if-else code to use "fast" integers, you're usually going to end up slower than if you'd just done it the easy way with 32 bit float. It really is that fast.

Interrupts are another feature that sometimes has a poor performance reputation from the AVR world. Especially using more than 1 interrupt on AVR is problematic, where serial communication interrupts mess up low-latency interrupt response. On ARM you get prioritized nesting with hardware optimizations for context save/restore. You'd be doing yourself a tremendous disservice to apply that "only 1 interrupt" strategy from the 8 bit world, when the ARM NVIC can allow your high priority interrupt to always get low latency service, even with lower priority communication interrupts are running. You'll discover Teensyduino already has sensible default interrupt priorities (unlike Arduino Due where they never supported this feature), and the highest 2 priority levels are unused by any existing libraries so you can have them fully available.

The DMA channels are also an incredibly powerful feature you can leverage in some cases, especially moving data on/off chip a fast sustained speeds, using basically zero CPU overhead other than setting up the DMA transfers.

Of course, if you want to start with a blank slate and do everything yourself from scratch, you certainly can. That first link will show you the way. But odds are you'll spend a year or more getting up to the already very good performance offered by Teensy's core library. It's your choice, but I hope you'll leverage what already exists and focus your efforts where they will matter most. Either way, get Joseph's Yiu's book!
 
will the TeensyDuino enable the floating point HW when Teensy 3.5 is selected ?

Yes. Just use float variables in your code and the FPU is automatically used.

For the trig, log and other math.h functions, make sure you use the 32 bit version. For example, use sinf(), not sin().
 
Actually there is not that much BARE metal needed for a gcode interpreter. (Dag landgenoot trouwens ).
I am also beginning to develop my motion controller after transitioning from due to teensy.
As far as I understood, there are only 2 ISRs running on teensy 3.5/3.6 . One is a 1ms systick and then some stuff for the usb.

The systick ISR really only does 1 integer addition so that won't slow down. Remember 3.6 runs 180MHz and overclocked even 240!
Bresenham is a very lightweight all integer algorithm and the cpu handles 32 bit integers as if they were nothing. The only thing needed is a couple of interrupts to enable/disable stepper step pins to keep them up arounf 10microseconds and digitalWriteFast (yes it is as fast as writing directly to the register) for the other signals. Then you also need a good way to count ellapsed time in microseconds (micros/ellapsedMicros) and all the rest of the code is pure c code.
 
First to directly answer your "bare metal" question, several people have gone down this path and shared their work...
Yes, I collected most of them already and am trying to re-use some of that effort (it all started with Karl Lunt : https://www.seanet.com/~karllunt/bareteensy31.html) However they don't address how to do 'bare-metal' AND do it from an easy-to-use IDE, so I though my question was legitimate

If you don't already have it, you really must get Joseph's Yiu's book.
...
There are tons of critically important features about the ARM code which you really must understand if you're going to achieve best performance. This book is really the only approachable source for that info. ARM's reference manuals document but don't explain how things are meant to work. Joseph Yiu does. No matter what approach you take, the info in this book is essential for getting the most out of the ARM processor.
I have it, and will read. This book however does not cover the peripherals specific to NXP K64/K66 and I though that a large portion of the 'hidden' init code is to do with configuring the peripherals.

... You also seem convinced discarding all the Teensy core library work would be a best way to accomplish your goals. That's not so good. With a lot of work, you'll manage to trim some flash memory and RAM usage (which amounts to just a few percent of the total memory), but you'll gain nothing performance-wise. If you decide to support USB or serial communication, you'll be hard pressed to achieve the levels of performance already present in Teensyduino's core library, which makes effective use of advanced features like USB DMA and serial FIFOs.
No, I don't want to discard the Teensy core - If I can know what's running - how and why, I'm certainly in favour of reusing them. I'm just worried about some 'black-box' piece of code, interfering with critical real-time stuff.
I am also not worried about flash or ram size, there is plenty of that in the Teensy3.5/3.6 and so I don't want to spend my energy on saving a few bytes there.
I am happy to learn that Teensy Core libraries are high quality (will need to see that with my own eyes of course ;)) but I think you understand my worries coming from Arduino/AVR where a large part of the libraries on the web are really poor quality

Teensyduino gives you a good base to build upon, and it'll be much easier to share your work with others.
A good balance between efficient code and easy to share my work is what this question was all about.

For example, you're probably conditioned to avoid floating point math, if you're used to lesser processors. On Teensy 3.5, the FPU is basically the same speed as 32 bit integer math, as long as you're careful to avoid avoid 64 bit double (the FPU only does 32 bit floats). As soon as you go to special effort like extra if-else code to use "fast" integers, you're usually going to end up slower than if you'd just done it the easy way with 32 bit float. It really is that fast.
No, I figured that out already. I'm doing 6 axis, 3rd order (S-profile) motion profiles (linear and helical) and so it involves (a lot of) floating point and (some) trig.. But with HW floating-point it should be feasible. It also simplifies stuff, as many of the 'integer-math-only' workarounds gain performance, but at the cost of complexity...

I hope you'll leverage what already exists and focus your efforts where they will matter most.
I will.

Thanks for your extensive answers and great advice!!
 
Actually there is not that much BARE metal needed for a gcode interpreter.
Yes, I know - GCode interpreter is a non-real-time thing, and I don't worry about that.. My worry is about the timer interrupt to the stepper-motors which may run up to 100-200 KHz

Bresenham is a very lightweight all integer algorithm....
Mmmmhh,
I decided to not use Bresenham.. It's one of those 'integer-math' workarounds which also introduces limitations and complexity :
* It doesn't do ARCs, and I don't want to expand ARCs to a lare number of straight lines either..
* You only need 1 float multiply per axis, in the real-time part... - that's perfectly feasible on the K64

I think our projects may benefit from each other - let's stay in touch.

Pascal
 
I'm actually interested in this particular topic, Bresenham seems to have some other limitations too, like with completely smooth movement in all axes, and I also don't want to have a trapezoidal acceleration profile in my algorithms so I will be interested in what Strooom discovers using the Teensy's, which I also purchased for some experimentation in this area. Regarding using 32 bit HW floating point, this only leaves 6-7 significant digits available, and you quickly lose precision on very large CNC machines without any warning, so I've coded up fixed point routines (yes, far more complexity) to give 9-10 digits of precision instead with (I expect) similar performance to that... But even if it's ever so slightly worse, in some cases I think it does make a difference to the precise calculations going on. At least, I've patched a bunch of out of range issues with my library that I know a "float" would have produced the wrong result. I'm talking > 1~2 meter axes with fairly high micro-stepping.

I've also finally received the SN74AHCT125N chips I ordered (to replace the 3.3v to 5v bidirectional logic level converters I was using, which are pull-up based) and they are working much better (thanks Paul). However since I moved on to the LPC1768 board while I waited for them (a RE-ARM board I ordered at the same time as the 3.5 Teensy's) I've managed to get much more working on that platform. My current issue (looking to get the software I've written working on both devices, which may be a hard task) is that the LPC1768 has 4 * 32 bit timers, and each of those timers has a 32 bit pre-scaler counter. This particular feature has allowed me to move the S-Curve acceleration profile into the pre-scaler while the normal timer counter handles the desired movement speed.

I've managed to get a max. 125KHz stepping (granted, on only one stepper at a time) on the 100MHz LPC1768 board, divide that for each additional stepper needed (3D printing requires 3 going at one time, and I am using a different timer for the extruder so I can "advance" its output) but to be honest (and this is why I'm interested in what Strooom comes up with) I'm really struggling to work out how to get it working on the Teensy's the same way with the same performance now? The few things I've tried (software interrupts) don't perform nearly as well.

(I know this is a sort-of negative post, but I really am now struggling to go back to the 3.5 Teensy's. Maybe the 3.6's would be better?)
 
I had a quick read through your wiki - I like your plan & will follow/contribute wherever I can.
I think the Teensy 3.5/3.6 is the perfect platform for this, I'd be happy to design a minimalistic CNC-specific PCB (e.g. screw terminals & buffers for step/dir, a few LEDs for EN signals, optoisolated limit switch inputs) for the Teensy to plug into.
 
I had a quick read through your wiki - I like your plan & will follow/contribute wherever I can.
I think the Teensy 3.5/3.6 is the perfect platform for this, I'd be happy to design a minimalistic CNC-specific PCB (e.g. screw terminals & buffers for step/dir, a few LEDs for EN signals, optoisolated limit switch inputs) for the Teensy to plug into.

Great, this is indeed on my ToDo list. Your offer to help is appreciated and will speed things up!
I'm thinking along the same lines :
* ULN2803 output buffers towards the stepper drivers and solid-state relays
* OptoCoupled and filtered inputs for limit-switches and buttons
* maybe a RS232<>TTL driver (one for comms to a PC and one for a pendant..)
* some 12V and 5V power-supply stuff.
* screw terminal blocks

I will draw a schematic-sketch and we can go from there. There is a HW-section on the repository exactly for this stuff.

Thanks!
 
Crossworks for Arm will do exactly what you want.

It's a very easy to use IDE that uses the GNU compiler, and I'm pretty sure is free up to a certain code size limit.

A bare minimum Kinetis project will give you just enough code to setup some default interrupt vectors, boot and clear out memory etc. then calls an empty main().


What peripherals do you think you'll need for your project? Just GPIO, a timer and possibly UART?
I wouldn't have thought that would be too difficult to set up just from the k64F manual.
 
I've managed to get a max. 125KHz stepping (granted, on only one stepper at a time) on the 100MHz LPC1768 board, divide that for each additional stepper needed (3D printing requires 3 going at one time, and I am using a different timer for the extruder so I can "advance" its output) but to be honest (and this is why I'm interested in what Strooom comes up with) I'm really struggling to work out how to get it working on the Teensy's the same way with the same performance now? The few things I've tried (software interrupts) don't perform nearly as well.

I've often thought about the possibility of generating step+direction pulses using timer compare and DMA channels. If done well, this could possibly allow 4 steppers to run at high rates with low CPU overhead.

However, the fundamental nature of DMA involves calculating the steps well in advance and storing the data into buffers. You get a notification when each buffer is completed. It's a huge change in approach from generating the individual steps under software control.

Does this sound like something you guys could use?
 
Does this sound like something you guys could use?

Perhaps that would be useful, depending on how multiple DMA transfers affect one another timing wise. It was something I considered too but I got a bit lost when I read the DMA section of the reference manual. I was really only considering it for the 10us step pulse delay I need though. I had considered FTM for that delay too, but in the end went with what I know, another simple timer interrupt which isn't too bad because it's only a fairly small amount of code. I think my main issue is the main step timer interrupt (the start of that process) because it has to fire for every step and for every motor (Bresenham would only require it firing for the motor with the highest frequency). There's quite a bit more code in that one, so more overhead saving the machine state. But I was going to live with 41KHz stepping with 3 motors going.

I had also considered chaining PIT timers on the teensy so I could still get the S-curve and target speed all working in hardware like the LPC1768, but there's a note that timer 0 can't be chained (I may be making an incorrect assumption that it means it can't be "chained to" from timer 1). But even if I had T0/T1 and T2/T3 chains, that's all my timers used (which might still be workable) however on the LPC1768 I'm driving up to 4 movement motors from one timer and up to 4 extruder motors off another timer (using the 4 match registers each timer has). I don't think the Teensy timers support this, they just count down to zero before doing anything, which also made my head hurt a little.

If DMA can be used somehow, it would be a real winner, because I'm not 100% happy with what I've got working so far.
 
my main issue is the main step timer interrupt (the start of that process) because it has to fire for every step and for every motor

Yup, that's exactly the problem DMA solves. You set up a buffer in advance with many outputs. The DMA engine steals memory cycles occasionally, but even that rarely has much impact on the CPU because the bus matrix allows a path from the RAM to DMA that doesn't conflict the flash access for code. The RAM is in 2 banks, so accesses only cause a wait state if using the same bank.

I might try fiddling to see if I can come up with a DMA example for you and others who might wish to control steppers.

At this point, I'm not 100% sure I understand what you mean by language like "the 10us step pulse delay". I'm guessing you're generating fixed width (HIGH for 10 us) pulses, with variable time (LOW) between them. If there are important details or special issues I should consider, now's a good time to explain....
 
That's correct, the DRV8825 & TMC2100 stepper motor drivers I'm using need the step signal to be around 10us wide I think. That's the only important detail. In the interrupt I set the direction signal then immediately after, the step signal and finally the timer match for 10us which turns that off when it fires. I set the enable signal earlier, in the gcode processor outside the interrupt.
 
I'll start by saying I'm still interested in seeing a DMA stepper example for Teensy (after Paul gets his T3.6 2nd USB stuff working which seems more important, for now) however...

I've been thinking about DMA (and reading up a bit on other developments such as: https://github.com/gnea/grbl/issues/67#issuecomment-269822638 , and also others that are somewhat less interesting).

I suspect there may be an issue (and Strooom mentioned it there too, so I'm interested in comments anyone has about my following thoughts) my concern is that 3D printer firmwares need access to SD cards and LCD panels, and using them will incur delays in processing. I note that graphics LCDs are both slower (~500KHz SPI) and a larger memory footprint (1KB) than SD card buffer (~??MHz SPI & 512 bytes a sector, though I'm unsure exactly how many SD card sectors need reading in a row for various operations) so I just worked out some numbers for LCD as a guide:

500,000Hz SPI clock - 16 bits transfer per byte (nibble processing that's required by the LCD interface) = 500,000 / 16 = 31,500 bytes/s.

Divide that by 1024 bytes per screen-full (128x64 pixels) and add 15us per byte additional delay that's also required (Arduino libraries use a 10us value but ignore the extra overhead/delay of the function entry-prologue/exit-epilogue, this is very different on much faster CPUs and my tests show 15us is needed for reliable usage).

1024 * 15us = 15.36ms.
31500bs / 1024bf = 30.76fps, 1000ms/30.76f = 32.5ms a frame.

= 47.86ms a frame total (it's actually a fair bit higher than this because of LCD control/addressing commands that I've not counted).

Luckily, an LCD isn't usually refreshed in one go because they are so slow, top-half/bottom-half is frequently used, and sometimes "in 4 quarters" is also used. (And it would be annoying to have to rewrite a library, so I'll go with 4...)

This means an approx. 12ms delay before the DMA data supplying routines get the CPU after an LCD refresh. I'm also looking to get 100KHz stepping out of this process (otherwise, I might as well stick to my current interrupt method and the ever so slightly bitter taste in my mouth) so that's:
100,000 * 0.012 = 1200 DMA buffers per stepper being controlled.

Let's say there's 20 bytes needed for each DMA buffer = 24,000bytes * 3 steppers being controlled at a time ~= 71KB to deal with what is actually a very short delay (and I suspect SD cards could end up being much worse).

So... I'm not totally sure this is a way forward.
 
That would be a slow display - seems I ran the i2c oled ssd1306 [128x96?] at 4MHz with adafruit code - the color TFT's certainly run faster at 20+ MHz SPI - the 128x128 was updating bar graph meters at or over 200 FPS with Sumotoy driver. I just ordered some '0.95 inch SPI Full Color OLED Display Module 96X64 SSD1331 LCD' I should see this month that Sumotoy made what should be a good driver for.

With sequential reads on one file the SDIO should get you megabytes per second.
 
This means an approx. 12ms delay before the DMA data supplying routines get the CPU after an LCD refresh. I'm also looking to get 100KHz stepping out of this process (otherwise, I might as well stick to my current interrupt method and the ever so slightly bitter taste in my mouth) so that's:
100,000 * 0.012 = 1200 DMA buffers per stepper being controlled.

I was imagining a library providing this DMA-driven output would give you an ISR-context callback when each buffer is completed. You could queue up another buffer within the callback, or later if you like. Of course, to keep the pulses happening without any gaps, you'd need to make sure there's always at least 2 buffered queued (including the buffer being processed as pulses).
 
Well, we (3D printer enthusiasts) will be looking at using somewhat "slower" displays than those, these are the actual timings for the reprapdiscount glcd which are popular with our community - so there's little that can be done there. @defragster, thanks for that hint but if I were to spend even more money on new hardware to solve issues, I'd do so with specialised hardware counters to control the motor drivers rather than on a new display panel. It's a good idea, I agree, but it leaves too many existing setups out of the equation.

@Paul, that's not really a "great" solution, replacing one ISR with another slightly more complex ISR called half as many times. It's interesting though :) Edit: or did I misunderstand, and you're talking about much larger buffers?
 
Last edited:
The more I think about this, the less convinced I am that it will provide the desired improvement (at least with the T3.5 which is only slightly faster than the LPC1768 I'm also testing).

What is being made more efficient here is due to the context save and restore for the ISR, by calculating [possibly many] more steps per interrupt. All the other calculations being done though, still need to be done. I've got 125KHz working on a 100MHz CPU, so the two ISRs are around 800 ticks/instructions but a 10% effeciency gain here (a bad guess at the context save/restore) isn't going to translate to anything like a 2.43 * improvement by using DMA. (3 steppers @ 41KHz back up to 100KHz each)

It may be a way to get the T3.5 hardware to do what the LPC1768 is doing for me (4 * 32 bit timers, having a 4 * 32 bit prescaler counters and 4 * 4 match compare registers) but the complexity it's adding probably isn't something I think is worthwile and it becomes much harder to port code later.

I might have to consider just going back to the Bresenham algorithm.
 
Status
Not open for further replies.
Back
Top