ARM assember code

Status
Not open for further replies.

wangnick

Well-known member
Hi all,

following issue: Teensy 3.0 should work as logic analyser. Fastest possible sampling rate whilst streaming results via USB to PC.

So I've set up the PIT0 timer to sample the data port via pit0_isr into a run-length encoded circular buffer, from which the main loop fetches it and writes it to the PC.

Unfortunately the pit0_isr takes about 1.2 us even after optimisation.

So I've rewritten pit0_isr into ARM assembler (not Thumb) for maximum speed. On the main path I'm now at 18 cycles code, not including the b lr at the end.

Now my questions:
  • I seemingly can't integrate my pit_assembler.S file into the Teensy Beta8 Arduino IDE. Any idea how to overcome this?
  • I can't call my pit0_isr easily from C++ anymore in order to measure its speed, I'm getting "Conflicting CPU architecture" from ld. Can I declare a procedure as ARM somehow, such that the proper bx call sequence is generated at the calling side?
  • Can someone with more experience confirm that ISR routines in ASM assembler are valid on the Cortex-M4? Which type of jump is contained in the vector table?
  • Is the b lr contained in the 12 cycles minimum ISR overhead that I read somewhere in the ARM docs? Do the 12 cycles apply in this case, making my code take 30 cycles on the main path?

I'll continue to investigate, but some help would be appreciated at this stage.

Kind regards,
Sebastian
 
For this application, I would explore using the DMA (direct memory access) controller to automatically sample 8 pins and place the data directly into memory. The DMA controller has a feature where you can chain 2 channels together, so the 2nd channel will automatically continue where the 1st left off. The idea is to configure both channels initially, then as each one finishes, an interrupt routine sets that one up again to begin with the other is done. Using that technique, you can (in theory) achieve perfect sampling very virtually no CPU overhead.

How many packets you can transfer per millisecond depends on how the usb host controller in the PC allocates the bandwidth. You have no control over that. Teensy 3.0 is easily capable of 1 MByte/sec rate (even Teensy 2.0 can, with carefully optimized code), which is about as fast as most computers can go with 12 Mbit/sec usb. In theory 19 packets can transfer per 1ms usb frame, but in practice I've never seen more than 18 per frame. A more conservative approach might try for 500 or 800 kbytes/sec speed, in case other USB devices are active.

If you're really crafty, you could dig into usb_serial.c and replicate some of that code. You could call usb_malloc() to obtain 64 byte packet buffers ahead of time, then set up the DMA controller to automatically fill the 64 byte packet buffers, and give each one to the usb stack to transmit. That way, the data goes directly into memory with no per-byte CPU overhead, and the usb stack will transmit it directly from the same memory, with no per-byte overhead (the usb DMA is a completely separate DMA controller, so none of the 4 channels are consumed by usb). The DMA can automatically move all the data, so all the CPU needs to do is set up the transfers.

DMA is pretty wonderful like that. Unfortunately, it's also complex to use. The DMA controller chapter in the Freescale reference manual is long and filled with tons of features to read and understand. Despite the huge complexity, just think of the TCD as 8 parameters to figure out. I must admit, I've only used it a little so far, so I'm far from an expert...

Also, beware the errata. Check out Freescale's site for the errata on this chip. There's 2 silicon bugs you probably care about. There some sort of issue using the PIT timers with the always-on DMA mux channels. Also, the scatter-gather feature does not work, which is a shame since it could do this with a single channel. Instead, you need to avoid the scatter-gather feature and use the 2 channel chaining feature (which does basically the same thing, but using 2 channels).

These caveats aren't meant to scare you, but rather save you time (hopefully) not getting stuck on known issues.

Good luck, and please post to let us know your progress. I'm really interested to see how it turns out.
 
Last edited:
Thanks Paul. I was too scared so far to dig into the DMA business. Also, I need to perform run-length encoding on the outgoing data if I want to go beyond the 500-800 kSamp/sec, so filling the usb_packet_t buf via DMA is no option. So I continued a bit on the pit0_isr path.

What I found out so far is that a basic ISR simply resetting the interrupt and incrementing a global counter takes already about 36 CPU cycles, suffocating the system if called faster than 2.4MHz. With careful assembler crafting I might be able to fill a cyclic buffer with little enough CPU cycles, and enough remaining for the run-length encoding and the USB data transmission, to reach 1 MSamp/sec.

My target though is 2MSamp/sec. I want to analyse I2C bus communication. My slave (an Attiny84 running the TinyWireS library) seems to sometimes toggle the SCL line a microsecond or so after the SDA line, so there is seemingly some delay missing in the I2C slave code. But I need to capture quite a lot of data as this occurs only rarely.

So I might let DMA fill the cyclic buffer such that I have enough CPU cycles left to perform the RLE, to fill the USB packet buffer, and to usb_tx() the data out.

Or I let the ISR trigger on PORT_PCR_IRQC_CHANGE only (due to RLE the max sampling rate is determined by the amount of logic change anyway already).

Or I cheat and buy one of these 24MHz dedicated logic analyzers ...

Kind regards,
Sebastian
 
I have DMA ADC working as Paul described using two DMA channels (one to read back the conversion result and the other to fire the next conversion). The code is still a bit rough, but there's lots of comments. The biggest challenge[*] is that you have to align the buffer by it's size in order to get the modulo feature to work. Code is here if you want to take a look. I'm going to clean this up and package it better.

For 64 conversions at 96Mhz I'm seeing 66 us total (6213 cycles), 1.03 us (97.08 cycles) per channel. At 48Mhz its 78 us total (3683 cycles), 1.22 us (57.55 cycles) per channel. I'm not really sure what if anything this means in terms of where it's bound. I'm amortizing some isr cost in there (one interrupt at the end of the 64 conversion cycle). Right now I'm just doing a single shot of 64 conversions, but I think I could use the PDB to fire off periodic rounds. I have the ADC clocks turned all the way up and no averaging. The accuracy seems okay but not great. I'll post separately about that using jbeale's methodology.

This is fine for my application which has less to do with absolute speed and more that I'd like timing accuracy (sample all channels on the board as close to each other as possible). Also, I like the asynchronicity of an interrupt when all conversions are complete -- not having to wait around (or keep checking) for the conversions to complete. I'm planning something event based.

Hope this is helpful,

-c

[*] Honestly the actual hardest part of all this stuff for me is that there's always some bit you have to set somewhere in order to turn something on or make it actually work and it's never documented in with the thing you're trying to use.
 
Last edited:
Status
Not open for further replies.
Back
Top