CPU cycles and data extraction

Status
Not open for further replies.

hillerup

Member
Hello
I posted a similar question yesterday but after further research I have some more concrete questions

I have counted the number of CPU clock cycles by the code listed below.

I can see for serial.print 277 cycles is used. And for .println it takes 391 (this is of course dependent on the size of the data)
However, is there a faster way of getting values on to the PC?

Code:
volatile int cycles;


void setup() {
  // put your setup code here, to run once:
  while (!Serial);
  delay(100);
  ARM_DEMCR |= ARM_DEMCR_TRCENA;
  ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

}


void loop() {

  uint32_t startCycleCPU;
  startCycleCPU = ARM_DWT_CYCCNT;

Serial.println(123);

  cycles = (ARM_DWT_CYCCNT - startCycleCPU)-1;
  //__enable_irq();
  Serial.print("cycles: ");
  Serial.println(cycles);
  delay(1000);
}
 
Last edited:
Transmitting strings has a lot of overhead. If you transmit binary values you'll be much faster (Serial.write)
 
True, it will be about 169 cycles (for the same number). But when using the data, it's easier to have the values in ASCII formate.
Is there a faster way then using the Serial. somthing?
 
Is there a faster way then using the Serial. somthing?

Not that I'm aware of but usually the USB bus (+ PC software) is the limiting factor not the Teensy. With a T4 you can easily transmit > 15MB / sec.
 
Your code does not tell you how fast you are "getting values onto the PC". It is only measuring how long it takes Serial.println to buffer the characters and initiate the transfer.

If you change your code to transfer a string of 79 chars, which will also generate the LF on the end making 80 in all:
Code:
Serial.println("1234567890123456789012345678901234567890123456789012345678901234567890123456789");
on a T4.1 the result is 272 cycles which is a rate of over 176MB/s.

Are you measuring the right thing and just how fast do you actually need?

Pete
 
Hello pete,
Thanks for the reply.
Im using the teensy 3.6 ( sorry for not provides clarity on this to start with)
im not sure what you mean. The println is of course a extra LF and the number of cycles used is dependent on the size of the data.

Are you measuring the right thing and just how fast do you actually need?

I just need to know the limitations on how fast i can extract values from the teensy without delaying the rest of the code more then necessary. That is are there software commands other than Serial.print which could extract the same data faster?
With a large amount of data i assume it would be faster to use the USB1 at 480 Mbits/s build in.
 
As noted - this is a semi double post - causing replies on two threads for the same issue - there is more info on the original with an answer there from Paul ... Output-processed-data-fast. And perhaps still missing some info - data being sent is received by what device? What protocols are available? If a PC then USB direct is best - and actual receive speed is a factor of the host device at some point - Teensy USB output is native fast full speed USB.

(from other thread) This is on a T_3.6 so unlike the T_4.0 Device USB at 480 Mbps - the T_3.6 runs at 12 Mbps. But that is mostly an answer to a question on the other thread ... because of bitrate and packet sizes the different rates the T_3.6 at 1 MB/sec transfers in 64 byte packets and the T_4.x at about 20 times faster [ typically limited by host receiving the data ] is done in 512 byte packets - if packets are filled, otherwise partial packets are sent after some timeout.

Not sure anything is faster than USB because of Overall throughout at either of the above speeds, Paul has worked to optimize throughput and minimize Teensy overhead - as Paul noted on the other post - the transfer is by DMA using USB protocol for data validation, etc - all with minimal impact on the Teensy once the data is queued for DMA send, As Pete notes the time measured above is just processing and buffering the data - that is all in memory manipulation to format and store the data for transmit - this time would be part of any data transfer whether UART Serial or SPI.

Once queued in a buffer for USB transfer the processor comes back to user task almost exclusively - more so that most other protocols as the 64 or 512 byte packets typically exceed the UART Fifo sizes or the SPI data groups - so even 30 Mhz SPI on a T_3.6 under DMA control would have a hard time matching that with SPI overhead - and no data verification or other checks inherent in USB transfer - and that wouldn't be suitable for reception depending on the Host getting the data.
 
I suspect the cost of the communication will mean that any slight savings of CPU cycles will be hidden by the large amount of time to send the data.

The Teensy 3.x processors send data at a maximum of 12 megabits/second (~ 1.2 megabytes/second). That of course is a theoretical maximum, you lose a lot of bandwidth along the way. You lose some bandwidth due to USB having to break things down into packets, and such. But you can also lose a lot of bandwidth because of delays in the host side, either the host OS itself, of inefficient ways of reading the data stream on your client application.

If you could upgrade to a Teensy 4.0 or 4.1, the USB channel there runs at USB 2.0 speeds of a maximum of 480 megabits/second (~ 48 megabytes/second). But note, we've seen various complaints that with the introduction of the Teensy 4.0 running at USB 2.x speeds, many host OSes can't really drink from the fire hydrant of USB data coming from a device at full speed.

You would have to restructure your application to send data via networking sockets, but with the Teensy 4.1, you have the option of direct ethernet that can run at 10 or 100 megabit/second. While, 100 Mbit/second of ethernet is slower than 480 Mbit/second of USB 2, I suspect the system is more geared towards streaming large network packets than USB packets. But I don't know, somebody would have to test it. I might be wrong on this.
 
Hello defragster
Thanks for your reply
This brought a lot of clarity to my understanding of the subject. To clarify I am sending data from the teensy 3.6 to PC by the USB port, with the Serial.print command.
As i understand your reply, the way i count the cycles, is the amount it takes to prepare the data for being sent off with USB. Sending and receiving the data from the T_3.6, results in almost no delay for running the remainder of the code.
Thus there is not really a more efficient way of preparing and transmitting the data from T_3.6 to the pc, than USB with the Serial.print command.
 
you're welcome.

There can be differences on how the print is executed - here is timing for a few variations - notice the times change on subsequent loop() calls ... this is on a T_3.6:
Code:
println() cycles: 428

print('\n') cycles: 210
123
println('123') cycles: 288
123
print("123\n") cycles: 167
123
printf("123\n") cycles: 1190
==============================
println() cycles: 189

print('\n') cycles: 166
123
println('123') cycles: 369
123
print("123\n") cycles: 167
123
printf("123\n") cycles: 824
==============================
println() cycles: 185

print('\n') cycles: 180
123
println('123') cycles: 274
123
print("123\n") cycles: 166
123
printf("123\n") cycles: 752
==============================

That came from this code:
Code:
volatile int cycles;

void setup() {
  while (!Serial);
  delay(100);
  ARM_DEMCR |= ARM_DEMCR_TRCENA;
  ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;
}

void loop() {
  uint32_t startCycleCPU;

  startCycleCPU = ARM_DWT_CYCCNT;
  Serial.println();
  cycles = (ARM_DWT_CYCCNT - startCycleCPU) - 1;
  showR("println() cycles: ");

  startCycleCPU = ARM_DWT_CYCCNT;
  Serial.print('\n');
  cycles = (ARM_DWT_CYCCNT - startCycleCPU) - 1;
  showR("print('\\n') cycles: ");

  startCycleCPU = ARM_DWT_CYCCNT;
  Serial.println("123");
  cycles = (ARM_DWT_CYCCNT - startCycleCPU) - 1;
  showR("println(\"123\") cycles: ");

  startCycleCPU = ARM_DWT_CYCCNT;
  Serial.print("123\n");
  cycles = (ARM_DWT_CYCCNT - startCycleCPU) - 1;
  showR("print(\"123\\n\") cycles: ");

  startCycleCPU = ARM_DWT_CYCCNT;
  Serial.printf("123\n");
  cycles = (ARM_DWT_CYCCNT - startCycleCPU) - 1;
  showR("printf(\"123\\n\") cycles: ");

  Serial.print("==============================");
  Serial.flush();
  delay(10000);
}

void showR( char *out ) {
  Serial.print(out);
  Serial.println(cycles);
  Serial.flush();
  delay(100);
}

Here is output with a T_4.1 at 600 Mhz instead of 180 MHz:
Code:
println() cycles: 148

print('\n') cycles: 127
123
println('123') cycles: 240
123
print("123\n") cycles: 125
123
printf("123\n") cycles: 549
==============================
println() cycles: 142

print('\n') cycles: 127
123
println('123') cycles: 237
123
print("123\n") cycles: 125
123
printf("123\n") cycles: 546
==============================
println() cycles: 142

print('\n') cycles: 132
123
println('123') cycles: 237
123
print("123\n") cycles: 125
123
printf("123\n") cycles: 546
==============================

Note: output showing println('123') is of course coded using println("123") ... code edited ...
 
Last edited:
Status
Not open for further replies.
Back
Top