Query speed limit teensy 4.1

Status
Not open for further replies.

JPPAD

Member
Query:

There is some way to optimize the following sequence of statements (fast quickly up and down a port), combining both statements in an assembly code online?.

digitalWriteFast(0,HIGH);
digitalWriteFast(0,LOW);

Both statements take approximately 2nS + 2nS => 4 nS for 1 GHz CLOCK CPU TEENSY 4.1 (LIMIT)
I require that the total execution time of both statements be less than 4 nS.
 
Each of those statements is compiled to a single assembler instruction so I don't think you can improve on it.
Does it make any difference if you compile with Tools|Optimize: "Fastest" ?

Pete
 
IIRC, Paul mentioned something about writes to the digital pins being throttled some time back. I recall there was some magic if you truly need that speed (which it sounds like you might).
 
In this message Paul says that digitalWriteFast compiles to a single STR instruction but that it can take two clock cycles to execute.
@tonton81 may be right that toggling could be faster because, for one thing, toggling the bit on and then off would reference the same GPIO6_DR_TOGGLE address twice (instead of GPIO6_DR_SET and then GPIO6_DR_CLEAR) which should require fewer registers to set up the address for the STR instructions.
Maybe give digitalToggle a try and see what it does. Just do:
Code:
digitalToggle(0);
digitalToggle(0);

Pete
 
The digitalToggle function doesn't compile down to a single STR like digitalWriteFast does but you can use this in its place:
Code:
CORE_PIN0_PORTTOGGLE = CORE_PIN0_BITMASK;
CORE_PIN0_PORTTOGGLE = CORE_PIN0_BITMASK;

I can't test this so I don't know if it's any faster. Give it a try.

Pete
 
The digitalToggle function doesn't compile down to a single STR like digitalWriteFast does but you can use this in its place:
Code:
CORE_PIN0_PORTTOGGLE = CORE_PIN0_BITMASK;
CORE_PIN0_PORTTOGGLE = CORE_PIN0_BITMASK;

I can't test this so I don't know if it's any faster. Give it a try.

Pete

What about digitalToggleFast(0);? Does that compile differently?
 
Reply to el_supremo:

I have tried with
-Faster
-Fast
-Fastest

And the temporal response does not change.

I need to upload and download a PIN as quickly as possible.
I would like to know where the source code for digitalWriteFast is located.
I fully understand that I must be touching the limit of this CPU, although I doubt if the problem comes from intermediate code from a library (I have to prove it to myself).

-----

Reply to tonton81:

The toggle (not toggleFast) method is even worse (in the style of the original digitalwrite).
Each digitalToggle statement takes 12 nS (TEENSY 4.1 CLOCK 1 GHz).
Its use is not recommended.

-----

Reply to MichaelMeissner:

I'm certainly looking for that magic, I sincerely believe that it must be possible to raise and lower the PIN of a port since I require it to create a trigger signal in an application that I have and the truth is that it would hurt a lot to have to change CPU (my beloved TEENSY).

-----

Reply to el_supremo and defragster:


I have tested and measured and the result is as follows:

Double digitalToggles take 12 nS each, totaling 24 nS.

The double digitalTogglesFast are equivalent to the digitalWriteFast takes 2 nS each, in total 4 nS.

The
CORE_PIN0_PORTTOGGLE = CORE_PIN0_BITMASK;
CORE_PIN0_PORTTOGGLE = CORE_PIN0_BITMASK;

they are equivalent to digitalWriteFast, each one takes 2 nS, in total 4 nS.

-----

Forgive everyone for terrible English, I'm from Spain (Madrid).
 
All the sources are installed - using an editor with folder search makes finding things easy:
Code:
T:\arduino-1.8.13\hardware\teensy\avr\cores\teensy4\core_pins.h:
 1588  }
 1589  
 1590: void digitalToggle(uint8_t pin);

That is the fastest it will run with that single instruction - the processor at 600 MHz has to wait for the I/O bus timing to complete the request.
 
With hardware assist, you could reduce it to a single statement (while still creating a pulse). I'm curious - what is the purpose of these very short pulses?
 
Query:

There is some way to optimize the following sequence of statements (fast quickly up and down a port), combining both statements in an assembly code online?.

digitalWriteFast(0,HIGH);
digitalWriteFast(0,LOW);

Both statements take approximately 2nS + 2nS => 4 nS for 1 GHz CLOCK CPU TEENSY 4.1 (LIMIT)
I require that the total execution time of both statements be less than 4 nS.

At these speeds you have to consider whether the chip's pin drivers can actually drive hard enough
for such fast signal edges. 3.3V logic signals are limited in risetime by the driver max current and the load
capacitance of the pin and pcb traces. When very high speed logic traces are needed LVDS is normally
employed so that the external circuit looks like pure resistive 100 ohm transmission line, and all the capacitance
is tuned out.
LVDS signals can work at 500MHz and beyond, single ended CMOS logic is starting to break up over 100MHz,
unless you have output drivers able to source / sink enough current to drive 50ohm stripline directly,
which is very power-hungry. The i.MX RT1062 datasheet (sect 12.4.2.3) specifies maximum operating
frequency of 200MHz, and that's only if the full slew-rate and maximum pad-drive level are selected.

So what are you trying to achieve with these short pulses?
 
The application consists of a Teensy 4.1 (forced to 1 GHZ clock), which, thanks to its adjacent parallel port (16 Bits), sends data to an external DAC, which requires a rising edge signal to record and proceed to the D / A conversion.

I get this trigger signal from another of the PIN / s that the CPU has, but I need that once I have set said PIN to HIGH, I need to pass it to LOW as soon as possible, to proceed with the next conversion sample (the DAC has sufficient response 200 Msps). The 3.3 Volt signals must withstand perfectly up to 350 MHz and the whole set of transmission lines is not in the best of cases (prototyped) but I can see on a 3 GHz BW channel and 10 Gsps oscilloscope that the signals arrive quite a lot defined.

The problem with taking advantage of the DAC is that I use three steps:


GPIO6_DR = (DAC Value << 16);

digitalWriteFast (0, HIGH);

digitalWriteFast (0, LOW);



In each of them I lose 2 nS (TEENSY 4.1 at 1 GHz), so the total is 6 nS, so I can process a new conversion value (LOOP).

My claim is that if an expert in "Assembly in line" can combine these three steps or even the last two (since they use the same address of the PIN) (raise / lower PIN) combination that could reduce between 1 nS or 2 nS the total D/A conversion time, I can make the DAC work at its maximum conversion efficiency (object of the need that I need for the project, in short, now I am at a technological limit set by TEENSY 4.1).
 
If you have an idea with external hardware, lower the PIN to LOW, do not hesitate to share it. I had thought of a derivative circuit, but the problem is in setting the PIN inside the CPU to LOW since then it has to cause the rising edge again for the new sample
 
While I appreciate the need for speed in some applications, it seems that you may be running up against hardware limits, both in pin drive and cpu instruction cycles.

Where are you getting the data that you are sending to the DAC? Is that data being generated so fast that the difference between 4nS and 6nS to output a word is a show stopper? Perhaps you can find the extra 2nS somewhere in the data generation so that you don't have to squeeze it out of the output function.
 
The application consists of a Teensy 4.1 (forced to 1 GHZ clock), which, thanks to its adjacent parallel port (16 Bits), sends data to an external DAC, which requires a rising edge signal to record and proceed to the D / A conversion.
Datasheet of DAC please....
I get this trigger signal from another of the PIN / s that the CPU has, but I need that once I have set said PIN to HIGH, I need to pass it to LOW as soon as possible, to proceed with the next conversion sample (the DAC has sufficient response 200 Msps).

200MSPS? Definitely need to see those specs for the DAC, this is 4+ layer PCB / impedance-controlled territory. How much of that 200MSPS
are you needing?
 
One way to get a pulse at each change in a clock is to use an XOR circuit like the one shown below. You delay one of the inputs with an RC network to set the pulse width. As others have pointed out, with a 250MHz input clock, you are at the limits of standard CMOS logic.

XOR_pulse.jpg

Please excuse the fact that the LTSpice simulator offers only the overkill 5-input version of the XOR gate.

If you can get that hardware addition properly set up for the input clock, you could get by with only a single digitalToggleFast for each word output.
 
DAC.png


I attach the main page of the DAC datasheet.

I fully understand the subject of digital transmission lines in which I strongly agree with you, I am professionally dedicated to RF and Microwave engineering issues.
At the moment it is true that I do not comply at a practical level with the good practices for high speed circuits, but I repeat that I have looked at the conformation of the signals, including the delays of the parallel bits, the waveforms have sufficient spectral composition for an acceptable square signal and the rise times (10% - 90%) are less than 1 nS, thus deducing that the prototyping circuit is not the best for these time domains, when it manages to reduce the time consumption in sentences , I will begin to make a more suitable circuit design, with its reduced transmission lines and the corresponding ground planes.

My goal right now is to validate the TEENSY 4.1 + DAC capability in time response for the signal synthesizer project.
 
Thanks for the idea of ​​the delay circuit and XOR post.

In principle I will tell you that it is not worth putting a single digitalToggleFast since the even or odd samples would be lost depending on how you look at it, and putting two digitalToggleFast is the subject temporarily equivalent to putting

digitalWriteFast (.. HIGH);
and
digitalWriteFast (.. LOW);

with which nothing is resolved at all.

The circuit has the problem that in the end it has to come from the CPU another uphill slope and for that I have to lower the PIN first.

I appreciate that you have dedicated that time of analysis and synthesis of the circuit, not everyone has such dedication and good work.

The final solution should come hand in hand with making that part in Assembly and inserting it into the code.

At the moment I am setting the speed limit that I can afford to give specifications.

They had asked me on the other hand, where do I get the samples that I intend to present to the DAC, in principle it is a simple calculation, an incremental triangular ramp, a mathematical equation, even a fixed table or a DMA access, of course it is not in any domain serial i2c, etc.

My mission right now is to exceed the limits I have indicated as a minimum with a 16 bit DAC resolution.

I want to give a real sense of professional utility to the TEENSY 4.1 + external DAC since at the time I did it with the TEENSY 3.6 at 12 bits (53 nS per sample) for another less demanding project, but these from NXP have stolen our DAC internal (which I was eagerly expecting to be 16-bit for this revolutionary version of CPU and TEENSY 4.1 integration).
 
i think i remember seeing a post by Paul S a year or so ago discussing how NXP had a small register to control to trise tfall
time for i/o pins (they had EMI reduction in mind) so software timing aside you might get crisper pulses if you set the
trise/tfall to its fastest value.
 
I am looking at the documentation of the RT1062, on the subject that you mention to be able to configure the response speed of the PINs.
It would be very interesting to find the post where Paul commented on this topic.
It is very possible that due to electromagnetic compatibility issues it has been intentionally worsened said answer
 
IOMUXC.png


I can verify that in the documentation of the RT1062 section 11.7.204 SW_PAD_CTL_PAD_GPIO_B0_05 SW PAD Control
Register (IOMUXC_SW_PAD_CTL_PAD_GPIO_B0_05)

It comes to configure in all PADs two aspects that have to do with temporary responses.

I understand that you must refer to these fields of the registry, I do not know if currently the TEENSY 4.1 library has the capacity to handle these aspects and in what state they are by default
 
I attach the main page of the DAC datasheet.
A photo is not a link to the datasheet - but at least I know the part number now and dug it out.

The MAX5885 requires synchronous clock and data, so I don't see how what you are doing works with that - normally
the clocking would be quartz-stable and data is clocked every cycle from something like SDRAM. The chip has a 3.5
clock latency from data in to analog current output, so sending one clock pulse does nothing. Its not clear if it
can tolerate stop-start clocking either, perhaps I missed something in the datasheet though.

A 16 bit DAC needs ultra-low jitter to perform well at higher frequencies, since jitter directly translates to noise and
you are hoping for low noise or you wouldn't be needing 16 bits. Driving the clock pin asynchrnously will mean
the performance in terms of noise and SFDR will drop significantly.

I think you are better continuously clocking the DAC and using something like a 74LVC16374 to reclock the output
data to the DAC clock itself. However that chip's only spec'd to 120MHz or so, but its pretty easy to layout having 16 bussed
signals in and out.
I fully understand the subject of digital transmission lines in which I strongly agree with you, I am professionally dedicated to RF and Microwave engineering issues.
At the moment it is true that I do not comply at a practical level with the good practices for high speed circuits, but I repeat that I have looked at the conformation of the signals, including the delays of the parallel bits, the waveforms have sufficient spectral composition for an acceptable square signal and the rise times (10% - 90%) are less than 1 nS, thus deducing that the prototyping circuit is not the best for these time domains, when it manages to reduce the time consumption in sentences , I will begin to make a more suitable circuit design, with its reduced transmission lines and the corresponding ground planes.

My goal right now is to validate the TEENSY 4.1 + DAC capability in time response for the signal synthesizer project.
Some more information about how you are trying to use the T4.1 would be useful I think.
 
Seems clear to me that the plan is for the pulses and new parallel data to be continuously and synchronously output at high speed - by the T4. The MCU speed is crystal controlled, so maybe jitter wouldn't be too bad.

> not worth putting a single digitalToggleFast since the even or odd samples would be lost

Not with the circuit creating rising and falling edges from one edge of either type. Ie, a clock doubler. But this would add jitter.
 
Last edited:
Status
Not open for further replies.
Back
Top