Teensy 4.0 clock counting and/or clock control

Hi all

I have a Teensy 4.0 and I am trying to analyse an ASIC that communicates using serial data. The ASIC was used for non-standard crypto functions. I am communicating with it using a serial program on my computer via Teensy.

The ASIC communicates using 9-bit half-duplex at 38.4k baud and computer talks to Teensy at 9.6k baud. This is to allow the ASIC to process the input data and send back the answer at a higher speed.

So the flow of operations looks like this:

Host (computer) --> (Serial 3 @ 9600 baud) Teensy (Serial 2 @ 38400 baud) --> ASIC --> Teensy --> Host

This all works fine. However, given the nature of the ASIC, I would like to read various ASIC registers during data processing (it's LFSR chased with taps). As an example, to process a single byte through a crypto function takes multiples of 1024 clocks (1024, 2048, etc). Each clock equals an iteration through the function. The external clock speed is 3.58MHz, which can be stopped and started (to the ASIC that is) without loss of contents, so it's a fully static device.

With that said, what I would like to do is count the number of cycles between sending a byte to be processed and issuing a read command. For example, after sending a byte, I would like to wait 1 clock cycle and then send a read register command. The read command itself takes around 1280 clocks to process, so I won't be able to see the data from the start. Or issue a read before I send the byte to process. A lot of trial and error there.

I was thinking of utilising something like FreqCount but not sure if that's the best way to do this.

The alternative method is to generate the clock from Teensy itself and use that to drive the ASIC. Given that the ASIC is static (it's a 4000 gate array from what I understand), I could potentially stop and start the clock as required, or even slow down (I guess the serial baud will change too).

However, how do I say to Teensy; generate a clock of xMHz for n clocks and then stop. There's the analogWriteFrequency(PIN, SPEED) but that will just run and run.. again, I could count the clocks somehow and issue a stop (using analogWrite(4, 0)?) but it doesn't give me control that I ideally want.

I am looking for ideas and any code examples :) Here's what I have so far, though it doesn't really have anything to do with the question itself - just something to build up from.

Code:
#define CONSOLE Serial

#define HOST Serial3
#define ASIC Serial2

#define CONSOLE_BAUD 115200
#define ASIC_BAUD 38400
#define HOST_BAUD 9600

#define HOST_ASIC 1
#define ASIC_HOST 3

uint16_t b;

int dir = ASIC_HOST;
int pdir = dir;

char fmt[64];

/* Reverse order of n bits in a byte */
static uint8_t _rev(uint8_t b, int n)
{
	uint8_t r = 0;
	while(n--)
	{
		r = (r << 1) | (b & 1);
		b >>= 1;
	}
	return(r);
}

void sendByteToASIC(uint8_t a)
{
  b = _rev(~(a), 8) << 1;
  ASIC.write9bit(b | 1);
  ASIC.flush();
  delay(1);
}

void sendByteToHost(uint16_t b)
{
  HOST.write(b);
  HOST.flush();
  delay(1);
}


void setup()
{
  CONSOLE.begin(CONSOLE_BAUD);
  ASIC.begin(ASIC_BAUD, SERIAL_9N1 | SERIAL_HALF_DUPLEX); /* ASIC -> HOST */
  HOST.begin(HOST_BAUD, SERIAL_9N1 | SERIAL_HALF_DUPLEX); /* HOST -> ASIC */
}

void loop()
{

  if(HOST.available() > 0)
  {
    dir = HOST_ASIC;

    /* Received byte */
    b = HOST.read() & 0xFF;

    sprintf(fmt, "%s %02X", dir == pdir ? "" : "\n-->RQST:", b);
    CONSOLE.print(fmt);

    /* Resend byte from HOST to ASIC */
    sendByteToASIC(b);
    pdir = dir;
  }

  if(ASIC.available() > 0)
  {
    dir = ASIC_HOST;
    /* Received byte */
    b = ASIC.read();

    b = _rev(~(b & 0xFF), 8);
    /* Resend byte from ASIC to HOST */
    sendByteToHost(b);
    sprintf(fmt, "%s %02X", dir == pdir ? "" : "\n<--ASIC:", b);
    CONSOLE.print(fmt);

    pdir = dir;
  }
}
 
If the clock stops - then there is no more UART transfer as well? Does that include the 'read register' interface?
 
3.8 MHz is slow enough that you could probably just use digitalWriteFast() or digitalToggleFast() with checks against the ARM DWT cycle counter for timing, and of course checks for the total number of pulses you want to output.

Not the most elegant solution, but a lot simpler than diving into the timer hardware.
 
Was going to add the manual clocking as p#3 - but puzzled by the goal of clocking the unit to a stop to then get data from the process?
 
If the clock stops - then there is no more UART transfer as well? Does that include the 'read register' interface?

Sadly, yes, the UART is driven by the same clock, so that needs to be taken into account.

puzzled by the goal of clocking the unit to a stop to then get data from the process?

Stopping isn't necessary but I would like to start it manually as soon as I send the byte to it to (try to) synchronise the two events. It will be a lot of trial and error but I have data I should be expecting at certain clock cycles already from a third party so I can use that for reference.

I suppose I could do that with counting the clock as well but I guess I would need to do that using an interrupt rather than a loop() since the loop() could potentially go around more than once within the same clock cycle.

3.8 MHz is slow enough that you could probably just use digitalWriteFast() or digitalToggleFast() with checks against the ARM DWT cycle counter for timing, and of course checks for the total number of pulses you want to output.

Thanks for this. ~3.6MHz seems like it's way too fast to me but I am forgetting that T_4.0 runs at 600MHz! I saw an elapsedCycles that could potentially be used but need to work on some maths to work out how to actually use it to drive the clock.
 
@PaulStoffregen - am I able to utilise the FreqCount library between certain points of execution? I am trying to count the number of (external) clocks between sending a "read register 1" command from Teensy (which is byte 0x13) and receiving back a response from the ASIC. I am semi-reliably told that it should take around ~1253 cycles but I am getting around 4600. Ideally I want to use an interrupt from "UART transmit finished" to start counting and finish on the first bit received of the response. Haven't figured out what those ISR() names are yet.... (Do USART1_RX_vect and USART1_TX_vect exist on Teensy?).

On calculating the number of Teensy ticks per ASIC tick, is it as simple as dividing 600MHz/3.58MHz = 167 ticks? I calculated the number of loops per second and I get 1630112, which is a lot less than 3580000 * 2 (for rising and falling edges). I guess I'd need to resort to hardware timers to do the pulsing.

Here's my code so far:

Code:
#define CONSOLE Serial

#define HOST_SERIAL Serial3
#define ASIC_SERIAL Serial2

#define CONSOLE_BAUD 1152007
#define ASIC_BAUD 38400
#define HOST_BAUD 9600

#define HOST_ASIC 1
#define ASIC_HOST 3

#include "elapsedCycles.h"

elapsedCycles dwt_cycles;
uint32_t t_cycles;
uint32_t asic_clock_count;
uint32_t time_elapsed = 0;
uint32_t loops = 0;

uint16_t b;
int dir = ASIC_HOST;
int pdir = dir;


char fmt[64];

/* Reverse order of n bits in a byte */
static uint8_t _rev(uint8_t b, int n)
{
	uint8_t r = 0;
	while(n--)
	{
		r = (r << 1) | (b & 1);
		b >>= 1;
	}
	return(r);
}

void sendByteToASIC(uint8_t a)
{
    b = _rev(~(a), 8) << 1;

    /* Reset Teensy cycles*/
    dwt_cycles = 0;
    ASIC_SERIAL.write9bit(b | 1);

    /* Grab the number of cycles */
    t_cycles = dwt_cycles;

    /* Reset ASIC cycles */
    asic_clock_count = 0;

    ASIC_SERIAL.flush();
    delay(1);
}

void sendByteToHost(uint16_t b)
{
    HOST_SERIAL.write(b);
    HOST_SERIAL.flush();
    delay(1);
}

void handleISR() {
  asic_clock_count++;
}

void setup()
{
    CONSOLE.begin(CONSOLE_BAUD);
    ASIC_SERIAL.begin(ASIC_BAUD, SERIAL_9N1 | SERIAL_HALF_DUPLEX); /* ASIC -> HOST */
    HOST_SERIAL.begin(HOST_BAUD, SERIAL_9N1 | SERIAL_HALF_DUPLEX); /* HOST -> ASIC */
    
    pinMode(9, INPUT_PULLUP);
    attachInterrupt(digitalPinToInterrupt(9), handleISR, RISING);
}

void loop()
{
    if(HOST_SERIAL.available() > 0)
    {
        dir = HOST_ASIC;

        /* Received byte */
        b = HOST_SERIAL.read() & 0xFF;

        sprintf(fmt, "%s %02X", dir == pdir ? "" : "\n-->RQST:", b);
        CONSOLE.print(fmt);

        /* Resend byte from HOST to ASIC */
        sendByteToASIC(b);
        pdir = dir;
    }

    if(ASIC_SERIAL.available() > 0)
    {
        uint32_t clocks = asic_clock_count;

        dir = ASIC_HOST;
        /* Received byte */
        b = ASIC_SERIAL.read();

        b = _rev(~(b & 0xFF), 8);
        /* Resend byte from ASIC to HOST */
        sendByteToHost(b);
        sprintf(fmt, "%s %02X", dir == pdir ? "" : "\n<--ASIC:", b);
        CONSOLE.print(fmt);

        sprintf(fmt, "\nByte send took %d Teensy cycles. ", t_cycles);
        CONSOLE.print(fmt);
        sprintf(fmt, "Read register took %d ASIC cycles.", clocks);
        CONSOLE.println(fmt);

        pdir = dir;
    }
}

And output:

Code:
-->RQST: 13
<--ASIC: 00
Byte send took 365 Teensy cycles. Read register took 4600 ASIC cycles.

-->RQST: 13
<--ASIC: 00
Byte send took 819 Teensy cycles. Read register took 4694 ASIC cycles.

-->RQST: 13
<--ASIC: 00
Byte send took 363 Teensy cycles. Read register took 4603 ASIC cycles.

-->RQST: 13
<--ASIC: 00
Byte send took 373 Teensy cycles. Read register took 4696 ASIC cycles.

Edit: using Timer3 library, I was able to achieve just 1MHz (2m interrupts).... odd.
 
Last edited:
T_4.x loop() will cycle 10M to 14M with an empty loop(). Everything added to loop() slows that down toward the 1.6M observed. A 'tight loop' will run much faster without the overhead of loop() calling yield() on each exit

Over 2M interrupts/sec starts hitting max depending on the code in the _isr() might go toward 4M usable - depending on the type. Pin interrupts have overhead where a single interrupt has to parse out the pin involved.

Some _isr() routines complete before the main CORE gets the notice the interrupt was serviced with the flag reset and it is called again unless the "DSB" command is included. {search "DSB" for the asm statement}

Re p#4 - and perhaps p#3 - it was assumed a blocking 'tight loop' was used for timing the clock pin to ASIC - i.e. nothing else to do but trigger the ASIC to get the desired information.

3.58MHz clock "Cycle" is 2*3.58MHz pin events: 3.58MHz * ( pin rise Rise; wait 3.58MHz/2 ticks; pin Fail (Toggle); wait 3.58MHz/2 ticks; )
where 3.58MHz/2 is 83.798882681564245810055865921788 ticks?

And that would be done for a fixed count of "cycles" to progress the ASIC some fixed number of cycles?

That goes back to post #2: assuming that works as desired - what is the next step to get results desired in post #1?

p#5 suggests stopping/controlling the ASIC wasn't actually the goal? - so the Teensy doesn't really need to do the clocking - and if it stops the clock there is no UART I/O.

The Teensy could time events once started but would need the indicators and methods to measure what is desired it seems?
 
Thanks @defragster. Need to look up how to implement tight loops. I assume it's something that's interrupt driven rather than poll. The only real thing I have in the loop is checks to see if there's any data on either of the two serial ports.

p#5 suggests stopping/controlling the ASIC wasn't actually the goal? - so the Teensy doesn't really need to do the clocking - and if it stops the clock there is no UART I/O.

The main goal is to read the ASIC register data after a specific number of clocks, since each clock cycle shifts the result register to the left bitwise. Here's an example:

1) Each time a byte is fed into the DATA register in the ASIC, it's passed through a cryptographic function.
2) Each pass takes 1024*n cycles to complete, where n is configurable.
3) Each cycle feeds the data register into the LFSR chain bit-by-bit and places the result into an ANSWER(8) register.
4) I can normally only see the result after each pass and not while the data is being processed.

In the example below, the byte is processed in two passes = 2048 clocks. Here's what I get.

Code:
-->RQST: 01 01  <-- Write 0x01 into DATA register
-->RQST: 13     <-- 0x13 = "read byte from register 1 (ASNWER)
<--ASIC: F3     <-- ANSWER back from ASIC
-->RQST: 13     <-- 0x13 = "read byte from register 1 (ASNWER)
<--ASIC: 70     <-- ANSWER back from ASIC
-->RQST: 13     <-- 0x13 = "read byte from register 1 (ASNWER)
<--ASIC: F4     <-- ANSWER back from ASIC
-->RQST: 13
<--ASIC: 68     
-->RQST: 13
<--ASIC: F9    
-->RQST: 13
<--ASIC: AD    
-->RQST: 13
<--ASIC: 1D     
-->RQST: 13
<--ASIC: A7 

F3 70 F4 68 F9 AD 1D A7

This tallies up with some old text that I found in the archives.

Code:
If the COUNT register is not zero, a byte written into the DATA
register causes 1024*COUNT clock cycles of processing, with
"busy" indicated in the STATUS register. The ANSWER register can
still be read during this processing. So one can loop
an operation, and examine the ANSWER values every clock
cycle, to see how it works. [ Another design flaw ! ]
(There is a snag, in that it takes about 0x4E5 clocks to issue the
read, so it is not possible to see what happens until this many
clocks after the start.)

        cycles ANS7---------------ANS0
        <snip>
        0007F0 9B ED F3 70 F4 68 F9 AD
        0007F1 37 DB E6 E1 E8 D1 F3 5A
        0007F2 6F B7 CD C3 D1 A3 E6 B4
        0007F3 DF 6F 9B 87 A3 47 CD 68
        0007F4 BE DF 37 0F 46 8F 9A D1
        0007F5 7D BE 6E 1E 8D 1F 35 A3
        0007F6 FB 7C DC 3D 1A 3E 6B 47
        0007F7 F6 F9 B8 7A 34 7C D6 8E
        0007F8 ED F3 70 F4 68 F9 AD 1D
        0007F9 DB E6 E1 E8 D1 F3 5A 3B
        0007FA B7 CD C3 D1 A3 E6 B4 76
        0007FB 6F 9B 87 A3 47 CD 68 ED
        0007FC DF 37 0F 46 8F 9A D1 DA
        0007FD BE 6E 1E 8D 1F 35 A3 B4
        0007FE 7C DC 3D 1A 3E 6B 47 69
        0007FF F9 B8 7A 34 7C D6 8E D3
        000800 F3 70 F4 68 F9 AD 1D A7 <- final ANSWER -- this matches my answer

So the goal is to achieve the same. The author does not say what hardware he used but as it was from 1996, it wasn't Teensy :)

Back to your question - the Teensy does not *need* to do the clocking but it seems difficult to count exact number of pulses and perform a read at exactly that spot - accounting for the number of cycles it takes to do a read, something I've not yet managed to count properly.

So, if I could control the clock with Teensy, it would be easier to control when to issue a read command. I could also halve or quarter the clock as it's not sensitive to clock speed - I'd need to read data using 19200 or 9600 baud to compensate - which would make it easier to pinpoint order of operations.

My idea was:

1) Using external clock at 3.58MHz, count the number of rising edges on the pin and issue the 'read' command at whatever number of counts I need to do it in. I am struggling to think how to make it as precise as possible with the runaway clock.

or

2) If using Teensy generated clock, I can count the number of pulses issued, stop when needed and restart it right before I issue a read command. This allows me to control the clock in between checking the number of pulses and issuing reads - rather than it pulsing endlessly.
 
Update:

I tried generating clock in Teensy but the ASIC wouldn't accept it/see it. I suspect that this is because the output level from Teensy is 3.3V, not 5V like it probably expects, so it's not clocking. I tried doing the same using Arduino and it worked fine.

At the moment the setup is Arduno clock -> ASIC <--> Teensy data handler.

I managed to get the clock down to just 31250Hz, which equates to transfer speed to the ASIC of 340 baud :D This is far more manageable from clock generation perspective and counting the clock - just makes it all much easier. I think! I am not sure I can go much lower than that...

I have ordered some 3.3V <> 5V logic level shifters, so I can use the Teensy for everything.

A couple of questions:

1) When sending a byte using SerialX, does it block the loop() during transmission? If it waits for the byte to be sent before coming back to loop() then I have to generate the clock using one of the timers, else the transmission will fail.

2) Which interrupts can I use to indicate the reception of the first start bit (or any bit) on UARTx and whenever transmission finished - or last bit sent. I tried using `attachInterruptVector(IRQ_LPUART3, callback);` but that didn't seem to do anything. I could potentially use Serial2.transmitterEnable(pin) and grab the event off that but there is likely to be a better way.

I think the clock counting is relatively accurate using transmitterEnable method but not super precise:

Code:
-->RQST: 13
<--ASIC: F9
Read register took 1227 ASIC cycles.

-->RQST: 13
<--ASIC: 68
Read register took 1224 ASIC cycles.

-->RQST: 13
<--ASIC: F4
Read register took 1224 ASIC cycles.

-->RQST: 13
<--ASIC: 70
Read register took 1221 ASIC cycles.

-->RQST: 13
<--ASIC: F3
Read register took 1227 ASIC cycles.

Help appreciated!
 
Having the clock work slower driving the ASIC will help give time to moitor and observe what is desired, hopefully the ' 3.3V <> 5V logic level shifter' is good for the speed in use.

re#1: With SerialX/Serial# being a UART as long as there is room in the output buffer, the sent data is copied to the buffer and immediately returns and then the UART running on interrupts pulls from the buffer to feed the UART. If the buffer would over fill given the new data, then it will block.

re#2: that take reading the ref manual and/or the code to understand the workings. I know that on T_3.x a simple attachInterrupt on the Rx pin could work to find the start of reception, but the same code fails on T_4.x as the pins are controlled differently.

There is some jitter and lag in interrupts perhaps affecting the 'ASIC clock' counting. Also the UART may not start sending at the same instant on 340 baud byte boundaries - which is about 10X the ASIC clock? There is an interrupt to refill the FIFO, perhaps a hack to that code could inc++ a counter of the FIFO refill would indicate when the first byes have been sent and the FIFO was refilled giving a more precise start reference to monitor on return from sending data to the UART?
 
Having the clock work slower driving the ASIC will help give time to moitor and observe what is desired, hopefully the ' 3.3V <> 5V logic level shifter' is good for the speed in use.
It isn't sadly. There's so much jitter on the 5V side that the clock is interpreted to be around 10 times faster and is highly unstable. It's OK for slower applications though.

In the meantime I used a signal switcher to get the 5V with 3.3v line as the trigger. Works well.

re#1: With SerialX/Serial# being a UART as long as there is room in the output buffer, the sent data is copied to the buffer and immediately returns and then the UART running on interrupts pulls from the buffer to feed the UART. If the buffer would over fill given the new data, then it will block.
What happens while the byte is being sent, between bits? I'd expect(?) it to come back to loop() while it's waiting to send the next hit - especially at 300 baud.

re#2: that take reading the ref manual and/or the code to understand the workings

I scanned through the LPUART section and it looked promising with some signals that are/can be generated on events such as transmission complete. I've not had much time to delve into it deeper.

For now I attached the serial pin to another pin which has an interrupt attached to it. That seems to work well, thought it's an ugly workaround.

I may try your suggestion of getting a counter into the HardwareSerial to grab the events through there.

I eventually might go down the route of homebrewed software serial but one that I control the clock of with no specific baud... If I want to be paranoid about accuracy.

There is some jitter and lag in interrupts perhaps affecting the 'ASIC clock' counting. Also the UART may not start sending at the same instant on 340 baud byte boundaries - which is about 10X the ASIC clock?]

The "native" baud of the ASIC is 38400 @ 3.58MHz, which equates to around 93 cycles per bit? There are more bits received than transmitted (9-bit serial with routing bit, parity and two stop bits on receiving but only 8 bits on transmission - no routing bit).

Interesting thought on the baud byte boundaries - will keep that in mind.

Will keep you posted!
 
It isn't sadly. There's so much jitter on the 5V side that the clock is interpreted to be around 10 times faster and is highly unstable. It's OK for slower applications though.

In the meantime I used a signal switcher to get the 5V with 3.3v line as the trigger. Works well.

What happens while the byte is being sent, between bits? I'd expect(?) it to come back to loop() while it's waiting to send the next hit - especially at 300 baud.
> background process in parallel - except for UART _isr() that refills the FIFO bytes for Tx. Only blocking is transfer of output bytes into the buffer, unless the buffer space isn't sufficient, that is a simple byte copy and buffer update then return.

I scanned through the LPUART section and it looked promising with some signals that are/can be generated on events such as transmission complete. I've not had much time to delve into it deeper.

For now I attached the serial pin to another pin which has an interrupt attached to it. That seems to work well, thought it's an ugly workaround.

I may try your suggestion of getting a counter into the HardwareSerial to grab the events through there.

I eventually might go down the route of homebrewed software serial but one that I control the clock of with no specific baud... If I want to be paranoid about accuracy.
On T_3.6 when interrupt was shareable on the Rx pin it was disabled after start bit transition occurred to avoid extra triggering bits. Then re-anabled before next transfer.

AFAIK the clocks are independent on each end based on agreed baud - Rx and Tx and no shared clock - so not good if both sides don't stay in sync at the agreed on rate.
The "native" baud of the ASIC is 38400 @ 3.58MHz, which equates to around 93 cycles per bit? There are more bits received than transmitted (9-bit serial with routing bit, parity and two stop bits on receiving but only 8 bits on transmission - no routing bit).

Interesting thought on the baud byte boundaries - will keep that in mind.

Will keep you posted!
 
AFAIK the clocks are independent on each end based on agreed baud - Rx and Tx and no shared clock - so not good if both sides don't stay in sync at the agreed on rate.
Yes that would normally be the case, but if control the ASIC clock myself then I would, in theory, know how many clocks there are between bits. I assume they'll be the exact number every time due to it be an ASIC, which is just a bunch (well... 4000) of logic gates.

The other question I need answering for myself is whether the ASIC copies the byte to be transmitted to some Tx buffer before sending or whether it reads the bits directly from the requested register. Hopefully the former, else the byte itself may change during transmission if the cryptographic function is still running.
 
Back
Top