Teensy-LC Beta Testing

transfer16() does not work and "crashes".

the current code is:
Code:
	inline static uint8_t transfer16(uint16_t data) {
		SPI0_C2 = SPI_C2_SPIMODE;
		SPI0_DH = data >> 8;
		SPI0_DL = data;
		while (!(SPI0_S & SPI_S_SPRF)) ; // wait
		uint16_t r = (SPI0_DH << 8) | SPI0_DL;
		SPI0_C2 = 0;
		return r;
	}
The description says:
Any switching between 8- and 16-bit data transmission length (controlled by SPIMODE
bit) in master mode will abort a transmission in progress, force the SPI system into idle
state, and reset all status bits in the SPIx_S register. To initiate a transfer after writing to
SPIMODE, the SPIx_S register must be read with SPTEF = 1,
and data must be written
to SPIx_DH:SPIx_DL in 16-bit mode (SPIMODE = 1) or SPIx_DL in 8-bit mode
(SPIMODE = 0).

I'm not sure what " the SPIx_S register must be read with SPTEF = 1, " means, but changing it to

Code:
	inline static uint8_t transfer16(uint16_t data) {
		SPI0_C2 = SPI_C2_SPIMODE;
[B][I]		SPI0_S ;
		//while (!(SPI0_S & SPI_S_SPTEF)) ; // wait[/I][/B]
		SPI0_DH = data >> 8;
		SPI0_DL = data;
		while (!(SPI0_S & SPI_S_SPRF)) ; // wait
		uint16_t r = (SPI0_DH << 8) | SPI0_DL;
		SPI0_C2 = 0;
[B][I]		SPI0_S;
		//while (!(SPI0_S & SPI_S_SPTEF)) ; // wait	[/I][/B]	
		return r;
	}

sems to help. But maybe we need to wait for SPTFEF with the "//while..." lines to be sure?

Frank
 
Last edited:
Indeed, Page 716 indicates, that we need a loop.

Maybe its faster not to switch to 16-bit mode for every transfer and use 2x 8bit instead...

EDIT:
return-type should be uint16_t,not uint8_t.

Did some measurings: for high-spi clocks, 16-bit is faster, for slow spi-clocks 2x8bit is fastr.

Edit: transfer16 is missing in keywords.txt

buffer-transfer sems to work. but i tested this with miso connected to mosi only :)
 
Last edited:
The Cycle-counter is not supported by hardware (Hardwired in MTBDWT_CTRL). Am i correct ?
It reads as "0" for me.

I suggest to remove the #DEFINE in Kinetis.h for LC (or #undef...)
 
Last edited:
Is it possible to configure a timer with 48MHz ? If yes, it could b used as replacement for DWT_CYCCNT (maybe in an additional lib?)
 
I'm trying to get SPI1 to work with DMA. I haven't had any luck. Anyone got a working SPI DMA for LC yet?
 
Is it possible to configure a timer with 48MHz ?

All three FTM / TPM timers clock at 48 MHz, even when the CPU is at 24 MHz.

They get their clock from a divide-by-2 circuit directly connected to the PLL. This is one of the differences from Teensy 3.0 & 3.1, where those timers clock from F_BUS.
 
All three FTM / TPM timers clock at 48 MHz, even when the CPU is at 24 MHz.

They get their clock from a divide-by-2 circuit directly connected to the PLL. This is one of the differences from Teensy 3.0 & 3.1, where those timers clock from F_BUS.

Huh, ok. I'll have to do more testing when I get back to the hotel tonight. Because it seemed to be taking ~120 clocks to hit overflow when I set FTM2_MOD to 60. I also was having some weird cases where it seemed like sometimes I would suddenly get things that looked like closer to 90 clocks, and when I rebuilt with more debug output to see what's going on, it would go back to ~120 clocks :/

The other alternative is for me to use the systick counter for the passage of clocks - it just gets a little ugly managing the wraparound at 48000 (since I can't just do a masking with it :/) - but this is what I used to do with the due, so I have code floating around for it anyway.
 
Something I've considered doing, since starting Teensy 3.0, is using Systick as a 24 bit free running timer. It's interrupt would increment a 32 bit count. A 56 bit counter at 96 MHz rolls back to zero after 23.8 years. Maybe that's long enough to ignore rollover trouble?

The tough part is an efficient divide to scale the 56 bit number to milliseconds and microseconds. Instead, I took the path of less resistance and configured Systick for 1 kHz interrupt rate.
 
What about this to get microseconds:

96Mhz: return (uint32_t)(((uint64_t)0xAAAAAAABULL * (systick>>5)) >> 33);
48Mhz: return (uint32_t)(((uint64_t)0xAAAAAAABULL * (systick>>4)) >> 33);
24Mhz: return (uint32_t)(((uint64_t)0xAAAAAAABULL * (systick>>3)) >> 33);

(Found here: http://stackoverflow.com/questions/171301/whats-the-fastest-way-to-divide-an-integer-by-3)

ETA: Bah - wasn't thinking about the 56-bit part. That makes the math a bit uglier (would gcc even give you a uint128_t(ish) temporary to shift away after multiplying to 64-bit values?)
 
I'm pretty sure gcc can't support bigger than 64 bit integers on 32 bit ARM.

It could be done in assembly, and maybe it'd be a net win for reduced interrupt rate? Maybe....

Edit: but sadly, it seems Cortex-M0+ lacks any 32x32 -> 64 bit multiply instruction, which really puts a damper on trying to extend to larger than 32 bit results. The only hardware multiply is 32x32 -> 32 bits, where the upper bits are discarded.
 
Last edited:
Since I had some band obligations this weekend I just started testing the LC in response to thread #19. I just tried the blink sketch at 4-16MHz and they compile and run fine, blinking the led at the correct rate. 2MHz would not compile because the FTM initialization is not defined right so the DEFAULT_FTM_MOD and DEFAULT_FTM_PRESCALE are not pulled in at compile time.

So back to my earlier post, since at these speeds the mcu is not using the PLL shouldn't we just define (kinetis.h) F_PLL to the F_CPU speed, for 2-16MHz? That makes sense to me.

Also for the Serial1 to work would require SIM_SOPT_UARTSR0 to be set to either the internal clock (2MHz) or external clock (4-16MHz) instead of not setting it at all. I have not tried this yet.
 
So back to my earlier post, since at these speeds the mcu is not using the PLL shouldn't we just define (kinetis.h) F_PLL to the F_CPU speed, for 2-16MHz? That makes sense to me.

Also for the Serial1 to work would require SIM_SOPT_UARTSR0 to be set to either the internal clock (2MHz) or external clock (4-16MHz) instead of not setting it at all. I have not tried this yet.

Yup, these sound fine. After you've had a chance to test, please send a pull request.
 
usb Midi up and running with noteOn and note Off, no problems so far other than Windows 7 being incredibly rubbish and slow at handling device connections.

Will be testing in anger tonight with pitch bends and modulations as well.

usbmidi receive not tested yet but bear with me
 
It's not on the list, but I'm sure you'll be pleased to know that a four trellis set works just fine, with four i2C devices, and no modification at all (Except changing the intpin used by the example file.)
Not tested in anger but all the lights ping on and off both via the code and the buttons.

Just to clarify this refers to the core Adafruit Trellis engine only. I did not utilise any Untztrument code from adafruit - it uses a flaky looking hackto use Pauls usbMidi code on a Leonardo, and also includes a hard coded encoder library which is best selected from Paul's supported encoder libs so I have not focused on the adafruit "Untztrument engine" at all. I did get bugs trying to compile it, but I suspect this is due to reasons other than Teensy-LC, which id did not spend any time trying to get to the bottom of.

The core elements are working and fully stable (alongside usbMidi, analogue pot readings and a metro timer system) and this is good enough for me.
 
I'm pretty sure gcc can't support bigger than 64 bit integers on 32 bit ARM.

It could be done in assembly, and maybe it'd be a net win for reduced interrupt rate? Maybe....

Edit: but sadly, it seems Cortex-M0+ lacks any 32x32 -> 64 bit multiply instruction, which really puts a damper on trying to extend to larger than 32 bit results. The only hardware multiply is 32x32 -> 32 bits, where the upper bits are discarded.

Why not ask the compiler ?
Code:
volatile int x;
volatile int y;

void f168(void) { y = x / 168; }
void f144(void) { y = x / 144; }
void f120(void) { y = x / 120; }
void f96(void) { y = x / 96; }
void f48(void) { y = x / 48; }
void f24(void) { y = x / 24; }

translates (-S -O3 -mcpu=cortex-m4) to :
Code:
	.syntax unified
	.cpu cortex-m4
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 1
	.eabi_attribute 30, 2
	.eabi_attribute 34, 0
	.eabi_attribute 18, 4
	.thumb
	.file	"tst.c"
	.text
	.align	2
	.global	f168
	.thumb
	.thumb_func
	.type	f168, %function
f168:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L2
	ldr	r1, .L2+4
	ldr	r3, [r3]
	ldr	r2, .L2+8
	smull	r0, r1, r1, r3
	asrs	r3, r3, #31
	rsb	r3, r3, r1, asr #5
	str	r3, [r2]
	bx	lr
.L3:
	.align	2
.L2:
	.word	x
	.word	818089009
	.word	y
	.size	f168, .-f168
	.align	2
	.global	f144
	.thumb
	.thumb_func
	.type	f144, %function
f144:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L5
	ldr	r1, .L5+4
	ldr	r3, [r3]
	ldr	r2, .L5+8
	smull	r0, r1, r1, r3
	asrs	r3, r3, #31
	rsb	r3, r3, r1, asr #5
	str	r3, [r2]
	bx	lr
.L6:
	.align	2
.L5:
	.word	x
	.word	954437177
	.word	y
	.size	f144, .-f144
	.align	2
	.global	f120
	.thumb
	.thumb_func
	.type	f120, %function
f120:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L8
	ldr	r1, .L8+4
	ldr	r3, [r3]
	ldr	r2, .L8+8
	smull	r0, r1, r1, r3
	add	r1, r1, r3
	asrs	r3, r3, #31
	rsb	r3, r3, r1, asr #6
	str	r3, [r2]
	bx	lr
.L9:
	.align	2
.L8:
	.word	x
	.word	-2004318071
	.word	y
	.size	f120, .-f120
	.align	2
	.global	f96
	.thumb
	.thumb_func
	.type	f96, %function
f96:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L11
	ldr	r1, .L11+4
	ldr	r3, [r3]
	ldr	r2, .L11+8
	smull	r0, r1, r1, r3
	asrs	r3, r3, #31
	rsb	r3, r3, r1, asr #4
	str	r3, [r2]
	bx	lr
.L12:
	.align	2
.L11:
	.word	x
	.word	715827883
	.word	y
	.size	f96, .-f96
	.align	2
	.global	f48
	.thumb
	.thumb_func
	.type	f48, %function
f48:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L14
	ldr	r1, .L14+4
	ldr	r3, [r3]
	ldr	r2, .L14+8
	smull	r0, r1, r1, r3
	asrs	r3, r3, #31
	rsb	r3, r3, r1, asr #3
	str	r3, [r2]
	bx	lr
.L15:
	.align	2
.L14:
	.word	x
	.word	715827883
	.word	y
	.size	f48, .-f48
	.align	2
	.global	f24
	.thumb
	.thumb_func
	.type	f24, %function
f24:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L17
	ldr	r1, .L17+4
	ldr	r3, [r3]
	ldr	r2, .L17+8
	smull	r0, r1, r1, r3
	asrs	r3, r3, #31
	rsb	r3, r3, r1, asr #2
	str	r3, [r2]
	bx	lr
.L18:
	.align	2
.L17:
	.word	x
	.word	715827883
	.word	y
	.size	f24, .-f24
	.comm	y,4,4
	.comm	x,4,4
	.ident	"GCC: (GNU Tools for ARM Embedded Processors) 4.8.4 20140725 (release) [ARM/embedded-4_8-branch revision 213147]"

seems to be pretty efficient! It uses the "Magic Number" 715827883 (0x2AAAAAAB)
EDIT: code is even shorter with unsigned int :) ( load "magic", umull, lsrs , store)

unfortunately for the m0 gcc wants to call __aeabi_idiv
how can one tell the compiler not to use floating-point ?
 
Last edited:
@manitou I've got no Teensy LC yet, but I can package totally untested DmaSpi code for you!

I have added my SPI1 DMA attempt (not working) to the github sketch/hack back in reply #75 of this thread. I also have performance numbers for 8-bit SPI1 write and for 16-bit write (22.3 mbs). I have tried a dizzying number of register settings to get the SPI1 DMA to work ... seeking divine intervention.:D

Ref manual (37.4.5.1) states "continuous mode is not a practical configuration of the DMA controller to write data to the SPI data register and is not recommended" -- this doesn't sound hopeful.
 
Last edited:
For full arithmetic support, GCC tends to enable __int128 only on 64-bit machines.

In terms of floating point, remember the Teensy 3.1 uses the MF without the additional floating point unit. This means that all floating point must be emulated in software. At some point in the future, the Teensy 4 (or 3.2 or 3++) will use a chip with at least single precision hardware support. But that chip has not been announced yet.
 
Why not ask the compiler ?
...
seems to be pretty efficient! It uses the "Magic Number" 715827883 (0x2AAAAAAB)
EDIT: code is even shorter with unsigned int :) ( load "magic", umull, lsrs , store)

unfortunately for the m0 gcc wants to call __aeabi_idiv
how can one tell the compiler not to use floating-point ?

The problem is that the teensy, being an M0, lacks the instruction that multiplies two 32 bit values into a 64 bit value, which is what you need for that magic number trick to work. You compiled the code for the M4, which is why it used the magic numbers.
 
Yes... :)
But how would look a efficient division 56 (or 64) Bit / our consts 24,48,96.. in assembler ?
For both CPUs
 
Yes... :)
But how would look a efficient division 56 (or 64) Bit / our consts 24,48,96.. in assembler ?
For both CPUs

<thinking loud>
division with i.e. 96 is th same as multiplikation with 1/96
1/96 is the roughly the same as ... 0x1555555 / 0x80000000 because (0x80000000 / 96 = 0x1555555) (32 bit in this example but works for every number of bits)
division with 0x80000000 is only a shift - or simply ignore the lower 32 bit in this case...
so the problem left is to multiply with our 64bit with 0x1555555... (~~roughly the same way as above #defines in post #86)

Multiplication 0xF * 0xF (example) can be written as (0xF * 8) + (0xF * 4) + (0xF * 2) + (0xF * 1) which are shifts too. Or, could we use the multiply-add ?

..and now its too late to write a macro :)
 
Last edited:
But how would look a efficient division 56 (or 64) Bit / our consts 24,48,96.. in assembler ?
For both CPUs

For Cortex-M4, it would be pretty efficient, I believe just 2 multiply instructions to create the two 64 bit partial products, and then 2 add instructions to sum them together into the 96 bit result. There might even be a crafty way to use only 4 registers instead of 5, maybe? (edit: on 2nd though, 7 registers might be needed, 2 extra to hold the upper partial product after creating the lower one but before adding)

For Cortex-M0+, I'm afraid it could become fairly complex. There's no 32x32 multiply that gives a 64 bit result. Instead, this would need to be handled as six 16 bit numbers as the inputs. So far, each input would be handled as 2 numbers. I believe 8 multiplies would be needed, to produce eight 32 bit partial products, and then lots of add instructions would sum them together to the 96 bit result, with plenty of 16 bit shifting and moving data around between registers. With some crafty optimizing, it *might* be possible to keep all the work within the low 8 registers. If any of the upper registers are needed, even more instructions would shuffle stuff around between registers, since Cortex-M0+ has fewer instructions that can do much with the other upper registers.

I'm going to just keep Systick at 1 kHz.
 
Last edited:
I've been testing the emulated EEPROM read and write, and getting a few lost bits.

I'm saving the LED state of my 8x8 grid, 255 is an on, 0 is an off.

When reading back 255 is on, anything that is not 255 is an off. In this way we should test the read/write of every bit each time.

I started out writing the EEPROM regularly in a loop (bad idea I know) and got quite a lot of dropped squares:
fail1.jpgfail2.jpgtarget.jpg

These dropped squares persist through multiple reboots. Therefore it is the write cycle at fault.

Pulling the write trigger out to a digitalpin improved things, but then I got something which leads me to believe it _could_ be i2c issues:
IMG_8908.jpg

an entire block is missing. The corner LED was only on because of some early bad code using that LED output to indicate the EEPROM write, improved now so it doesn't affect the save. Trust me, the block was written entirely "not 255".

My trellis set uses the interrupt pin I believe, I wonder if it is disrupting the write? or the write is disrupting the i2c read?

code here:
Code:
void saveGrid() {
  for (uint8_t i=0; i<numKeys; i++) {
   if (trellis.isLED(i))
	    EEPROM.write(i,255);
	  else
	    EEPROM.write(i,0);
  }  
}

void loadGrid() {
  uint8_t setLED;
  for (uint8_t i=0; i<numKeys; i++) {
   if (EEPROM.read(i) == 255)
	    trellis.setLED(i);
	  else
	    trellis.clrLED(i);
  }  
  trellis.writeDisplay();
}

I also have two metro timers running, 1 at 2ms and 1 at 1000ms.

So does the emulated EEPROM use a ringfenced bit of sketch RAM?

I'm ironically struggling to reproduce the error again now....
It's quite possible that it is the (unsupported) trellis library at fault here of course.
 
Back
Top