AudioOutputI2SQuad and memcpy_tointerleaveQuad

Status
Not open for further replies.

chipaudette

Well-known member
I'm trying to understand the interleaving that is done with the audio samples in the AudioOutputI2SQuad class. Looking at the isr() method, I see that it calls the memcpy_tointerleaveQuad, which is in assembly. Not being an assembly guy, I'm having a hard time figuring it out.

The core part of this assembly function seems to be:

Code:
/* void memcpy_tointerleaveQuad(int16_t *dst, const int16_t *src1, const int16_t *src2, const int16_t *src3, const int16_t *src4) */
 .global	memcpy_tointerleaveQuad
.thumb_func
	memcpy_tointerleaveQuad:

	@ r0: dst
	@ r1: src1
	@ r2: src2
	@ r3: src3
	@ r4: src4

	push	{r4-r11}
	ldr r4, [sp, #(0+32)] //5th parameter is saved on the stack
	add r11,r0,#(AUDIO_BLOCK_SAMPLES*4)
	.align 2
.loopQuad:

	ldr r5, [r1],4
	ldr r6, [r3],4
	pkhbt r7,r5,r6,LSL #16
	pkhtb r9,r6,r5,ASR #16
	ldr r5, [r2],4
	ldr r6, [r4],4
	pkhbt r8,r5,r6,LSL #16
	pkhtb r10,r6,r5,ASR #16

	stmia r0!, {r7-r10}

	cmp r11, r0
	bne .loopQuad

	pop	{r4-r11}
	BX lr
.END

#endif

The core part is clearly this:

Code:
	ldr r5, [r1],4
	ldr r6, [r3],4
	pkhbt r7,r5,r6,LSL #16
	pkhtb r9,r6,r5,ASR #16
	ldr r5, [r2],4
	ldr r6, [r4],4
	pkhbt r8,r5,r6,LSL #16
	pkhtb r10,r6,r5,ASR #16

Can someone help me see what is happening here? The pkhbt and pkhtb lines, in particular, are confusing me because I totally don't understand what end format is desired.

What is the end format that is desired? What is the order of the samples? How tightly are they packed? Is this flipping the endian-ness?

I'd appreciate any help!

Chip
 
Hi Paul,

Yeah, I saw that. When I copied that into my Float32 extension of this class, and adjusted all the data types, it doesn't work on T4. It successfully works fine on T3.6 and sounds fine. On T4.1, it compiles but sounds horribly distorted.

My class works fine for T4.1 when I insert extra code to re-introduce temporary int16 arrays just so that I call that assembly function to do the interleaving. That works.

But, going back to the c code, my audio sounds distorted. The audio is only half there and the other half is zeros, which is what made me think that my interleaving was off.

I fully accept that the problem is likely still on my side. But since I've now localized my issue to just the interleaving, I figured that I'd ask: Even on T4, should that C code result in an equivalent interleaving as the assembly?

(I understand the C, but not the assembly, so I can't confirm for myself)

Chip
 
Sorry, can't dive into this right now. But I can tell you the 6 and 8 channel versions work the same basic way, and do not (yet) have assembly optimized copy. Maybe also look at those for some hints?
 
I think that I found the issue. The OCT version of AudioOutputI2S gave me an important clue...

Here is the quad version:

Code:
#if 1
	memcpy_tointerleaveQuad(dest, src1, src2, src3, src4);
#else
	for (int i=0; i < AUDIO_BLOCK_SAMPLES/2; i++) {
		*[COLOR="#FF0000"]dest[/COLOR]++ = *src1++;
		*[COLOR="#FF0000"]dest[/COLOR]++ = *src3++;
		*[COLOR="#FF0000"]dest[/COLOR]++ = *src2++;
		*[COLOR="#FF0000"]dest[/COLOR]++ = *src4++;
	}
#endif
	arm_dcache_flush_delete(dest, sizeof(i2s_tx_buffer) / 2 );

and here is the OCT version:

Code:
#if 0
	// TODO: optimized 8 channel copy...
	memcpy_tointerleaveQuad(dest, src1, src2, src3, src4);
#else
	[COLOR="#FF0000"]int16_t *p=dest;[/COLOR]
	for (int i=0; i < AUDIO_BLOCK_SAMPLES/2; i++) {
		*p++ = *src1++;
		*p++ = *src3++;
		*p++ = *src5++;
		*p++ = *src7++;
		*p++ = *src2++;
		*p++ = *src4++;
		*p++ = *src6++;
		*p++ = *src8++;
	}
#endif
	arm_dcache_flush_delete(dest, sizeof(i2s_tx_buffer) / 2);

The OCT version ends up NOT moving the dest pointer whereas the QUAD version does. When I change my version of QUAD to be like your OCT version (by introducing the dummy pointer *p), my code suddenly starts working. Yay! Problem solved!

[Note: I would suspect that your QUAD doesn't work if someone were to use that block of C code rather than your assembly macro.]

To my eye, the error actually happens when we call arm_dcache_flush_delete function, because that function uses the dest pointer. If dest gets moved by the preceding code, the arm_dcache function would do its flush_delete on the wrong memory range. If you leave dest in its original location, though, everything works fine.

For my own education, what is the arm_decache function doing? I don't remember seeing it in the T3 version of the code...

Chip
 
[Note: I would suspect that your QUAD doesn't work if someone were to use that block of C code rather than your assembly macro.]

Yeah, that old code probably wouldn't work anymore.

I added a comment in the code with a link to this thread. Hopefully that will help anyone else who later wants to figure out how it works.


For my own education, what is the arm_decache function doing? I don't remember seeing it in the T3 version of the code...

Teensy 3 doesn't have caching for RAM. Teensy 4 does.

Usually when you've just written data to memory, some or all of what you wrote will be stored only in the processor's L1 cache. DMA can only access the actual memory, not the cache. Flush means the data sitting only in the cache must be written out to the actual memory. Delete means the cache is to forget it had that data, regardless of whether it is the same or different from the actual memory.

Normally for DMA-based transmission, you want to flush to memory and delete it from the cache, so the CPU can put its cache to better use for other stuff. For DMA receiving you would use only delete, so your reads of freshly arrived data are sure to read the actual memory rather than give you whatever happens to still be held in the cache.

For some type of transmissions, like a graphics frame buffer, you might wish to only flush before DMA. Keeping a copy in the cache means more drawing operations with lots of read-modify-write operations will tend to run faster. That's why 3 of these functions exist, so you can flush, delete or flush+delete.
 
Status
Not open for further replies.
Back
Top