teensy 3 memcpy has gotten slower

Paul... i never heard of a user who used so much flash that he really needed newlib-nano.
the flash-memory usage does not matter.

where is the advantage that a sketch needs 20k instead of 30k? i'm really curious...


Edit: There are some more assembler-functions. We could look at them, maybe they can speed up some other functions too ?
edit: we're not on AVR where size was important...
 
Last edited:
Frank, give this a try.....

Code:
#define WORDS 1024
int src[WORDS], dst[WORDS];
//int32_t DMAMEM _dma_Buffer_A[WORDS];
//int32_t DMAMEM _dma_Buffer_B[WORDS];
void memcpy32(int *dest, const int *src, unsigned int count)
{
  DMA_TCD1_SADDR = src;
  DMA_TCD1_SOFF = 16;
  DMA_TCD1_ATTR = DMA_TCD_ATTR_SSIZE(4) | DMA_TCD_ATTR_DSIZE(4);
  DMA_TCD1_NBYTES_MLNO = count * 4;
  DMA_TCD1_SLAST = 0;
  DMA_TCD1_DADDR = dest;
  DMA_TCD1_DOFF = 16;
  DMA_TCD1_CITER_ELINKNO = 1;
  DMA_TCD1_DLASTSGA = 0;
  DMA_TCD1_BITER_ELINKNO = 1;
  DMA_TCD1_CSR = DMA_TCD_CSR_START;
  while (!(DMA_TCD1_CSR & DMA_TCD_CSR_DONE)) /* wait */ ;
}

I haven't actually checked if this is really copying the data properly.... and I really need to get some rest, then get back on the installer for 1.6.0. But if it's really working... crazy speed!


Edit: I've always wondered if the 16 byte mode really works???
 
Obviously the rest of newlib has a lot of speed optimized stuff, which enlarges even this very simple program by 50% and uses 2K extra RAM.
I would also guess that part of the issue is newlib being compiled with -O2 or -O3 and nano is using -Os. It may make sense to bring in the optimized memset as well. However, I suspect memset is called less frequently in embedded environments than in hosted environments, since you have less dynamic allocation (and less calls to calloc).
 
Frank, give this a try.....

Code:
#define WORDS 1024
int src[WORDS], dst[WORDS];
//int32_t DMAMEM _dma_Buffer_A[WORDS];
//int32_t DMAMEM _dma_Buffer_B[WORDS];
void memcpy32(int *dest, const int *src, unsigned int count)
{
  DMA_TCD1_SADDR = src;
  DMA_TCD1_SOFF = 16;
  DMA_TCD1_ATTR = DMA_TCD_ATTR_SSIZE(4) | DMA_TCD_ATTR_DSIZE(4);
  DMA_TCD1_NBYTES_MLNO = count * 4;
  DMA_TCD1_SLAST = 0;
  DMA_TCD1_DADDR = dest;
  DMA_TCD1_DOFF = 16;
  DMA_TCD1_CITER_ELINKNO = 1;
  DMA_TCD1_DLASTSGA = 0;
  DMA_TCD1_BITER_ELINKNO = 1;
  DMA_TCD1_CSR = DMA_TCD_CSR_START;
  while (!(DMA_TCD1_CSR & DMA_TCD_CSR_DONE)) /* wait */ ;
}

I haven't actually checked if this is really copying the data properly.... and I really need to get some rest, then get back on the installer for 1.6.0. But if it's really working... crazy speed!

I will test, if it's copying correctly. But i think, it will be slower on small amounts.
Edit: 16 bytes for example. but i look at this. maybe we can use both.
 
I believe running the DMA in 32 bit mode is using 2 cycles to read and 2 cycles to write. The measured 50 us for 4000 bytes is pretty 93 Mbyte/sec, pretty close to 1 word per 4 cycles.

I believe the 16 byte mode reads 4 words in 5 cycles, and writes 4 words in 5 cycles. This isn't actually documented anywhere... it's only my guesswork based on optimizing code (mostly in the audio lib) and reading a lot about Cortex-M4. That 16 bytes every 10 cycles, or theoretically 153.6 MByte/sec. I measured 28 us for 4096 bytes, which is 146.3 Mbyte/sec.

But I didn't carefully inspect the memory. I didn't inspect it all. But if this really is working, that's a pretty nice speedup, and a trick I'll keep in mind for future stuff in libraries....
 
Last edited:
Someday we'll have Cortex-M7. My understanding is the TCM gives you a big chunk of RAM inside the processor code where ST and LD instructions (the first in a sequence) don't have to suffer that 1 cycle bus arbitration. Today everything in the audio lib is structured to get 2 words in 3 cycles or 4 words in 5 cycles. Bring able to get 1 word in 1 cycle could really change a lot of things with code optimizations!
 
Ok, i can confirm that the DMA-variant is copying correctly, but this was only a first little check. I will test more .. unfortunately i have not so much time today, but i can do it tomorrow.

But..this does not help with our newlib-nano problem..
Are you going to use the workaround ?

In the meantime, i already installed ubuntu 8.10 in a virtual machine, so,theoretically, i'm able to compile the launchpad-toolchain now..
 
Edit: I've always wondered if the 16 byte mode really works???

Nice! faster than previous memcpy32() for 3.1. No change for 3.0. and copy is working.

teensy 3.1:
Code:
3
3
45
744.19 mbs  43 us   memcpy32
1185.19 mbs  27 us   memcpy32p
744.19 mbs  43 us   memset32
329.90 mbs  97 us   loop copy
640.00 mbs  50 us   memcpy
3
127.49 mbs  251 us   memset
42424242
415.58 mbs  77 us   set loop
 
I have a working setup for compiling a gcc 4.9.3 toolchain now, if someone is interested:
I think i can compile a variant with newlib-nano and some optimized functions and -O2 or -O3 instead of -Os and can upload it somewhere

The standard -Os compiles fine so far.
(takes sveral hours :)

I currently compile the latest version with newest newlib-2.2.0-1. These are the changes:
Code:
*** Major changes in newlib version 2.2.0:

- [B][B]multiple functional/performance enhancements for arm[/B][/B]/aarch64
- new nano formatted I/O support
- replacement of or16/or32 with or1k platform
- qsort_r support
- additional long double math routines
- ito/utoa/ltoa
- restructuring of gmtime/localtime so tz functions only linked by localtime
- unlocked I/O functions
- various warning clean-ups
 
Last edited:
I successfully compiled my codecs with mp3 player with gcc 4.9.3 and newlib 2.2. Works!

But this change in newlib maks tensyduino a bit incompatible to new newlib:
Code:
- ito/utoa/ltoa

At first it does not compile anything successful, and there are some changes in tensyduino needed to get it working.
For now, i commented out these functions in avr_functions.h and wstring.cpp. But i think that is a minor problem and a better fix is possible.

Newlib-nano with -O2 needs the same amount of RAM and a bit more flash. So my MP3 player leaves 191KB unused flash instead of 193...
Tomorrow I want to test -O3 and do some benchmarks.

Edit:And test Teensy LC :)
 
Last edited:
Adding the .S file in the core library breaks 1.6.0's reuse of core.a.

I just committed code to improve 1.6.0's handling of assembly files. It'll be in "beta6", probably later today. Please keep an eye on whether .S file handling is really correct, when using beta6.
 
Adding the .S file in the core library breaks 1.6.0's reuse of core.a.

I just committed code to improve 1.6.0's handling of assembly files. It'll be in "beta6", probably later today. Please keep an eye on whether .S file handling is really correct, when using beta6.

I compiled my codecs (they use *.S). Looks good, seems to work so far.
 
Paul,

i did some tests (but with newlib2.2, to be sure we have to test with your "official" version (is it 2.1?) again):
Indeed, it seems to be sufficiant to switch from -Os to -O2 to get faster functions. The RAM-usage is the same, Flash a bit more. So, the workaround is not needed i think.

It would be great if you could change the optimization (with 1.22?) , since this affects not only memcpy.
Do you use the launchpad toolchain ?, If yes, all you would have to do is to change this in build-toolchain.sh:
Code:
[...]
popd
restoreenv

echo Task [III-3] /$HOST_NATIVE/newlib-nano/
saveenv
prepend_path PATH $INSTALLDIR_NATIVE/bin
saveenvvar CFLAGS_FOR_TARGET '-g [B][I][U]-O2 [/U][/I][/B]-ffunction-sections -fdata-sections' <-- change here
rm -rf $BUILDDIR_NATIVE/newlib-nano && mkdir -p $BUILDDIR_NATIVE/newlib-nano
pushd $BUILDDIR_NATIVE/newlib-nano

$SRCDIR/$NEWLIB_NANO/configure  \
    $NEWLIB_CONFIG_OPTS \
    --target=$TARGET \
[...]

extra hint: you could save compilationtime and space with this:
(Line 92)
Code:
MULTILIB_LIST="--with-multilib-list=armv6-m,armv7e-m"<-- change here
(compilation only for M0+M4)

another hint:
there are updated versions of gmp, mpc and mpf (SRC directory)
If you update them, change the corresponding lines in build-common.sh
Code:
[B][I]GMP=gmp-6.0.0a[/I][/B]
NEWLIB_NANO=newlib
SAMPLES=samples
LIBELF=libelf-0.8.13
LIBICONV=libiconv-1.14
[B][I]MPC=mpc-1.0.2[/I][/B]
[B][I]MPFR=mpfr-3.1.2[/I][/B]
NEWLIB=newlib
ISL=isl-0.11.1
ZLIB=zlib-1.2.8
INSTALLATION=installation
SAMPLES=samples
BUILD_MANUAL=build-manual

CLOOG_PACK=$CLOOG.tar.gz
EXPAT_PACK=$EXPAT.tar.gz
[B][I]GMP_PACK=$GMP.tar.xz[/I][/B]
LIBELF_PACK=$LIBELF.tar.gz
LIBICONV_PACK=$LIBICONV.tar.gz
[B][I]MPC_PACK=$MPC.tar.gz
MPFR_PACK=$MPFR.tar.bz2[/I][/B]
ISL_PACK=$ISL.tar.bz2
ZLIB_PACK=$ZLIB.tar.gz

last hint: ubuntu 10.04 LTS is great for compilation
 
Last edited:
I don't want to say "no", but I absolutely must say "later" on this.

We're very late in the 1.21 beta cycle right now, with a final release of 1.21 only a couple weeks away. Now is not the time to make a change in the toolchain or its libraries.

I'm willing to consider this for 1.22, but it can't happen for 1.21.
 
I don't want to say "no", but I absolutely must say "later" on this.

We're very late in the 1.21 beta cycle right now, with a final release of 1.21 only a couple weeks away. Now is not the time to make a change in the toolchain or its libraries.

I'm willing to consider this for 1.22, but it can't happen for 1.21.

I did not expect anything else, and it is absolutely ok :)
 
Back
Top