teensy 3 memcpy has gotten slower

manitou

Senior Member+
I'm not sure when it happened but teensy 3 memcpy() and memset() are slower than they used to be. Here
are some numbers (megabits/sec mbs) using 1000 int's. For comparison, there are rates when using for-loop and DMA.

Code:
        teensy 3/3.1 @96 mhz  LC @48mhz
                 teensy 3     teensy 3.1   LC(beta)
                 then   now   then  now    now
     memcpy       336    92    582   96     42  mbs
     memset       627   125   1333  125     54  mbs
     loop copy    336   311    340  330    136  mbs
     loop set     478   432    485  432    163  mbs
     memcpy32    1391  1391    744  744    615  mbs   DMA
     memset32    1391  1391    744  744    592  mbs   DMA
There used to be an arm-optimized memcpy in newlib. it had an unrolled loop using ldr.w str.w
teensy 3.1 has fast memory, but it's DMA is not fast as teensy 3.0 (?never explained).
 
I'm not sure when it happened but teensy 3 memcpy() and memset() are slower than they used to be. Here
are some numbers (megabits/sec mbs) using 1000 int's. For comparison, there are rates when using for-loop and DMA.

Code:
        teensy 3/3.1 @96 mhz  LC @48mhz
                 teensy 3     teensy 3.1   LC(beta)
                 then   now   then  now    now
     memcpy       336    92    582   96     42  mbs
     memset       627   125   1333  125     54  mbs
     loop copy    336   311    340  330    136  mbs
     loop set     478   432    485  432    163  mbs
     memcpy32    1391  1391    744  744    615  mbs   DMA
     memset32    1391  1391    744  744    592  mbs   DMA
There used to be an arm-optimized memcpy in newlib. it had an unrolled loop using ldr.w str.w
teensy 3.1 has fast memory, but it's DMA is not fast as teensy 3.0 (?never explained).


This is the current memcpy:
Code:
00004788 <memcpy>:
    4788:	b510      	push	{r4, lr}
    478a:	2300      	movs	r3, #0
    478c:	4293      	cmp	r3, r2
    478e:	d003      	beq.n	4798 <memcpy+0x10>
    4790:	5ccc      	ldrb	r4, [r1, r3]
    4792:	54c4      	strb	r4, [r0, r3]
    4794:	3301      	adds	r3, #1
    4796:	e7f9      	b.n	478c <memcpy+0x4>
    4798:	bd10      	pop	{r4, pc}
It's copying byte-wise. There's a very fast variant from the ARM-developers in newlib, but it seems that it is not used. Maybe because of size-optimizing ?

https://sourceware.org/ml/newlib/2013/msg00419.html
 
Last edited:
Attached th file.I'm not sure if it's the newest version
 

Attachments

  • memcpy-armv7m.zip
    2.9 KB · Views: 197
Isn't there a special ARM instruction for this, that doesn't have to loop in asm?
Compiler optimization level affect this?
 
No, a loop is needed. But it is huge difference if you copy byte-wise or aligned word (4-bytes) wise. Then, the optimized ARM-Code uses loop-unrolling wich gives an additional speedup.

In an old newlib, i found this:
Code:
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
const char* OUT_end = (char*)OUT + N;
while ((char*)OUT < OUT_end) {
*((char*)OUT) = *((char*)IN);
OUT++;
IN++;
}
 
return OUT0;
#else .....

So i think one of the #defines above is set. In my opion, its better not to optimize the newlib for size, we have so much flash-space..

Edit:
Maybe -Os during compilation of newlib ? __OPTIMIZE_SIZE__ is set with -Os
https://gcc.gnu.org/onlinedocs/gcc-3.4.2/cpp/Common-Predefined-Macros.html
 
Last edited:
The toolchain upgrade from 4.7.2 to 4.8.4 also switched from normal newlib to newlib-nano, which is size optimized.

We probably lost the fast memcpy in the process. Maybe there's some way to get it back, while keeping the rest of newlib-nano?

There's no way I can work on this right now, with Teensy-LC and Arduino 1.6.0 taking all my attention. But I'll put this on my low-priority list to investigate later... maybe MUCH later....
 
Actually, if you're willing to run the benchmarks again, could you try deleting this line from boards.txt:

Code:
teensy31.build.linkoption2=--specs=nano.specs

That should cause normal newlib to be used. You should see your program size grow, especially if using printf().

Does this restore memcpy() to it's previously glorious speed?
 
Yep, that fixed it. I commented out the nano.specs for 3.1 3.0 and LC, results
Code:
     3.1
      1142.86 mbs  7 us   memset
      615.38 mbs  13 us   memcpy 
     3.0
      1000.00 mbs  8 us   memset
      333.33 mbs  24 us   memcpy
   LC
      380.95 mbs  21 us   memset
      235.29 mbs  34 us   memcpy

And for the LC, because of the larger library, I had to reduce the size of the arrays used in the mem* tests.

ARM memcpy notes http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3934.html
 
Last edited:
But I'll put this on my low-priority list to investigate later... maybe MUCH later....
NO PLEASE!!! Why slowdown ?
Don't do this..

Ok.. workaround:
place this into kinetis.h (inside extern "c"):
Code:
#if defined(__MK20DX128__) || defined(__MK20DX256__)
#include <stdlib.h>
extern void *memcpy (void *dst, const void *src, size_t count);
#endif

Copy the attached assembler file to cores\teensy3

@manitou, maybe you can test this workaround ?
 

Attachments

  • teensy3memcpy.zip
    33.5 KB · Views: 163
Last edited:
Paul, i 'v never done this before, but if you could give a few hints how to do this, i could compile the the newlib-nano 2.1 for you, with the faster memcpy and a scond variant with -O3 or -Ofast...
 
I'm pretty sure it's not the compiler flags that matter. It's the algorithm. Obviously "nano" is using a small but slower code.

We really do need the smaller nano library for bloated stuff like printf. Maybe memcpy could be brought into the core library? Or maybe it's possible to build newlib nano with the faster memcpy? If it is, I'm pretty sure that won't be as simple as merely recompiling with different optimization flags.
 
So we will have big list of libs that all will be a bit slower than before with the byte-memcpy:
sd.h, cc3000, play_sd_wav, FastSPI_LED,OctoWS2811.....

Code:
Search "memcpy" (178 hits in 55 files)
  C:\Arduino\libraries\Adafruit_CC3000\Adafruit_CC3000.cpp (8 hits)
	Line 660:   memcpy(retip, ipconfig.aucIP, 4);
	Line 661:   memcpy(netmask, ipconfig.aucSubnetMask, 4);
	Line 662:   memcpy(gateway, ipconfig.aucDefaultGateway, 4);
	Line 663:   memcpy(dhcpserv, ipconfig.aucDHCPServer, 4);
	Line 664:   memcpy(dnsserv, ipconfig.aucDNSServer, 4);
	Line 1012:     memcpy(&pingReport, data, length);
	Line 1399:   memcpy(_rx_buf, copy._rx_buf, RXBUFFERSIZE);
	Line 1407:   memcpy(_rx_buf, other._rx_buf, RXBUFFERSIZE);
  C:\Arduino\libraries\Adafruit_CC3000\examples\InternetTime\InternetTime.ino (2 hits)
	Line 256:       memcpy_P( buf    , timeReqA, sizeof(timeReqA));
	Line 257:       memcpy_P(&buf[12], timeReqB, sizeof(timeReqB));
  C:\Arduino\libraries\Adafruit_CC3000\examples\SendTweet\SendTweet.ino (2 hits)
	Line 425:       memcpy_P( buf    , timeReqA, sizeof(timeReqA));
	Line 426:       memcpy_P(&buf[12], timeReqB, sizeof(timeReqB));
  C:\Arduino\libraries\Adafruit_CC3000\utility\evnt_handler.cpp (5 hits)
	Line 366: 							memcpy((UINT8 *)pRetParams, 
	Line 414: 						memcpy((UINT8 *)pRetParams, pucReceivedParams, 4);
	Line 427: 						memcpy((UINT8 *)pRetParams, (CHAR *)(pucReceivedParams + GET_SCAN_RESULTS_FRAME_TIME_OFFSET + 2), GET_SCAN_RESULTS_SSID_MAC_LENGTH);	
	Line 482: 					memcpy(from, (pucReceivedData + HCI_DATA_HEADER_SIZE + BSD_RECV_FROM_FROM_OFFSET) ,*fromlen);
	Line 485: 				memcpy(pRetParams, pucReceivedParams + HCI_DATA_HEADER_SIZE + ucArgsize,
  C:\Arduino\libraries\Adafruit_CC3000\utility\hci.cpp (3 hits)
	Line 187: 		memcpy((pucBuff + SPI_HEADER_SIZE) + HCI_PATCH_HEADER_SIZE, patch, usDataLength);
	Line 198: 		memcpy(pucBuff + SPI_HEADER_SIZE + HCI_PATCH_HEADER_SIZE, patch, SL_PATCH_PORTION_SIZE);
	Line 225: 			memcpy(data_ptr + SIMPLE_LINK_HCI_PATCH_HEADER_SIZE, patch, usTransLength);
  C:\Arduino\libraries\Adafruit_CC3000\utility\nvmem.cpp (2 hits)
	Line 170: 	memcpy((ptr + SPI_HEADER_SIZE + HCI_DATA_CMD_HEADER_SIZE + 
	Line 287: 	  memcpy_P(rambuffer, spDataPtr, SP_PORTION_SIZE);
  C:\Arduino\libraries\Adafruit_CC3000\utility\sntp.cpp (1 hit)
	Line 410: 	memcpy(&(socketAddr.sin_addr), ntpServerAddr, 4);
  C:\Arduino\libraries\Adafruit_CC3000\utility\socket.cpp (16 hits)
	Line 356: 	if (addr) memcpy(addr, &tAcceptReturnArguments.tSocketAddress, ASIC_ADDR_LEN);
	Line 683: 			memcpy(readsds, &tParams.uiRdfd, sizeof(tParams.uiRdfd));
	Line 688: 			memcpy(writesds, &tParams.uiWrfd, sizeof(tParams.uiWrfd)); 
	Line 693: 			memcpy(exceptsds, &tParams.uiExfd, sizeof(tParams.uiExfd)); 
	Line 872: 		memcpy(optval, tRetParams.ucOptValue, 4);
	Line 1280: 	memcpy(mdnsResponsePtr, "_device-info", 12);   // _device-info
	Line 1283: 	memcpy(mdnsResponsePtr, "_udp", 4);	           // _udp
	Line 1286: 	memcpy(mdnsResponsePtr, "local", 5);	       // local
	Line 1300: 	memcpy(mdnsResponsePtr, "_services", 9);	       // _services
	Line 1303: 	memcpy(mdnsResponsePtr, "_dns-sd", 7);	           // _dns-sd
	Line 1306: 	memcpy(mdnsResponsePtr, "_udp", 4);	               // _udp
	Line 1309: 	memcpy(mdnsResponsePtr, "local", 5);	           // local
	Line 1335: 	memcpy(mdnsResponsePtr, "dev=CC3000", 10);	       // _device-info
	Line 1338: 	memcpy(mdnsResponsePtr, "vendor=Texas-Instruments", 24);	// _udp
	Line 1405: 	memcpy(mdnsResponsePtr, deviceServiceName, device_name_len);
	Line 1455: 	memcpy(mdnsResponsePtr,
  C:\Arduino\libraries\Adafruit_nRF8001\Adafruit_BLE_UART.cpp (1 hit)
	Line 304:     memcpy(device_name, deviceName, strlen(deviceName));
  C:\Arduino\libraries\Adafruit_nRF8001\utility\acilib.cpp (15 hits)
	Line 58:   memcpy(buffer + OFFSET_ACI_CMD_T_SET_LOCAL_DATA + OFFSET_ACI_CMD_PARAMS_SEND_DATA_T_TX_DATA + OFFSET_ACI_TX_DATA_T_ACI_DATA,  &(p_aci_cmd_params_set_local_data->tx_data.aci_data[0]), data_size);
	Line 124:   memcpy((buffer + OFFSET_ACI_CMD_T_SEND_DATA + OFFSET_ACI_CMD_PARAMS_SEND_DATA_T_TX_DATA + OFFSET_ACI_TX_DATA_T_ACI_DATA), &(p_aci_cmd_params_send_data_t->tx_data.aci_data[0]), data_size);
	Line 152:   memcpy((buffer + OFFSET_ACI_CMD_T_ECHO + OFFSET_ACI_CMD_PARAMS_ECHO_T_ECHO_DATA), &(p_cmd_params_echo->echo_data[0]), msg_size);
	Line 178:   memcpy((buffer + OFFSET_ACI_CMD_T_WRITE_DYNAMIC_DATA + OFFSET_ACI_CMD_PARAMS_WRITE_DYNAMIC_DATA_T_DYNAMIC_DATA), dynamic_data, dynamic_data_size);
	Line 215:   memcpy((buffer + OFFSET_ACI_CMD_T_SETUP), &(p_aci_cmd_params_setup->setup_data[0]), setup_data_size);
	Line 261:   memcpy(buffer + OFFSET_ACI_CMD_T_OPEN_ADV_PIPE + OFFSET_ACI_CMD_PARAMS_OPEN_ADV_PIPE_T_PIPES, p_aci_cmd_params_open_adv_pipe->pipes, 8);
	Line 289:   memcpy((buffer + OFFSET_ACI_CMD_T_SET_KEY + OFFSET_ACI_CMD_PARAMS_SET_KEY_T_PASSKEY), (uint8_t * )&(p_aci_cmd_params_set_key->key), len-2);//Reducing 2 for the opcode byte and type
	Line 405:       memcpy((uint8_t *)(p_device_address->bd_addr_own), (buffer_in + OFFSET_ACI_EVT_T_CMD_RSP+OFFSET_ACI_EVT_PARAMS_CMD_RSP_T_GET_DEVICE_ADDRESS+OFFSET_ACI_EVT_CMD_RSP_PARAMS_GET_DEVICE_ADDRESS_T_BD_ADDR_OWN), BTLE_DEVICE_ADDRESS_SIZE);
	Line 424:       memcpy((uint8_t *)(p_read_dyn_data->dynamic_data), (buffer_in + OFFSET_ACI_EVT_T_CMD_RSP + OFFSET_ACI_EVT_PARAMS_CMD_RSP_T_READ_DYNAMIC_DATA + OFFSET_ACI_CMD_PARAMS_WRITE_DYNAMIC_DATA_T_DYNAMIC_DATA), ACIL_DECODE_EVT_GET_LENGTH(buffer_in) - 3); // 3 bytes subtracted account for EventCode, CommandOpCode and Status bytes.
	Line 445:   memcpy((uint8_t *)p_aci_evt_params_pipe_status->pipes_open_bitmap, (buffer_in + OFFSET_ACI_EVT_T_PIPE_STATUS + OFFSET_ACI_EVT_PARAMS_PIPE_STATUS_T_PIPES_OPEN_BITMAP), 8);
	Line 446:   memcpy((uint8_t *)p_aci_evt_params_pipe_status->pipes_closed_bitmap, (buffer_in + OFFSET_ACI_EVT_T_PIPE_STATUS + OFFSET_ACI_EVT_PARAMS_PIPE_STATUS_T_PIPES_CLOSED_BITMAP), 8);
	Line 469:   memcpy((uint8_t *)p_evt_params_data_received->rx_data.aci_data, (buffer_in + OFFSET_ACI_EVT_T_DATA_RECEIVED + OFFSET_ACI_RX_DATA_T_ACI_DATA), size);
	Line 483:   memcpy((uint8_t *)p_aci_evt_params_hw_error->file_name, (buffer_in + OFFSET_ACI_EVT_T_HW_ERROR + OFFSET_ACI_EVT_PARAMS_HW_ERROR_T_FILE_NAME), size);
	Line 495:   memcpy(&(p_aci_evt_params_connected->dev_addr[0]), (buffer_in + OFFSET_ACI_EVT_T_CONNECTED + OFFSET_ACI_EVT_PARAMS_CONNECTED_T_DEV_ADDR), BTLE_DEVICE_ADDRESS_SIZE);
	Line 532:   memcpy(&aci_evt_params_echo->echo_data[0], (buffer_in + OFFSET_ACI_EVT_T_EVT_OPCODE + 1), size);
  C:\Arduino\libraries\Adafruit_nRF8001\utility\aci_setup.cpp (1 hit)
	Line 48:     memcpy_P(&aci_cmd, &(aci_stat->aci_setup_info.setup_msgs[num_cmd_offset+i]), 
  C:\Arduino\libraries\Adafruit_nRF8001\utility\hal_aci_tl.cpp (2 hits)
	Line 132:   memcpy((uint8_t *)&(aci_q->aci_data[aci_q->tail].buffer[0]), (uint8_t *)&p_data->buffer[0], length + 1);
	Line 147:   memcpy((uint8_t *)p_data, (uint8_t *)&(aci_q->aci_data[aci_q->head]), sizeof(hal_aci_data_t));
  C:\Arduino\libraries\Adafruit_nRF8001\utility\lib_aci.cpp (4 hits)
	Line 236:   memcpy(&(aci_cmd_params_set_local_data.tx_data.aci_data[0]), p_value, size);
	Line 341:       memcpy(&(aci_cmd_params_send_data.tx_data.aci_data[0]), p_value, size);
	Line 454:   memcpy((uint8_t*)&(aci_cmd_params_set_key.key), key, len);
	Line 473:   memcpy(&(aci_cmd_params_echo.echo_data[0]), p_msg_data, msg_size);
  C:\Arduino\libraries\Audio\examples\Recorder\Recorder.ino (2 hits)
	Line 134:     memcpy(buffer, queue1.readBuffer(), 256);
	Line 136:     memcpy(buffer+256, queue1.readBuffer(), 256);
  C:\Arduino\libraries\Audio\examples\SamplePlayer\wav2sketch\wav2sketch.exe (1 hit)
	Line 55: u
  C:\Arduino\libraries\Audio\examples\SamplePlayerSerialFlash\wav2sketch\wav2raw.exe (1 hit)
	Line 267: _iob
  C:\Arduino\libraries\Audio\play_sd_wav.cpp (3 hits)
	Line 206: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 235: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 256: 		memcpy((uint8_t *)header + header_offset, p, len);
  C:\Arduino\libraries\Audio\play_sd_wav.cpp.bak (3 hits)
	Line 207: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 236: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 257: 		memcpy((uint8_t *)header + header_offset, p, len);
  C:\Arduino\libraries\Bridge\src\Console.cpp (1 hit)
	Line 58:     memcpy(tmp + 1, buff, size);
  C:\Arduino\libraries\Entropy\Entropy.cpp (1 hit)
	Line 208:   memcpy((void *) &fRetVal, (void *) &tmp_random, sizeof(fRetVal));
  C:\Arduino\libraries\Ethernet\Dhcp.cpp (6 hits)
	Line 25:     memcpy((void*)_dhcpMacAddr, (void*)mac, 6);
	Line 154:     memcpy(buffer + 4, &(xid), 4);
	Line 162:     memcpy(buffer + 10, &(flags), 2);
	Line 174:     memcpy(buffer, _dhcpMacAddr, 6); // chaddr
	Line 203:     memcpy(buffer + 10, _dhcpMacAddr, 6);
	Line 282:         memcpy(_dhcpLocalIp, fixedMsg.yiaddr, 4);
  C:\Arduino\libraries\FastLED\lib8tion.cpp (5 hits)
	Line 7: // memset8, memcpy8, memmove8:
	Line 9: //  routines memset, memcpy, and memmove.
	Line 55: void * memcpy8 ( void * dst, void* src, uint16_t num )
	Line 86:         // if src > dst then we can use the forward-stepping memcpy8
	Line 87:         return memcpy8( dst, src, num);
  C:\Arduino\libraries\FastLED\lib8tion.h (8 hits)
	Line 131:  - Optimized memmove, memcpy, and memset, that are
	Line 134:       memcpy8(  dest, src,  bytecount)
	Line 150: // for memmove, memcpy, and memset if not defined here
	Line 1300: // memmove8, memcpy8, and memset8:
	Line 1301: //   alternatives to memmove, memcpy, and memset that are
	Line 1307: void * memcpy8 ( void * dst, const void * src, uint16_t num )  __attribute__ ((noinline));
	Line 1313: #define memcpy8 memcpy
	Line 1313: #define memcpy8 memcpy
  C:\Arduino\libraries\FastSPI_LED\FastSPI_LED.h (1 hit)
	Line 96:   void setRGBData(unsigned char *rgbData) { memcpy(m_pData,rgbData,m_nLeds); m_nDirty=1;}
  C:\Arduino\libraries\OctoWS2811\examples\VideoSDcard\VideoSDcard.ino (1 hit)
	Line 116:     memcpy(dest, buffer + bufpos, n);
  C:\Arduino\libraries\OctoWS2811\OctoWS2811.cpp (1 hit)
	Line 197: 		memcpy(frameBuffer, drawBuffer, stripLen * 24);
  C:\Arduino\libraries\OSC\OSCBundle.cpp (1 hit)
	Line 300:                 memcpy(&msgSize, incomingBuffer, 4);
  C:\Arduino\libraries\OSC\OSCBundle.h (1 hit)
	Line 117:         memcpy(&timetag, buff, 8);
  C:\Arduino\libraries\OSC\OSCData.cpp (4 hits)
	Line 94:             memcpy(mem, lenPtr, 4);
	Line 96:             memcpy(mem + 4, b, len);
	Line 117:             memcpy(mem, datum->data.b, bytes);
	Line 196:         memcpy(blobBuffer, data.b, bytes);
  C:\Arduino\libraries\OSC\OSCMessage.cpp (5 hits)
	Line 504:                         memcpy(u.b, incomingBuffer, 4);
	Line 517:                         memcpy(u.b, incomingBuffer, 4);
	Line 530:                         memcpy(u.b, incomingBuffer, 8);
	Line 543:                         memcpy(u.b, incomingBuffer, 8);
	Line 565:                         memcpy(u.b, incomingBuffer, 4);
  C:\Arduino\libraries\RadioHead\RadioHead.h (4 hits)
	Line 425:  #define memcpy_P memcpy
	Line 425:  #define memcpy_P memcpy
	Line 434:  #define memcpy_P memcpy
	Line 434:  #define memcpy_P memcpy
  C:\Arduino\libraries\RadioHead\RHMesh.cpp (2 hits)
	Line 46:     memcpy(a->data, buf, len);
	Line 174: 	    memcpy(buf, a->data, *len);
  C:\Arduino\libraries\RadioHead\RHRouter.cpp (3 hits)
	Line 100:     memcpy(&_routes[index], &_routes[index+1], 
	Line 172:     memcpy(_tmpMessage.data, buf, len);
	Line 274: 	    memcpy(buf, _tmpMessage.data, *len);
  C:\Arduino\libraries\RadioHead\RH_ASK.cpp (2 hits)
	Line 46:     memcpy(_txBuf, preamble, sizeof(preamble));
	Line 344: 	memcpy(buf, _rxBuf+RH_ASK_HEADER_LEN+1, *len);
  C:\Arduino\libraries\RadioHead\RH_NRF24.cpp (2 hits)
	Line 178:     memcpy(_buf+RH_NRF24_HEADER_LEN, data, len);
	Line 291: 	memcpy(buf, _buf+RH_NRF24_HEADER_LEN, *len);
  C:\Arduino\libraries\RadioHead\RH_NRF905.cpp (2 hits)
	Line 148:     memcpy(_buf+RH_NRF905_HEADER_LEN, data, len);
	Line 244: 	memcpy(buf, _buf+RH_NRF905_HEADER_LEN, *len);
  C:\Arduino\libraries\RadioHead\RH_RF22.cpp (3 hits)
	Line 482:     memcpy_P(&cfg, &MODEM_CONFIG_TABLE[index], sizeof(RH_RF22::ModemConfig));
	Line 525: 	memcpy(buf, _buf, *len);
	Line 588:     memcpy(_buf + _bufLen, data, len);
  C:\Arduino\libraries\RadioHead\RH_RF69.cpp (2 hits)
	Line 388:     memcpy_P(&cfg, &MODEM_CONFIG_TABLE[index], sizeof(RH_RF69::ModemConfig));
	Line 444: 	memcpy(buf, _buf, *len);
  C:\Arduino\libraries\RadioHead\RH_RF95.cpp (2 hits)
	Line 213: 	memcpy(buf, _buf+RH_RF95_HEADER_LEN, *len);
	Line 335:     memcpy_P(&cfg, &MODEM_CONFIG_TABLE[index], sizeof(RH_RF95::ModemConfig));
  C:\Arduino\libraries\RadioHead\RH_Serial.cpp (1 hit)
	Line 156: 	memcpy(buf, _rxBuf+RH_SERIAL_HEADER_LEN, *len);
  C:\Arduino\libraries\RadioHead\RH_TCP.cpp (4 hits)
	Line 166: 			memcpy(_rxBuf, packet->payload, payloadLen);
	Line 174: 		memcpy(socketBuf, socketBuf + messageLen, sizeof(socketBuf) - messageLen);
	Line 215: 	memcpy(buf, _rxBuf, *len);
	Line 262:     memcpy(m.payload, data, len);
  C:\Arduino\libraries\RadioHead\STM32ArduinoCompat\wirish.h (2 hits)
	Line 13: #define memcpy_P memcpy
	Line 13: #define memcpy_P memcpy
  C:\Arduino\libraries\Robot_Control\EasyTransfer2.cpp (1 hit)
	Line 123:         //memcpy(data,d,rx_len);
  C:\Arduino\libraries\Robot_Control\Fat16.cpp (5 hits)
	Line 173:   memcpy(dir, p, sizeof(dir_t));
	Line 383:   memcpy(p->name, dname, 11);
	Line 580:     memcpy(dst, src, n);
	Line 631:   memcpy(dir, p, sizeof(dir_t));
	Line 918:     memcpy(dst, src, n);
  C:\Arduino\libraries\Robot_Motor\EasyTransfer2.cpp (1 hit)
	Line 123:         //memcpy(data,d,rx_len);
  C:\Arduino\libraries\SD\File.cpp (1 hit)
	Line 25:     memcpy(_file, &f, sizeof(SdFile));
  C:\Arduino\libraries\SD\utility\SdFile.cpp (5 hits)
	Line 167:   memcpy(dir, p, sizeof(dir_t));
	Line 308:   memcpy(&d, p, sizeof(d));
	Line 317:   memcpy(&SdVolume::cacheBuffer_.dir[0], &d, sizeof(d));
	Line 329:   memcpy(&SdVolume::cacheBuffer_.dir[1], &d, sizeof(d));
	Line 440:   memcpy(p->name, dname, 11);
  C:\Arduino\libraries\SdFat\examples\SdFormatter\SdFormatter.ino (4 hits)
	Line 283:   memcpy(pb->volumeLabel, noName, sizeof(pb->volumeLabel));
	Line 284:   memcpy(pb->fileSystemType, fat16str, sizeof(pb->fileSystemType));
	Line 355:   memcpy(pb->volumeLabel, noName, sizeof(pb->volumeLabel));
	Line 356:   memcpy(pb->fileSystemType, fat32str, sizeof(pb->fileSystemType));
  C:\Arduino\libraries\SdFat\utility\FatFile.cpp (10 hits)
	Line 173:   memcpy(dst, dir, sizeof(dir_t));
	Line 380:   memcpy(&dot, dir, sizeof(dot));
	Line 394:   memcpy(&pc->dir[0], &dot, sizeof(dot));
	Line 400:   memcpy(&pc->dir[1], &dot, sizeof(dot));
	Line 806:       memcpy(dst, src, n);
	Line 949:   memcpy(&entry, dir, sizeof(entry));
	Line 981:   memcpy(&dir->creationTimeTenths, &entry.creationTimeTenths,
	Line 994:     memcpy(&entry, &pc->dir[1], sizeof(entry));
	Line 1008:     memcpy(&pc->dir[1], &entry, sizeof(entry));
	Line 1471:       memcpy(dst, src, n);
  C:\Arduino\libraries\SdFat\utility\FatFileLFN.cpp (1 hit)
	Line 465:   memcpy(dir->name, fname->sfn, 11);
  C:\Arduino\libraries\SdFat\utility\FatFileSFN.cpp (1 hit)
	Line 211:   memcpy(dir->name, fname->sfn, 11);
  C:\Arduino\libraries\SdFat\utility\StdioStream.cpp (8 hits)
	Line 73:       memcpy(s, m_p, n);
	Line 79:     memcpy(s, m_p, n);
	Line 170:     memcpy(dst, m_p, m_r);
	Line 178:   memcpy(dst, m_p, need);
	Line 250:     memcpy(m_p, src, m_w);
	Line 258:   memcpy(m_p, src, todo);
	Line 270:     memcpy(m_p, src, m_w);
	Line 278:   memcpy(m_p, src, todo);
  C:\Arduino\libraries\VirtualWire\VirtualWire.cpp (1 hit)
	Line 570:     memcpy(buf, vw_rx_buf + 1, *len);
  C:\Arduino\libraries\WiFi\WiFi.cpp (2 hits)
	Line 161: 	memcpy(mac, _mac, WL_MAC_ADDR_LENGTH);
	Line 194: 	memcpy(bssid, _bssid, WL_MAC_ADDR_LENGTH);
  C:\Arduino\libraries\Wire\Wire.cpp (2 hits)
	Line 25: #include <string.h> // for memcpy
	Line 287: 		memcpy(txBuffer + txBufferLength, data, quantity);
Search "memcpy" (10 hits in 5 files)
  C:\Arduino\libraries\Audio\examples\Recorder\Recorder.ino (2 hits)
	Line 134:     memcpy(buffer, queue1.readBuffer(), 256);
	Line 136:     memcpy(buffer+256, queue1.readBuffer(), 256);
  C:\Arduino\libraries\Audio\examples\SamplePlayer\wav2sketch\wav2sketch.exe (1 hit)
	Line 55: u
  C:\Arduino\libraries\Audio\examples\SamplePlayerSerialFlash\wav2sketch\wav2raw.exe (1 hit)
	Line 267: _iob
  C:\Arduino\libraries\Audio\play_sd_wav.cpp (3 hits)
	Line 206: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 235: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 256: 		memcpy((uint8_t *)header + header_offset, p, len);
  C:\Arduino\libraries\Audio\play_sd_wav.cpp.bak (3 hits)
	Line 207: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 236: 		memcpy((uint8_t *)header + header_offset, p, len);
	Line 257: 		memcpy((uint8_t *)header + header_offset, p, len);
 
Last edited:
i could figure this out

Good. I'm happy to change command line setting, or even make a custom build of newlib-nano.

I simply don't have the time right now (or anytime soon) to figure out the process. Supporting Arduino 1.6.0 and a long list of stuff to do for Teensy-LC is taking all my time right now.
 
Good. I'm happy to change command line setting, or even make a custom build of newlib-nano.

I simply don't have the time right now (or anytime soon) to figure out the process. Supporting Arduino 1.6.0 and a long list of stuff to do for Teensy-LC is taking all my time right now.

Ok, looking @post 9 takes only 3 minutes :)

and..can you please upload your configurd copy of newlib-nano to github
 
Last edited:
Is that really the code from newlib?

From the comments in the code, it looks like it came from ARM, not from the newlib project.
 
yes: https://github.com/FrankBoesing/newlib-nano-2/blob/master/newlib/libc/machine/arm/memcpy-armv7m.S

this was the patch:
http://sourceware.org/ml/newlib/2013/msg00420.html

https://sourceware.org/ml/newlib/2013/txtrnAJb90Atn.txt

On 03/06/13 07:07, Joey Ye wrote:
> Memcpy in assembly tuned for ARM Cortex-M3/Cortex-M4, with tremendous
> speed-up comparing to C implementation. For aligned copy it is 45% faster in
> average. For unaligned copy it is up to 8x faster.
>
> Tested on Cortex-M3/Cortex-M4 boards and qemu. No regression.
 
Last edited:
@manitou, maybe you can test this workaround ?

OK, I re-enabled nano in boards.txt and added the .S to cores/teens3, and edited kinetis.h. (confirmed memcpy-armv7m.S.o in /tmp build area). Mods don't consume as much space as noted below, and run a bit faster than no-nano.
(I also allowed LC to use .S with __MKL26Z64__ with smaller test vectors, but the .S is apparently not applicable for the LC)

Code:
             memcpy              flash/ram KB
teensy      .S mbs    .S           nano          no-nano
3.1          640      13.6/10.4   13.3/10.4     15.9/12.6
3.0          356      13.2/10.2   12.9/10.2     15.6/12.3
LC            42      18.9/5.4    18.9/5.4       21.4/7.5
 
Last edited:
OK, I re-enabled nano in boards.txt and added the .S to cores/teens3, and edited kinetis.h. (confirmed memcpy-armv7m.S.o in /tmp build area). Mods don't consume as much space as noted below, and run a bit faster.
(I also allowed LC to use .S with __MKL26Z64__ with smaller test vectors, but the .S is apparently not applicable for the LC)

Code:
             memcpy              flash/ram KB
teensy      .S mbs    .S           nano          no-nano
3.1          640      13.6/10.4   13.3/10.4     15.9/12.6
3.0          356      13.2/10.2   12.9/10.2     15.6/12.3
LC            42      18.9/5.4    18.9/5.4       21.4/7.5

That's interesting. Where can i find your benchmark, or can you upload it ? Maybe whe have to change a bit more.
 
Ok..but this workaround is not for DMA. DMA should be unchanged, it's hardware..

Yes, I understand that. I was just explaining that the benchmark sketch was intended for testing DMA, and memcpy/memset were there for baseline comparisons, and that was the sketch i was using for the numbers provided.
 
i get 335us with pauls bytecopy version, 50us with my workaround for tensy 3.1 (with your sketch). thats 6.7 times faster. (96MHz)
 
I just ran the benchmark on a Teensy 3.1 and got these numbers:

nano:

memcpy 334 us
memset 251 us
code size: 10,676
ram size: 10,432

newlib:

memcpy 50 us
memset 25 us
code size: 15,456
ram size: 12,600

nano with memcpy.S

memcpy 50 us
memset 251 us
code size: 10,972
ram size: 10,432

Only 300 bytes seems like a good trade-off for a tremendous speedup in such a commonly used function.

Obviously the rest of newlib has a lot of speed optimized stuff, which enlarges even this very simple program by 50% and uses 2K extra RAM.

I believe I'll leave the other line commented out in boards.txt, so if anyone wants to push their project to higher performance, they can just uncomment 1 line to switch to the speed optimized newlib.
 
Back
Top