    T4.1 Data Access Violation Crash

    I'm currently working on rebuilding NativeEthernet from scratch so that it's fully non-blocking for better performance, while working on the TCP Server code I'm running into an annoying crash I can't figure out. The only thing consistent is the type of crash that is reported, Data Access Violation, though I can't be sure why it's happening.

    The new library is a little bit more complex in that you don't have to manually poll the server for connected clients or keep track of them yourself. All of the connected clients and their socket buffers are allocated from FNET's heap memory and linked by a pointer list to the server, I've kept careful track of what's allocated and deallocated to make sure it's not running out of memory and FNET has functions to check it's memory so I know that whenever this crash happens the memory is not full. It also doesn't have to do with any cache memory because it happens no matter where I allocate FNET's stack(DTCM, DMAMEM, or EXTMEM) and it doesn't take any longer to happen when I give it more memory.

    I use Apache Benchmark to stress test the Server and the crash will happen after a random number of clients each time, sometimes less than 2,000 clients, sometimes not until over 20,000 connections have been made.

    This CrashReport comes from this code: "if (client_list->object)"
      A problem occurred at (system time) 18:57:44
      Code was executing from address 0x992
      CFSR: 82
    	(DACCVIOL) Data Access Violation
    	(MMARVALID) Accessed Address: 0xDF08C8F7          //The accessed address always changes between crashes
      Temperature inside the chip was 59.68 įC
      Startup CPU clock speed is 600MHz
      Reboot was caused by auto reboot after fault or bad interrupt detected
    This CrashReport comes from this code: "switch (tcp_ptr->tcp_state)"
      A problem occurred at (system time) 19:0:1
      Code was executing from address 0x5F4
      CFSR: 82
    	(DACCVIOL) Data Access Violation
    	(MMARVALID) Accessed Address: 0xF43F0F60          //The accessed address always changes between crashes
      Temperature inside the chip was 59.68 įC
      Startup CPU clock speed is 600MHz
      Reboot was caused by auto reboot after fault or bad interrupt detected
    I'll try to explain the short pieces or code to understand more what's happening in them and after.
    client_list is the pointer list of connected clients, object is a pointer that points to a specific clients object.
    After passing the if check the object is then passed to a function that will give it the name tcp_ptr, tcp_state is just a normal uint32_t.

    So object and tcp_ptr should reference the same object when it gets passed to the function, object is assigned in the Client class constructor and then added to the list if the Client was successfully created, everything in the list has a valid address before even being added because all the pointers and mallocs have been checked before the Client was made with placement new. I also know there is nothing wrong with the way Clients are added and removed to the pointer list since it's just a copy of what FNET uses to allocate it's own pointer lists used throughout the whole library.

    Does anyone have any insight on how these pointers could be pointing to wrong addresses despite there being numerous checks in place and the fact that it happens in 1 out of thousands of iterations of the same code?

    > 1 out of thousands of iterations of the same code?

    Could be a race condition.

    You need a debugger. Is TeensyDebug good enough?

    I’ve never used it before so I don’t know what it’s capable of, my knowledge of debugging only includes the serial monitor.

    Accessed Address: 0xDF08C8F
    switch (tcp_ptr->tcp_state)
    looks like tcp_ptr is somehow wrong, sometimes.

    Quote Originally Posted by Frank B View Post
    looks like tcp_ptr is somehow wrong, sometimes.
    That's the confusing part, it gets past multiple null pointer checks after allocating the socket memory and the client object, none of them fail and the pointers are set to null when created so it's not like the client failed to create and there are wild pointers left. Not to mention that all of them are being created equally through the exact same code yet they don't all crash like this.

    Perhaps it gets overwritten by a buffer overflow (array index out of bounds) I'd take a look at the mapfile and look for arrays that are near.

    I wouldn't rule it out, but the current code doesn't have any arrays near it or anything that should overflow as far as I can tell.

    I'm not really sure what I should be looking for in the map file.

    Here's more of the relevant code.

    Here's where the Clients are made and added to the list:
                  uint8_t* new_client_memory = (uint8_t*) _fnet_malloc_netbuf(sizeof(EthernetClient));
                      Serial.println("Failed to allocate client!");
                  EthernetClient* new_client = new (new_client_memory) EthernetClient(accepted_socket);
                  new_client->client_memory = new_client_memory;
                  new_client->init(server_ptr->socket_recv_size, server_ptr->socket_send_size);
                  object_queue_add(&server_ptr->client_list, &(new_client->client_id));
    Here's where the list is looped through and processed:
          object_chain_t *client_list = NULL;
          EthernetClient *client = NULL;
          client_list = server_ptr->client_list;
          if(client_list == 0)
    //        Serial.println("Client List empty!");
    //        Serial.println("Handle Client chain!");
            client = (EthernetClient*) client_list->object;
            if(client->tcp_state == EthernetClient::TCP_STATE_DISABLED){
              //Deallocate Clients
              object_queue_del(&server_ptr->client_list, &(client->client_id));
              fnet_socket_listen(server_ptr->tcp_socket, ++server_ptr->backlog);
              client_list = client_list->next_chain;
              if(client_list->object){                                                      //Here's where the crash happens sometimes
                client = (EthernetClient*) client_list->object;
                if(client->tcp_state == EthernetClient::TCP_STATE_DISABLED){
                  //Deallocate Clients
                  object_queue_del(&server_ptr->client_list, &(client->client_id));
                  fnet_socket_listen(server_ptr->tcp_socket, ++server_ptr->backlog);
                object_queue_del(&server_ptr->client_list, client_list);
    //        Serial.println("Handle Client chain done!");
    Here's the beginning of the client process function:
    void EthernetClient::tcp_client_poll(EthernetClient *client){
      if(client == NULL) return;
      EthernetClient* tcp_ptr = client;
      switch (tcp_ptr->tcp_state) {                                                    //Here's where the crash happens sometimes

    The ordering of variables in memory is pretty random. The problematic code (if there is any) does not need to be in the near of the switch.. it can be anywhere.
    Can you post the *.sym file? Its in the temp directory where Arduino builds the hex file.

    Here you go.
    hm, there is no tcp_ptr mentioned :-(

    It's a variable in this function here
    000005e8 g     F .text.itcm	00000124 EthernetClient::tcp_client_poll(EthernetClient*)
    void EthernetClient::tcp_client_poll(EthernetClient *client){
      if(client == NULL) return;
      EthernetClient* tcp_ptr = client;

    Quote Originally Posted by vjmuzik View Post
    It's a variable in this function here
    Oh.. it's on the stack, then.. difficult.. yes, in this case the map does not help.

    Is "client" valid in any case? Sorry, I don't think I can help much..

    It should be valid, it's at least valid thousands of times besides the one that causes the crash.

    Could be an issue with an interrupt, too..

    I think you might've been right, it hasn't crashed yet after more than 100,000 successful connections, this has been giving me a headache all day and it's about the only thing I haven't tried to do to fix this. I'll do more testing tomorrow though it does appear to be solved so far, I'll have to remember to try turning interrupts off and on next time I run into random issues like this. Outside of benchmarking I don't expect to see a Teensy server ever serve that many clients in such a short time so I'm happy to be past this issue and continue scratching my head over other problems.

    The compiler can sometimes help. See my comments here:

    If it's a rare but random problem, I wouldn't ignore the issue.

    Be careful with disabling interrupts - various code relies on them. And lots of teensy code turns them back on.

    In #7, I see use of "new" without checking that it succeeded.
    Last edited by jonr; 08-18-2021 at 01:56 PM.

    I wrapped the function in noInterrupts()/interrupts() so itís not completely disabled, I at least know not to do that.

    There is no check on the new since it will never fail there, itís a placement new operator so itís using preallocated memory. Youíll notice right above it that that is where the memory is allocated and checked for validity, right now I have it that if it fails itíll just lock up so I can definitively see it.
                  uint8_t* new_client_memory = (uint8_t*) _fnet_malloc_netbuf(sizeof(EthernetClient));
                      Serial.println("Failed to allocate client!");
                  EthernetClient* new_client = new (new_client_memory) EthernetClient(accepted_socket);

