Become a leader in the IoT community!

New DevHeads get a 320-point leaderboard boost when joining the DevHeads IoT Integration Community. In addition to learning and advising, active community leaders are rewarded with community recognition and free tech stuff. Start your Legendary Collaboration now!

Step 1 of 5

CREATE YOUR PROFILE *Required

OR
Step 2 of 5

WHAT BRINGS YOU TO DEVHEADS? *Choose 1 or more

Collaboration & Work 🤝
Learn & Grow 📚
Contribute Experience & Expertise 🔧
Step 3 of 5

WHAT'S YOUR INTEREST OR EXPERTISE? *Choose 1 or more

Hardware & Design 💡
Embedded Software 💻
Edge Networking
Step 4 of 5

Personalize your profile

Step 5 of 5

Read & agree to our COMMUNITY RULES

  1. We want this server to be a welcoming space! Treat everyone with respect. Absolutely no harassment, witch hunting, sexism, racism, or hate speech will be tolerated.
  2. If you see something against the rules or something that makes you feel unsafe, let staff know by messaging @admin in the "support-tickets" tab in the Live DevChat menu.
  3. No age-restricted, obscene or NSFW content. This includes text, images, or links featuring nudity, sex, hard violence, or other graphically disturbing content.
  4. No spam. This includes DMing fellow members.
  5. You must be over the age of 18 years old to participate in our community.
  6. Our community uses Answer Overflow to index content on the web. By posting in this channel your messages will be indexed on the worldwide web to help others find answers.
  7. You agree to our Terms of Service (https://www.devheads.io/terms-of-service/) and Privacy Policy (https://www.devheads.io/privacy-policy)
By clicking "Finish", you have read and agreed to the our Terms of Service and Privacy Policy.

Debugging Persistent Segmentation Fault in Multi-threaded C++ Program on AMD Barcelona CPUs

I have been wrestling with a persistent segmentation fault in a multi-threaded C++ program running on a cluster of AMD Barcelona CPUs Linux/x86_64. The code causing the crashes is a heavily used function, and under load, running 1000 instances of the program same optimized binary can generate 1 to 2 crashes per hour.
Now here’s the interesting part, the crashes happen on different machines within the cluster although the machines themselves are almost identical, and they all share the same characteristics – same crash address and call stack.

  1. Marvee Amasi#0000

    The crash Details:

    Signal: Segmentation fault (SIGSEGV)
    Faulting Instruction Address: 0x17bd9fc (mid-instruction in function Foo)

  2. Marvee Amasi#0000

    My code around the crash location: “`(gdb) x/6i $pc-12
    0x17bd9f1: mov (%rbx),%eax ;
    0x17bd9f3: mov %rbx,%rdi ;
    0x17bd9f6: callq *0x70(%rax) ;
    0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d ;
    0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp) ;
    0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e ;
    “`

  3. Marvee Amasi#0000

    The crash happens in the middle of the instruction at `0x17bd9fc` , which is after a call to a virtual function through a pointer at offset `0x70` from memory pointed to by `%eax` .
    Examining the virtual table shows it’s not corrupted, and it points to the expected function `Foo::Get()` .
    `Foo::Get()` itself seems to be simple and well-behaved (will be shown in disassembly below).
    The return address on the stack ($rsp-8) points to the correct instruction after the call to Foo::Get().

  4. Marvee Amasi#0000

    Disassembly of Foo::Get():
    “`
    (gdb) disas 0x2d3d7b0
    0x0000000002d3d7b0 <+0>: push %rbp
    0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax ; Move value from memory pointed to by offset 0x70 from %rdi to %eax
    0x0000000002d3d7b4 <+4>: mov %rsp,%rbp
    0x0000000002d3d7b7 <+7>: leaveq
    0x0000000002d3d7b8 <+8>: retq
    End of assembler dump.
    “`
    It’s as if during the return from Foo::Get(), something increments the program counter (%rip) by 4 bytes, leading to the crash mid-instruction in Foo.
    Has anyone encountered anything similar? Any suggestions on how to approach debugging this further?

  5. UC GEE#0000

    Tools like GDB can help you track memory access patterns in different threads , and you know that mutexes are synchronization mechanisms you should add around critical sections of `foo` to allow you have thread safe access to shared data . How’s it going ? @marveeamasi

  6. Marvee Amasi#0000

    Yh @ucgee so I was able to identify the issue as a data race condition within the Foo function. Multiple threads were like accessing or modifying shared data concurrently, it coused the corruption and the crash

  7. Marvee Amasi#0000

    I synchronized thread with a(n) semaphore around the critical sections of Foo that involved shared data access. I wanted to ensure that only one thread can access that data at a time, preventing race conditions

CONTRIBUTE TO THIS THREAD

Browse other Product Reviews tagged

Leaderboard

RANKED BY XP

All time
  • 1.
    Avatar
    @Nayel115
    1620 XP
  • 2.
    Avatar
    @UcGee
    650 XP
  • 3.
    Avatar
    @melta101
    600 XP
  • 4.
    Avatar
    @lifegochi
    250 XP
  • 5.
    Avatar
    @Youuce
    180 XP
  • 6.
    Avatar
    @hemalchevli
    170 XP