Become a leader in the IoT community!

New DevHeads get a 320-point leaderboard boost when joining the DevHeads IoT Integration Community. In addition to learning and advising, active community leaders are rewarded with community recognition and free tech stuff. Start your Legendary Collaboration now!

Step 1 of 5

CREATE YOUR PROFILE *Required

OR
Step 2 of 5

WHAT BRINGS YOU TO DEVHEADS? *Choose 1 or more

Collaboration & Work 🤝
Learn & Grow 📚
Contribute Experience & Expertise 🔧
Step 3 of 5

WHAT'S YOUR INTEREST OR EXPERTISE? *Choose 1 or more

Hardware & Design 💡
Embedded Software 💻
Edge Networking
Step 4 of 5

Personalize your profile

Step 5 of 5

Read & agree to our COMMUNITY RULES

  1. We want this server to be a welcoming space! Treat everyone with respect. Absolutely no harassment, witch hunting, sexism, racism, or hate speech will be tolerated.
  2. If you see something against the rules or something that makes you feel unsafe, let staff know by messaging @admin in the "support-tickets" tab in the Live DevChat menu.
  3. No age-restricted, obscene or NSFW content. This includes text, images, or links featuring nudity, sex, hard violence, or other graphically disturbing content.
  4. No spam. This includes DMing fellow members.
  5. You must be over the age of 18 years old to participate in our community.
  6. Our community uses Answer Overflow to index content on the web. By posting in this channel your messages will be indexed on the worldwide web to help others find answers.
  7. You agree to our Terms of Service (https://www.devheads.io/terms-of-service/) and Privacy Policy (https://www.devheads.io/privacy-policy)
By clicking "Finish", you have read and agreed to the our Terms of Service and Privacy Policy.

Optimizing a bubble sort implementation in C for an x86-64 architecture

Hello , I’m working on optimizing a bubble sort implementation in `C` for an `x86-64` architecture specifically targeting an Intel Core i7 processor using `GCC 11.2` . I noticed that the `-O3` flag resulted in slower performance compared to `-O2` when sorting large arrays of integers 1 million elements in this case.

Here are the timings, average of multiple runs:

  1. Marvee Amasi#0000

    Without optimizations:
    “`./sort 1000000 2.54s user 0.01s system 99% cpu 2.552 total“`
    -O2:
    “`./sort 1000000 1.25s user 0.00s system 99% cpu 1.253 total“`
    -O3:
    “`./sort 1000000 1.78s user 0.01s system 99% cpu 1.790 total“`
    The relevant part of my bubble sort function is included below:

  2. Marvee Amasi#0000

    “`void bubblesort(int *arr, int n) {
    for (int i = 0; i < n-1; i++) { for (int j = 0; j < n-i-1; j++) { if (arr[j] > arr[j+1]) {
    int temp = arr[j];
    arr[j] = arr[j+1];
    arr[j+1] = temp;
    }
    }
    }
    }
    “`

  3. Marvee Amasi#0000

    I’ve examined the assembly output for both -O2 and -O3 versions. While -O3 seems to attempt vector instructions like movdqa, there might be other factors affecting performance, like unnecessary register spills or missed opportunities for further vectorization.

    Could additional instructions introduced by -O3 outweigh the benefit of vectorization for bubble sort on x86-64? Are there any compiler flags or optimization techniques specific to x86-64 that might be more suitable for this scenario, considering the limitations of bubble sort for large datasets?

  4. RED HAT#0000

    Hi @marveeamasi ! It’s interesting that `-O3` is slower than `-O2` for your case. `-O3` enables more aggressive optimizations, but these optimizations can sometimes lead to unexpected performance

  5. Marvee Amasi#0000

    Thanks man @redhat_hacker I’m curious to dig deeper into why this might be happening

  6. RED HAT#0000

    @marveeamasi The `-O3` flag can sometimes result in larger binary sizes, which might not fit well in the CPU cache, leading to slower performance and the `-O3` might inline functions more aggressively, which can sometimes degrade performance if the inline functions are large or if they lead to increased instruction cache misses.

  7. RED HAT#0000

    you should use profiling for your application to understand where the issues are, tools like `gprof`, `perf`, or even `valgrind `can provide insights into which parts of your code are consuming the most time.

  8. Marvee Amasi#0000

    Oh profiling sounds like the next step . I wonder if anyone else has encountered similar behavior with -O3 and bubble sort on x86. Perhaps there are known patterns to watch out for in the profiling data?

  9. Marvee Amasi#0000

    So cache behavior and inlining decisions can play a significant role here. Any profiling techniques using gprof or perf to pinpoint the bottlenecks in my bubble sort implementation @redhat_hacker

  10. RED HAT#0000

    Compile your code with profiling enabled using
    “`sh
    Copy code
    gcc -pg -O3 -o sort sort.c
    ./sort 1000000
    gprof sort gmon.out > analysis.txt
    “`
    For `perf`, use
    “`sh
    perf record -g ./sort 1000000
    “`
    Generate a report to get full detail of where time is being spent on your program
    “`sh
    perf report
    “`

  11. Marvee Amasi#0000

    Gracious 👍

CONTRIBUTE TO THIS THREAD

Browse other Product Reviews tagged

Leaderboard

RANKED BY XP

All time
  • 1.
    Avatar
    @Nayel115
    1620 XP
  • 2.
    Avatar
    @UcGee
    650 XP
  • 3.
    Avatar
    @melta101
    600 XP
  • 4.
    Avatar
    @lifegochi
    250 XP
  • 5.
    Avatar
    @Youuce
    180 XP
  • 6.
    Avatar
    @hemalchevli
    170 XP