Become a leader in the IoT community!
New DevHeads get a 320-point leaderboard boost when joining the DevHeads IoT Integration Community. In addition to learning and advising, active community leaders are rewarded with community recognition and free tech stuff. Start your Legendary Collaboration now!
Without optimizations:
“`./sort 1000000 2.54s user 0.01s system 99% cpu 2.552 total“`
-O2:
“`./sort 1000000 1.25s user 0.00s system 99% cpu 1.253 total“`
-O3:
“`./sort 1000000 1.78s user 0.01s system 99% cpu 1.790 total“`
The relevant part of my bubble sort function is included below:
“`void bubblesort(int *arr, int n) {
for (int i = 0; i < n-1; i++) { for (int j = 0; j < n-i-1; j++) { if (arr[j] > arr[j+1]) {
int temp = arr[j];
arr[j] = arr[j+1];
arr[j+1] = temp;
}
}
}
}
“`
I’ve examined the assembly output for both -O2 and -O3 versions. While -O3 seems to attempt vector instructions like movdqa, there might be other factors affecting performance, like unnecessary register spills or missed opportunities for further vectorization.
Could additional instructions introduced by -O3 outweigh the benefit of vectorization for bubble sort on x86-64? Are there any compiler flags or optimization techniques specific to x86-64 that might be more suitable for this scenario, considering the limitations of bubble sort for large datasets?
Hi @marveeamasi ! It’s interesting that `-O3` is slower than `-O2` for your case. `-O3` enables more aggressive optimizations, but these optimizations can sometimes lead to unexpected performance
Thanks man @redhat_hacker I’m curious to dig deeper into why this might be happening
@marveeamasi The `-O3` flag can sometimes result in larger binary sizes, which might not fit well in the CPU cache, leading to slower performance and the `-O3` might inline functions more aggressively, which can sometimes degrade performance if the inline functions are large or if they lead to increased instruction cache misses.
you should use profiling for your application to understand where the issues are, tools like `gprof`, `perf`, or even `valgrind `can provide insights into which parts of your code are consuming the most time.
Oh profiling sounds like the next step . I wonder if anyone else has encountered similar behavior with -O3 and bubble sort on x86. Perhaps there are known patterns to watch out for in the profiling data?
So cache behavior and inlining decisions can play a significant role here. Any profiling techniques using gprof or perf to pinpoint the bottlenecks in my bubble sort implementation @redhat_hacker
Compile your code with profiling enabled using
“`sh
Copy code
gcc -pg -O3 -o sort sort.c
./sort 1000000
gprof sort gmon.out > analysis.txt
“`
For `perf`, use
“`sh
perf record -g ./sort 1000000
“`
Generate a report to get full detail of where time is being spent on your program
“`sh
perf report
“`
Gracious 👍
CONTRIBUTE TO THIS THREAD