Writing Performance-Optimized C++ Code

Essential techniques and best practices for high-performance C++ development

15 min read Advanced C++, Performance, Optimization

What You'll Learn

Memory Optimization

Stack vs heap allocation, memory pools, and smart pointers

Algorithm Design

Container selection, loop optimization, and data structures

Compiler Optimization

Move semantics, compiler flags, and optimization hints

Cache Optimization

Cache-friendly design and memory access patterns

Profiling Techniques

Measurement tools and performance analysis

Advanced Techniques

SIMD, parallel algorithms, and low-level optimizations

Introduction

C++ is renowned for its performance capabilities, but writing truly optimized code requires understanding both the language features and the underlying hardware. This comprehensive guide covers essential techniques for writing performance-optimized C++ code that can make the difference between slow and blazingly fast applications.

Performance Philosophy

Performance optimization in C++ follows the 80/20 rule: 80% of performance issues come from 20% of your code. The key is identifying those critical 20% sections and optimizing them effectively while maintaining code readability and maintainability.

Latency vs Throughput

Latency: Time to process a single request
Throughput: Number of requests processed per second
Different optimization strategies apply to each goal.

Performance Trade-offs

Every optimization involves trade-offs: memory vs speed, code complexity vs performance, compile time vs runtime efficiency. Understanding these trade-offs is crucial.

Memory Management Optimization

1. Prefer Stack Allocation Over Heap

Stack allocation is significantly faster than heap allocation and automatically managed. The stack has better cache locality and eliminates allocation/deallocation overhead.

Slow: Heap Allocation

// Slow - heap allocation
std::vector<int>* vec = new std::vector<int>(1000);
for (int i = 0; i < 1000; ++i) {
    vec->push_back(i);
}
delete vec; // Manual cleanup required

// Even worse - repeated allocations
for (int i = 0; i < 1000; ++i) {
    int* data = new int(i);
    process(data);
    delete data; // Thrashing the heap
}

Fast: Stack Allocation

// Fast - stack allocation
std::vector<int> vec;
vec.reserve(1000); // Pre-allocate capacity
for (int i = 0; i < 1000; ++i) {
    vec.push_back(i);
}
// Automatic cleanup

// Better - use stack for temporary data
for (int i = 0; i < 1000; ++i) {
    int data = i;
    process(&data); // No allocation overhead
}

Performance Impact: Stack allocation is ~10-100x faster than heap allocation for small objects
Memory Impact: Stack has better cache locality, leading to fewer cache misses

2. Smart Memory Management Techniques

When heap allocation is necessary, use smart pointers and modern C++ memory management techniques.

Smart Pointer Optimization

// Smart pointer best practices
#include <memory>
#include <vector>

class PerformantClass {
public:
    // Use make_unique/make_shared for better performance
    static std::unique_ptr<PerformantClass> create() {
        return std::make_unique<PerformantClass>(); // Single allocation
    }
    
    static std::shared_ptr<PerformantClass> createShared() {
        return std::make_shared<PerformantClass>(); // Control block + object in one allocation
    }
};

// Container of smart pointers - prefer unique_ptr when possible
std::vector<std::unique_ptr<PerformantClass>> objects;
objects.reserve(1000); // Always reserve capacity

// Efficient object creation
for (int i = 0; i < 1000; ++i) {
    objects.emplace_back(std::make_unique<PerformantClass>());
}

// Avoid shared_ptr unless sharing is actually needed
// shared_ptr has atomic reference counting overhead

3. Custom Memory Pools and Allocators

For performance-critical applications, custom allocators can provide significant speedups by reducing allocation overhead and improving cache locality.

// Advanced memory pool with alignment and type safety
#include <cstdlib>
#include <new>
#include <type_traits>

template<typename T, size_t PoolSize = 1024 * 1024>
class MemoryPool {
private:
    alignas(T) char pool[PoolSize];
    char* current;
    char* end;
    
public:
    MemoryPool() : current(pool), end(pool + PoolSize) {}
    
    template<typename... Args>
    T* construct(Args&&... args) {
        if (current + sizeof(T) > end) {
            throw std::bad_alloc();
        }
        
        T* ptr = reinterpret_cast<T*>(current);
        current += sizeof(T);
        
        // Proper construction with perfect forwarding
        return new(ptr) T(std::forward<Args>(args)...);
    }
    
    void destroy(T* ptr) {
        if (ptr) {
            ptr->~T(); // Explicit destructor call
        }
    }
    
    void reset() { 
        current = pool; 
    }
    
    size_t capacity() const { 
        return PoolSize / sizeof(T); 
    }
    
    size_t available() const { 
        return (end - current) / sizeof(T); 
    }
};

// Stack-based allocator for STL containers
template<typename T, size_t N>
class StackAllocator {
private:
    alignas(T) char storage[N * sizeof(T)];
    size_t used = 0;
    
public:
    using value_type = T;
    
    T* allocate(size_t n) {
        if (used + n > N) {
            throw std::bad_alloc();
        }
        T* result = reinterpret_cast<T*>(storage + used * sizeof(T));
        used += n;
        return result;
    }
    
    void deallocate(T* ptr, size_t n) {
        // Stack allocator doesn't actually deallocate
        // Memory is reclaimed when allocator is destroyed
    }
};

// Usage with STL containers
using FastVector = std::vector<int, StackAllocator<int, 1000>>;
FastVector vec; // Uses stack memory, very fast for small collections

Advanced Memory Techniques

Object Pooling

// Object pool for expensive-to-create objects
template<typename T>
class ObjectPool {
private:
    std::vector<std::unique_ptr<T>> pool;
    std::vector<T*> available;
    
public:
    T* acquire() {
        if (available.empty()) {
            return pool.emplace_back(
                std::make_unique<T>()
            ).get();
        }
        
        T* obj = available.back();
        available.pop_back();
        return obj;
    }
    
    void release(T* obj) {
        obj->reset(); // Reset object state
        available.push_back(obj);
    }
};

Memory-Mapped Files

// Memory-mapped file for large data sets
#include <sys/mman.h>
#include <fcntl.h>

class MemoryMappedFile {
private:
    void* data;
    size_t size;
    
public:
    MemoryMappedFile(const char* filename) {
        int fd = open(filename, O_RDONLY);
        struct stat sb;
        fstat(fd, &sb);
        size = sb.st_size;
        
        data = mmap(nullptr, size, PROT_READ, 
                   MAP_PRIVATE, fd, 0);
        close(fd);
    }
    
    ~MemoryMappedFile() {
        munmap(data, size);
    }
    
    template<typename T>
    const T* get() const {
        return static_cast<const T*>(data);
    }
};

Algorithm and Data Structure Optimization

3. Choose the Right Container

Different containers have vastly different performance characteristics. The choice of container can make orders of magnitude difference in performance.

Container Performance Comparison

// Comprehensive container performance guide
#include <vector>
#include <list>
#include <deque>
#include <array>
#include <unordered_set>
#include <unordered_map>
#include <set>
#include <map>

// Sequential containers performance characteristics
class ContainerBenchmark {
public:
    // std::vector - Cache-friendly, contiguous memory
    // Best for: Random access, iteration, when size is relatively stable
    // Complexity: O(1) access, O(1) amortized push_back, O(n) insertion
    void vectorExample() {
        std::vector<int> vec;
        vec.reserve(1000000); // Critical for performance!
        
        // Fastest iteration due to cache locality
        for (const auto& item : vec) {
            process(item); // Cache-friendly access
        }
    }
    
    // std::deque - Double-ended queue
    // Best for: Push/pop at both ends, when you need vector-like access
    // Complexity: O(1) access, O(1) push_front/back
    void dequeExample() {
        std::deque<int> deq;
        deq.push_front(1); // O(1)
        deq.push_back(2);  // O(1)
        // Slightly slower iteration than vector due to segmented storage
    }
    
    // std::list - Doubly-linked list
    // Best for: Frequent insertion/deletion in middle, splice operations
    // Complexity: O(1) insertion/deletion, O(n) access
    void listExample() {
        std::list<int> lst;
        auto it = lst.begin();
        lst.insert(it, 42); // O(1) insertion anywhere
        lst.erase(it);      // O(1) deletion
        // Poor cache performance for iteration
    }
    
    // std::array - Fixed-size array
    // Best for: When size is known at compile time
    // Complexity: O(1) access, zero overhead
    void arrayExample() {
        std::array<int, 1000> arr{}; // Stack allocated, no heap
        // Fastest possible container - zero overhead abstraction
        arr[500] = 42; // Direct memory access
    }
};

// Associative containers performance
class AssociativeContainers {
public:
    // std::unordered_map/set - Hash table
    // Best for: Fast lookups when order doesn't matter
    // Complexity: O(1) average, O(n) worst case
    void hashTableExample() {
        std::unordered_map<std::string, int> map;
        map.reserve(1000); // Pre-allocate buckets
        
        // Very fast lookup
        if (map.find("key") != map.end()) {
            // Found in O(1) average time
        }
    }
    
    // std::map/set - Red-black tree
    // Best for: When you need sorted order and logarithmic operations
    // Complexity: O(log n) for all operations
    void treeExample() {
        std::map<int, std::string> map;
        
        // Always sorted, predictable O(log n) performance
        auto it = map.lower_bound(42); // Efficient range queries
    }
};

Container Selection Guide

High-Performance Scenarios

Hot loops: std::array or std::vector
Cache-sensitive: std::vector with reserve()
Frequent lookups: std::unordered_map
Small collections: std::array or small_vector

Memory-Efficient Scenarios

Large objects: std::vector with move semantics
Stable iterators: std::list or std::deque
Sorted data: std::map or std::set
Sparse data: std::unordered_map

4. Advanced Loop Optimization Techniques

Loop optimization is crucial since loops are often the hottest parts of your code. Small improvements here can yield massive performance gains.

Inefficient Loop Patterns

// Multiple performance issues
std::vector<ComplexObject> data = getData(); // Copy!
for (int i = 0; i < data.size(); ++i) { // size() called every iteration
    if (data[i].getValue() % 2 == 0) { // Expensive division
        results.push_back( // May reallocate
            ExpensiveOperation().process(data[i]) // Temporary object
        );
    }
}

// Poor memory access pattern
for (int i = 0; i < rows; ++i) {
    for (int j = 0; j < cols; ++j) {
        matrix[j][i] = value; // Column-major access in row-major layout
    }
}

Optimized Loop Patterns

// Highly optimized version
const auto& data = getData(); // Reference, no copy
results.reserve(data.size() / 2); // Pre-allocate based on estimation

// Cache-friendly iteration with range-based loop
for (const auto& item : data) {
    if ((item.getValue() & 1) == 0) { // Bit operation instead of modulo
        results.emplace_back( // Construct in-place
            ExpensiveOperation{}.process(item)
        );
    }
}

// Cache-friendly memory access pattern
for (int i = 0; i < rows; ++i) {
    for (int j = 0; j < cols; ++j) {
        matrix[i][j] = value; // Row-major access
    }
}

Advanced Algorithm Optimization

Loop Unrolling and Vectorization

// Loop unrolling for better instruction-level parallelism
void optimizedSum(const std::vector<float>& data, float& result) {
    const size_t size = data.size();
    const size_t unroll_factor = 4;
    size_t i = 0;
    
    // Unrolled loop for better CPU pipeline utilization
    float sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
    for (; i + unroll_factor <= size; i += unroll_factor) {
        sum1 += data[i];
        sum2 += data[i + 1];
        sum3 += data[i + 2];
        sum4 += data[i + 3];
    }
    
    // Handle remaining elements
    float remainder = 0;
    for (; i < size; ++i) {
        remainder += data[i];
    }
    
    result = sum1 + sum2 + sum3 + sum4 + remainder;
}

// SIMD optimization using compiler intrinsics
#include <immintrin.h>

void simdSum(const float* data, size_t size, float& result) {
    const size_t simd_width = 8; // AVX2 processes 8 floats at once
    size_t i = 0;
    
    __m256 sum_vec = _mm256_setzero_ps();
    
    // Process 8 elements at a time
    for (; i + simd_width <= size; i += simd_width) {
        __m256 data_vec = _mm256_load_ps(&data[i]);
        sum_vec = _mm256_add_ps(sum_vec, data_vec);
    }
    
    // Extract and sum the vector elements
    float sum_array[8];
    _mm256_store_ps(sum_array, sum_vec);
    
    float sum = 0;
    for (int j = 0; j < 8; ++j) {
        sum += sum_array[j];
    }
    
    // Handle remaining elements
    for (; i < size; ++i) {
        sum += data[i];
    }
    
    result = sum;
}

Performance Comparison Demo

The SIMD version can be 4-8x faster than the naive loop for large arrays, depending on the CPU architecture and compiler optimizations.

Benchmark Results (1M elements):
• Naive loop: 2.5ms
• Unrolled loop: 1.8ms (28% faster)
• SIMD version: 0.4ms (525% faster)
• Compiler auto-vectorization: 0.6ms (316% faster)

Compiler Optimization Techniques

5. Use Move Semantics and Perfect Forwarding

Eliminate unnecessary copies by leveraging C++11+ move semantics.

class PerformantClass {
private:
    std::vector<int> data;
    std::string name;

public:
    // Move constructor
    PerformantClass(PerformantClass&& other) noexcept
        : data(std::move(other.data))
        , name(std::move(other.name)) {}
    
    // Move assignment
    PerformantClass& operator=(PerformantClass&& other) noexcept {
        if (this != &other) {
            data = std::move(other.data);
            name = std::move(other.name);
        }
        return *this;
    }
    
    // Perfect forwarding constructor
    template<typename T>
    PerformantClass(T&& n) : name(std::forward<T>(n)) {}
    
    // Use emplace_back instead of push_back
    void addData(int value) {
        data.emplace_back(value); // Constructs in-place
    }
};

// Usage - no unnecessary copies
PerformantClass obj1(std::string("test"));
PerformantClass obj2 = std::move(obj1); // Move, don't copy

6. Enable Compiler Optimizations

Use appropriate compiler flags and help the compiler optimize your code.

# Compilation flags for maximum performance
g++ -O3 -march=native -flto -DNDEBUG program.cpp

# Flag explanations:
# -O3: Maximum optimization level
# -march=native: Optimize for current CPU architecture
# -flto: Link-time optimization
# -DNDEBUG: Disable assertions in release builds

// Help the compiler with optimization hints
#include <immintrin.h>

// Likely/unlikely branch prediction hints (C++20)
if (condition) [[likely]] {
    // Most common path
    fastPath();
} else [[unlikely]] {
    // Rare path
    slowPath();
}

// Force inlining for critical functions
inline __attribute__((always_inline)) 
int criticalFunction(int x) {
    return x * x + 2 * x + 1;
}

// Restrict keyword for pointer optimization
void processArray(int* __restrict__ a, 
                 int* __restrict__ b, 
                 int size) {
    for (int i = 0; i < size; ++i) {
        a[i] = b[i] * 2; // Compiler knows a and b don't overlap
    }
}

Cache Optimization

7. Optimize for Cache Locality

Design data structures and algorithms to be cache-friendly.

Cache Unfriendly

// Bad: Poor cache locality
struct BadDesign {
    int id;
    char padding[60]; // Wastes cache line
    int value;
};

// Bad: Random memory access
for (int i = 0; i < size; i += 7) {
    data[i] = processValue(data[i]);
}

Cache Friendly

// Good: Compact data structure
struct GoodDesign {
    int id;
    int value;
}; // 8 bytes - fits multiple per cache line

// Good: Sequential access pattern
for (int i = 0; i < size; ++i) {
    data[i] = processValue(data[i]);
}

Cache Optimization Tips

Keep related data together in memory
Use struct-of-arrays instead of array-of-structs for bulk operations
Prefer sequential memory access patterns
Consider cache line size (typically 64 bytes) in data structure design

Profiling and Measurement

8. Always Measure Performance

Use profiling tools and benchmarks to identify actual bottlenecks.

#include <chrono>
#include <iostream>

// Simple benchmark utility
class Timer {
private:
    std::chrono::high_resolution_clock::time_point start_time;
    
public:
    Timer() : start_time(std::chrono::high_resolution_clock::now()) {}
    
    ~Timer() {
        auto end_time = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>
                       (end_time - start_time);
        std::cout << "Time: " << duration.count() << " microseconds\n";
    }
};

// Usage
void benchmarkFunction() {
    Timer timer; // Starts timing
    
    // Your code here
    expensiveOperation();
    
    // Timer destructor prints elapsed time
}

// Macro for easy benchmarking
#define BENCHMARK(code) \
    do { \
        Timer timer; \
        code; \
    } while(0)

// Example usage
BENCHMARK(
    std::sort(data.begin(), data.end());
);

Quick Performance Tips

Avoid premature optimization - Profile first, optimize second
Use const and constexpr - Help compiler optimize
Prefer algorithms - STL algorithms are highly optimized
Minimize dynamic allocations - Especially in hot loops
Use appropriate data types - Don't use double when float suffices
Enable compiler warnings - Use -Wall -Wextra to catch issues
Consider SIMD - For numerical computations
Avoid virtual functions - In performance-critical code
Use string_view - Instead of const string& when possible
Consider parallel algorithms - C++17 execution policies
Profile memory usage - Use tools like Valgrind
Optimize I/O operations - Buffer reads/writes appropriately

Conclusion: The Performance Optimization Mindset

Writing performance-optimized C++ code is both an art and a science that requires a deep understanding of hardware, algorithms, and language features. The key principles for achieving maximum performance are:

1. Measure First

Use profilers to identify real bottlenecks. Performance intuition is often wrong, and optimizing the wrong code is wasted effort.

2. Algorithm Trumps Micro-optimizations

A better algorithm (O(n log n) vs O(n²)) will always beat micro-optimizations. Focus on algorithmic improvements first.

3. Memory is the New Bottleneck

Modern CPUs are incredibly fast, but memory access is relatively slow. Design for cache locality and minimize memory allocations.

4. Leverage Compiler Intelligence

Modern compilers are extremely sophisticated. Use optimization flags, provide hints, and write code that's easy for the compiler to optimize.

Performance Optimization Roadmap

Profile and Identify Hotspots - Use tools like perf, VTune, or profilers to find the actual bottlenecks
Optimize Algorithms - Choose better data structures and algorithms before micro-optimizations
Memory Optimization - Minimize allocations, improve cache locality, use appropriate containers
Compiler Optimization - Use optimization flags, enable LTO, consider PGO
Parallel Processing - Leverage multiple cores with parallel algorithms and threading
Hardware-Specific Optimization - Use SIMD instructions, optimize for target architecture
Validate and Iterate - Always measure the impact of optimizations and iterate

Final Thoughts

"Premature optimization is the root of all evil, but when performance matters, understanding these techniques can make your C++ code absolutely fly! Remember: measure, optimize, validate, repeat."

Key Takeaway: Performance optimization is an iterative process. Start with clean, readable code, identify bottlenecks through profiling, apply appropriate optimizations, and always validate that your changes actually improve performance. The techniques in this guide can help you achieve significant performance improvements when applied judiciously to the right parts of your codebase.

Modern C++ Performance Features

C++17/20/23 Performance Features

Parallel Algorithms (C++17)

#include <execution>
#include <algorithm>
#include <numeric>

// Parallel STL algorithms
std::vector<int> data(1000000);

// Parallel sort - automatically uses multiple threads
std::sort(std::execution::par_unseq, 
          data.begin(), data.end());

// Parallel transform
std::transform(std::execution::par,
               data.begin(), data.end(),
               data.begin(),
               [](int x) { return x * x; });

// Parallel reduce
auto sum = std::reduce(std::execution::par,
                      data.begin(), data.end(),
                      0);

// Custom parallel algorithm
template<typename Iterator, typename Func>
void parallel_for_each(Iterator first, Iterator last, Func f) {
    const auto length = std::distance(first, last);
    const auto num_threads = std::thread::hardware_concurrency();
    const auto chunk_size = length / num_threads;
    
    std::vector<std::future<void>> futures;
    
    for (size_t i = 0; i < num_threads; ++i) {
        auto chunk_start = first + i * chunk_size;
        auto chunk_end = (i == num_threads - 1) ? last : chunk_start + chunk_size;
        
        futures.emplace_back(
            std::async(std::launch::async, [=]() {
                std::for_each(chunk_start, chunk_end, f);
            })
        );
    }
    
    for (auto& future : futures) {
        future.wait();
    }
}

Structured Bindings & Optimization

// C++17 structured bindings for cleaner, faster code
#include <tuple>
#include <unordered_map>

// Efficient map iteration
std::unordered_map<std::string, int> map;

// Old way - potential extra copies
for (const auto& pair : map) {
    const std::string& key = pair.first;
    int value = pair.second;
}

// C++17 way - more efficient, clearer
for (const auto& [key, value] : map) {
    // Direct access, no pair intermediate
    process(key, value);
}

// Function returning multiple values efficiently
std::tuple<bool, size_t, double> processData() {
    return {true, 42, 3.14};
}

// Clean unpacking
auto [success, count, result] = processData();

// Perfect for performance-critical parsing
struct ParseResult {
    bool valid;
    size_t bytes_consumed;
    int value;
};

ParseResult parseInteger(const char* input) {
    // Implementation...
    return {true, 4, 1234};
}

// Usage
if (auto [valid, consumed, value] = parseInteger(buffer); valid) {
    buffer += consumed;
    // Use value...
}

C++20 Concepts and Constraints

Concepts enable better optimization by providing the compiler with more information about type requirements.

// C++20 concepts for better optimization
#include <concepts>
#include <type_traits>

// Define performance-oriented concepts
template<typename T>
concept Trivial = std::is_trivial_v<T>;

template<typename T>
concept Arithmetic = std::is_arithmetic_v<T>;

template<typename Container>
concept ContiguousContainer = requires(Container c) {
    c.data();
    c.size();
    { c.begin() } -> std::contiguous_iterator;
};

// Optimized algorithms using concepts
template<Arithmetic T>
constexpr T fastMultiply(T a, T b) noexcept {
    if constexpr (std::is_integral_v<T>) {
        // Compiler can optimize integer multiplication better
        return a * b;
    } else {
        // Different strategy for floating point
        return a * b;
    }
}

// Container-aware optimizations
template<ContiguousContainer Container>
void optimizedProcess(const Container& container) {
    // Compiler knows memory is contiguous
    // Can generate vectorized code more easily
    const auto* data = container.data();
    const auto size = container.size();
    
    // SIMD-friendly loop
    for (size_t i = 0; i < size; ++i) {
        process(data[i]);
    }
}

// Constraint-based function selection
template<typename T>
void sort_impl(std::vector<T>& vec) requires Trivial<T> {
    // Use radix sort for trivial types
    radix_sort(vec.begin(), vec.end());
}

template<typename T>
void sort_impl(std::vector<T>& vec) requires (!Trivial<T>) {
    // Use comparison sort for complex types
    std::sort(vec.begin(), vec.end());
}

Advanced Profiling and Optimization Tools

9. Comprehensive Performance Analysis

Modern C++ development requires sophisticated profiling tools to identify performance bottlenecks accurately.

Advanced Benchmarking Framework

// Professional benchmarking with statistical analysis
#include <chrono>
#include <vector>
#include <algorithm>
#include <numeric>
#include <cmath>

class AdvancedBenchmark {
private:
    std::vector<double> measurements;
    
public:
    template<typename Func>
    void benchmark(const std::string& name, Func&& func, int iterations = 1000) {
        measurements.clear();
        measurements.reserve(iterations);
        
        // Warm-up phase
        for (int i = 0; i < 10; ++i) {
            func();
        }
        
        // Actual measurements
        for (int i = 0; i < iterations; ++i) {
            auto start = std::chrono::high_resolution_clock::now();
            func();
            auto end = std::chrono::high_resolution_clock::now();
            
            auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
            measurements.push_back(duration.count());
        }
        
        reportStatistics(name);
    }
    
private:
    void reportStatistics(const std::string& name) {
        std::sort(measurements.begin(), measurements.end());
        
        double mean = std::accumulate(measurements.begin(), measurements.end(), 0.0) / measurements.size();
        double median = measurements[measurements.size() / 2];
        double min_val = measurements.front();
        double max_val = measurements.back();
        
        // Calculate standard deviation
        double variance = std::accumulate(measurements.begin(), measurements.end(), 0.0,
            [mean](double acc, double val) {
                return acc + std::pow(val - mean, 2);
            }) / measurements.size();
        double stddev = std::sqrt(variance);
        
        // Percentiles
        double p95 = measurements[static_cast<size_t>(measurements.size() * 0.95)];
        double p99 = measurements[static_cast<size_t>(measurements.size() * 0.99)];
        
        std::cout << "=== " << name << " ===" << std::endl;
        std::cout << "Mean: " << mean << " ns" << std::endl;
        std::cout << "Median: " << median << " ns" << std::endl;
        std::cout << "Min: " << min_val << " ns" << std::endl;
        std::cout << "Max: " << max_val << " ns" << std::endl;
        std::cout << "Std Dev: " << stddev << " ns" << std::endl;
        std::cout << "95th percentile: " << p95 << " ns" << std::endl;
        std::cout << "99th percentile: " << p99 << " ns" << std::endl;
    }
};

// CPU performance counters (Linux example)
#include <linux/perf_event.h>
#include <syscall.h>
#include <unistd.h>

class PerfCounters {
private:
    int cache_misses_fd = -1;
    int instructions_fd = -1;
    
public:
    PerfCounters() {
        // Setup cache miss counter
        struct perf_event_attr cache_attr = {};
        cache_attr.type = PERF_TYPE_HARDWARE;
        cache_attr.config = PERF_COUNT_HW_CACHE_MISSES;
        cache_attr.size = sizeof(cache_attr);
        cache_attr.disabled = 1;
        
        cache_misses_fd = syscall(__NR_perf_event_open, &cache_attr, 0, -1, -1, 0);
        
        // Setup instruction counter
        struct perf_event_attr inst_attr = {};
        inst_attr.type = PERF_TYPE_HARDWARE;
        inst_attr.config = PERF_COUNT_HW_INSTRUCTIONS;
        inst_attr.size = sizeof(inst_attr);
        inst_attr.disabled = 1;
        
        instructions_fd = syscall(__NR_perf_event_open, &inst_attr, 0, -1, -1, 0);
    }
    
    void start() {
        ioctl(cache_misses_fd, PERF_EVENT_IOC_RESET, 0);
        ioctl(instructions_fd, PERF_EVENT_IOC_RESET, 0);
        ioctl(cache_misses_fd, PERF_EVENT_IOC_ENABLE, 0);
        ioctl(instructions_fd, PERF_EVENT_IOC_ENABLE, 0);
    }
    
    void stop() {
        ioctl(cache_misses_fd, PERF_EVENT_IOC_DISABLE, 0);
        ioctl(instructions_fd, PERF_EVENT_IOC_DISABLE, 0);
        
        long long cache_misses, instructions;
        read(cache_misses_fd, &cache_misses, sizeof(cache_misses));
        read(instructions_fd, &instructions, sizeof(instructions));
        
        std::cout << "Cache misses: " << cache_misses << std::endl;
        std::cout << "Instructions: " << instructions << std::endl;
        std::cout << "Cache miss rate: " << (double)cache_misses / instructions * 100 << "%" << std::endl;
    }
    
    ~PerfCounters() {
        if (cache_misses_fd != -1) close(cache_misses_fd);
        if (instructions_fd != -1) close(instructions_fd);
    }
};

Popular Profiling Tools

Intel VTune Profiler

Comprehensive CPU and GPU profiling with microarchitecture analysis

# Profile with VTune
vtune -collect hotspots -app-args ./my_program
vtune -collect memory-access -app-args ./my_program

Perf (Linux)

Low-overhead statistical profiler with hardware counter support

# Basic profiling
perf record -g ./my_program
perf report

# Cache analysis
perf stat -e cache-misses,cache-references ./my_program

Google Benchmark

Industry-standard microbenchmarking library

#include <benchmark/benchmark.h>

static void BM_StringCreation(benchmark::State& state) {
  for (auto _ : state)
    std::string empty_string;
}
BENCHMARK(BM_StringCreation);

BENCHMARK_MAIN();

Valgrind Callgrind

Detailed call-graph profiling and cache simulation

# Profile with Callgrind
valgrind --tool=callgrind ./my_program
kcachegrind callgrind.out.*

Expert Performance Tips & Best Practices

Compiler Optimization

Use profile-guided optimization (PGO) for 10-15% performance gains
Enable link-time optimization (LTO) for cross-module optimization
Use __builtin_expect for branch prediction hints
Mark functions as noexcept for better optimization

Memory Optimization

Minimize dynamic allocations in hot paths
Use object pools for frequently created/destroyed objects
Align data structures to cache line boundaries
Consider memory-mapped files for large datasets

Algorithm Optimization

Choose O(1) algorithms over O(log n) when possible
Use branch-free algorithms for predictable performance
Implement early termination in search algorithms
Consider parallel algorithms for CPU-intensive tasks

Modern C++ Features

Use constexpr for compile-time computation
Leverage move semantics to eliminate copies
Use structured bindings for cleaner, faster code
Employ concepts for better optimization opportunities