What is a thread?
Is essentially a lightweight unit of execution within a process that is given a sequence of instructions, its own registers, stack and program counter.
- They share the same heap and global memory, so they can communicate easily
- The OS’s scheduler decides which threads run on which cores
- If there are more threads than cores, it time slices between them
So why not add infinite threads?
Adding more threads does not always mean a speed-up:
- Synchronisation costs: threads sometimes need to wait for each other (e.g. when accessing shared memory).
- consumes CPU cycles just idling
- Overhead: creating threads, context switching, and scheduling have costs.
- Load balancing: if one thread does more work than others, it becomes the bottleneck.
How many threads to use?
Well it really depends on the nature of the task and what your bottleneck is and how multi-threading can address this.
Pure computation (Parallelsim)
like for example your trying to find all the prime numbers between 0 and n
- This is a CPU-intensive task and the bottleneck is the core itself
- Adding threads up to the amount of cores available in your machine (and splitting work accordingly) means that the work can be run in parallel up to that many cores
- Additional threads above this usually doesn’t help as the work cannot be parallelised further and just adds to overhead
I/O Bound (Concurrency)
When your task is heavily dependent on I/O, say a web server, having more threads than CPU cores is beneficial as it enables concurrency. This is ensures that the time the CPU is spent idling is a minimum.
e.g. A thread per request web-server
- Each incoming request is given its own thread.
- If one thread is blocked on I/O, other threads can still process requests.
- The OS scheduler switches between threads, so the CPU can always do useful work while others are waiting.
** In practise this is not very scalable so modern web-servers use asynchronous I/O, where instead of creating a thread per request, the same thread can do other work while also performing I/O in a non-blocking manner.
What is the speed-up? (Amdahl’s Law)
When parallelising you task amongst multiple cores the speed up is bound by the portion of serial work. Amdahl’s Law demonstrates this with this formula, for the expected speedup of parallelising a computational task
Where:
- = fraction of code that can be parallelised.
- = number of threads/cores.
Intuition
- The part of your code that cannot be parallelised, always costs the same time no matter how many cores you add.
- The parallel part shrinks as you add more cores , but it can never vanish completely unless .
- This means the speedup curve flattens out, so even when the number of cores goes to infinity there is not much difference since the serial part becomes the performance bottleneck
pthreads (POSIX threads)
Is a low-level library following the POSIX standard where you can manually create, join and manage threads. It’s a doubled edged sword, gives you lot of control but you have to also handle synchronisation (race conditions) and work distribution yourself.
Here is a parallel block of code that I wrote to compute the primes up to a number , using the Sieve of Eratosthenes algorithm.
std::vector<bool> printPrimesToNParalell(int n, int threads) {
if (n < 2)
return {true, true};
// get primes < sqrt(n)
auto basePrimes = getPrimesToSqrtN((int)floor(sqrt(n)));
std::vector<bool> isComp(n + 1);
isComp[0] = isComp[1] = true;
// initialise tasks and threads
std::vector<pthread_t> threadsIds(threads);
std::vector<Task> tasks(threads);
int totalChunks = n - 1;
int chunkSize = (totalChunks + threads - 1) / threads;
int threadSpawned = 0;
for (int t = 0; t < threads; t++) {
ll rangeStart = 2 + (ll)t * chunkSize;
ll rangeEnd = rangeStart + chunkSize - 1; // inclusive
if (rangeStart > n) break;
if (rangeEnd > n) rangeEnd = n;
tasks[t] = Task{t, rangeStart, rangeEnd, &isComp, &basePrimes};
if (pthread_create(&threadsIds[t], NULL, findCompositesInRange, &tasks[t]) != 0) {
perror("pthread_create");
exit(1);
}
threadSpawned++; //increment the successfull threads spawned
}
for (int t = 0; t < threadSpawned; t++)
pthread_join(threadsIds[t], NULL);
return isComp;
}Essentially, the algorithm first computes the primes up to (line 6) and then, in parallel, marks the multiples of those prime numbers that are above as composite in a boolean array, where then we know which numbers are not composite.
I parallelised the work by splitting the range of[2..n]into chunks, with each of the 8 threads in my machine assigned to it’s own piece. The threads would then mark the multiples in its own respective section. Since each thread operates in its own boundary, within the boolean array, there is no synchronisation required.
For this, somewhat simple, implementation, working with pthreads wasn’t too difficult, and I managed to get around a 3x speedup. I think this figure is indicative of the portion of serial tasks within the program and the uneven workload amongst the range of [2...n], where each thread isn’t handling a similarly intensive task.
OpenMP
Is a higher-level abstraction, through compiler directives, that makes creating and managing threads easier. It hides the details of thread management, while still letting you influence scheduling and synchronisation.
You basically tell the compiler “what to do”, and OpenMP decides “how” it will parallelise the computation. As opposed to the pthreads implementation above, OpenMP provides some additional options for how the program should distribute and schedule work amongst threads. The pthreads range split approach above assumes work is equal in each chunk and implementing measures to account for uneven work would require additional complexity
With OpenMP this a little more simple, allowing you to guide how work should be distributed and scheduled with just simple compiler directives:
static[,chunk]: round-robin fixed chunks (low overhead; best for uniform work, same as pthreads range-split).dynamic[,chunk]: threads pull next chunk when done (good for irregular work).guided[,chunk]: large chunks first, shrinking over time (compromise between overhead & balance).
What does the chunk in [,chunk] mean?
chunk is an optional integer that tells the runtime how many loop iterations to hand to a thread at a time
Small Chunks
- Good when work is unbalanced
- A single thread is not left doing a large portion that is intensive while other threads idle
- Better load balancing, but also more overhead Large Chunks
- Per-iteration cost is uniform (or the heavy work is evenly scattered so averages out).
- i.e. all threads finish around the same time (one is not idling)
- Useful when scheduling overhead dominates (with small chunks) and you need contiguous blocks for cache locality.
For this task of computing primes up to n, using a smaller chunk value helped as the work tended to be more intensive in the upper end up of n. Heres the code…
std::vector<bool> printPrimesToOpenMP(int n, int threads) {
if (n < 2)
return {true, true};
// get primes < sqrt(n)
auto basePrimes = getPrimesToSqrtN((int)floor(sqrt(n)));
std::vector<bool> isComp(n + 1);
isComp[0] = isComp[1] = true;
ll totalChunks = n - 1;
ll segments = threads * 10;
ll chunkSize = (totalChunks + segments - 1) / segments; // 80 segments
omp_set_num_threads(threads);
#pragma omp parallel for schedule(dynamic, 1)
for (ll rangeStart = 2; rangeStart <= n; rangeStart += chunkSize) {
ll rangeEnd = rangeStart + chunkSize - 1; // inclusive
if (rangeEnd > n)
rangeEnd = n;
for (auto p : basePrimes) {
ll p2 = p * p
ll first = std::max(p2, ((rangeStart + p - 1) / p) * p);
for (ll i = first; i <= rangeEnd; i += p) {
isComp[i] = true;
}
}
}
return isComp;
}The work is split up into 80 different segments and the dynamic scheduler assigns a segment to thread when it become available.
Performance vs pthreads
With some non-extensive benchmarking, the OpenMP implementation yielded:
-> % ./openmp_primes.o
Enter a value for n: 100000000
Time taken: 0.128826secondswhile the pthread implementation was slightly slower:
-> % ./pthread_primes.o
Enter a value for n: 100000000
Time taken: 0.134285secondsMy assumption would be that this is mostly due to the dynamic distribution of work (with OpenMP) being more optimal for this particular task.
Extra: OpenMP private and shared keywords
OpenMP provides two keywords for specifying how data should be shared amongst threads. These control whether threads see one common instance or their own copy
shared(varlist)- All threads refer to the same single instance.
- Reads/writes must be synchronised to avoid races.
private(varlist)- Each thread gets its own uninitialised copy.
- The original value is not copied in and on exit the private value is discarded.
In my code this is not explicitly specified and is left to OpenMP to configure. By default…
- Variables declared inside the parallel region are private to that region.
- Variables visible from outside (locals in the enclosing scope, globals/statics) are typically shared.
Summary
This is was a quick introduction into multi-threading and parallel computation, going over two common frameworks which enable this kind of work. This is only a foundation for me at the moment, to get a grasp what parallel computation is and how it is done at a high level. Hope you enjoyed and learned something like me :)
- ravi