I started out trying to use the idiomatic goroutine pattern described in many places on the web, where a goroutine is started for each file, but ran out of file descriptors.
Instead, I ended up using a "worker" pattern, where I launch a limited number of worker goroutines, and have each process some of the files. I was certainly not the first person to use this pattern in Go, but it worked out well for me, so I decided to write this article in the hope that others may find it useful.
The Worker Pattern
The worker pattern is implemented like this:
// On entry, tasks is a slice containing the
// tasks to be processed.
const workers = the number of workers
var wg sync.WaitGroup
wg.Add(workers)
// Create a channel to "upload" the tasks.
// task_type is the type of each task (e.g.
// os.File).
in := make(chan task_type)
// Start the workers.
for w := 0; w < workers; w++ {
go func() {
go func() {
// When this worker finishes,
// signal completion. defer wg.Done()
// Read tasks from the
// channel. All workers
// compete for available
// tasks.
for t := range in {
process(t)
process(t)
}
}()
}()
}
// Push all of the available tasks into the
// channel.
for i := 0; i < len(tasks); i++ {
in <- tasks[i]
}
// Close the channel. Doing this will cause
// the for loop in the workers to end when
// the last task is read.
close(in)
// Wait for all workers to complete.
wg.Wait()
Benchmarks
I wrote a small program to benchmark this technique against the more typical goroutine pattern. The code can be found on bitbucket.org.
In this benchmark, I compared computing the primes factors for all of the integers from 2 to 20000. I borrowed the actual prime factorization code from the Rosetta Project's Prime Decomposition problem.
The program contains three separate calculation functions: a sequential version that computes all of the required factors using a single thread; a goroutine version that launches a separate goroutine for each integer whose prime factors are required; and a worker version that launches up to 8 worker goroutines and then feeds the numbers 2 through 20000 to them using a channel.
Results
These are the results of running the benchmark on my Ubuntu 16.04 x64 laptop with 32 GB RAM and an Intel Core i7-4900MQ CPU at 2.80GHz. The CPU has 4 hyper-threaded cores. The program includes the line
runtime.GOMAXPROCS(runtime.NumCPU())
so that all hyperthreaded cores will be used.
Each scenario runs 20 times and the average elapsed time is calculated.
runtime.GOMAXPROCS(runtime.NumCPU())
so that all hyperthreaded cores will be used.
Each scenario runs 20 times and the average elapsed time is calculated.
Workers | Average Time (seconds) |
---|---|
1 | 3.114244 |
2 | 1.708337 |
3 | 1.317147 |
4 | 1.224138 |
5 | 0.993016 |
6 | 1.005200 |
7 | 0.995244 |
8 | 1.032952 |
Sequential | 3.127022 |
One goroutine per integer | 0.990083 |
Not surprisingly, the 1-worker version takes about the same time as the sequential version, and that beginning with 5 workers, the worker-based versions are comparable to the goroutine version.
This benchmark uses a CPU-intensive task as its "work". Using this pattern with an I/O bound problem may yield very different results. The reader is encouraged to run his or her own benchmarks.
Conclusion
The worker pattern may be useful when trying to parallelized a program where there is a resource (e.g. open file descriptors) that is limited, and creating a separate goroutine for each task will not work.