Sunday, May 1, 2016

The Golang "Worker" Pattern

I have recently started using the Go programming language for a web application project and I needed a small command-line utility to process several thousand files. Each file task is completely independent of the others, making this an "embarrassingly" parallel utility.

I started out trying to use the idiomatic goroutine pattern described in many places on the web, where a goroutine is started for each file, but ran out of file descriptors.

Instead, I ended up using a "worker" pattern, where I launch a limited number of worker goroutines, and have each process some of the files. I was certainly not the first person to use this pattern in Go, but it worked out well for me, so I decided to write this article in the hope that others may find it useful.

The Worker Pattern

The worker pattern is implemented like this:

  // On entry, tasks is a slice containing the
        // tasks to be processed.

        const workers = the number of workers

        var wg sync.WaitGroup
wg.Add(workers)

        // Create a channel to "upload" the tasks.
        // task_type is the type of each task (e.g.
        // os.File).
in := make(chan task_type)

        // Start the workers.
  for w := 0; w < workers; w++ {
  go func() {
                        // When this worker finishes,
                        // signal completion.     defer wg.Done()

                        // Read tasks from the
                        // channel. All workers
                        // compete for available
                        // tasks.
        for t := range in {
           process(t)
        }
     }()
  }

        // Push all of the available tasks into the
        // channel.
  for i := 0; i < len(tasks); i++ {
     in <- tasks[i]
  }

        // Close the channel. Doing this will cause
        // the for loop in the workers to end when
        // the last task is read.
  close(in)

        // Wait for all workers to complete.
  wg.Wait()


Benchmarks

I wrote a small program to benchmark this technique against the more typical goroutine pattern. The code can be found on bitbucket.org.

In this benchmark, I compared computing the primes factors for all of the integers from 2 to 20000. I borrowed the actual prime factorization code from the Rosetta Project's Prime Decomposition problem.

The program contains three separate calculation functions: a sequential version that computes all of the required factors using a single thread; a goroutine version that launches a separate goroutine for each integer whose prime factors are required; and a worker version that launches up to 8 worker goroutines and then feeds the numbers 2 through 20000 to them using a channel.

Results

These are the results of running the benchmark on my Ubuntu 16.04 x64 laptop with 32 GB RAM and an Intel Core i7-4900MQ CPU at 2.80GHz. The CPU has 4 hyper-threaded cores. The program includes the line

runtime.GOMAXPROCS(runtime.NumCPU())

so that all hyperthreaded cores will be used.

Each scenario runs 20 times and the average elapsed time is calculated.

Workers Average Time (seconds)
1 3.114244
2 1.708337
3 1.317147
4 1.224138
5 0.993016
6 1.005200
7 0.995244
8 1.032952
Sequential 3.127022
One goroutine per integer 0.990083

Not surprisingly, the 1-worker version takes about the same time as the sequential version, and that beginning with 5 workers, the worker-based versions are comparable to the goroutine version.

This benchmark uses a CPU-intensive task as its "work". Using this pattern with an I/O bound problem may yield very different results. The reader is encouraged to run his or her own benchmarks.

Conclusion

The worker pattern may be useful when trying to parallelized a program where there is a resource (e.g. open file descriptors) that is limited, and creating a separate goroutine for each task will not work.