Prelude
This is the third post in a three part series that will provide an understanding of the mechanics and semantics behind the scheduler in Go. This post focuses on concurrency.
Index of the three part series:
1) Scheduling In Go : Part I - OS Scheduler
2) Scheduling In Go : Part II - Go Scheduler
3) Scheduling In Go : Part III - Concurrency
Introduction
When I’m solving a problem, especially if it’s a new problem, I don’t initially think about whether concurrency is a good fit or not. I look for a sequential solution first and make sure that it’s working. Then after readability and technical reviews, I will begin to ask the question if concurrency is reasonable and practical. Sometimes it’s obvious that concurrency is a good fit and other times it’s not so clear.
In the first part of this series, I explained the mechanics and semantics of the OS scheduler that I believe are important if you plan on writing multithreaded code. In the second part, I explained the semantics of the Go scheduler that I believe are important for understanding how to write concurrent code in Go. In this post, I will begin to bring the mechanics and semantics of the OS and Go schedulers together to provide a deeper understanding on what concurrency is and isn’t.
The goals of this post are:
Provide guidance on the semantics you must consider to determine if a workload is suitable for using concurrency.
Show you how different types of workloads change the semantics and therefore the engineering decisions you will want to make.
What is Concurrency
Concurrency means “out of order” execution. Taking a set of instructions that would otherwise be executed in sequence and finding a way to execute them out of order and still produce the same result. For the problem in front of you, it has to be obvious that out of order execution would add value. When I say value, I mean add enough of a performance gain for the complexity cost. Depending on your problem, out of order execution may not be possible or even make sense.
It’s also important to understand that concurrency is not the same as parallelism. Parallelism means executing two or more instructions at the same time. This is a different concept from concurrency. Parallelism is only possible when you have at least 2 operating system (OS) and hardware threads available to you and you have at least 2 Goroutines, each executing instructions independently on each OS/hardware thread.
Figure 1 : Concurrency vs Parallelism
In figure 1, you see a diagram of two logical processors (P) each with their independent OS thread (M) attached to an independent hardware thread (Core) on the machine. You can see two Goroutines (G1 and G2) are executing in parallel, executing their instructions on their respective OS/hardware thread at the same time. Within each logical processor, three Goroutines are taking turns sharing their respective OS thread. All these Goroutines are running concurrently, executing their instructions in no particular order and sharing time on the OS thread.
Here’s the rub, sometimes leveraging concurrency without parallelism can actually slow down your throughput. What’s also interesting is, sometimes leveraging concurrency with parallelism doesn’t give you a bigger performance gain than you might otherwise think you can achieve.
Workloads
How do you know when out of order execution may be possible or make sense? Understanding the type of workload your problem is handling is a great place to start. There are two types of workloads that are important to understand when thinking about concurrency.
CPU-Bound: This is a workload that never creates a situation where Goroutines naturally move in and out of waiting states. This is work that is constantly making calculations. A Thread calculating Pi to the Nth digit would be CPU-Bound.
IO-Bound: This is a workload that causes Goroutines to naturally enter into waiting states. This is work that consists in requesting access to a resource over the network, or making system calls into the operating system, or waiting for an event to occur. A Goroutine that needs to read a file would be IO-Bound. I would include synchronization events (mutexes, atomic), that cause the Goroutine to wait as part of this category.
With CPU-Bound workloads you need parallelism to leverage concurrency. A single OS/hardware thread handling multiple Goroutines is not efficient since the Goroutines are not moving in and out of waiting states as part of their workload. Having more Goroutines than there are OS/hardware threads can slow down workload execution because of the latency cost (the time it takes) of moving Goroutines on and off the OS thread. The context switch is creating a “Stop The World” event for your workload since none of your workload is being executed during the switch when it otherwise could be.
With IO-Bound workloads you don’t need parallelism to use concurrency. A single OS/hardware thread can handle multiple Goroutines with efficiency since the Goroutines are naturally moving in and out of waiting states as part of their workload. Having more Goroutines than there are OS/hardware threads can speed up workload execution because the latency cost of moving Goroutines on and off the OS thread is not creating a “Stop The World” event. Your workload is naturally stopped and this allows a different Goroutine to leverage the same OS/hardware thread efficiently instead of letting the OS/hardware thread sit idle.
How do you know how many Goroutines per hardware thread provides the best throughput? Too few Goroutines and you have more idle time. Too many Goroutines and you have more context switch latency time. This is something for you to think about but beyond the scope of this particular post.
For now it’s important to review some code to solidify your ability to identify when a workload can leverage concurrency, when it can’t and if parallelism is needed or not.
Adding Numbers
We don’t need complex code to visualize and understand these semantics. Look at the following function named add
that sums a collection of integers.
Listing 1
https://play.golang.org/p/r9LdqUsEzEz
36 func add(numbers []int) int {
37 var v int
38 for _, n := range numbers {
39 v += n
40 }
41 return v
42 }
In listing 1 on line 36, a function named add
is declared that takes a collection of integers and returns the sum of the collection. It starts on line 37 with the declaration of the v
variable to contain the sum. Then on line 38, the function traverses the collection linearly and each number is added to the current sum on line 39. Finally on line 41, the function returns the final sum back to the caller.
Question: is the add
function a workload that is suitable for out of order execution? I believe the answer is yes. The collection of integers could be broken up into smaller lists and those lists could be processed concurrently. Once all the smaller lists are summed, the set of sums could be added together to produce the same answer as the sequential version.
However, there is another question that comes to mind. How many smaller lists should be created and processed independently to get the best throughput? To answer this question you must know what kind of workload add
is performing. The add
function is performing a CPU-Bound workload because the algorithm is performing pure math and nothing it does would cause the goroutine to enter into a natural waiting state. This means using one Goroutine per OS/hardware thread is all that is needed for good throughput.
Listing 2 below is my concurrent version of add
.
Note: There are several ways and options you can take when writing a concurrent version of add. Don’t get hung up on my particular implementation at this time. If you have a more readable version that performs the same or better I would love for you to share it.
Listing 2
https://play.golang.org/p/r9LdqUsEzEz
44 func addConcurrent(goroutines int, numbers []int) int {
45 var v int64
46 totalNumbers := len(numbers)
47 lastGoroutine := goroutines - 1
48 stride := totalNumbers / goroutines
49
50 var wg sync.WaitGroup
51 wg.Add(goroutines)
52
53 for g := 0; g < goroutines; g++ {
54 go func(g int) {
55 start := g * stride
56 end := start + stride
57 if g == lastGoroutine {
58 end = totalNumbers
59 }
60
61 var lv int
62 for _, n := range numbers[start:end] {
63 lv += n
64 }
65
66 atomic.AddInt64(&v, int64(lv))
67 wg.Done()
68 }(g)
69 }
70
71 wg.Wait()
72
73 return int(v)
74 }
In Listing 2, the addConcurrent
function is presented which is the concurrent version of the add
function. The concurrent version uses 26 lines of code as opposed to the 5 lines of code for the non-concurrent version. There is a lot of code so I will only highlight the important lines to understand.
Line 48: Each Goroutine will get their own unique but smaller list of numbers to add. The size of the list is calculated by taking the size of the collection and dividing it by the number of Goroutines.
Line 53: The pool of Goroutines are created to perform the adding work.
Line 57-59: The last Goroutine will add the remaining list of numbers which may be greater than the other Goroutines.
Line 66: The sum of the smaller lists are summed together into a final sum.
The concurrent version is definitely more complex than the sequential version but is the complexity worth it? The best way to answer that question is to create a benchmark. For these benchmarks I have used a collection of 10 million numbers with the garbage collector turned off. There is a sequential version that uses the add
function and a concurrent version that uses the addConcurrent
function.
Listing 3
func BenchmarkSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
add(numbers)
}
}
func BenchmarkConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
addConcurrent(runtime.NumCPU(), numbers)
}
}
Listing 3 shows the benchmark functions. Here are the results when only a single OS/hardware thread is available for all Goroutines. The sequential version is using 1 Goroutine and the concurrent version is using runtime.NumCPU
or 8 Goroutines on my machine. In this case, the concurrent version is leveraging concurrency without parallelism.
Listing 4
10 Million Numbers using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential 1000 5720764 ns/op : ~10% Faster
BenchmarkConcurrent 1000 6387344 ns/op
BenchmarkSequentialAgain 1000 5614666 ns/op : ~13% Faster
BenchmarkConcurrentAgain 1000 6482612 ns/op
Note: Running a benchmark on your local machine is complicated. There are so many factors that can cause your benchmarks to be inaccurate. Make sure your machine is as idle as possible and run benchmarks a few times. You want to make sure you see consistency in the results. Having the benchmark run twice by the testing tool is giving this benchmark the most consistent results.
The benchmark in listing 4 shows that the Sequential version is approximately 10 to 13 percent faster than the Concurrent when only a single OS/hardware thread is available for all Goroutines. This is what I would have expected since the concurrent version has the overhead of context switches on that single OS thread and the management of the Goroutines.
Here are the results when an individual OS/hardware thread is available for each Goroutine. The sequential version is using 1 Goroutine and the concurrent version is using runtime.NumCPU
or 8 Goroutines on my machine. In this case, the concurrent version is leveraging concurrency with parallelism.
Listing 5
10 Million Numbers using 8 goroutines with 8 cores
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 8 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential-8 1000 5910799 ns/op
BenchmarkConcurrent-8 2000 3362643 ns/op : ~43% Faster
BenchmarkSequentialAgain-8 1000 5933444 ns/op
BenchmarkConcurrentAgain-8 2000 3477253 ns/op : ~41% Faster
The benchmark in listing 5 shows that the concurrent version is approximately 41 to 43 percent faster than the sequential version when an individual OS/hardware thread is available for each Goroutine. This is what I would have expected since all the Goroutines are now running in parallel, eight Goroutines executing their concurrent work at the same time.
Sorting
It’s important to understand that not all CPU-bound workloads are suitable for concurrency. This is primarily true when it’s very expensive to either break work up, and/or combine all the results. An example of this can be seen with the sorting algorithm called Bubble sort. Look at the following code that implements Bubble sort in Go.
Listing 6
https://play.golang.org/p/S0Us1wYBqG6
01 package main
02
03 import "fmt"
04
05 func bubbleSort(numbers []int) {
06 n := len(numbers)
07 for i := 0; i < n; i++ {
08 if !sweep(numbers, i) {
09 return
10 }
11 }
12 }
13
14 func sweep(numbers []int, currentPass int) bool {
15 var idx int
16 idxNext := idx + 1
17 n := len(numbers)
18 var swap bool
19
20 for idxNext < (n - currentPass) {
21 a := numbers[idx]
22 b := numbers[idxNext]
23 if a > b {
24 numbers[idx] = b
25 numbers[idxNext] = a
26 swap = true
27 }
28 idx++
29 idxNext = idx + 1
30 }
31 return swap
32 }
33
34 func main() {
35 org := []int{1, 3, 2, 4, 8, 6, 7, 2, 3, 0}
36 fmt.Println(org)
37
38 bubbleSort(org)
39 fmt.Println(org)
40 }
In listing 6, there is an example of Bubble sort written in Go. This sorting algorithm sweeps through a collection of integers swapping values on every pass. Depending on the ordering of the list, it may require multiple passes through the collection before everything is sorted.
Question: is the bubbleSort
function a workload that is suitable for out of order execution? I believe the answer is no. The collection of integers could be broken up into smaller lists and those lists could be sorted concurrently. However, after all the concurrent work is done there is no efficient way to sort the smaller lists together. Here is an example of a concurrent version of Bubble sort.
Listing 8
01 func bubbleSortConcurrent(goroutines int, numbers []int) {
02 totalNumbers := len(numbers)
03 lastGoroutine := goroutines - 1
04 stride := totalNumbers / goroutines
05
06 var wg sync.WaitGroup
07 wg.Add(goroutines)
08
09 for g := 0; g < goroutines; g++ {
10 go func(g int) {
11 start := g * stride
12 end := start + stride
13 if g == lastGoroutine {
14 end = totalNumbers
15 }
16
17 bubbleSort(numbers[start:end])
18 wg.Done()
19 }(g)
20 }
21
22 wg.Wait()
23
24 // Ugh, we have to sort the entire list again.
25 bubbleSort(numbers)
26 }
In Listing 8, the bubbleSortConcurrent
function is presented which is a concurrent version of the bubbleSort
function. It uses multiple Goroutines to sort portions of the list concurrently. However, what you are left with is a list of sorted values in chunks. Given a list of 36 numbers, split in groups of 12, this would be the resulting list if the entire list is not sorted once more on line 25.
Listing 9
Before:
25 51 15 57 87 10 10 85 90 32 98 53
91 82 84 97 67 37 71 94 26 2 81 79
66 70 93 86 19 81 52 75 85 10 87 49
After:
10 10 15 25 32 51 53 57 85 87 90 98
2 26 37 67 71 79 81 82 84 91 94 97
10 19 49 52 66 70 75 81 85 86 87 93
Since the nature of Bubble sort is to sweep through the list, the call to bubbleSort
on line 25 will negate any potential gains from using concurrency. With Bubble sort, there is no performance gain by using concurrency.
Reading Files
Two CPU-Bound workloads have been presented, but what about an IO-Bound workload? Are the semantics different when Goroutines are naturally moving in and out of waiting states? Look at an IO-Bound workload that reads files and performs a text search.
This first version is a sequential version of a function called find
.
Listing 10
https://play.golang.org/p/8gFe5F8zweN
42 func find(topic string, docs []string) int {
43 var found int
44 for _, doc := range docs {
45 items, err := read(doc)
46 if err != nil {
47 continue
48 }
49 for _, item := range items {
50 if strings.Contains(item.Description, topic) {
51 found++
52 }
53 }
54 }
55 return found
56 }
In listing 10, you see the sequential version of the find
function. On line 43, a variable named found
is declared to maintain a count for the number of times the specified topic
is found inside a given document. Then on line 44, the documents are iterated over and each document is read on line 45 using the read
function. Finally on line 49-53, the Contains
function from the strings
package is used to check if the topic can be found inside the collection of items read from the document. If the topic is found, the found
variable is incremented by one.
Here is the implementation of the read
function that is being called by find
.
Listing 11
https://play.golang.org/p/8gFe5F8zweN
33 func read(doc string) ([]item, error) {
34 time.Sleep(time.Millisecond) // Simulate blocking disk read.
35 var d document
36 if err := xml.Unmarshal([]byte(file), &d); err != nil {
37 return nil, err
38 }
39 return d.Channel.Items, nil
40 }
The read
function in listing 11 starts with a time.Sleep
call for one millisecond. This call is being used to mock the latency that could be produced if we performed an actual system call to read the document from disk. The consistency of this latency is important for accurately measuring the performance of the sequential version of find
against the concurrent version. Then on lines 35-39, the mock xml document stored in the global variable file
is unmarshaled into a struct value for processing. Finally, a collection of items is returned back to the caller on line 39.
With the sequential version in place, here is the concurrent version.
Note: There are several ways and options you can take when writing a concurrent version of find. Don’t get hung up on my particular implementation at this time. If you have a more readable version that performs the same or better I would love for you to share it.
Listing 12
https://play.golang.org/p/8gFe5F8zweN
58 func findConcurrent(goroutines int, topic string, docs []string) int {
59 var found int64
60
61 ch := make(chan string, len(docs))
62 for _, doc := range docs {
63 ch <- doc
64 }
65 close(ch)
66
67 var wg sync.WaitGroup
68 wg.Add(goroutines)
69
70 for g := 0; g < goroutines; g++ {
71 go func() {
72 var lFound int64
73 for doc := range ch {
74 items, err := read(doc)
75 if err != nil {
76 continue
77 }
78 for _, item := range items {
79 if strings.Contains(item.Description, topic) {
80 lFound++
81 }
82 }
83 }
84 atomic.AddInt64(&found, lFound)
85 wg.Done()
86 }()
87 }
88
89 wg.Wait()
90
91 return int(found)
92 }
In Listing 12, the findConcurrent
function is presented which is the concurrent version of the find
function. The concurrent version uses 30 lines of code as opposed to the 13 lines of code for the non-concurrent version. My goal in implementing the concurrent version was to control the number of Goroutines that are used to process the unknown number of documents. A pooling pattern where a channel is used to feed the pool of Goroutines was my choice.
There is a lot of code so I will only highlight the important lines to understand.
Lines 61-64: A channel is created and populated with all the documents to process.
Line 65: The channel is closed so the pool of Goroutines naturally terminate when all the documents are processed.
Line 70: The pool of Goroutines is created.
Line 73-83: Each Goroutine in the pool receives a document from the channel, reads the document into memory and checks the contents for the topic. When there is a match, the local found variable is incremented.
Line 84: The sum of the individual Goroutine counts are summed together into a final count.
The concurrent version is definitely more complex than the sequential version but is the complexity worth it? The best way to answer this question again is to create a benchmark. For these benchmarks I have used a collection of 1 thousand documents with the garbage collector turned off. There is a sequential version that uses the find
function and a concurrent version that uses the findConcurrent
function.
Listing 13
func BenchmarkSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
find("test", docs)
}
}
func BenchmarkConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
findConcurrent(runtime.NumCPU(), "test", docs)
}
}
Listing 13 shows the benchmark functions. Here are the results when only a single OS/hardware thread is available for all Goroutines. The sequential is using 1 Goroutine and the concurrent version is using runtime.NumCPU
or 8 Goroutines on my machine. In this case, the concurrent version is leveraging concurrency without parallelism.
Listing 14
10 Thousand Documents using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential 3 1483458120 ns/op
BenchmarkConcurrent 20 188941855 ns/op : ~87% Faster
BenchmarkSequentialAgain 2 1502682536 ns/op
BenchmarkConcurrentAgain 20 184037843 ns/op : ~88% Faster
The benchmark in listing 14 shows that the concurrent version is approximately 87 to 88 percent faster than the sequential version when only a single OS/hardware thread is available for all Goroutines. This is what I would have expected since all the Goroutines are efficiently sharing the single OS/hardware thread. The natural context switch happening for each Goroutine on the read
call is allowing more work to get done over time on the single OS/hardware thread.
Here is the benchmark when using concurrency with parallelism.
Listing 15
10 Thousand Documents using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential-8 3 1490947198 ns/op
BenchmarkConcurrent-8 20 187382200 ns/op : ~88% Faster
BenchmarkSequentialAgain-8 3 1416126029 ns/op
BenchmarkConcurrentAgain-8 20 185965460 ns/op : ~87% Faster
The benchmark in listing 15 shows that bringing in the extra OS/hardware threads don’t provide any better performance.
Conclusion
The goal of this post was to provide guidance on the semantics you must consider to determine if a workload is suitable for using concurrency. I tried to provide examples of different types of algorithms and workloads so you could see the differences in semantics and the different engineering decisions that needed to be considered.
You can clearly see that with IO-Bound workloads parallelism was not needed to get a big bump in performance. Which is the opposite of what you saw with the CPU-Bound work. When it comes to an algorithm like Bubble sort, the use of concurrency would add complexity without any real benefit of performance. It’s important to determine if your workload is suitable for concurrency and then identify the type of workload you have to use the right semantics.