

# Transactional memory & atomic blocks

**Tim Harris** 

# Research

September 2007. urlpagc=http://dx.doi.org/10.1109/ IISWC.2007.4362177, pdf=http://www-sal.cs.uiuc.edu/ ~zilles/papers/tm\_false\_conflicts.iiswc2007.pdf, catId1=SW, catId2=usingstm.

V 10

100

1.00

- [497] Craig Zilles and Ravi Rajwar. Transactional memory and the birthday paradox (brief announcement). In SPAA '07: Proc. 19th Symposium on Parallel Algorithms and Architectures, pages 303-304, June 2007. A longer version is available as Technical Report UIUCDCS-R-2007-2835, March 2007, urlpage=http://doi. acm.org/10.1145/1248377.1248428, pdf=http://www-sal. cs.uiuc.edu/~zilles/papers/tm-bday.spaa2007.pdf.
- [498] Ferad Zyulkyarov, Adrián Cristal, Sanja Cvijic, Eduard Ayguadé, Mateo Valero, Osman S. Unsal, and Tim Harris. WormBench: a configurable workload for evaluating transactional memory systems. In MEDEA '08: Proc. 9th workshop on MEmory performance, pages 61-68, October 2008. urlpage=http://dx.doi.org/10.1145/ 1509084.1509093.
- [499] Ferad Zyulkyarov, Vladimir Gajinov, Osman S. Unsal, Adrián Cristal, Eduard Ayguadé, Tim Harris, and Mateo Valero. Atomic Quake: using transactional memory in an interactive multiplayer game server. In PPoPP '09: Proc. 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 25– 34, February 2009. urlpage=http://dx.doi.org/10.1145/ 1504176.1504183, pdf=http://www.bscmsrc.eu/sites/ default/files/atomicquake-ppopp09-zyulkyarov.pdf.
- [500] Ferad Zyulkyarov, Tim Harris, Osman S. Unsal, Adrián Cristal, and Mateo Valero. Debugging programs that use atomic blocks and transactional memory. In PPoPP '10: Proc. 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, January 2010.



# AMD quad-core

1.00

× 30 WD 100





# Sun Niagara-2

100

2.0

× 90 WD

| L2 Data<br>Bank 0<br>L2B0<br>L2 Data<br>Bank 1<br>L2B1 | SPARC<br>Core 1 | SPARC<br>Core 1 | SPARC<br>Core 5 | SPARC<br>Core 4 | L2 Data<br>Bank 4<br>L2B4<br>L2 Data<br>Bank 5<br>L2B5 |
|--------------------------------------------------------|-----------------|-----------------|-----------------|-----------------|--------------------------------------------------------|
|                                                        | L2<br>TAG0      | L2<br>TAG1      | L2<br>TAG5      | L2<br>TAG4      | MCU2                                                   |
| L2B3<br>L2 Data                                        | BNG<br>SII      | co              | X               | OS EFU          | L2B7<br>L2 Data                                        |
| Bank 3<br>L2B2<br>L2 Data<br>Bank 2                    | L2<br>TAG2      | L2<br>TAG3      | L2<br>TAG7      | L2<br>TAG6      | Bank 7<br>L2B6<br>L2 Data<br>Bank 6                    |
| DMU                                                    | SPARC<br>Core 2 | SPARC<br>Core 3 | SPARC<br>Core 7 | SPARC<br>Core 6 | RDP TDS                                                |
| PEU<br>PSR #                                           | L ESR)          |                 | SR              | MAC             |                                                        |



### Example: double-ended queue



### **Example: coarse-grained locking**

100

Microsoft<sup>-</sup>

Research





### **Example: fine-grain locking**

```
Class Q {
  Lock leftLock = new Lock();
  Lock rightRlock = new Lock();
  OElem leftSentinel:
  QElem rightSentinel;
  void pushLeft(int item) {
    QElem e = new QElem(item);
    leftLock.Acquire();
    e.right = this.leftSentinel.right;
    e.left = this.leftSentinel:
    this.leftSentinel.right.left = e;
    this.leftSentinel.right = e;
    leftLock.Release();
  }
}
```



### Example: fine-grain locking





## What we want





Hardware



# **Atomic blocks**



#### Atomic blocks built over transactional memory 3 primitives: atomic, retry, orElse

Hardware



Microsoft<sup>\*</sup>

Acld

- To a first approximation, just write the sequential code, and wrap **atomic** around it
- All-or-nothing semantics: **Atomic** commit
- Atomic block executes in **Isolation**
- Cannot deadlock (there are no locks!)
- Atomicity makes error recovery easy (e.g. exception thrown inside the **PopLeft** code)



### Atomic blocks compose (locks do not)

void GetTwo() {
 atomic {
 i1 = PopLeft();
 i2 = PopLeft();
 }
 DoSomething( i1, i2 );
}

- Guarantees to get two consecutive items
- PopLeft() is unchanged
- Cannot be achieved with locks (except by breaking the PopLeft abstraction)

Composition is THE way we build big programs that work



### Blocking: how does PopLeft wait for data?

```
Item PopLeft() {
    atomic {
        if (leftSentinel.right==rightSentinel) {
            retry;
        } else { ...remove item from queue... }
    }
}
```

- **retry** means "abandon execution of the atomic block and re-run it (when there is a chance it'll complete)"
- No lost wake-ups
- No consequential change to GetTwo(), even though GetTwo must wait for there to be two items in the queue

# Research

# Choice: waiting for either of two



- do {...this...} orelse {...that...} tries to run "this"
- If "this" retries, it runs "that" instead
- If both retry, the do-block retries. GetEither() will thereby wait for there to be an item in *either* queue

# A X 90 100 Microsoft Research

# Programming with atomic blocks

With locks, you think about:

- Which lock protects which data? What data can be mutated when by other threads? Which condition variables must be notified when?
- None of this is explicit in the source code

With atomic blocks you think about

- What are the **invariants** (e.g. the tree is balanced)?
- Each atomic block maintains the invariants
- Purely sequential reasoning within a block, which is dramatically easier
- Much easier setting for static analysis tools



# Summary so far

- Atomic blocks (atomic, retry, orElse) are a real step forward
- It's like using a high-level language instead of assembly code: whole classes of low-level errors are eliminated.
- Not a silver bullet:
  - you can still write buggy programs;
  - concurrent programs are still harder to write than sequential ones;
  - just aimed at shared memory.
- But the improvement is very substantial



# State of the art ~ 2003

100



Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535

# Implementation techniques

Microsoft<sup>\*</sup>

Research

- Direct-update STM
  - Allow transactions to make updates in place in the heap
  - Avoids reads needing to search the log to see earlier writes that the transaction has made
  - Makes successful commit operations faster at the cost of extra work on contention or when a transaction aborts
- Compiler integration
  - Decompose the transactional memory operations into primitives
  - Expose the primitives to compiler optimization (e.g. to hoist concurrency control operations out of a loop)
- Runtime system integration
  - Integration with the garbage collector or runtime system components to scale to atomic blocks containing 100M memory accesses
  - Memory management system used to detect conflicts between transactional and non-transactional accesses

A X 50 40 100 Microsoft Research

#### Results: concurrency control overhead





# Direct update STM

- Transactional write:
  - Lock objects before they are written to (abort if another thread has that lock)
  - Log the overwritten data we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread
- Transactional read:
  - Log a version number we associate with the object
- Commit:
  - Check the version numbers of objects we've read
  - Increment the version numbers of object we've written



c2

ver = 200

val = 40

| Thread T1                | Thread T2                 | c1        |
|--------------------------|---------------------------|-----------|
| int t = 0;<br>→ atomic { | → atomic {<br>t = c1.val; | ver = 100 |
| t += c1.val;             | t ++;                     | val = 10  |
| t += c2.val;<br>}        | c1.val = t;<br>}          |           |

T1's log: T2's log:



| Thread T1                                                                  | Thread T2                                                                              |  |  |
|----------------------------------------------------------------------------|----------------------------------------------------------------------------------------|--|--|
| <pre>int t = 0;<br/>atomic {<br/>t += c1.val;<br/>t += c2.val;<br/>}</pre> | <pre>     atomic {         t = c1.val;         t ++;         c1.val = t;     } }</pre> |  |  |
| T1's log:                                                                  | T2's log:                                                                              |  |  |
| c1.ver=100                                                                 |                                                                                        |  |  |
| logs                                                                       | T1 reads from c1:<br>logs that it saw<br>version 100                                   |  |  |
|                                                                            |                                                                                        |  |  |









| Thread T1                                                     | Thread T2                                                  |  |  |
|---------------------------------------------------------------|------------------------------------------------------------|--|--|
| int t = 0;<br>atomic {<br>t += c1.val;<br>→ t += c2.val;<br>} |                                                            |  |  |
| T1's log:                                                     | T2's log:                                                  |  |  |
| c1.ver=100<br>c2.ver=200                                      | c1.ver=100                                                 |  |  |
|                                                               |                                                            |  |  |
| re                                                            | Suppose T1 now<br>reads from c2, sees it<br>at version 200 |  |  |





| Thread T1                                                     | Thread T2                   | c1                                                                          | c2                    |
|---------------------------------------------------------------|-----------------------------|-----------------------------------------------------------------------------|-----------------------|
| int t = 0;<br>atomic {<br>t += c1.val;<br>→ t += c2.val;<br>} | atomic {                    | locked:T2<br>val = 10                                                       | ver = 200<br>val = 40 |
| T1's log:                                                     | T2's log:                   |                                                                             |                       |
| c1.ver=100<br>c2.ver=200                                      | c1.ver=100<br>lock: c1, 100 |                                                                             |                       |
|                                                               |                             | Before updating c1, thread<br>T2 must lock it: record old<br>version number |                       |



| Thread T1                                                     | Thread T2                                                                                                         | c1                                           | c2                    |  |
|---------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|----------------------------------------------|-----------------------|--|
| int t = 0;<br>atomic {<br>t += c1.val;<br>→ t += c2.val;<br>} | atomic {                                                                                                          | locked:T2<br>val = 11                        | ver = 200<br>val = 40 |  |
| T1's log:<br>c1.ver=100                                       | T2's log:<br>c1.ver=100                                                                                           | (2) After logg<br>value, T2 makes<br>place t | s its update in       |  |
| c2.ver=200                                                    | lock: c1, 100<br>c1.val=10                                                                                        |                                              |                       |  |
|                                                               | <ul><li>(1) Before updating c1.val,</li><li>thread T2 must log the data</li><li>it's going to overwrite</li></ul> |                                              |                       |  |



| Thread T1                                                     | Thread T2                                | c1                                                           | c2                                                         |
|---------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------|
| int t = 0;<br>atomic {<br>t += c1.val;<br>→ t += c2.val;<br>} | atomic {                                 | ver=101<br>v: = 10                                           | ver = 200<br>val = 40                                      |
| T1's log:                                                     | T2's log:                                | successfully.                                                | saction commits<br>Unlock the object,<br>ew version number |
| c1.ver=100<br>c2.ver=200                                      | c1.ver=100<br>lock: c1, 100<br>c1.val=10 | (1) Check the version<br>locked matches the we previously re | version                                                    |







 temp==0 is the only correct result here if these blocks really are atomic





- x == 0
- y == 0
- z == 0



















# Strong isolation

- Add a mechanism to detect conflicts between tx and normal accesses
  - e.g. 'z' in this example
- We would like:
  - Implementation flexibility e.g. different STMs
  - No overhead on non-transactional accesses
  - Predictable performance
  - Little overhead over weak atomicity



# Strong isolation: implementation















100

LUD.

E

Microsoft<sup>-</sup>

Research



# Conflicting normal access

100

LUD.

C. C.

Microsoft<sup>-</sup>

Research



# Performance figures depend on...

LOD

V 20 WD

Microsoft<sup>\*</sup>

Research

- Workload : What do the atomic blocks do? How long is spent inside them?
- Baseline implementation: Mature existing compiler, or prototype?
- Intended semantics: Support static separation? Violation freedom (TDRF)?
- STM implementation: In-place updates, deferred updates, eager/lazy conflict detection, visible/invisible readers?
- STM-specific optimizations: e.g. to remove or downgrade redundant TM operations
- Integration: e.g. dynamically between the GC and the STM, or inlining of STM functions during compilation
- Implementation effort: low-level perf tweaks, tuning, etc.
- Hardware: e.g. performance of CAS and memory system

# A X 30 WD 100 Microsoft Research

## Labyrinth



- STAMP v0.9.10
- 256x256x3 grid
- Routing 256 paths
- Almost all execution inside atomic blocks
- Atomic blocks can attempt 100K+ updates
- C# version derived from original C
- Compiled using Bartok, whole program mode, C# -> x86 (~80% perf of original C with VS2008)
- Overhead results with Core2 Duo running Windows Vista

"STAMP: Stanford Transactional Applications for Multi-Processing" Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, Kunle Olukotun , IISWC 2008



100

















## Scaling – Labyrinth





## Scaling – Delaunay





## Scaling – Genome





#### Scaling – Vacation





# Conclusion

- What are atomic blocks good for?
  - Shared memory data structures
- Implementations involve work throughout the software stack
  - Language design
  - Compiler
  - Language runtime system
  - OS-runtime-system interfaces
- Two different experiences
  - STM-Haksell
  - STM.Net