**Prof Simon McIntosh-Smith** 

University of Bristol

@simonmcs



# Modelling Advanced Arm-based CPUs with SimEng

SimEng developers: Jack Jones, Andrei Poenaru, Harry Waugh, Ainsley Rutterford, Hal Jones, James Price Funding: EPSRC ASiMoV project (Advanced Simulation and Modelling of Virtual systems) EP/S005072/1, Arm via a Centre of Excellence in HPC at University of Bristol





# SimEng design goals

### Primary goals:

- Fast millions of OoO instructions per second on a single core
- <u>Accurate</u> typically within ~10% of real hardware
- Easy to modify days for a radically different processor model

### Secondary goals:

- Use existing frameworks where possible
  - CAPSTONE for instruction decode, SST for memory hierarchy / multicore
  - Gem5-compatible tracing, checkpointing, ...





#### SimEng generic CPU model







## **Current status and WIP**

- Targeting Armv8.4+SVE. Using CAPSTONE, which also supports x86, RISC-V, POWER, ...
  - SimEng now supports ~480 instructions, ~10% of the ISA
  - Includes sophisticated branch predictors (A64FX-style)
  - Partial SVE support to match A64FX
    - Can vary SVE widths and number of units
  - Single-core only (for now)
- Support for syscall emulation:
  - Enough to handle libc startup routines in real binaries (compiled from C)
  - Basic printf support
  - File I/O works
  - malloc works for most cases, but not yet complete
- Integrating with SST:

University of

- SimEng models up to the load/store units, will use SST's models for the memory hierarchy (SimEng includes its own infinite L1 cache model)
  - Prototype demonstrated in the summer
  - Will also use SST to enable multi-core simulations









### **Results**

- Running McCalpin's STREAM benchmark
  - Run a problem small enough to fit in L1D cache
  - Using an out-of-order/superscalar core model, parameterized for ThunderX2 and A64FX
  - The STREAM run takes ~10ms on a real ThunderX2 core and ~13ms on a real A64FX core
- SimEng running on an Intel Xeon Processor E5-2603 v4 @ 1.7 GHz
  - ThunderX2 model
    - OoO takes ~94 seconds  $\rightarrow$  221 kHz / 0.50 MIPS
    - Cycle count error is **5.3%** versus real ThunderX2 hardware
  - A64FX model
    - OoO takes ~96 seconds  $\rightarrow$  186 kHz / 0.39 MIPS
    - Cycle count error is **7.4%** versus real A64FX hardware
- gem5 built from Arm's sve/beta1 branch, on same Intel CPU, ThunderX2 model only
  - OoO takes ~280 seconds → 63 kHz / 0.14 MIPS (SimEng 3.5X / 3.6X)
  - Cycle count error is 12.4% versus real ThunderX2 hardware







## Key statistics about the project

- ~20,000 lines of simple, modern C++17
  - ~7,500 lines are specific for Armv8+SVE support
  - An additional ~10,000 lines of test code across ~350 tests
  - Can build with GCC 7 (or later), Clang (7 or 5), or Armclang 20. Intel 19 soon too.
- Includes a full Continuous Integration (CI) workflow
  - CircleCl $\rightarrow$ Jenkins, Googletest
- Supported host platforms include: Ubuntu, CentOS and macOS
- Will be released under a permissive LLVM-style Apache 2.0 license **With Each Characteristy** University of



\_ \_ \_ \_

---

\_

-- -- -- -- --

\_\_\_\_\_

\_ \_ \_ \_

| ) jackj@DESK         | TOP-SEAURM3: /mnt/c/Users/jackj/Documents/Github/SimEng/src/tools |          |                          |       |                                       |                     |            |            | - 0 ) |
|----------------------|-------------------------------------------------------------------|----------|--------------------------|-------|---------------------------------------|---------------------|------------|------------|-------|
|                      | [TIMELINE]                                                        |          |                          |       | [INS                                  | N_NUM][PC][DISASM]- |            |            |       |
| fd <mark>np</mark>   |                                                                   | 217282   | 0x00000510               | b.ne  | #0xfffffffffffffffe0                  |                     |            |            |       |
| - <mark>f</mark> dnp | <mark>i</mark> cr                                                 | 217283   | 0x000004f0               | add   | x2, x0, x26                           |                     |            |            |       |
| - <mark>f</mark> dnp | ic                                                                | 217284   | 0x000004 <del>f</del> 4  | add   | x1, x0, x19                           |                     |            |            |       |
| -fdnp                | r                                                                 | 217285   | 0x000004 <del>f</del> 8  | ldr   | q1, [x2]                              |                     |            |            |       |
| - <mark>f</mark> dnp |                                                                   | 217286   | 0x000004fc               | ldr   | q0, [x1]                              |                     |            |            |       |
| fdnp.                | cr                                                                | 217287   | 0x00000500               | fmla  | v0.2d, v1.2d, v2.2d                   |                     | F          | fotob      |       |
| fdn <mark>p</mark> . | <b>i</b> c <mark>r</mark>                                         | 217288   | 0x00000504               | str   | q0, [x27, x0]                         |                     | <b>I</b> - | - Ietch    |       |
| fdnp.                | <mark>i</mark> c                                                  | 217289   | 0x00000508               | add   | x0, x0, #0x10                         |                     |            | <b>_</b>   |       |
| fdn <mark>p</mark> . |                                                                   | 217290   | 0x0000050c               | cmp   | x0, #2, 1sl #12                       |                     | d -        | - decode   |       |
| fdn                  | ••••••• <mark>1</mark> c•••••• <mark>r</mark> ••••••••            | 217291   | 0x00000510               | b.ne  | #0xffffffffffffffe0                   |                     |            |            |       |
| fdn                  | pr                                                                | 217292   | 0x000004f0               | add   | x2, x0, x26                           |                     | <b>n</b> - | ronamo     |       |
| fdn                  | prr                                                               | 217293   | 0x000004 <del>f</del> 4  | add   | x1, x0, x19                           |                     | <b>T T</b> |            |       |
| fdn                  | prr                                                               | 217294   | 0x000004 <del>f</del> 8  | ldr   | q1, [x2]                              |                     |            |            |       |
| fdn                  | pn                                                                | 217295   | 0x000004fc               | ldr   | q0, [x1]                              |                     | p -        | - dispatch |       |
|                      | npn                                                               | 217296   | 0x00000500               | fm⊥a  | v0.2d, v1.2d, v2.2d                   |                     | <b>_</b>   |            |       |
|                      | np <b>1</b> c.n                                                   | 21/29/   | 0x00000504               | str   | q0, [x2/, x0]                         |                     | - r        | - issue    |       |
|                      | np                                                                | 21/298   | 0x00000508               | add   | X0, X0, #0X10                         |                     |            | 10040      |       |
|                      |                                                                   | 21/299   | 0X0000050C               | cmp   | X0, #2, ISI #12                       |                     | ~          | acmalata   |       |
|                      |                                                                   | 21/300   | 0x000001510              | o.ne  |                                       |                     | C          | - comprete |       |
|                      | dro                                                               | 217201   | 0x000004T0               | add   | $X_2, X_0, X_20$                      |                     |            | _          |       |
|                      |                                                                   | 217302   | 000000414                | l d n | AI, X0, X19                           |                     | <b>r</b> - | - retıre   |       |
|                      | fdpp                                                              | 217303   | 0x00000418<br>0x00000418 | ldn   | $q_{\perp}, \lfloor \times 2 \rfloor$ |                     |            |            |       |
|                      |                                                                   | 217304   | 0×00000410               | fmla  | $v_{0} = [x_{\pm}]$                   |                     |            | fluching   |       |
|                      | fdp                                                               | 217305   | 0x00000500               | str   | a [x27 x 8]                           |                     |            | TTUSIITING |       |
|                      |                                                                   | 217307   | 0x000000507              | add   | $x_0$ $x_0$ $\#_0x_10$                |                     |            |            |       |
|                      |                                                                   | 217308   | 0x0000050c               | cmp   | x0, #2, ]s] #12                       |                     |            |            |       |
|                      | == fdng========                                                   | 217309   | 0x00000510               | b.ne  | #0xfffffffffffffffe0                  |                     |            |            |       |
|                      | === <mark>f</mark> dnp                                            | 217310   | 0x000004 <del>f</del> 0  | add   | x2, x0, x26                           |                     |            |            |       |
|                      |                                                                   |          |                          |       | ,,                                    |                     |            |            |       |
|                      | [PROBE]                                                           |          |                          |       |                                       |                     |            |            |       |
|                      |                                                                   | branch.  | mispredict               |       |                                       |                     |            |            |       |
|                      |                                                                   | L1D.cach | ne.miss                  |       |                                       |                     |            |            |       |
|                      |                                                                   | L11.cack | ne.miss                  |       |                                       |                     |            |            |       |
|                      |                                                                   | nename 1 | allocationStall          | 5     |                                       |                     |            |            |       |

rename.allocationStalls decode.earlyFlushes dispatch.rsStalls fetch.branchStalls issue.portBusyStalls -----

\_ \_ \_ \_ \_ \_

\_\_\_\_\_

# Things coming in 2021

- Support for accelerators, e.g. SME (in progress)
- More comprehensive libc support (in progress)
- Ares and Zeus models (in progress)
- Instruction fusing and micro-oping
- SST integration for the memory model and multi-core
- Other ISAs (via Capstone), e.g. RISC-V
- Integration with gem5? (Drop-in replacement for their OoO)



