Configuring SimEng¶

SimEng configuration files are written in a YAML format, and provide values for the parameters of the processor architecture to be simulated. Pre-written model configuration files can be found in the configs/ directory.

The configuration files are split into several sections, each of which is associated with a specific area of the architecture modelled.

Core¶

SimEng cores can be one of three types:

emulation: An atomic “emulation-style” core which, per cycle, processes an instruction in its entirety before proceeding to the next instruction.
inorderpipeline: An in-order pipeline processor core with discrete fetch, decode, execute, and writeback stages.
outoforder: A complex superscalar out-of-order core, similar to those found in modern high-performance processors.

These core types are primarily referred to as core “archetypes”.

Note

Currently, the configuration files do not take into account the core archetype being modelled and require all parameters (without default values) to be defined, even if unused (e.g. reservation station definitions for an emulation core archetype). However, future developments plan for the exemption of those options not used under the selected core archetype.

Configuration options within the Core section are concerned with the functionality of the simulated processor pipeline. These include:

ISA

The Instruction Set Architecture under simulation. The options are AArch64 and rv64.

Simulation-Mode

The core archetype to use, the options are emulation, inorderpipelined, and outoforder.

Clock-Frequency-GHz

The clock frequency, in GHz, of the processor being modelled.

Timer Frequency-MHz

This dictates the frequency in MHz that the CPU’s internal counter timer is updated.

i.e. For models based on an Arm ISA, this dictates how often the Virtual Counter Timer system register is updated to the number of cycles completed. This value is then accessible to programmers through mrs x0 CNTVCT_el0.

Micro-Operations

Whether to enable instruction splitting for pre-defined Macro Operations or not.

Vector-Length (Only in use when ISA is AArch64)

The vector length used by instructions belonging to Arm’s Scalable Vector Extension. Supported vector lengths are those between 128 and 2048 in increments of 128.

Streaming-Vector-Length (Only in use when ISA is AArch64)

The vector length used by instructions belonging to Arm’s Scalable Matrix Extension. Although the architecturally valid vector lengths are powers of 2 between 128 and 2048 inclusive, the supported vector lengths are those between 128 and 2048 in increments of 128.

Compressed (Only in use when ISA is rv64)

Enables the RISC-V compressed extension. If set to false and compressed instructions are supplied, a misaligned program counter exception is usually thrown.

Fetch¶

This section is concerned with the parameterisation of the fetch unit and its internal structures.

Fetch-Block-Size: The size, in bytes, of the block fetched from the instruction cache.
Loop-Buffer-Size: The number of Macro-ops which can be stored in the loop buffer.
Loop-Detection-Threshold: The number of commits a unique branch instruction must go through, without another branch instruction being committed, before a loop is detected and the loop buffer is filled.

Process Image¶

This allows the stack and heap size to be altered as required.

Heap-Size: Size of the Heap; defined in bytes.
Stack- Size: Size of the Stack in memory; defined in bytes.

Register-set¶

The number of physical registers, of each type, to be used under the register renaming scheme. The register types supported are ISA dependent. The types available for each supported ISA are as followed:

AArch64

GeneralPurpose-Count
The number of physical general-purpose registers.
FloatingPoint/SVE-Count
The number of physical floating point registers. Also considered as the number of Arm SVE extension z registers where appropriate.
Predicate-Count (Optional)
The number of physical Arm SVE extension predicate registers.
Conditional-Count
The number of physical status/flag/conditional-code registers.
Matrix-Count (Optional)
The number of physical za Arm SME registers.

RISC-V

GeneralPurpose-Count
The number of physical general-purpose registers.
FloatingPoint-Count
The number of physical floating point registers.

Pipeline-widths¶

This section is concerned with the width of the simulated processor pipeline at specific stages, including:

Commit: The commitment/retirement width from the re-order buffer.
FrontEnd: The width of the pipeline before the execution stage (also excludes the dispatch/issue stage if simulating an outoforder core archetype).
LSQ-Completion: The width between the load/store queue unit and the write-back unit (translates to the number of load instructions that can be sent to the write-back unit per cycle).

Excluding the Commit option, the value given for these widths denotes the number of Micro-Ops, as opposed to Macro-ops, if the simulated architecture supports them.

Queue-sizes¶

This section defines the size of specific architectural queues. These queues currently include:

ROB: The size of the re-order buffer.
Load: The size of the load queue within the load/store queue unit.
Store: The size of the store queue within the load/store queue unit.

Branch-Predictor¶

The Branch-Prediction section contains those options to parameterise the branch predictor used during simulation. Currently, the options are minimal, but, planned developments will see options including the toggling and parameterisation of common branch predictor algorithms/structures.

The current options include:

Type: The type of branch predictor that is used, the options are Generic, and Perceptron. Both types of predictor use a branch target buffer with each entry containing a direction prediction mechanism and a target address. The direction predictor used in Generic is a saturating counter, and in Perceptron it is a perceptron.
BTB-Tag-Bits: The number of bits used to index the entries in the Branch Target Buffer (BTB). The number of entries in the BTB is obtained from the calculation: 1 << bits. For example, a bits value of 12 would result in a BTB with 4096 entries.
Saturating-Count-Bits: Only needed for a Generic predictor. The number of bits used in the saturating counter value.
Global-History-Length: The number of bits used to record the global history of branch directions. Each bit represents one branch direction. For PerceptronPredictor, this dictates the size of the perceptrons (with each perceptron having Global-History-Length + 1 weights).
RAS-entries: The number of entries in the Return Address Stack (RAS).
Fallback-Static-Predictor: Only needed for a Generic predictor. The static predictor used when no dynamic prediction is available. The options are either "Always-Taken" or "Always-Not-Taken".

L1-Data-Memory¶

This section describes the configuration for the L1 data cache in use.

Interface-Type: The type of memory interface used to model the L1 data cache. Options are currently Flat or Fixed which represent a FlatMemoryInterface or FixedMemoryInterface respectively. More information concerning these interfaces can be found here.

Note

Currently, if the chosen Simulation-Mode option is emulation or inorderpipelined, then only a Flat value is permitted. Future developments will seek to allow for more memory interfaces with these simulation archetypes.

L1-Instruction-Memory¶

This section describes the configuration for the L1 instruction cache in use.

Interface-Type: The type of memory interface used to model the L1 instruction cache. Options are currently Flat or Fixed which represent a FlatMemoryInterface or FixedMemoryInterface respectively. More information concerning these interfaces can be found here.

Note

Currently, only a Flat value is permitted for the L1 instruction cache interface. Future developments will seek to allow for more memory interfaces to be used with the L1 instruction cache.

LSQ-L1-Interface¶

This section contains the options used to configure SimEng’s interface between the LSQ and the L1 data cache. These options include:

Access-Latency: The cycle latency of L1 cache access.
Exclusive: If set to true, only one type of memory access (read or write) can be performed per cycle.
Load-Bandwidth: The number of bytes permitted to be loaded per cycle.
Store-Bandwidth: The number of bytes permitted to be stored per cycle.
Permitted-Requests-Per-Cycle: The number of load and store requests permitted per cycle.
Permitted-Loads-Per-Cycle: The number of load requests permitted per cycle.
Permitted-Stores-Per-Cycle: The number of store requests permitted per cycle.

Ports¶

Within this section, execution unit port definitions are constructed. Each port is defined with a name and a set of instruction groups/opcodes it supports. The instruction groups/opcodes are architecture-dependent, but, the available AArch64 instruction groups/opcodes can be found here and for RISC-V, can be found here.

To define a port, the following structure must be adhered to:

0:
  Portname: <port_name>
  Instruction-Group-Support:
  - <instruction_group>
  - ...
  - <instruction_group>
  Instruction-Opcode-Support:
  - <instruction_opcode>
  - ...
  - <instruction_opcode>
...
N-1:
  Portname: <port_name>
  Instruction-Group-Support:
  - <instruction_group>
  - ...
  - <instruction_group>
  Instruction-Opcode-Support:
  - <instruction_opcode>
  - ...
  - <instruction_opcode>

With N as the number of execution ports.

Reservation-Stations¶

The relationships between reservation stations and the execution ports, i.e. which reservation stations map to which execution ports, are defined in this section. The configuration of each reservation station contains a size value, a dispatch rate value, and a set of port names, previously defined in the Ports section.

The following structure must be adhered to when defining a reservation station:

0:
  Size: <number_of_entries>
  Dispatch-Rate: <number_of_permitted_dispatches_per_cycle>
  Ports:
  - <port_name>
  - ...
  - <port_name>
...
N-1:
    Size: <number_of_entries>
    Dispatch-Rate: <number_of_permitted_dispatches_per_cycle>
    Ports:
    - <port_name>
    - ...
    - <port_name>

With N as the number of reservation stations. Each execution port must be mapped to a reservation station.

Execution-Units¶

An execution unit can be configured to optionally include an internal pipeline and a set of instruction groups for operation blocking. The instruction groups referenced here are the same as those used in the Ports section.

The following structure must be adhered to when defining an execution unit:

0:
  Pipelined: <True/False>
  Blocking-Groups:
  - <instruction_group>
  - ...
  - <instruction_group>
...
N-1:
    Pipelined: <True/False>
    Blocking-Groups:
    - <instruction_group>
    - ...
    - <instruction_group>

With N as the number of execution units. The number of execution units should be equivalent to the number of execution ports.

Note, the indexing used in both the Ports and Execution-Units sections provide a relationship mapping, the 0th execution port maps to the 0th execution unit.

Latencies¶

The execution latency and throughput can be configured under the Latencies section. A latency/throughput pair can be defined for a set of instruction groups/opcodes, the groups/opcodes available are the same as the set discussed in the Ports section.

The execution latency defines the total number of cycles an instruction will spend in an execution unit. The throughput is how many cycles an instruction will block another instruction entering the execution unit. In non-pipelined execution units, the throughput is equal to the latency.

The following structure must be adhered to when defining group latencies:

0:
  Instruction-Groups:
  - <instruction_group>
  - ...
  - <instruction_group>
  Instruction-Opcodes:
  - <instruction_opcode>
  - ...
  - <instruction_opcode>
  Execution-Latency: <number_of_cycles>
  Execution-Throughput: <number_of_cycles>
...
N-1:
  Instruction-Groups:
  - <instruction_group>
  - ...
  - <instruction_group>
  Instruction-Opcodes:
  - <instruction_opcode>
  - ...
  - <instruction_opcode>
  Execution-Latency: <number_of_cycles>
  Execution-Throughput: <number_of_cycles>

With N as the number of user-defined latency mappings. The default latencies, both execution and throughput, for those instruction groups not covered are 1.

Note, unlike other operations, the execution latency defined for load/store operations are triggered in the LoadStoreQueue as opposed to within the execution unit (more details here).

CPU Info¶

This section contains information about the physical properties of the CPU. These fields are currently only used to generate a replica of the required Special Files directory structure.

Generate-Special-Dir

Values are either True or False. Dictates whether or not SimEng should generate the Special-Files directory tree at runtime. If your code requires Special-Files but you wish to use your own / existing files from a real system, you will need to set this option to False. The files which are currently generated / supported in SimEng are:

/proc/cpuinfo

/proc/stat

/sys/deviced/system/cpu/online

/sys/deviced/system/cpu/cpu{0..CoreCount}/topology/core_id

/sys/deviced/system/cpu/cpu{0..CoreCount}/topology/physical_package_id

Special-File-Dir-Path

Represented as a String; is the absolute path to the root directory where the Special-Files will be generated OR where existing Special-Files are located. This is optional, and defaults to SIMENG_BUILD_DIRECTORY/specialFiles. The root directory must already exist.

Core-Count

Defines the total number of Physical cores (Not including threads).

Note

Max Core-Count currently supported is 1.

Socket-Count: Defines the number of sockets used. Typically set to 1, but can be more for CPU’s that support multi-socket implementations (i.e. ThunderX2).

Note

Max Socket-Count currently supported is 1.

Note

If Socket-Count is more than 1, Core-Count must reflect the number of physical cores per socket.

SMT: Defines the number of threads present on each core.

Note

Max SMT currently supported is 1.

The fields listed below are used to generate /proc/cpuinfo. Their values can be found there on a Linux system using the CPU being modelled. With each field is a description of the format required and an example value.

BogoMIPS : Float in format x.00, i.e. 200.00

Features : String with values seperated with a space, i.e. “fp asimd sha1 sha2 fphp”

CPU-Implementer : Hex value represented as a string, i.e. “0x46”

CPU-Architecture : Integer, i.e. 8

CPU-Variant : Hex value represented as a string, i.e. “0x1”

CPU-Part : Hex value represented as a string, i.e. “0x001”

CPU-Revision : Integer, i.e. 0

Note

If values are unknown then set equal to 0 in the correct format

Package-Count: Used to generate /sys/devices/system/cpu/cpu{0..Core-Count}/topology/{physical_package_id, core_id} files. On each CPU the cores are split into packages. The number of packages used can be calculated by analysing the physical_package_id files on a Linux system using the CPU being modelled.

Note

Core-Count must be wholly divisible by Package-Count.

Note

Max Package-Count currently supported is 1.