Configuring SimEng¶
SimEng configuration files are written in a YAML format, and provide values for the parameters of the processor architecture to be simulated. Pre-written model configuration files can be found in the configs/
directory.
The configuration files are split into several sections, each of which is associated with a specific area of the architecture modelled.
Core¶
SimEng cores can be one of three types:
emulation
An atomic “emulation-style” core which, per cycle, processes an instruction in its entirety before proceeding to the next instruction.
inorderpipeline
An in-order pipeline processor core with discrete fetch, decode, execute, and writeback stages.
outoforder
A complex superscalar out-of-order core, similar to those found in modern high-performance processors.
These core types are primarily referred to as core “archetypes”.
Note
Currently, the configuration files do not take into account the core archetype being modelled and require all parameters (without default values) to be defined, even if unused (e.g. reservation station definitions for an emulation
core archetype). However, future developments plan for the exemption of those options not used under the selected core archetype.
Configuration options within the Core section are concerned with the functionality of the simulated processor pipeline. These include:
- ISA
The Instruction Set Architecture under simulation. The options are
AArch64
andrv64
.- Simulation-Mode
The core archetype to use, the options are
emulation
,inorderpipelined
, andoutoforder
.- Clock-Frequency-GHz
The clock frequency, in GHz, of the processor being modelled.
- Timer Frequency-MHz
This dictates the frequency in MHz that the CPU’s internal counter timer is updated.
i.e. For models based on an Arm ISA, this dictates how often the Virtual Counter Timer system register is updated to the number of cycles completed. This value is then accessible to programmers through
mrs x0 CNTVCT_el0
.- Micro-Operations
Whether to enable instruction splitting for pre-defined Macro Operations or not.
- Vector-Length (Only in use when ISA is
AArch64
) The vector length used by instructions belonging to Arm’s Scalable Vector Extension. Supported vector lengths are those between 128 and 2048 in increments of 128.
- Streaming-Vector-Length (Only in use when ISA is
AArch64
) The vector length used by instructions belonging to Arm’s Scalable Matrix Extension. Although the architecturally valid vector lengths are powers of 2 between 128 and 2048 inclusive, the supported vector lengths are those between 128 and 2048 in increments of 128.
- Compressed (Only in use when ISA is
rv64
) Enables the RISC-V compressed extension. If set to false and compressed instructions are supplied, a misaligned program counter exception is usually thrown.
Fetch¶
This section is concerned with the parameterisation of the fetch unit and its internal structures.
- Fetch-Block-Size
The size, in bytes, of the block fetched from the instruction cache.
- Loop-Buffer-Size
The number of Macro-ops which can be stored in the loop buffer.
- Loop-Detection-Threshold
The number of commits a unique branch instruction must go through, without another branch instruction being committed, before a loop is detected and the loop buffer is filled.
Process Image¶
This allows the stack and heap size to be altered as required.
- Heap-Size
Size of the Heap; defined in bytes.
- Stack- Size
Size of the Stack in memory; defined in bytes.
Register-set¶
The number of physical registers, of each type, to be used under the register renaming scheme. The register types supported are ISA dependent. The types available for each supported ISA are as followed:
AArch64
- GeneralPurpose-Count
The number of physical general-purpose registers.
- FloatingPoint/SVE-Count
The number of physical floating point registers. Also considered as the number of Arm SVE extension
z
registers where appropriate.
- Predicate-Count (Optional)
The number of physical Arm SVE extension predicate registers.
- Conditional-Count
The number of physical status/flag/conditional-code registers.
- Matrix-Count (Optional)
The number of physical
za
Arm SME registers.
RISC-V
- GeneralPurpose-Count
The number of physical general-purpose registers.
- FloatingPoint-Count
The number of physical floating point registers.
Pipeline-widths¶
This section is concerned with the width of the simulated processor pipeline at specific stages, including:
- Commit
The commitment/retirement width from the re-order buffer.
- FrontEnd
The width of the pipeline before the execution stage (also excludes the dispatch/issue stage if simulating an
outoforder
core archetype).- LSQ-Completion
The width between the load/store queue unit and the write-back unit (translates to the number of load instructions that can be sent to the write-back unit per cycle).
Excluding the Commit option, the value given for these widths denotes the number of Micro-Ops, as opposed to Macro-ops, if the simulated architecture supports them.
Queue-sizes¶
This section defines the size of specific architectural queues. These queues currently include:
- ROB
The size of the re-order buffer.
- Load
The size of the load queue within the load/store queue unit.
- Store
The size of the store queue within the load/store queue unit.
Branch-Predictor¶
The Branch-Prediction section contains those options to parameterise the branch predictor used during simulation. Currently, the options are minimal, but, planned developments will see options including the toggling and parameterisation of common branch predictor algorithms/structures.
The current options include:
- Type
The type of branch predictor that is used, the options are
Generic
, andPerceptron
. Both types of predictor use a branch target buffer with each entry containing a direction prediction mechanism and a target address. The direction predictor used inGeneric
is a saturating counter, and inPerceptron
it is a perceptron.- BTB-Tag-Bits
The number of bits used to index the entries in the Branch Target Buffer (BTB). The number of entries in the BTB is obtained from the calculation: 1 <<
bits
. For example, abits
value of 12 would result in a BTB with 4096 entries.- Saturating-Count-Bits
Only needed for a
Generic
predictor. The number of bits used in the saturating counter value.- Global-History-Length
The number of bits used to record the global history of branch directions. Each bit represents one branch direction. For
PerceptronPredictor
, this dictates the size of the perceptrons (with each perceptron having Global-History-Length + 1 weights).- RAS-entries
The number of entries in the Return Address Stack (RAS).
- Fallback-Static-Predictor
Only needed for a
Generic
predictor. The static predictor used when no dynamic prediction is available. The options are either"Always-Taken"
or"Always-Not-Taken"
.
L1-Data-Memory¶
This section describes the configuration for the L1 data cache in use.
- Interface-Type
The type of memory interface used to model the L1 data cache. Options are currently
Flat
orFixed
which represent aFlatMemoryInterface
orFixedMemoryInterface
respectively. More information concerning these interfaces can be found here.
Note
Currently, if the chosen Simulation-Mode
option is emulation
or inorderpipelined
, then only a Flat
value is permitted. Future developments will seek to allow for more memory interfaces with these simulation archetypes.
L1-Instruction-Memory¶
This section describes the configuration for the L1 instruction cache in use.
- Interface-Type
The type of memory interface used to model the L1 instruction cache. Options are currently
Flat
orFixed
which represent aFlatMemoryInterface
orFixedMemoryInterface
respectively. More information concerning these interfaces can be found here.
Note
Currently, only a Flat
value is permitted for the L1 instruction cache interface. Future developments will seek to allow for more memory interfaces to be used with the L1 instruction cache.
LSQ-L1-Interface¶
This section contains the options used to configure SimEng’s interface between the LSQ and the L1 data cache. These options include:
- Access-Latency
The cycle latency of L1 cache access.
- Exclusive
If set to true, only one type of memory access (read or write) can be performed per cycle.
- Load-Bandwidth
The number of bytes permitted to be loaded per cycle.
- Store-Bandwidth
The number of bytes permitted to be stored per cycle.
- Permitted-Requests-Per-Cycle
The number of load and store requests permitted per cycle.
- Permitted-Loads-Per-Cycle
The number of load requests permitted per cycle.
- Permitted-Stores-Per-Cycle
The number of store requests permitted per cycle.
Ports¶
Within this section, execution unit port definitions are constructed. Each port is defined with a name and a set of instruction groups/opcodes it supports. The instruction groups/opcodes are architecture-dependent, but, the available AArch64 instruction groups/opcodes can be found here and for RISC-V, can be found here.
To define a port, the following structure must be adhered to:
0:
Portname: <port_name>
Instruction-Group-Support:
- <instruction_group>
- ...
- <instruction_group>
Instruction-Opcode-Support:
- <instruction_opcode>
- ...
- <instruction_opcode>
...
N-1:
Portname: <port_name>
Instruction-Group-Support:
- <instruction_group>
- ...
- <instruction_group>
Instruction-Opcode-Support:
- <instruction_opcode>
- ...
- <instruction_opcode>
With N as the number of execution ports.
Reservation-Stations¶
The relationships between reservation stations and the execution ports, i.e. which reservation stations map to which execution ports, are defined in this section. The configuration of each reservation station contains a size value, a dispatch rate value, and a set of port names, previously defined in the Ports section.
The following structure must be adhered to when defining a reservation station:
0:
Size: <number_of_entries>
Dispatch-Rate: <number_of_permitted_dispatches_per_cycle>
Ports:
- <port_name>
- ...
- <port_name>
...
N-1:
Size: <number_of_entries>
Dispatch-Rate: <number_of_permitted_dispatches_per_cycle>
Ports:
- <port_name>
- ...
- <port_name>
With N as the number of reservation stations. Each execution port must be mapped to a reservation station.
Execution-Units¶
An execution unit can be configured to optionally include an internal pipeline and a set of instruction groups for operation blocking. The instruction groups referenced here are the same as those used in the Ports section.
The following structure must be adhered to when defining an execution unit:
0:
Pipelined: <True/False>
Blocking-Groups:
- <instruction_group>
- ...
- <instruction_group>
...
N-1:
Pipelined: <True/False>
Blocking-Groups:
- <instruction_group>
- ...
- <instruction_group>
With N as the number of execution units. The number of execution units should be equivalent to the number of execution ports.
Note, the indexing used in both the Ports and Execution-Units sections provide a relationship mapping, the 0th execution port maps to the 0th execution unit.
Latencies¶
The execution latency and throughput can be configured under the Latencies section. A latency/throughput pair can be defined for a set of instruction groups/opcodes, the groups/opcodes available are the same as the set discussed in the Ports section.
The execution latency defines the total number of cycles an instruction will spend in an execution unit. The throughput is how many cycles an instruction will block another instruction entering the execution unit. In non-pipelined execution units, the throughput is equal to the latency.
The following structure must be adhered to when defining group latencies:
0:
Instruction-Groups:
- <instruction_group>
- ...
- <instruction_group>
Instruction-Opcodes:
- <instruction_opcode>
- ...
- <instruction_opcode>
Execution-Latency: <number_of_cycles>
Execution-Throughput: <number_of_cycles>
...
N-1:
Instruction-Groups:
- <instruction_group>
- ...
- <instruction_group>
Instruction-Opcodes:
- <instruction_opcode>
- ...
- <instruction_opcode>
Execution-Latency: <number_of_cycles>
Execution-Throughput: <number_of_cycles>
With N as the number of user-defined latency mappings. The default latencies, both execution and throughput, for those instruction groups not covered are 1.
Note, unlike other operations, the execution latency defined for load/store operations are triggered in the LoadStoreQueue as opposed to within the execution unit (more details here).
CPU Info¶
This section contains information about the physical properties of the CPU. These fields are currently only used to generate a replica of the required Special Files directory structure.
- Generate-Special-Dir
Values are either True or False. Dictates whether or not SimEng should generate the Special-Files directory tree at runtime. If your code requires Special-Files but you wish to use your own / existing files from a real system, you will need to set this option to False. The files which are currently generated / supported in SimEng are:
/proc/cpuinfo
/proc/stat
/sys/deviced/system/cpu/online
/sys/deviced/system/cpu/cpu{0..CoreCount}/topology/core_id
/sys/deviced/system/cpu/cpu{0..CoreCount}/topology/physical_package_id
- Special-File-Dir-Path
Represented as a String; is the absolute path to the root directory where the Special-Files will be generated OR where existing Special-Files are located. This is optional, and defaults to SIMENG_BUILD_DIRECTORY/specialFiles. The root directory must already exist.
- Core-Count
Defines the total number of Physical cores (Not including threads).
Note
Max Core-Count currently supported is 1.
- Socket-Count
Defines the number of sockets used. Typically set to 1, but can be more for CPU’s that support multi-socket implementations (i.e. ThunderX2).
Note
Max Socket-Count currently supported is 1.
Note
If Socket-Count is more than 1, Core-Count must reflect the number of physical cores per socket.
- SMT
Defines the number of threads present on each core.
Note
Max SMT currently supported is 1.
The fields listed below are used to generate /proc/cpuinfo. Their values can be found there on a Linux system using the CPU being modelled. With each field is a description of the format required and an example value.
BogoMIPS : Float in format x.00, i.e. 200.00
Features : String with values seperated with a space, i.e. “fp asimd sha1 sha2 fphp”
CPU-Implementer : Hex value represented as a string, i.e. “0x46”
CPU-Architecture : Integer, i.e. 8
CPU-Variant : Hex value represented as a string, i.e. “0x1”
CPU-Part : Hex value represented as a string, i.e. “0x001”
CPU-Revision : Integer, i.e. 0
Note
If values are unknown then set equal to 0 in the correct format
- Package-Count
Used to generate /sys/devices/system/cpu/cpu{0..Core-Count}/topology/{physical_package_id, core_id} files. On each CPU the cores are split into packages. The number of packages used can be calculated by analysing the physical_package_id files on a Linux system using the CPU being modelled.
Note
Core-Count must be wholly divisible by Package-Count.
Note
Max Package-Count currently supported is 1.