Performance portability is a critical capability in parallel simulation codes. Multi-physics codes contain thousands of independent kernels, for which developers set architectural tuning parameters at compile time. Existing approaches to auto-tuning typically run many instances of a kernel when performing a guided search of the performance parameter space. As a result, fast searches still take minutes to process, and tuning is effective only if the code’s behavior changes slowly. In adaptive mesh refinement simulations, for example, where the code’s behavior can evolve during a run, tuning a kernel for one set of inputs may not improve performance on another—different tunings of each kernel may be necessary. To tune adaptive applications for the best performance, statically optimized code paths need to be chosen dynamically.
Approach and Benefits
Apollo, an auto-tuning extension of the RAJA portability framework, provides a lightweight and efficient solution to performance portability challenges by tuning parameters each time a kernel runs. During this machine learning process, classifiers map arbitrary runtime features into the fastest tuning parameter values, and conditional statements are evaluated before each kernel. This approach enables Apollo to respond quickly to changes in data during processing, freeing developers from manually choosing parameters.
Making an on-line tuning decision can be costly because existing approaches are unable to adapt quickly to changing inputs. Apollo mitigates this performance lag by using off-line training to learn how to directly select tuning parameter values using decision tree classifiers, which encode measured features to make intelligent decisions. Thus, the cost of modeling is spread over multiple runs while removing the overhead of constructing the model from the runtime.
The cumulative effect of Apollo’s efficiency has been shown in tests to select the fastest parameter up to 98% of the time and to increase performance up to 4.8x.
Applications
Apollo’s first version runs as an extension of the RAJA programming model and uses Caliper to collect training data. Tests with CleverLeaf and LULESH have yielded compelling performance data.
- RAJA maintains single-source code while leveraging standard C++ features for portability and integration. RAJA interfaces with Apollo’s recording and tuning components; these are decoupled from the application, so the same executable can be run in either mode. This flexibility allows the decision model to re-train on new data without recompiling the application.
- Caliper is an introspection tool that connects independent data through contextual annotations. This system measures kernel runtime and stores arbitrary attribute values representing additional features. Apollo uses these capabilities to determine features that can be added to the model.
- CleverLeaf is an adaptive mesh refinement application whose kernels work on variably sized patches of data, making it an ideal test case for a dynamic auto-tuning framework. In strong scaling, where adaptively refined mesh is divided into smaller subdomains, Apollo speeds up execution.
- LULESH is a proxy application that models shock hydrodynamics. With kernels iterating over domain elements as well as material regions, the code’s algorithms and programming features are representative of standard C++ applications.
Table 1. Apollo runtime features: During auto-tuning, kernel, instruction, and application features are collected for each RAJA kernel. Additional multi-physics code features are within scope for future versions of Apollo.
Feature Type | Feature | Description |
---|---|---|
Kernel | func | Name of function |
func_size | Total number of instructions in kernel body | |
index_type | Type of RAJA IndexSet | |
loop_id | Address identifying kernel | |
num_indices | Number of indices in each segment | |
num_segments | Number of segments | |
stride | Stride of indices in each segment | |
Instruction | add, and, call, cmp, comisd, divsd, inc, jb, lea, loop, maxsd, minsd, mov, mulpd, nop, pop, push, pxor, ret, sar, shl/sal, sqrtsd, sub, test, ucomisd, unpckhpd, unpcklpd, xor, xorps | Instruction mnemonics are grouped to save space (for example, the add mnemonic corresponds to add, addpd, and addsd) |
Application | timestep | Current cycle |
problem_size | Global problem size | |
problem_name | Name of the input deck | |
patch_id | Numeric ID assigned to the AMR subdomain being processed (CleverLeaf only) |
void
apollo_begin_forall_iset(const RAJA::IndexSet& iset, void*) {
RAJA::apollo::model_params p;
const int num_indices = iset.getLength();
// Use decision tree to predict best execution policy
if ( num_indices <= 103938.0 ) {
if ( num_indices <= 19965.5 ) {
p.policy = seq_exec;
} else {
p.policy = omp_exec;
} else {
if ( num_indices <= 2382100.0 ) {
p.policy = omp_exec;
} else {
p.policy = omp_exec;
}
}
}
// Write predicted model parameters to the blackboard
RAJA::apollo::set_model_params(p);
}