НОУ ИНТУИТ | Introduction to performance optimization using Intel SW tools. Лекция 9: Optimizing compiler. Static and dynamic profiler. Memory manager. Code generator

Учитесь и получайте официальные документы БЕСПЛАТНО. Вы можете поддержать наш проект.

Твой путь к знаниям!

Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 355 / 24 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00

Специальности: Программист

Теги: basic, basic block, call graph, linux, loop optimization, microprocessor, objective-c, openmp, optimizing compiler, permute, pipelining, prefetcher, register allocation, remark

|

Вам нравится? Нравится 9 студентам

| Поделиться |

Поддержать курс

| Скачать электронную книгу

Code generator

Code generation (CG) is a part of the compilation process. Code generator converts correct internal representation into a sequence of instructions that can be run on the particular proccessor architecture. CG may apply different machine-dependent optimizations. Code generator can be a common part for a variety of compilers, each of which generates an intermediate representation as input to the code generator.

Basic actions:

Conversion of the internal representation to the instructions of given processor architecture.
Specific architectural optimization;
Simple intrinsic substitution (inline);
Basic blocks memory alignment;
Procedure calls preparations, load the appropriate variables to registers and/or to the stack for parameters passing;
The same for the called procedure. Local variable stack allocation.
Instruction scheduling;
Register allocation;
Jump distances calculation;
…

Register allocation

One of the basic tasks of code generator is a register allocation.

The register allocation is program variable mapping to the microprocessor register set. Register allocation can be performed inside a single basic block (the local register allocation), or the entire process (global register allocation).

Typically, the number of variables in the program much greater than the number of available physical registers, so variables are stored in the memory and loaded to registers before usage. After usage register should be saved to memory. Memory exchange (register save/load operations) should be minimized for better performance; compiler should choose and hold in registers more frequently used variables. It is hard to determine frequency of use for different variables. A problem which causes loss of performance because of exchange between registers and memory is called register spilling.

Register allocation is performed via interference graph coloring.

The implementation of register allocation with graph coloring contains the following steps:

Identifying the live range of variables (A program region in which the variable is used) and gives each a unique name.
Interference graph building. Each variable corresponds to a vertex. If the live ranges of variables intersect, then there is edge between these vertexes. Each vertex color should be different from the connected vertexes colors. Number of colors used relates to number of registers needed.
Actual graph coloring.
If the coloring fails then we need to break some vertex (this means storing register to memory during live range of variable) and retry graph coloring.

The register allocation is better when the registers contains most frequently used data. Dynamic profiler information can be very useful for better register allocation.

Рис. 9.6.

Data dependence for register reuse

Dependency issue was raised in previous lectures. Dependencies are used and calculated in order to prove the validity of the permutation optimizations. Code generator uses dependencies to identify opportunities for reusability of data in calculations. It allows to avoid unnecessary memory loads, and memory write backs.

For example:

DO I = 1, N
  A (I+1) = A (I) F (...)
END DO

It makes sense to tie A (I+1) with register, so the next iteration won't load A(I) from memory

Instruction scheduling

It is a computer optimization which is used to improve the instructional parallelism level. This optimization is usually done by changing the order of instructions to reduce delays in the processor pipeline. Another reason for instruction scheduling can be an attempt to improve memory subsystem work by moving memory read far before it’s usage.

Any processor contains its own mechanism for instruction planning and distribution across the execution units. This mechanism provides a proactive view of incoming instructions. But it can not be sufficiently effective because "window-ahead view" is limited.

Instructions can be interchanged according to the following considerations:

Place memory read as far as possible before using the results;
Mixed instructions use different executable unit of the processor;
Closer instructions use the same variable to simplify the selection of registers.

Planning regulations can be made within a single base unit, or within the superblock, combining several basic blocks. Some instructions can be moved beyond the boundaries of their base block.

Instruction planning can be carried out before and after the allocation of registers.

An example of a processor and architectural optimization (using cmovn)

Control flow dependence can be replaced by data dependence using cmovne. Branching disappears and it speeds up the badly predicted branches.

#include <stdio.h>
int main() {
int volatile t1,t2,t3;
int i,j,aa;
int a[1000];
t1=t2=t3=0;
aa=0;
for(i=1;i<100000;i++) {
 for(j=1;j<1000;j++){
   if(t1|t2|t3)
     aa=2;
   else
     aa=0;
   a[j]=a[j]+aa;
   t3=j%2;
 }
}
 printf("%d\n",a[50]);
}

icc test.c -O2 -xP -o a.out   time ./a.out 0m0.379s
icc test.c -O2     -o b.out   time ./b.out 0m0.441s
-xP ( /QxP)                   use /QxSSE3

This example demonstrates how instruction set can change performance of application.

Assembler for better test:

..B1.3:                         # Preds ..B1.9 ..B1.2
  movl    4008(%esp), %ebx      #12.7
  orl     4004(%esp), %ebx      #12.10
  movl    $2, %edx              #15.6
  orl     4000(%esp), %ebx      #12.13
  movl    $0, %ebx              #15.6
  cmovne  %edx, %ebx
#15.6
  addl  %ebx, (%esp,%eax,4)     #16.14
  movl  %eax, %edx              #17.9
  andl  $-2147483647, %edx      #17.9
  jge   ..B1.9    # Prob 50%   #17.9
                  # LOE eax edx ecx esi edi
..B1.10:                       # Preds ..B1.3
  subl  $1, %edx               #17.9
  orl   $-2, %edx              #17.9
  addl  $1, %edx               #17.9
                  # LOE eax edx ecx esi edi
..B1.9:                        # Preds ..B1.3 ..B1.10
  movl      %edx, 4000(%esp)   #17.4
  addl      $1, %eax           #11.17
  cmpl      $1000, %eax        #11.12
  jl        ..B1.3

Assembler for test without cmovne :

..B1.3:               # Preds ..B1.9 ..B1.2
  movl  4008(%esp), %ecx          #12.7
  orl   4004(%esp), %ecx          #12.10
  orl   4000(%esp), %ecx          #12.13
  movl  $2, %ecx                  #15.6
  jne   ..L1          # Prob 50%  #15.6
  movl  $0, %ecx                  #15.6
..L1:                             #
  addl  %ecx, (%esp,%edx,4)       #16.14
  movl  %edx, %ecx                #17.9
  andl  $-2147483647, %ecx        #17.9
  jge   ..B1.9        # Prob 50%  #17.9
                      # LOE eax edx ecx ebx esi edi
..B1.10:              # Preds ..B1.3
  subl  $1, %ecx                  #17.9
  orl   $-2, %ecx                 #17.9
  addl  $1, %ecx                  #17.9
                      # LOE eax edx ecx ebx esi edi
..B1.9:               # Preds ..B1.3 ..B1.10
  movl  %ecx, 4000(%esp)          #17.4
  addl  $1, %edx                  #11.17
  cmpl  $1000, %edx               #11.12
  jl    ..B1.3

Дальше >>

Авторизоваться

Introduction to performance optimization using Intel SW tools

Optimizing compiler. Static and dynamic profiler. Memory manager. Code generator

Code generator

Register allocation

Data dependence for register reuse

Instruction scheduling

An example of a processor and architectural optimization (using cmovn)

Вопросы и ответы