Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 293 / 18 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00
Специальности: Программист
Лекция 5:

# Optimizing compiler. Vectorization

< Лекция 4 || Лекция 5: 12345 || Лекция 6 >

### Option for vectorization control

/Qvec-report[n]

control amount of vectorizer diagnostic information

• n=0 no diagnostic information
• n=1 indicate vectorized loops (DEFAULT)
• n=2 indicate vectorized/non-vectorized loops
• n=3 indicate vectorized/non-vectorized loops and prohibiting data dependence information
• n=4 indicate non-vectorized loops
• n=5 indicate non-vectorized loops and prohibiting data dependence information

Usage: icl -c -Qvec_report3 loop.c

Diagnostic examples:

• C:\loops\loop1.c(5) (col. 1) : remark: LOOP WAS VECTORIZED.
• C:\loops\loop3.c(5) (col. 1): remark: loop was not vectorized: vectorization possible but seems inefficient.
• C:\loops\loop6.c(5) (col. 1) : remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

### Simple criteria of vectorization admissibility

Let’s write vectorization of loop with usage of fortran array sections.

A good criterion for vectorization is the fact that the introduction section of the array does not create dependency. Рис. 5.6.

There is dependency because A(I+1:I+1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 are intersected. Рис. 5.7.

There is no dependency because A(I-1:I-1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 aren’t intersected.

```PROGRAM TEST_VEC
INTEGER,PARAMETER :: N=1000
#ifdef PERF
INTEGER,PARAMETER :: P=4
#else
INTEGER,PARAMETER :: P=3
#endif
INTEGER A(N)
DO I=1,N-P
A(I+P)=A(I)
END DO
PRINT *,A(50)
END
```

Let’s check an assumption:

Loop can be vectorized, if the dependence distance greater or equal to number of array elements within the vector register.

Check this with compiler:

```ifort test.F90 -o a.out –vec_report3
echo -------------------------------------
ifort test.F90 -DPERF -o b.out –vec_report3
./build.sh
test.F90(11): (col. 1) remark: loop was not vectorized: existence of vector dependence.
-------------------------------------
test.F90(11): (col. 1) remark: LOOP WAS VECTORIZED.
```

### Dependency analysis and directives

There are two tasks which compiler should perform for dependency evaluation:

1. Alias analysis (pointers which can address the same memory should be detected)
2. Definition-use chains analysis

Compiler should prove that there are not aliased objects and precisely calculate the dependencies. It is hard task and sometimes compiler isn’t able to solve it.

There are methods of providing additional information to the compiler:

• Option –ansi_alias (the pointers can refer only to the objects of the same or compatible type).
• restrict attributes for pointer arguments (C/C++).
• #pragma ivdep says that there are not dependencies in the following loop. (C/C++)
• !DEC\$ IVDEP Fortran analogue of #pragma ivdep

### Some performance issues for the vectorized code

```INTEGER :: A(1000),B(1000)
INTEGER I,K
INTEGER, PARAMETER :: REP = 500000
A = 2
DO K=1,REP
END DO
PRINT *,SHIFT,B(101)
CONTAINS
INTEGER A(1000),B(1000)
INTEGER I
!DEC\$ UNROLL(0)
DO I=1,1000-SHIFT
B(I) = A(I+SHIFT)+1
END DO
END SUBROUTINE
END
```

Let’s consider some simple test with a assignment which is appropriate for vectorization. Let us obtain vectorized code with usage of Intel Fortran compiler for different values of SHIFT macro.

/fppoption for preprocessor Intel compiler makes vectorization if level of optimization is 2 or 3. (-O2 or -O3)

Option –Ob0 is used to forbid inlining.

Experiment results

```ifort test1.F90 -O2  -Ob0 /fpp /DSHIFT=0 -Fea.exe -Qvec_report >a.out 2>&1
ifort test1.F90 -O2  -Ob0 /fpp /DSHIFT=1 -Feb.exe -Qvec_report >b.out 2>&1

time.exe a.exe
0           3
CPU time for command: 'a.exe'
real    0.125 sec
user    0.094 sec
system  0.000 sec
time.exe  b.exe
1           3
CPU time for command: 'b.exe'
real    0.297 sec
user    0.281 sec
system  0.000 sec
```
```ifort test1.F90 -O2  -Ob0 /fpp /DSHIFT=0 /Fas -Ob0 -S –Fafast.s
fast.s
.B2.5:                          ; Preds .B2.5 .B2.4
\$LN83:
;;;    B(I) = A(I+SHIFT)+1
movdqa    xmm1, XMMWORD PTR [eax+ecx*4]                 ;17.11
\$LN84:
\$LN85:
movdqa    XMMWORD PTR [edx+ecx*4], xmm1                 ;17.4
\$LN86:
\$LN87:
cmp       ecx, 1000                                     ;16.3
\$LN88:
jb        .B2.5         ; Prob 99%                      ;16.3

ifort test1.F90 -O2  -Ob0 /fpp /DSHIFT=1  /Fas -Ob0  -S  –Faslow.s

slow.s

.B2.5:                          ; Preds .B2.5 .B2.4
\$LN81:
;;;    B(I) = A(I+SHIFT)+1
movdqu    xmm1, XMMWORD PTR [4+eax+ecx*4]               ;17.11
\$LN82:
\$LN83:
movdqa    XMMWORD PTR [edx+ecx*4], xmm1                 ;17.4
\$LN84:
\$LN85:
cmp       ecx, 996                                      ;16.3
\$LN86:
jb        .B2.5         ; Prob 99%                      ;16.3
```

CONCLUSION:

In fast version aligned instructions are used and vector registers are filled faster.

Unaligned instructions are slower. For latest architectures they shows the same performance as aligned instructions if applied to the aligned data.

Performance of vectorized loop depends on the memory location of the objects used. The important aspect of program performance is the memory alignment of the data

Data Structure Alignment is computer memory data placement. This concept includes two distinct but related issues: alignment of the data (Data alignment) and data structure filling (Data structure padding).

Data alignment specifies how certain data is located relative to the boundaries of memory. This property is usually associated with a data type.

Filling data structures involves insertion of unnamed fields into the data structure in order to preserve the relative alignment of structure fields.

< Лекция 4 || Лекция 5: 12345 || Лекция 6 >

### Студенты

Еленеа Бобко
 Беларусь, Минск
Dunduk Dunduk
 Россия