Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 293 / 18 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00
Специальности: Программист
Лекция 6:

Optimizing compiler. Auto parallelization

< Лекция 5 || Лекция 6: 12345 || Лекция 7 >

Software prefetching

Prefetching is loading data from relatively slow memory into the cache before the memory is required by processor. Software prefetching is insertion of the special prefetch instructions to the code.

There are several methods of prefetch usage:

  • Explicit instruction insertion.
  • Implicit insertion with compiler option –prefetch, known as auto prefetch compiler feature.

Prefetch intrinsic functions are defined in xmmintrin.h and has the form

# include <xmmintrin.h>
enum _mm_hint {_MM_HINT_T0 = 3, (L1)
   _MM_HINT_T1 = 2, (L2)
   _MM_HINT_T2 = 1, (L3)
   _MM_HINT_NTA = 0};
void _mm_prefetch (void * p, enum _mm_hint h);

It loads a cache line from the address specified (size of the cache line is 64 bytes)

Use CALL mm_prefetch (P, HINT) inside the fortran programs

Why software prefetching can be useful

There is hardware prefetch mechanism which tries to identify the memory access pattern to choose the appropriate preloading scheme. It works fine when the memory is accessed with constant stride and this stride is relatively small.

Software prefetch instructions have its price. Computing system can ignore software prefetching instructions when the system bus is busy.

Don’t use software prefetching instructions

  • in case when hardware prefetching mechanism is able to help
  • if there are many memory requests and the system bus is busy
  • all needed memory is already cached

There are many cases when programmer could help to preload the memory required:

  • large constant stride
  • work with chains
  • variable stride access to memory
  • many different memory objects (?)

The VTUNE usage can help to identify slowdowns, caused by the inefficient memory access.

SUBROUTINE CALC(A,B,C,N,K,SEC)
INTEGER N,K,SEC,I,J
REAL A(K,N),B(K,N),C(K,N)
DO I=1,K
  DO J=1,N
    A(I,J)=A(I,J)/(A(I,J)+ B(I,J)*C(I,J))
#ifdef PERF
      CALL mm_prefetch(A(I,J+SEC),3)
      CALL mm_prefetch(B(I,J+SEC),3)
      CALL mm_prefetch(C(I,J+SEC),3)
#endif
  END DO
END DO
END SUBROUTINE CALC

Idea of this example:

Memory is accessed with large constant stride.

Can we obtain performance gain?

Would it be different for different SEC values?

INTEGER N,K
REAL, ALLOCATABLE :: A(:,:),B(:,:),C(:,:),D(:,:)
REAL T1,T2
INTEGER REP,SEC
READ *,N,SEC
READ *,K
ALLOCATE(A(K,N),B(K,N),C(K,N))
ALLOCATE(D(10000,10000))
A=1
B=1
C=1
D=0
CALL CPU_TIME(T1)
CALL CALC(A,B,C,N,K,SEC)
CALL CPU_TIME(T2)
PRINT *,T2-T1 
END

Execution results

 ifort /fpp  -DPERF -Od pref.f90 -Feperf_pref.exe 
 ifort /fpp         -Od pref.f90 -Feperf.exe  

Data for input:

 4000 SEC
 4000

without SP: 0.48s.

  SEC == 1 with SP:   0.56s.
  SEC == 4 with SP:   0.48s    

?? Price of prefetch instructions exceed gain from prefetch.

Let’s enlarge calculations inside loop

A(I,J)=A(I,J)/(A(I,J)+ B(I,J)*C(I,J)) => 
A(I,J) = (EXPONENT(A(I,J))+EXPONENT(B(I,J))+EXPONENT(C(I,J)))/(A(I,J)*B(I,J)*C(I,J))

without SP: 1.45s.

  SEC == 1 with SP:   1.07s.
  SEC == 4 with SP:   0.98s 

Conclusion

It is hard to determine if the prefetch instruction can be helpful for all computing systems. The performance of the memory subsystems depend on many different factors such as, amount of cash memory, memory latency, bandwidth, etc. Prefetch instruction call has its price and increases amount of data which should be passed through the system bus. Therefore result of software prefetch can be different for the different computing systems.

Auto software prefetching options

/Qopt-prefetch[:n]

  • 1-4 Enables different levels of software prefetching. If you do not specify a value for n, the default is 2 on IA-32 and Intel® 64 architecture; the default is 3 on IA-64 architecture. Use lower values to reduce the amount of prefetching.
  • 0 Disables software prefetching. This is the same as specifying -no-opt-prefetch (Linux and Mac OS X) or /Qopt-prefetch- (Windows).
< Лекция 5 || Лекция 6: 12345 || Лекция 7 >
Еленеа Бобко
Еленеа Бобко
Беларусь, Минск
Dunduk Dunduk
Dunduk Dunduk
Россия