Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 293 / 18 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00
Специальности: Программист
Лекция 6:

Optimizing compiler. Auto parallelization

< Лекция 5 || Лекция 6: 12345 || Лекция 7 >

Profitability of auto parallelization

Let’s consider simple fortran test:

REAL :: a(1000,1000),b(1000,1000),
        c(1000,1000)
integer   i,j,rep_factor 
DO I=1,1000
  DO J=1,1000
    A(J,I) = I
    B(J,I) = I+J
    C(J,I) = 0
  END DO
END DO
DO rep_factor=1,1000
  C=B/A+rep_factor 
END DO
END 

Рис. 6.4.

Рис. 6.5.

This slide demonstrates auto parallelization speedup. We have improvement on real time from 2.29 s. to 1.67 s. User time has been changed from 2.23 s. to 6.02 s. It means that amount of work was increased.

void matrix_mul_matrix(int n, 
float C[n][n], float A[n][n], 
float B[n][n]) {
 int i,j,k;
 for (i=0; i<n; i++) 
  for (j=0; j<n; j++) {
   C[i][j]=0;
       for(k=0;k<n;k++)
         C[i][j]+=A[i][k]*B[k][j];
    }
}
Algorithm scalability

Рис. 6.6. Algorithm scalability

CPU cores may compete for some processor resources, for example for cash subsystem and for system bus. Bus bandwidth can be bottleneck for some algorithms. The right picture demonstrates the relation between time and number of threads for matrix multiplication loop. It is an example of highly scalable algorithm.

void matrix_add(int n, float Res[n][n],float A1[n][n], float A2[n][n],
float A3[n][n],float A4[n][n], float A5[n][n], float A6[n][n],
float A7[n][n], float A8[n][n]) {
  int i,j;
  for (i=0; i<n; i++) 
    for (j=1; j<n-1; j++) 
    Res[i][j]=A1[i][j]+A2[i][j]+A3[i][j]+A4[i][j]+
      A5[i][j]+A6[i][j]+A7[i][j]+A8[i][j]+
      A1[i][j+1]+A2[i][j+1]+A3[i][j+1]+A4[i][j+1]+
      A5[i][j+1]+A6[i][j+1]+A7[i][j+1]+A8[i][j+1];
Some poorly scalable algorithm

Рис. 6.7. Some poorly scalable algorithm
< Лекция 5 || Лекция 6: 12345 || Лекция 7 >
Еленеа Бобко
Еленеа Бобко
Беларусь, Минск
Dunduk Dunduk
Dunduk Dunduk
Россия