Parallel I/O
Basics
Claudio Gheller
CINECA
[email protected]
Firenze, 10-11 Giugno 2003, C. Gheller
Reading and Writing data is a problem
usually underestimated.
However it can become crucial for:
•Performance
•Porting data on different platforms
•Parallel implementation of I/O algorithms
Firenze, 10-11 Giugno 2003, C. Gheller
Performance
Time to access disk: approx 10-100 Mbyte/sec
Time to access memory: approx 1-10 Gbyte/sec
THEREFORE
When reading/writing on disk a code is 100
times slower.
Optimization is platform dependent. In general:
write large amount of data in single shots
Firenze, 10-11 Giugno 2003, C. Gheller
Performance
Optimization is platform dependent. In general:
write large amount of data in single shots
For example:
avoid looped read/write
do i=1,N
write (10) A(i)
enddo
Is VERY slow
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability
This is a subtle problem, which becomes crucial only
after all… when you try to use data on different
platforms.
For example: unformatted data written by a IBM
system cannot be read by a Alpha station or by a
Linux/MS Windows PC
There are two main problem:
• Data representation
• File structure
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: number representation
There are two different representations:
Little Endian
Alpha, PC
Byte3 Byte2 Byte1 Byte0
will be arranged in memory as follows:
Base Address+0 Byte0
Base Address+1 Byte1
Base Address+2 Byte2
Big Endian
Base Address+3 Byte3
Byte3 Byte2 Byte1 Byte0
will be arranged in memory as follows:
Base Address+0 Byte3
Base Address+1 Byte2
Unix (IBM, SGI, Base Address+2 Byte1
SUN…)
Base Address+3 Byte0
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: File structure
For performance reasons, Fortran organizes binary files
in BLOCKS.
Each block is identified by a proper bit sequence
(usually 1 byte long)
Unfortunately, each Fortran compiler has its own Block
size and separators !!!
Notice that this problem is typical of Fortran and does
not affect C / C++
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: Compiler solutions
Some compilers allows to overcome these problems
with specific options
However this leads to
• Spend a lot of time in re-configuring compilation on
each different system
• Have a less portable code (the results depending
on the compiler)
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: Compiler solutions
For example, Alpha Fortran compiler allows to use BigEndian data using the
-convert big_endian
option
However this option is not present in any other
compiler and, furthermore, data produced with this
option are incompatible with the system that wrote
them!!!
Firenze, 10-11 Giugno 2003, C. Gheller
Fortran offers a possible solution both for the
performance and for the portability problems
with the DIRECT ACCESS files.
Open(unit=10, file=‘datafile.bin’, form=‘unformatted, access=‘direct’, recl=N)
The result is a binary file with no blocks and no control
characters. Any Fortran compiler writes (and can
read) it in THE SAME WAY
Notice however that the endianism problem is still
present… However the file is portable between any
platform with the same endianism
Firenze, 10-11 Giugno 2003, C. Gheller
Direct Access Files
The keyword recl is the basic quantum of written
data. It is usually expressed in bytes (except Alpha
which expresses it in words).
Example 1
Real*4 x(100)
Inquire(IOLENGTH=IOL) x(1)
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)
Do i=1,100
write(10,rec=i)x(i)
Enddo
Close (10)
Portable but not performing !!!
(Notice that, this is precisely the C fread-fwrite I/O)
Firenze, 10-11 Giugno 2003, C. Gheller
Direct Access Files
Example 2
Real*4 x(100)
Inquire(IOLENGTH=IOL) x
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)
write(10,rec=1)x
Close (10)
Portable and Performing !!!
Firenze, 10-11 Giugno 2003, C. Gheller
Direct Access Files
Example 3
Real*4 x(100),y(100),z(100)
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)
write(10,rec=1)x
write(10,rec=2)y
write(10,rec=3)z
Close (10)
The same result can be obtained as
Real*4 x(100),y(100),z(100)
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)
write(10,rec=2)y
write(10,rec=3)z
write(10,rec=1)x
Close (10)
Order is not important!!!
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
I/O is not a trivial issue
in parallel
Example
Program Scrivi
Write(*,*)’
Hello World’
End program Scrivi
Pe 0
Execute in parallel
on 4 processors:
$ ./Scrivi
Hello World
Pe 1
Pe 2
Hello World
Hello World
Hello World
Pe 3
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Goals:
Improve the performance
Ensure data consistency
Avoid communication
Usability
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 1: Master-Slave
Only 1 processor performs I/O
Goals:
Improve the
performance: NO
Pe 1
Pe 2
Pe 3
Pe 0
Data File
Ensure data
consistency: YES
Avoid
communication: NO
Usability: YES (but in
general not portable)
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 2: Distributed I/O
All the processors read/writes their own files
Pe 1
Data File 1
Pe 2
Data File 2
Pe 3
Data File 3
Pe 0
Data File 0
Goals:
Improve the performance: YES
(but be careful)
Ensure data consistency: YES
Avoid communication: YES
Usability: NO
Warning: Do not parametrize with processors!!!
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 3: Distributed I/O on single file
All the processors read/writes on a single
ACCESS=DIRECT file
Goals:
Pe 1
Pe 2
Pe 3
Pe 0
Improve the performance: YES
for read, NO for write
Data File
Ensure data consistency: NO
Avoid communication: YES
Usability: YES (portable !!!)
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 4: MPI2 I/O
MPI functions performs the I/O. These functions are
not standards. Asyncronous I/O is supported
Goals:
Pe 1
Improve the performance: YES
(strongly!!!)
Pe 2
Data File
Pe 3
Ensure data consistency: NO
Avoid communication: YES
Usability: YES
Pe 0
MPI
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Data analysis – case 1
How many clusters are
there in the image ???
Cluster finding algorithm
Input = the image
Output = a number
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 1- Parallel implementation
Parallel Cluster finding
algorithm
Pe 0
Pe 1
Input = a fraction of the image
Output = a number for each
processor
All the parallelism is in the setup
of the input. Then all processors
work independently !!!!
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 1- Setup of the input
Each processor reads its
own part of the input file
! The image is NxN pixels, using 2 processors
Pe 0
Pe 1
Real*4 array(N,N/2)
Open (unit=10,
file=“image.bin”,access=‘direct’,recl=4*N*N/2)
Startrecord=mype+1
read(10,rec=Startrecord)array
Call Sequential_Find_Cluster(array, N_cluster)
Write(*,*)mype,’ found’, N_cluster, ‘ clusters’
Firenze, 10-11 Giugno 2003, C. Gheller
Boundaries must be treated in
a specific way
Case Study
! The image is NxN pixels, using 2 processors
Case 1- Boundary
conditions
Real*4 array(0:N+1,0:N/2+1)
! Set boundaries on the image side
array(0,:) = 0.0
array(N+1,:)= 0.0
jside= mod(mype,2)*N/2+mod(mype,2)
suggested
array(:,jside)=0.0
Open (unit=10,
file=“image.bin”,access=‘direct’,recl=4*N)
Do j=1,N/2
record=mype*N/2+j
Pe 0
Pe 1
read(10,rec=record)array(:,j)
Enddo
If(mype.eq.0)then
record=N/2+1
read(10,rec=record)array(:,N/2+1)
avoid
else
record=N/2-1
read(10,rec=record)array(:,0)
endif
Call Sequential_Find_Cluster(array, N_cluster)
Write(*,*)mype,’ found’, N_cluster, ‘ clusters’
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Data analysis – case 2
From observed data…
…
…to the sky map
Firenze, 10-11 Giugno 2003, C. Gheller
Each map pixel is
meausered N times.
Case Study
Data analysis – case 2
The final value for each
pixel is an “average” of
all the corresponding
measurements
0.
1
0.
7
0.
3
1.
8
2.
3
0.
2
5.
7
1.
0
0.
4
0.
3
…
0.
7
0.
6
1.
2
1.
3
8.
1
3.
2
0.
9
0.
8
0.
1
0.
3
2
7
1
11
76
23
2
37
21
21
…
5
8
21
15
3
1
21
22
54
3
0.3
0.5
0.7
0.9
1.0
1.1
1.4
1.2
1.1
0.9
values
map pixels id
MAP
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 2: parallelization
•Values and ids are distributed between processors in
the data input phase (just like case 1)
•Calculation is performed independently by each
processor
•Each processor produce its own COMPLETE map
(which is small and can be replicated)
•The final map is the SUM OF ALL THE MAPS calculated
by different processors
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 2: parallelization
! N Data, M pixels, Npes processors (M << N)
Real*8 value(N/Npes)
Real*8 map(M)
Define basic arrays
Integer id(N/Npes)
Open(unit=10,file=‘data.bin’,access=‘direct’,recl=4*N/Npes)
Open(unit=20,file=‘ids.bin’,access=‘direct’,recl=4*N/Npes)
record=mype+1
Read(10,rec=record)value
Read(20,rec=record)id
Read data in parallel
(boundaries are neglected)
Call Sequential_Calculate_Local_Map(value,id,map)
Calculate local maps
Call BARRIER
Sincronize process
Call Calculate_Final_Map(map)
Parallel calculation of the final map
Call Print_Final_Map(map)
Print final map
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 2: calculation of the final map
Subroutine Calculate_Final_Map(map)
Real*8 map(M)
Real*8 map_aux(M)
Do i=1,npes
If(mype.eq.0)then
call RECV(map_aux,i-1)
map=map+map_aux
Calculate final map
processor by processor
Else if (mype.eq.i-1)then
call SEND(map,0)
Endif
Call BARRIER
enddo
return
However MPI offers a
MUCH BETTER solution
(we will see it tomorrow)
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 2: print the final map
At this point ONLY
processor 0 has the final
map and can print it out
Subroutine Print_Final_Map(map)
Real*8 map(M)
If(mype.eq.0)then
Only one processor writes the result
do i=1,m
write(*,*)i,map(i)
enddo
Endif
return
Firenze, 10-11 Giugno 2003, C. Gheller
Scarica

Claudio Gheller