[Users] Checkpointing issue.
Peter Diener
diener at cct.lsu.edu
Thu Apr 17 11:49:29 CDT 2008
Hi,
Sorry about the long e-mail.
I just saw the following weird behaviour when restarting from a set of
Carpet mesh refinementcheckpoint files. I was running with 9 refinementl
levels on 128 MPI-processes and was restarting on the same number of
processes when it seemed like the restart stalled when reading in
refinement level 5. I killed the job and set
IO::verbose = "full"
to get more info and
CarpetIOHDF5::open_one_input_file_at_a_time = yes
to make sure that it didn't open more than one file at a time. Below is a
summary of the restart process on stdout from MPI process 0:
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_0.h5'
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 0
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 0 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.576 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 1
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 1 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.288 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 2
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 2 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.144 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 3
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 3 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.072 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 4
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 4 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.036 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 5
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_1.h5'
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_2.h5'
<lots of opening statements>
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_97.h5'
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_98.h5'
INFO (CarpetIOHDF5): reading 'ADMCONSTRAINTS::ham' from dataset 'ADMCONSTRAINTS::ham it=19968 tl=0 rl=5 c=98'
<lots of reding statements>
INFO (CarpetIOHDF5): reading 'PSIKADELIA::riczz' from dataset 'PSIKADELIA::riczz it=19968 tl=0 rl=5 c=98'
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 5 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.018 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 6
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_0.h5'
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 6 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.009 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 7
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 7 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.0045 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 8
<lots of reding statements>
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 8 at
iteration 19968 (simulation time 89.856)
The whole process took almost 7 hours with almost all of the time spent on
reflevel 5 opening 98 files before finding the right data.
So it seems that for some reason all refinement levels, except for level
5, where distributed in the same way in the original run and the restart
run. What used to be on MPI process 0 on level 5 was suddenly on MPI
process 98. I don't have output for the other MPI processes, so I don't
know how much was moved around...
Has anybody seen something like this before?
Any suggestions as to what to do about it?
Cheers,
Peter Diener
More information about the Users
mailing list