[Users] Checkpointing issue.

Peter Diener diener at cct.lsu.edu
Thu Apr 17 11:49:29 CDT 2008


Hi,

Sorry about the long e-mail.

I just saw the following weird behaviour when restarting from a set of 
Carpet mesh refinementcheckpoint files. I was running with 9 refinementl 
levels on 128 MPI-processes and was restarting on the same number of 
processes when it seemed like the restart stalled when reading in 
refinement level 5. I killed the job and set

IO::verbose                             = "full"

to get more info and

CarpetIOHDF5::open_one_input_file_at_a_time = yes

to make sure that it didn't open more than one file at a time. Below is a
summary of the restart process on stdout from MPI process 0:
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_0.h5'
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 0

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 0 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.576 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 1

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 1 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.288 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 2

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 2 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.144 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 3

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 3 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.072 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 4

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 4 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.036 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 5
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_1.h5'
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_2.h5'

<lots of opening statements>

INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_97.h5'
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_98.h5'
INFO (CarpetIOHDF5):   reading 'ADMCONSTRAINTS::ham' from dataset 'ADMCONSTRAINTS::ham it=19968 tl=0 rl=5 c=98'

<lots of reding statements>

INFO (CarpetIOHDF5):   reading 'PSIKADELIA::riczz' from dataset 'PSIKADELIA::riczz it=19968 tl=0 rl=5 c=98'
INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 5 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.018 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 6
INFO (CarpetIOHDF5): opening checkpoint file 'random-1o1-med-5/checkpoint.chkpt.it_19968.file_0.h5'

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 6 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.009 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 7

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 7 at iteration 19968 (simulation time 89.856)
INFO (ADMMacros): Spatial finite differencing order: 4
INFO (Time): Timestep set to 0.0045 (courant_static)
INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 8

<lots of reding statements>

INFO (CarpetIOHDF5): restarting simulation on mglevel 0 reflevel 8 at 
iteration 19968 (simulation time 89.856)

The whole process took almost 7 hours with almost all of the time spent on 
reflevel 5 opening 98 files before finding the right data.

So it seems that for some reason all refinement levels, except for level 
5, where distributed in the same way in the original run and the restart 
run. What used to be on MPI process 0 on level 5 was suddenly on MPI 
process 98. I don't have output for the other MPI processes, so I don't 
know how much was moved around...

Has anybody seen something like this before?

Any suggestions as to what to do about it?

Cheers,

  Peter Diener


More information about the Users mailing list