[Users] Checkpointing issue.

Erik Schnetter schnetter at cct.lsu.edu
Fri Apr 18 15:47:32 CDT 2008


On Apr 18, 2008, at 12:17:40, Thomas Radke wrote:

> Peter Diener wrote:
>> Hi,
>>
>> Sorry about the long e-mail.
>>
>> I just saw the following weird behaviour when restarting from a set  
>> of
>> Carpet mesh refinementcheckpoint files. I was running with 9  
>> refinementl
>> levels on 128 MPI-processes and was restarting on the same number of
>> processes when it seemed like the restart stalled when reading in
>> refinement level 5.
>>
>> The whole process took almost 7 hours with almost all of the time  
>> spent on
>> reflevel 5 opening 98 files before finding the right data.
>>
>> So it seems that for some reason all refinement levels, except for  
>> level
>> 5, where distributed in the same way in the original run and the  
>> restart
>> run. What used to be on MPI process 0 on level 5 was suddenly on MPI
>> process 98. I don't have output for the other MPI processes, so I  
>> don't
>> know how much was moved around...
>>
>> Has anybody seen something like this before?
>>
>> Any suggestions as to what to do about it?
>
> Hi Peter,
>
> I don't know why the grid structure for refinement level 5 would be
> different in the recovery run but not for other levels. I have seen  
> this
> behaviour before though but couldn't find out why Carpet was doing  
> that.


It could be that Carpet, for some reason, decides to use a different  
processor decomposition, and therefore has to read in data that were  
written by other processors.  (It could also be some other  
inconsistency in checkpointing).  Peter, do you have the old and new  
processor decompositions?  It should suffice to look at the shapes of  
the components in the old checkpoint file and in a checkpoint file  
that you write just after restarting.

Since reading in so many files is so slow, we may need to change  
recovery to make every processor open only one file and send the data  
around via MPI.

-erik

-- 
Erik Schnetter <schnetter at cct.lsu.edu>   http://www.cct.lsu.edu/~eschnett/

My email is as private as my paper mail.  I therefore support encrypting
and signing email messages.  Get my PGP key from www.keyserver.net.



-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://www.cactuscode.org/pipermail/users/attachments/20080418/2d8a8f5c/attachment-0001.bin 


More information about the Users mailing list