[Developers] synchronise all processors before aborting on parameter errors
Thomas Radke
tradke at aei.mpg.de
Wed Dec 5 11:13:11 CST 2007
Tom Goodale wrote:
> On Tue, 4 Dec 2007, Erik Schnetter wrote:
>
>
>>On Dec 4, 2007, at 05:25:22, Thomas Radke wrote:
>>
>>
>>>Hi,
>>>
>>>there exists a PARAM_CHECK bin in which thorns can schedule routines to
>>>check the consistency of parameters and have the run stopped (using
>>>CCTK_Abort) if there are errors.
>>>Now Bela reported the problem that, for multi-processor simulations
>>>using certain Infiniband MPI implementations, the run would die
>>>prematurely because some processors call CCTK_Abort() earlier than
>>>others, and in the logfile one cannot easily find the real reason for
>>>the abort anymore.
>>>
>>>Putting output buffer caching issues aside, the problem could be fixed
>>>by inserting a CCTK_Barrier() call in the flesh function
>>>CCTKi_FinaliseParamWarn(), just before it would check whether there were
>>>any (local) parameter errors and then call CCTK_Abort().
>>>
>>>I guess this small performance penalty would be acceptable ? Or does
>>>someone have a better solution ?
>>
>>
>>In addition to this good idea, we could insert a sleep(10) in CCTK_Abort, so
>>that other processors have a bit of time to catch up before aborting. This
>>should often be enough for them to produce some additional debug output.
>>(I'm thinking of a new parameter INT sleep_time_before_abort, with a default
>>value of 10 or so.)
It does sleep already before calling abort(). I committed that patch to
src/comm/CactusDefaultComm.c back in 2003 already, although it has the
delay hard-coded to 5 seconds. Is that not enough ?
> Makes sense (although needs to be a separate commit).
Okay, I will then just commit the CCTK_Barrier() in
CCTKi_FinaliseParamWarn().
--
Cheers, Thomas.
More information about the Developers
mailing list