[Developers] synchronise all processors before aborting on parameter errors

Thomas Radke tradke at aei.mpg.de
Wed Dec 5 11:13:11 CST 2007


Tom Goodale wrote:
> On Tue, 4 Dec 2007, Erik Schnetter wrote:
> 
> 
>>On Dec 4, 2007, at 05:25:22, Thomas Radke wrote:
>>
>>
>>>Hi,
>>>
>>>there exists a PARAM_CHECK bin in which thorns can schedule routines to
>>>check the consistency of parameters and have the run stopped (using
>>>CCTK_Abort) if there are errors.
>>>Now Bela reported the problem that, for multi-processor simulations
>>>using certain Infiniband MPI implementations, the run would die
>>>prematurely because some processors call CCTK_Abort() earlier than
>>>others, and in the logfile one cannot easily find the real reason for
>>>the abort anymore.
>>>
>>>Putting output buffer caching issues aside, the problem could be fixed
>>>by inserting a CCTK_Barrier() call in the flesh function
>>>CCTKi_FinaliseParamWarn(), just before it would check whether there were
>>>any (local) parameter errors and then call CCTK_Abort().
>>>
>>>I guess this small performance penalty would be acceptable ? Or does
>>>someone have a better solution ?
>>
>>
>>In addition to this good idea, we could insert a sleep(10) in CCTK_Abort, so 
>>that other processors have a bit of time to catch up before aborting.  This 
>>should often be enough for them to produce some additional debug output. 
>>(I'm thinking of a new parameter INT sleep_time_before_abort, with a default 
>>value of 10 or so.)

It does sleep already before calling abort(). I committed that patch to 
src/comm/CactusDefaultComm.c back in 2003 already, although it has the 
delay hard-coded to 5 seconds. Is that not enough ?

> Makes sense (although needs to be a separate commit).

Okay, I will then just commit the CCTK_Barrier() in 
CCTKi_FinaliseParamWarn().

-- 
Cheers, Thomas.


More information about the Developers mailing list