[Developers] synchronise all processors before aborting on parameter errors

Tom Goodale goodale at cct.lsu.edu
Tue Dec 4 11:46:41 CST 2007


On Tue, 4 Dec 2007, Erik Schnetter wrote:

> On Dec 4, 2007, at 05:25:22, Thomas Radke wrote:
>
>> Hi,
>> 
>> there exists a PARAM_CHECK bin in which thorns can schedule routines to
>> check the consistency of parameters and have the run stopped (using
>> CCTK_Abort) if there are errors.
>> Now Bela reported the problem that, for multi-processor simulations
>> using certain Infiniband MPI implementations, the run would die
>> prematurely because some processors call CCTK_Abort() earlier than
>> others, and in the logfile one cannot easily find the real reason for
>> the abort anymore.
>> 
>> Putting output buffer caching issues aside, the problem could be fixed
>> by inserting a CCTK_Barrier() call in the flesh function
>> CCTKi_FinaliseParamWarn(), just before it would check whether there were
>> any (local) parameter errors and then call CCTK_Abort().
>> 
>> I guess this small performance penalty would be acceptable ? Or does
>> someone have a better solution ?
>
>
> In addition to this good idea, we could insert a sleep(10) in CCTK_Abort, so 
> that other processors have a bit of time to catch up before aborting.  This 
> should often be enough for them to produce some additional debug output. 
> (I'm thinking of a new parameter INT sleep_time_before_abort, with a default 
> value of 10 or so.)

Makes sense (although needs to be a separate commit).

Cheers,

Tom


More information about the Developers mailing list