[Developers] synchronise all processors before aborting on parameter errors
Erik Schnetter
schnetter at cct.lsu.edu
Tue Dec 4 10:58:00 CST 2007
On Dec 4, 2007, at 05:25:22, Thomas Radke wrote:
> Hi,
>
> there exists a PARAM_CHECK bin in which thorns can schedule
> routines to
> check the consistency of parameters and have the run stopped (using
> CCTK_Abort) if there are errors.
> Now Bela reported the problem that, for multi-processor simulations
> using certain Infiniband MPI implementations, the run would die
> prematurely because some processors call CCTK_Abort() earlier than
> others, and in the logfile one cannot easily find the real reason for
> the abort anymore.
>
> Putting output buffer caching issues aside, the problem could be fixed
> by inserting a CCTK_Barrier() call in the flesh function
> CCTKi_FinaliseParamWarn(), just before it would check whether there
> were
> any (local) parameter errors and then call CCTK_Abort().
>
> I guess this small performance penalty would be acceptable ? Or does
> someone have a better solution ?
In addition to this good idea, we could insert a sleep(10) in
CCTK_Abort, so that other processors have a bit of time to catch up
before aborting. This should often be enough for them to produce
some additional debug output. (I'm thinking of a new parameter INT
sleep_time_before_abort, with a default value of 10 or so.)
-erik
--
Erik Schnetter <schnetter at cct.lsu.edu>
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from www.keyserver.net.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://www.cactuscode.org/pipermail/developers/attachments/20071204/7d7e2b62/attachment.bin
More information about the Developers
mailing list