[Developers] nasty bug

Jonathan Thornburg jthorn at aei.mpg.de
Thu May 5 08:17:14 CDT 2005


Hi,

I've just fixed a nasty bug in some thorns of mine, which for various
reasons took me over a month to track down :(.  I think this bug might
well bite many other programmers, so I'm writing this message to warn
everyone about it, and explain how to avoid it.


The Bug
=======

I'll illustrate this for the particular case which bit me, but I'm
sure everyone can see other analogous trouble situations...

Suppose you have some code which might get handed a static-conformal
metric (of the type managed by the StaticConformal thorn in the
CactusEinstein arrangement), but which can only handle a physical
meetric.  You might write code like this to make sure that the
following code always sees a physical metric:

c
c Assume we're in a thorn which inherits from StaticConformal,
c and declares the StaticConformal aliased fn ConfToPhysInPlace()
c in interface.ccl.
c

c
c This subroutine ensures the metric is physical, converting
c it from static-conformal to physical if necessary.
c
subroutine ensure_metric_is_physical(CCTK_ARGUMENTS)
DECLARE_CCTK_ARGUMENTS

c if metric is static-conformal, convert it to physical
if (conformal_state .gt. 0) then
 	call CCTK_INFO('converting metric from static-conformal to physical')

 	c do the actual conversion using aliased function
 	c provided by thorn StaticConformal
         call ConfToPhysInPlace(cctk_lsh(1), cctk_lsh(2), cctk_lsh(3),
                                psi,
                                gxx, gxy, gxz, gyy, gyz, gzz);

         c record that we've done the conversion
         conformal_state = 0
end if
return
end


What's Wrong
============

This code works fine in a unigrid setting (eg anything run with PUGH),
but fails in any non-unigrid environment (eg anything with mesh refinement
in Carpet, or anything multipatch).  The problem is that  conformal_state
is *global*, but this code must (will, assuming you schedule it correctly)
be called on *each* component/patch grid.

What will happen will be this:
* subroutine called on first component/patch grid
   * convert metric to physical on first component/patch grid
   * set conformal_state to 0
* subroutine called on next component/patch grid
   * conformal_state is already 0, so subroutine is a no-op. :(
etc etc

Thus the metric will be left physical in the first component/patch,
but static-conformal in all others. :( :(


The Fix
=======

The solution is to move the reset of  conformal_state  out of this
routine into a separately scheduled routine, which is scheduled only
*after* the actual conversion has been done on all components/patches.
For Carpet, the reset of  conformal_state  should probably be scheduled
in global mode, too.

More generally, I think the key lesson to learn from this bug is that
Erik's advice in sections 5.1 and 5.3 of
arrangements/Carpet/doc/documentation.tex,
is sehr important and must be followed sehr carefully:

> Of course a serial applicate usually will have to be changed to
> support multiple processors.  In order to do so, all the operations
> that the application performs have to be classified into one of two
> categories:
> 
> One category contains the so-called \emph{local} operations.  These
> are operations that are applied to each and every grid point
> individually, and that do not depend on any other grid point except
> nearby neighbours.  Each local operation will thus involve a loop over
> all grid points, and in order to run on multiple processors, after
> each such loop the synchronisation routine has to be called.  An
> example of a local operation would be calculating a spatial
> derivative.
> 
> The other category contains so-called \emph{global} operations.  These
> operations do not depend on individual grid points, and thus do not
> involve loops over grid points.  The result of a global operation is
> the same on all processors; therefore global operations don't involve
> communication and don't require synchronisation.  An example of a
> global operation would be to check how many time steps have been
> taken, and decide whether the simulation should be terminated.

> [[It's necessary to]] break up the time evolution routine into several
> smaller routines, each consisting of a single either local or global
> operation.  (``Local'' and ``global'' have here the exact same
> meanings that were defined above for parallelisation.)  A local
> operation works, by definition, on individual grid points.  Hence the
> local routines have to be called once for every grid component.  A
> global operation, by definition, does not depend on individual grid
> points.  Hence it has to be called only once per processor, and not
> once per component.  That means that the driver has to be told the
> category individual routine is in.

(In my case, actually converting the metric from static-conformal to
physical is a local operation, while resetting  conformal_state  is
a global operation.  So these have to be kept separate!)

ciao, happy programming-in-the-wierd-and-wonderful-world-of-non-unigrid,

-- 
-- Jonathan Thornburg <jthorn at aei.mpg.de>
    Max-Planck-Institut fuer Gravitationsphysik (Albert-Einstein-Institut),
    Golm, Germany, "Old Europe"     http://www.aei.mpg.de/~jthorn/home.html
    "Washing one's hands of the conflict between the powerful and the
     powerless means to side with the powerful, not to be neutral."
                                       -- quote by Freire / poster by Oxfam




More information about the Developers mailing list