/*! \page testpio_example testpio: a regression and benchmarking code

The testpio directory, included with the release package, tests both the accuracy
and performance of reading and writing data
using the pio library.

The testpio directory contains 3 perl scripts that you can use to build and run the testpio.F90 code.
<ul>
<li>testpio_build.pl - builds the pio, timing and testpio libraries and executables</li>
<li>testpio_bench.pl - setups, builds and runs a user specified set of test suites using testpio and generates log files with benchmarking information. </li>
<li>testpio_run.pl - setups, builds and runs a user specified set of test suites using testpio. </li>
</ul>

Additional C shell scripts wrappers are packaged with the testpio suite to allow for environment customization of the 3 perl scripts listed above. The following help information describes in more detail how the testpio code works.

 The tests are controlled via a namelist. Sample namelist files
are located in the testpio/namelists directory. It contains a
set of general namelists and specific namelists to setup a computational
decomposition and an IO decomposition.  The computational decomposition
should be setup to duplicate a realistic model data decomposition.  The IO decomposition
is generally not used, but in some cases, can be used and impacts IO performance.
The IO decomposition is an intermediate decomposition that provides
compatability between a relative arbitrary computational decomposition
and the MPI-IO, netcdf, pnetcdf, or other IO layers.  Depending on the
IO methods used, only certain IO decompositions are valid.  In general,
the IO decomposition is not used and is set internally.

The namelist input file is called "testpio_in".  The first namelist
block, io_nml, contains some general settings:

<table border="1" cellpadding="2px">
  <tr>
    <th colspan="2" align="left">namelist io_nml</th>
  </tr>
  <tr>
    <td>casename</td>       <td>string, user defined test case name</td>
  </tr>
  <tr>
    <td>nx_global</td>      <td>integer, global size of "x" dimension</td>
  </tr>
  <tr>
    <td>ny_global</td>      <td>integer, global size of "y" dimension</td>
  </tr>
  <tr>
    <td>nz_global</td>      <td>integer, glboal size of "z" dimension</td>
  </tr>
  <tr>
    <td>ioFMT</td>          <td>string, type and i/o method of data file
                     ("bin","pnc","snc"), binary, pnetcdf, or serial netcdf</td>
  </tr>
  <tr>
    <td>rearr</td>          <td>string, type of rearranging to be done
                     ("none","mct","box","boxauto")</td>
  </tr>
  <tr>
    <td>nprocsIO</td>       <td>integer, number of IO processors used only when rearr is
                     not "none", if rearr is "none", then the IO decomposition
                     will be the computational decomposition</td>
  </tr>
  <tr>
    <td>base</td>           <td>integer, base pe associated with nprocIO striding</td>
  </tr>
  <tr>
    <td>stride</td>         <td>integer, the stride of io pes across the global pe set. A stride=-1
    	 	     directs PIO to calculate the stride automatically.</td>
  </tr>
  <tr>
    <td>num_aggregator</td> <td>integer, mpi-io number of aggregators, only used if no
                     pio rearranging is done</td>
  </tr>
  <tr>
    <td>dir</td>            <td>string, directory to write output data, this must exist
                     before the model starts up</td>
  </tr>
  <tr>
    <td>num_iodofs</td>     <td>tests either 1dof or 2dof init decomp interfaces (1,2)</td>
  </tr>
  <tr>
    <td>maxiter</td>        <td>integer, the number of trials for the test</td>
  </tr>
  <tr>
    <td>DebugLevel</td>     <td>integer, sets the debug level (0,1,2,3)</td>
  </tr>
  <tr>
    <td>compdof_input</td>  <td>string, setting of the compDOF ('namelist' or a filename)</td>
  </tr>
  <tr>
    <td>compdof_output</td> <td>string, whether the compDOF is saved to disk
                     ('none' or a filename)</td>
  </tr>
</table>

Notes:
  - the "mct" rearr option is not currently available
  - if rearr is set to "none", then the computational decomposition is also
    going to be used as the IO decomposition.  The computation decomposition
    must therefore be suited to the underlying I/O methods.
  - if rearr is set to "box", then pio is going to generate an internal
    IO decomposition automatically and pio will rearrange to that decomp.
  - num_aggregator is used with mpi-io and no pio rearranging.  mpi-io is only
    used with binary data.
  - nprocsIO, base, and stride implementation has some special options
    - if nprocsIO >  0 and stride >  0, then use input values
    - if nprocsIO >  0 and stride <= 0, then stride=(npes-base)/nprocsIO
    - if nprocsIO <= 0 and stride >  0, then nprocsIO=(npes-base)/stride
    - if nprocsIO <= 0 and stride <= 0, then nprocsIO=npes, base=0, stride=1

Two other namelist blocks exist to described the computational
and IO decompositions, compdof_nml and iodof_nml.  These namelist
blocks are identical in use.

<table border="1" cellpadding="2px">
  <tr>
	<th colspan="2" align="left">namelist compdof_nml - or - iodof_nml</th>
  </tr>
  <tr>
    <td>nblksppe</td>   <td>integer, sets the number of blocks desired per pe,
                 the default is one per pe for automatic decomposition.
                 increasing this increases the flexibility of decompositions.</td>
  </tr>
  <tr>
    <td>grdorder</td>   <td>string, sets the gridcell ordering within the block
                 ("xyz","xzy","yxz","yzx","zxy","zyx")</td>
  </tr>
  <tr>
    <td>grddecomp</td>  <td>string, sets up the block size with gdx, gdy, and gdz, see
                 below, ("x","y","z","xy","xye","xz","xze","yz","yze",
                 "xyz","xyze","setblk")</td>
  </tr>
  <tr>
    <td>gdx</td>        <td>integer, "x" size of block</td>
  </tr>
  <tr>
    <td>gdy</td>        <td>integer, "y" size of block</td>
  </tr>
  <tr>
    <td>gdz</td>        <td>integer, "z" size of block</td>
  </tr>
  <tr>
    <td>blkorder</td>   <td>string, sets the block ordering within the domain
                 ("xyz","xzy","yxz","yzx","zxy","zyx")</td>
  </tr>
  <tr>
    <td>blkdecomp1</td> <td>string, sets up the block / processor layout within the domain
                 with bdx, bdy, and bdz, see below.
                 ("x","y","z","xy","xye","xz","xze","yz","yze","xyz","xyze",
                 "setblk","cont1d","cont1dm")</td>
  </tr>
  <tr>
    <td>blkdecomp2</td> <td>string, provides an additional option to the block decomp
                 after blkdecomp1 is computes ("","ysym2","ysym4")</td>
  </tr>
  <tr>
    <td>bdx</td>        <td>integer, "x" numbers of contiguous blocks</td>
  </tr>
  <tr>
    <td>bdy</td>        <td>integer, "y" numbers of contiguous blocks</td>
  </tr>
  <tr>
    <td>bdz</td>        <td>integer, "z" numbers of contiguous blocks</td>
  </tr>
</table>


A description of the decomposition implementation and some examples
are provided below.

Testpio writes out several files including summary information to
stdout, data files to the namelists directory, and a netcdf
file summarizing the decompositions.  The key output information
is written to stdout and contains the timing information.  In addition,
a netcdf file called gdecomp.nc is written that provides both the
block and task ids for each gridcell as computed by the decompositions.
Finally, foo.* files are written by testpio using the methods
specified.

Currently, the timing information is limited to the high level
pio read/write calls which generally will also include copy and rearrange
overhead as well as actual I/O time.  Addition timers will be added
in the future.

The test script is called testpio_run.pl, it uses the hostname
function to determine the platform.  New platforms can be added by
editing the files build_defaults.xml and Utils.pm.  If more than one
configuration should be tested on a single platform you can provide
two hostnames in this file and specify the name to test in a --host
option to testpio_run.pl

There are several testpio_in files for the pio test suite.  The ones that
come with pio test specific things.  In general, these tests include:
<ul>
 <li>sn = serial netcdf and no rearrangement </li>
 <li>sb = serial netcdf and box rearrangement</li>
 <li>pn = parallel netcdf and no rearrangement</li>
 <li>pb = parallel netcdf and box rearrangement</li>
 <li>bn = binary I/O and no rearrangement</li>
 <li>bb = binary I/O and box rearrangment and the test number (01, etc) is consistent across I/O methods with</li>
 <li>01 = all data on root pe, only root pe active in I/O</li>
 <li>02 = simple 2d xy decomp across all pes with all pes active in I/O</li>
 <li>03 = all data on root pe, all pes active in I/O</li>
 <li>04 = simple 2d xy decomp with yxz ordering and stride=4 pes active in I/O</li>
 <li>05 = 2d xy decomp with 4 blocks/pe, yxz ordering, xy block decomp, and stride=4 pes active in I/O</li>
 <li>06 = 3d xy decomp with 4 blocks/pe, yxz ordering, xy block decomp, and stride=4 pes active in I/O</li>
 <li>07 = 3d xyz decomp with 16 blocks/pe, yxz ordering, xyz block decomp with block yzx ordering and stride=4 pes active in I/O</li>
 <li>08 = 2d xy decomp with 4 blocks/pe and yxz grid ordering, yxz block ordering and cont1d block decomp the rd01 and wr01 tests are distinct and test writing, reading and use of DOF data via pio methods.</li>
</ul>

PIO can use several backend libraries including netcdf, pnetcdf and mpi-io.   For each library used, a
compile time cpp flag is defined (eg _USEMPIIO).   The test suite builds and tests the model for several
combinations of these cpp flags.
<ul>
 <li>snet  = serial netcdf only</li>
 <li>pnet  = parallel netcdf only</li>
 <li>mpiio = mpiio only</li>
 <li>all   = everything on</li>
 <li>ant   = everything on but timing disabled</li>
</ul>

========================================================================
\section Decomposition

The decomposition implementation supports the decomposition of
a general 3 dimensional "nx * ny * nz" grid into multiple blocks
of gridcells which are then ordered and assigned to processors.
In general, blocks in the decomposition are rectangular,
"gdx * gdy * gdz" and the same size, although some blocks around
the edges of the domain may be smaller if the decomposition is uneven.
Both gridcells within the block and blocks within the domain can be
ordered in any of the possible dimension hierarchies, such as "xyz"
where the first dimension is the fastest.

The gdx, gdy, and gdz inputs allow the user to specify the size in
any dimension and the grddecomp input specifies which dimensions are
to be further optimized.  In general, automatic decomposition generation
of 3 dimensional grids can be done in any of possible combination of
dimensions, (x, y, z, xy, xz, yz, or xyz), with the other dimensions having a
fixed block size.  The automatic generation of the decomposition is
based upon an internal algorithm that tries to determine the most
"square" blocks with an additional constraint on minimizing the maximum
number of gridcells across processors.  If evenly divided grids are
desired, use of the "e" addition to grddecomp specifies that the grid
decomposition must be evenly divided.  The setblk option uses the
prescibed gdx, gdy, and gdz inputs without further automation.

The blkdecomp1 input works fundamentally the same way as the grddecomp
in mapping blocks to processors, but has a few additional options.
"cont1d" (contiguous 1d) basically unwraps the blocks in the order specified
by the blkorder input and then decomposes that "1d" list of blocks
onto processors by contiguously grouping blocks together and allocating
them to a processor.  The number of contiguous blocks that are
allocated to a processor is the maximum of the values of bdx, bdy, and
bdz inputs.  Contiguous blocks are allocated to each processor in turn
in a round robin fashion until all blocks are allocated.  The
"cont1dm" does basically the same thing except the number of
contiguous blocks are set automatically such that each processor
recieves only 1 set of contiguous blocks.  The ysym2 and ysym4
blkdecomp2 options modify the original block layout such that
the tasks assigned to the blocks are 2-way or 4-way symetric
in the y axis.

The decomposition tool is extremely flexible, but arbitrary
inputs will not always yield valid decompositions.  If a valid
decomposition cannot be computed based on the global grid size,
number of pes, number of blocks desired, and decomposition options,
the model will stop.

As indicated above, the IO decomposition must be suited to the
IO methods, so decompositions are even further limited by those
constraints.  The testpio tool provides limited checking about
whether the IO decomposition is valid for the IO method used.
Since the IO output is written in "xyz" order, it's likely the
best IO performance will be achieved with both grdorder and blkorder
set to "xyz" for the IO decomposition.

Also note that in all cases, regardless of the decomposition,
the global gridcell numbering and ordering in the output file
is assumed to be "xyz" and defined as a single block.  The number
scheme in the examples below demonstrates how the namelist input
relates back to the grid numbering on the local computational
decomposition.

Some decomposition examples:
<ul>
  <li>"B" is the block number</li>
  <li>"P" is the processor (1:npes) the block is associated with
  numbers are the local gridcell numbering within the block if the
    local dimensions are unrolled.</li>
</ul>

Standard xyz ordering, 2d decomp:
note: blkdecomp plays no role since there is 1 block per pe
<pre>
 nx_global  6
 ny_global  4
 nz_global  1           ______________________________
 npes       4          |B3  P3        |B4  P4         |
 nblksppe   1          |              |               |
 grdorder   "xyz"      |              |               |
 grddecomp  "xy"       |              |               |
 gdx        0          |              |               |
 gdy        0          |--------------+---------------|
 gdz        0          |B1  P1        |B2  P2         |
 blkorder   "xyz"      |  4    5   6  |  4   5   6    |
 blkdecomp1 "xy"       |              |               |
 blkdecomp2 ""         |              |               |
 bdx        0          |  1    2   3  |  1   2   3    |
 bdy        0          |______________|_______________|
 bdz        0
</pre>

Same as above but yxz ordering, 2d decomp
note: blkdecomp plays no role since there is 1 block per pe
<pre>
 nx_global  6
 ny_global  4
 nz_global  1           _____________________________
 npes       4          |B2  P2        |B4  P4        |
 nblksppe   1          |              |              |
 grdorder   "yxz"      |              |              |
 grddecomp  "xy"       |              |              |
 gdx        0          |              |              |
 gdy        0          |--------------+--------------|
 gdz        0          |B1  P1        |B3  P3        |
 blkorder   "yxz"      |  2    4   6  |  2   4   6   |
 blkdecomp1 "xy"       |              |              |
 blkdecomp2 ""         |              |              |
 bdx        0          |  1    3   5  |  1   3   5   |
 bdy        0          |______________|______________|
 bdz        0
</pre>

xyz grid ordering, 1d x decomp
               note: blkdecomp plays no role since there is 1 block per pe
               note: blkorder plays no role since it's a 1d decomp
<pre>
 nx_global  8
 ny_global  4
 nz_global  1           _____________________________________
 npes       4          |B1  P1  |B2  P2   |B3  P3  |B4  P4   |
 nblksppe   1          | 7   8  |  7   8  |        |         |
 grdorder   "xyz"      |        |         |        |         |
 grddecomp  "x"        |        |         |        |         |
 gdx        0          | 5   6  |  5   6  |        |         |
 gdy        0          |        |         |        |         |
 gdz        0          |        |         |        |         |
 blkorder   "yxz"      | 3   4  |  3   4  |        |         |
 blkdecomp1 "xy"       |        |         |        |         |
 blkdecomp2 ""         |        |         |        |         |
 bdx        0          | 1   2  |  1   2  |        |         |
 bdy        0          |________|_________|________|_________|
 bdz        0
</pre>

yxz block ordering, 2d grid decomp, 2d block decomp, 4 block per pe
<pre>
 nx_global  8
 ny_global  4
 nz_global  1           _____________________________________
 npes       4          |B4  P2  |B8  P2   |B12  P4 |B16  P4  |
 nblksppe   4          |        |         |        |         |
 grdorder   "xyz"      |--------+---------+--------+---------|
 grddecomp  "xy"       |B3  P2  |B7  P2   |B11  P4 |B15  P4  |
 gdx        0          |        |         |        |         |
 gdy        0          |--------+---------+--------+---------|
 gdz        0          |B2  P1  |B6  P1   |B10  P3 |B14  P3  |
 blkorder   "yxz"      |        |         |        |         |
 blkdecomp1 "xy"       |--------+---------+--------+---------|
 blkdecomp2 ""         |B1  P1  |B5  P1   |B9   P3 |B13  P3  |
 bdx        0          | 1   2  | 1   2   |        |         |
 bdy        0          |________|_________|________|_________|
 bdz        0
</pre>

*/
