The Research Group's Sun Fire X4600 Server

In October of 2006 our research group received a donation of a Sun Fire X4600 as the result of an Academic Excellence Grant from Sun Microsystems.

Thus far, the following software has been ported over to machine and Solaris x86/x64 for the purposes of our research:
  1. The Single Column Model developed for studying boundary layer clouds by Dr. Larson and Dr. Golaz.

    1. The model itself was developed on Portland Group Fortran 90 over several years and ported to Sun Fortran 8.1/ 8.2 within the last 2 years when the Sun Studio development suite was made freely available to us.  Probably the greatest benefit derived from this port was the ability to run the code profiler (collect/analyzer) and find bottlenecks with the model.
       
    2. The major modification for Sun Fire was the ability to use OpenMP to compute multiple columns within a forecast model simultaneously.  The original model used several shared data sections which were shared via Fortran 90 modules, and these were not thread safe.  Through the use of $omp threadprivate statements and other modifications within the SCM code, we were able to use the scheme within a large 3D forecast model with OpenMP parallelism.  Initial results look promising.

  2. A variant of the forecast model developed by the Naval Research Labs, COAMPS, has been ported to the Sun Fire.  This particular variant has modified to run Large Eddy Simulations.

    1. The model itself is written in Fortran 90 and uses MPI for parallelism.  We were able to compile mpich 1.2.7 with the shared memory comm device using Sun Fortran 8.2 after some modification of the code for portability.  Simulations run on the machine using all 8 Opteron processors compare favorably to the old method of running with GNU/Linux nodes over a gigabit ethernet connection.

    2. Several past GCSS simulations have been successfully re-run to demonstrate to test COAMPS on Sun Fortran.  One such case we re-ran was DYCOMS II RF02.

      A visualization of rain water mixing ratio that was rendered using Vis5d.

      This was creatd by running the COAMPS forecast model and generating the 3D stat data files on the Sun Fire, converting them to a Vis5D format, rendering/capturing each screen on a desktop workstation with Vis5d, and then finally using ffmpeg to generate the mpeg4 animation.

    3. Now that the code has been modified for use on the Sun Fire, we'll be running future models there as well.

  3. Dr. Kharoutdinov's LES forecast model “SAM” has been successfully compiled and run using Sun HPC Cluster Tools 6.  Because of the speed in which the SAM code could be compiled and run, we ran a number of interesting benchmarks using the cluster tools.

    1. Dr. Kharoutdinov himself ran a number of tests using the GATE case without interactive radiation (10 second time step / 100 model iterations) on older IBM and SGI HPC systems.  While all these machines are obsolete now, we used the numbers from these tests as our baseline to make sure we had configured the model correctly and used the same benchmark used on those machines to test the run time of the model.

    2. Our first configuration involved simply a low level -xO3 optimization and -xarch=native64 paired with the mpich-shmem MPI as we had with COAMPS.  Our COAMPS variant had a number of issues using MPI-2 in the past, and so we did not use Cluster Tools 6 initially.  For whatever reason, mpich could not run the SAM model with 8 processors, and so Cluster Tools 6 was tried and found to work with an arbitrary number of processors.

      1. The use of the Cluster Tools ended up having the added bonus of allowing us to run mpprof, a tool that provides diagnostic information MPI jobs of a cluster.  To use this, we simply did:
        MPI_PROFILE=1 mprun -np 8 SAM

        and then:
        mpprof -c 90 mpprof.index.cre.<jobnum>

        Which provides diagnostic output like so:

        OVERVIEW
        ========

        The program being reported on is "<unknown>," which ran as job name "cre.15" on Sat Feb
        10 20:21:05 2007.

        Profiled Time Range:

          Start at elapsed time 0.000003 secs
          End at elapsed time   50.643227 secs
          Total duration is     50.643224 secs
          Fraction spent in MPI 18.6%

        Elapsed time is measured from the end of MPI_Init. Data is being reported for 8 processes
        of a 8-process job.

        LOAD BALANCE
        ============

        Data is being reported on 8 MPI processes. The following histogram shows how these
        processes were distributed as a function of the fraction of the time the processes spent
        in MPI routines:

          Number of MPI Processes

         10-|
            |
          9-|
            |
          8-|
            |
          7-|
            |
          6-|
            |
          5-|
            |
          4-|
            |
          3-|
            |
          2-|                                                             #
            |                                                             #
          1-|  ###                                                        #   ##            #
            |  ###                                                        #   ##            #
          0-+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------
            6.0  7.5  9.1 10.6 12.1 13.7 15.2 16.7 18.3 19.8 21.3 22.8 24.4 25.9 27.4 29.0
             Percentage time in MPI

        Rank       Hostname        MPI Time
           0  ketley.math.uwm.edu     6.01%
           3  ketley.math.uwm.edu     6.48%
           7  ketley.math.uwm.edu     6.70%
           2  ketley.math.uwm.edu    24.19%
           5  ketley.math.uwm.edu    24.28%
           6  ketley.math.uwm.edu    25.30%
           1  ketley.math.uwm.edu    25.77%
           4  ketley.math.uwm.edu    29.88%

        Low MPI time for an MPI process may indicate the process has too much of a compute load.
        A high compute load forces the other processes to wait, increasing their MPI time.

        To focus reporting on one or more particular processes you may use the -p option on the
        command line for mpprof. Type "mpprof -h" for more information.

        MPI ENVIRONMENT VARIABLES
        =========================

        MPI_POLLALL:
        You ran with full polling of connections. This means that Sun MPI monitored all
        connections for incoming messages, whether your program explicitly posted receive
        requests for those connections or not. Typically, this leads to a degradation in
        performance.

        Suggestion: Set the environment variable MPI_POLLALL to "0".

        Warning: If your program relies on MPI_Send to provide substantial internal buffering of
        messages, this suggestion could result in deadlock. On the other hand, that would be an
        indication that the program in not MPI compliant. If such deadlock results, it may be
        resolved by disregarding this suggestion. Even better performance could result, however,
        by modifying the program to post the appropriate receives or, in some cases, by setting
        MPI environment variables to increase internal buffers.

        MPI_SPIN:
        You ran without spin waiting. When a Sun MPI processes is waiting on an event (receipt of
        a message, availability of internal buffer space, et cetera), it may yield its processor
        (CPU) temporarily to allow other processes to take advantage of that computational
        resource. While such a policy helps the system, as a whole, often to be more productive,
        it can slow the performance of a job that is intented to make best use of a dedicated
        system.

        Suggestion: Set the environment variable MPI_SPIN to "1".

        Warning: This suggestion can hamper performance if there are insufficient processors on a
        node to handle the computational load, including background system daemons.

        MPI_PROCBIND:
        You ran without process binding: Processing binding can help overcome operating system
        scheduling algorithms that, unfortunately, sometimes degrade the performance of large,
        dedicated, multiprocess parallel jobs.

        Suggestion: Set the environment variable MPI_PROCBIND to "1"

        Warning: This setting can hamper performance tremendously if any of your MPI processes is
        multithreaded or if other jobs are competing for computational resources on your system.

        SHM
        ===

        MPI_SHM_CPOOLSIZE:
        The SHM Protocol Module encountered a connection pool that was congested. This impeded a
        process in sending a message. There are two strategies for dealing with this. One is to
        set MPI_SHM_CPOOLSIZE to some value greater than 24576, which is what you used by
        default.If your code has many threads per process all trying to send point-to-point
        messages, increasing MPI_SHM_CPOOLSIZE to 436224 may be your best strategy.

        If multiple threads sending messages is not anticipated to be significant problem, then a
        better strategy is to organize internal buffer pools by 'sender' rather than by
        'connection'.

        Pooling buffers in this manner allows surges in one connection to be accommodated,
        potentially, by reserves in another. The appropriate setting for MPI_SHM_SBPOOLSIZE is
        hard to determine, but a starting guess should definitely exceed 24576.

        Suggestion: Set environment variable MPI_SHM_SBPOOLSIZE to 436224
        The relative SHM memory consumption of 'send buffer pools' versus 'connection pools'
        depends not only on the sizes of the pools, but also on how many MPI processes are on a
        node. For example, on node 'ketley.math.uwm.edu', you had only 8 processes.
         Moving to send buffer pools with the above suggestion would require an extra 2113536
        bytes on that node.

        SUGGESTION SUMMARY
        ==================

        Summary of environment variable suggestions:

          Set: MPI_POLLALL=0
          Set: MPI_PROCBIND=1
          Set: MPI_SHM_SBPOOLSIZE=436224
          Set: MPI_SPIN=1

        In the C shell, these environment variables may be set by the following commands:

        setenv MPI_POLLALL 0
        setenv MPI_PROCBIND 1
        setenv MPI_SHM_SBPOOLSIZE 436224
        setenv MPI_SPIN 1

        In the Bourne or Korn shell, these environment variables may be set by the following
        commands:

        export MPI_POLLALL=0
        export MPI_PROCBIND=1
        export MPI_SHM_SBPOOLSIZE=436224
        export MPI_SPIN=1

        BREAKDOWN BY MPI ROUTINE
        ========================

        Here, averages over all MPI processes profiled are reported. The numbers in parentheses
        roughly indicate the variations there are among all of the MPI processes. These
        variations are computed as (1-min/max)/2 where "min" and "max" are the minimum and
        maximum values, respectively, for each statistic reported. A total of 8 different MPI
        APIs were called.

         MPI Routine         Time          Calls Made           Sent             Received
        MPI_Allreduce  1.518302 (46.4%)      359  (0.0%)     526620  (0.0%)     526620  (0.0%)
        MPI_Barrier    1.919698 (41.8%)     2218  (0.0%)          0  (0.0%)          0  (0.0%)
        MPI_Comm_rank  0.000021 (14.5%)        1  (0.0%)          0  (0.0%)          0  (0.0%)
        MPI_Comm_size  0.000000 (23.0%)        1  (0.0%)          0  (0.0%)          0  (0.0%)
        MPI_Irecv      0.059352 (41.9%)    14800  (0.0%)          0  (0.0%)          0  (0.0%)
        MPI_Send       1.661869 (17.7%)    14800  (0.0%)  426598400  (0.0%)          0  (0.0%)
        MPI_Test       3.265142 (45.1%)  6060394 (45.2%)          0  (0.0%)  420454400  (0.0%)
        MPI_Waitall    0.983842 (50.0%)      200  (0.0%)          0  (0.0%)    6144000  (0.0%)

        Where "Time" is in seconds and "Sent" and "Received" are in bytes.

        TIME DEPENDENCE
        ===============

        No time-dependent information is being printed since the duration 50.64322 from the start
        time 0.00000 to the end time 50.64323 is less than the profiling interval 60.0-second.
        For time-dependent information, rerun your MPI code with the environment variable
        MPI_PROFINTERVAL set to some value much less than 50.64322. See the mpprof(1) man page
        for more information about MPI_PROFINTERVAL.

        CONNECTIONS
        ===========

        A connection is a sender/receiver pair. For 8 processes, there are 8x8=64 connections,
        including send-to-self connections.

        Here are statistics on the messages sent for each connection, reported on a scale of 0-99
        with 99 corresponding to 3801 messages:

            sender
             0  1  2  3  4  5  6  7
        receiver
          0  _ 99 52 88  5  5 46 88
          1 98  _ 88 52  5  5 88 46
          2 46 88  _ 98 52 88  5  5
          3 88 46 98  _ 88 52  5  5
          4  5  5 46 88  _ 98 52 88
          5  5  5 88 46 98  _ 88 52
          6 52 88  5  5 46 88  _ 98
          7 88 52  5  5 88 46 98  _

        Here are statistics on the bytes sent for each connection, reported on a scale of 0-99
        with 99 corresponding to 168147200 bytes:

            sender
             0  1  2  3  4  5  6  7
        receiver
          0  _ 99 49 15 13 13 45 15
          1 98  _ 15 49 13 13 15 45
          2 45 15  _ 98 49 15 13 13
          3 15 45 98  _ 15 49 13 13
          4 13 13 45 15  _ 98 49 15
          5 13 13 15 45 98  _ 15 49
          6 49 15 13 13 45 15  _ 98
          7 15 49 13 13 15 45 98  _

        The average length of point-to-point messages was 28822 bytes per message.

        After running with the suggested options and the profiler enabled a number of times, we finally decided on the following options for an 8 processor run:

        export MPI_POLLALL=0
        export MPI_PROCBIND=1
        export MPI_SHM_SBPOOLSIZE=1024000
        export MPI_SHM_NUMPOSTBOX=54
        export MPI_SPIN=1

        After which the wall clock was reduced from the original 51 seconds down to 39 seconds, an approximate improvement of 30%.  While a few seconds seems unimportant in the case of this benchmark, an actual LES simulation for scientific purposes can run for hours or even days, and so a 30% decrease in runtime is great.

        Baseline + Sun Fire X4600 Machine Results (Note: we have an 8 processor X4600 model, and cannot do a 4x4 domain decomposition)

        System
        nsubdomain_x
        nsubdomain_y
        # of Proc.
        Wall Clock (sec)
        SGI Origin (R1200@350mhz)
        1
        1
        1
        1526

        2
        2
        2
        505

        2
        4
        8
        221





        IBM-SP@375mhz
        1
        1
        1
        995

        2
        2
        4
        315

        4
        4
        16
        75

        8
        8
        64
        23





        Sun X4600 (Opteron 885@2.6Ghz)




        (Before using profiling options)
        2
        4
        8
        51
        (After using profiling options)
        2
        4
        8
        39

        2
        2
        4
        73

        1
        1
        1
        224

      2. Experiments done using SAM with the -xipo, -xprofile=collect/use, and the -xvector options.  We then re-ran the code using collect/analyzer.  Sadly, IPO and profiling were unable to appreciably improve our runtime performance, but -xvector=simd generates a noticeably faster code.  We finally settled on the following options for the Fortran compiler:
        mpf90 -xtarget=native64 -g -c -xO4 -xvector=simd -dalign -ftrap=%none

      3. Now that we have SAM we have run a number of test cases and may do an experiment with implementing the single column model into the code at a later date.