Development of a scalable and coupled hydraulics and sediment model*

P. V. Bangalore, J. Zhu, D. Huddleston, A. Skjellum

Engineering Research Center for Computational Field Simulation

Mississippi State University

Mississippi State, MS 39762

{puri,jzhu,hudd,tony}@erc.msstate.edu

 

1. Introduction

The three-dimensional Curvilinear Hydrodynamics model (CH3D-WES) developed at the U.S. Army Engineer Waterways Experiment Station provides a suitable framework for the simulation of unsteady, three-dimensional fixed-bed flow along deep navigational channels and irregular shorelines (Johnson 1991). The model includes physical processes like tides, wind, salinity and temperature affects, freshwater inflows, turbulence effects, and the effect of the earth's rotation. The hydraulics model uses a boundary-fitted grid which provides detailed resolution to represent complex horizontal geometries found in deep navigation channels and irregular shorelines. A non-cohesive sediment model was added to the hydraulics model to include mobile bed processes (such as bedload transport, suspended-sediment transport, bed level changes, hydraulic sorting and armoring, etc.) as described in (Spasojevic 1994). Further, improvements were made to the sediment modeling of the hydraulics model to reduce the time taken for sediment computations as described in (Spasojevic 1997), and this sequential version of the model is used as the starting point of the current work.

The combined hydraulics and sediment models (CH3D-SED) provide a computational framework to study and model several aspects of sediment-flow interaction. The combined model has been used in the successful simulation of sedimentation on bendways, crossings, and distributaries on the lower Mississippi and Atchafalaya Rivers for the purpose of evaluating processes like dredging, channel evolution, and channel training processes. Hence the CH3D-SED model was chosen in the coupling of the hydraulics, sediment and wave models. The development of a fully coupled and scalable model for simulation of physical marine processes has impact on many national defense and security operations. In order to have an effective coupling of the hydraulics, sediment, and wave models a scalable version of the coupled hydraulics and sediment models forms an important component. Initial development of a scalable version of the hydraulics model was described in (Zhu 1998). Results herein describe the development of the scalable version of the coupled hydraulics and sediment models. The details on the development of the parallel version of the coupled codes along with the parallelization strategy is presented in section 2. Section 3 provides details about the test cases and section 4 describes the performance results for the parallel version of the code on the CEWES MSRC platforms. Summary and conclusions are provided in section 5.

2. Strategy for Parallelization

Domain decomposition is used to distribute data between different processors. In order to minimize the idle time, static load-balancing is used to distribute the data such that each processor gets almost the same number of computational points. The partitioning and load-balancing is done in the pre-processing stage, wherein, separate grid files are generated for each processor along with other necessary information about the partitioning. Thus there is no need to allocate any extra storage or scatter the grid data when the parallel program is executed. Also at the end of the parallel computation each process writes the output into separate files and the post-processing routines combine these files and generate output files suitable for verification and visualization.

2.1 Pre-processing and Partitioning

To parallelize the computation, a three dimensional spatial domain is decomposed into subdomains. Each processor gets a subdomain and is responsible for the computations in that subdomain. There are many ways to decompose a three dimensional subdomain and there is no unique answer as to which decomposition strategy is the best. It depends, among other things, on the problems size, the machine parameters (communication speed vs. computation speed, for example), and the numerical algorithms used. The one-dimensional decomposition strategy is used here for parallelizing this code based on the following considerations:

Based on the preceding reasons a 1-D decomposition in the J direction is used to partition the domain. If the grid size is NX x NY, then a simple distribution along J direction on NP processors would be to divide NY equally among NP processors. However, the CH3D-SED code uses the I-blanking technique to facilitate the representation of complex geometry. Thus not all cells in the domain are active or used in the computation, those cells that are used in computation are referred to as computational cells or active cells. Hence dividing the NY columns equally among NP processors will not provide a uniform distribution in computational cells. For example, the Lake Michigan test case contains 67 x 138 cells in the lateral plane with a total of 9313 cells. However, the grid utilizes only 3598 computational cells or active cells. Hence it is important to get a uniform distribution for the number of computational cells when the grid is partitioned for the parallel runs.

A simple load-balancing scheme is used during the partitioning phase. The complete grid is read and the number of computational cells along each I line and the total number of computational cells in the grid are computed. The approximate number of computational cells per processor equals the total number of computational cells divided by the number of processors. Then the number of computational cells in each I line are added until the sum is less than the approximate number of computational cells per processor. Since the I line cannot be split between two processors the distribution in the computational cells will vary slightly from one processor to another. A sample distribution for the Lake Michigan grid on 10 processors is shown in table 1.

No. of

processors

With no load-balancing

With load-balancing

Index

Columns

Cell Count

Index

Columns

Cell Count

1

1-14

14

222

1-19

19

351

2

15-28

14

398

20-31

12

362

3

29-42

14

431

32-43

12

370

4

43-56

14

406

44-55

12

349

5

57-70

14

337

56-70

15

362

6

71-84

14

316

71-86

16

364

7

85-98

14

360

87-100

14

371

8

99-112

14

485

101-110

10

345

9

113-125

13

486

111-119

9

374

10

126-138

13

157

120-138

19

350

Table 1. Sample distribution of computational cells with and without load-balancing for 10 processors. ``Index'' denotes the global coordinates and ``columns'' denotes the number of columns in J direction.

The first part of the table indicates the number of computational cells in each processor with the simple distribution of the columns while the second part shows the number of computational cells when the above mentioned load-balancing technique is used. It can be observed that in the former case the number of columns remain almost the same and the number of computational cells vary whereas in the latter case the number of columns vary and the number of computational cells per processor remains roughly the same. Without any load-balancing the processor which has fewer number of computational points has to wait at each phase of computation to receive the boundary values from the neighboring processor, which could lead to a lot of idle time and thus reduce the parallel efficiency.

Once the grid boundaries are determined an extra I line is added on either side of the block to provide a single cell overlapping regions. Then the grids with the overlapping regions are written out into separate files based on the processor number, also the distribution of the grid in global coordinates is written to disk for use during the post-processing stage. Along with the grid files other necessary files for tide, river, and wind data are also written to disk. The pre-processing and partitioning is performed sequentially on a single processor.

2.2 Parallel Computations

All the required input files are generated during the pre-processing and partitioning phase. Each processor reads the corresponding input file and grid file and performs computation on the local domain. At the end of computation of each phase data is exchanged between the neighboring processors and computations for the next phase proceed in parallel in each processor. Other than the salinity and temperature computations all the other computations in the hydraulics model use a single-cell overlap, hence 1-cell data is exchanged at the boundaries. The salinity and temperature computations as well as all the sediment computations use a two-cell overlap, which requires exchange of two-cell data at the boundaries.

In the external mode, the vertically averaged velocities and the free surface displacement are computed using an implicit scheme which requires tridiagonal solves in the I and J directions. Since the data is decomposed along the J direction, the tridiagonal solves along the I direction does not require any data exchange, whereas, the tridiagonal solve along the J direction requires a parallel tridiagonal solver. Since the total time spent in this part of the program is less than 1% of the total time we have used a trivial parallel tridiagonal solver.

Each processor computes the coefficients of the system of linear equations that should be solved and the first processor proceeds with the forward solution phase and sends the boundary values to the next processor. The second processor starts the forward solution phase and passes the boundary values to the next processor and this process continues until the last processor. The last processor completes the forward solution phase and performs the backward solution and sends the boundary information back to the previous processor. This continues till the data reaches the first processor. After each processor sends the boundary values of the forward solve to the next processor it has to wait until it receives the boundary values for the backward solution from the next processor. This idle time can create a bottleneck and degrade scalability when larger number of processors are used. In order to reduce the idle time one of the widely used techniques in the literature is to overlap computation of the coefficients for the next J line with the idle time in waiting for the data from the backsolve. Unfortunately, due to the use of the I-blanking technique an effective overlap of computation and communication requires extra book-keeping, check points, and makes the code very complicated. Hence it is not implemented in the current version of the code but left as a future enhancement.

Other than the external mode solution phase, the rest of the code uses implicit schemes in the vertical direction (K direction) only, hence there is no extra message-passing involved except for the exchange of boundaries for data blocks with overlapping regions. All these exchange of boundary information is performed separately in a Fortran 90 module, thus there are no message passing calls in the original code. Whenever data needs to be exchanged the corresponding function call in the F90 module is called. The F90 module contains not only the message passing routines but also all the data associated with message passing (like communicators, status and request objects, etc.). There is no need to include any information in the common block structure of the original code and thus requires minimal changes to the original code. The use of F90 modules makes the code modular and restricts all the code changes to one module, thus making the code maintenance and enhancements easier.

2.3 Post-processing and Visualization

At the end of the parallel computation each processor writes the output variables into a separate file based on the processor number (rank). Using the grid distribution information generated during the pre-processing and partitioning stage the separate output files are combined together to form a single output file for each output variable and written to disk in a format consistent with the data structure used in the original sequential code. This enables the verification of the parallel code and utilization of existing post-processing tools for visualization. This process is performed sequentially on a single processor.

3. Test Cases

First the code was setup to run two different problems and executed sequentially on the CEWES MSRC platforms. The results obtained from the sequential code were used as the base line for comparison of the results obtained from the parallel version of the code. The output obtained from the post-processing step mentioned above was used to compare with the results obtained from the sequential run. The following sections describe the test cases, input conditions used, and output obtained. The first test case (Lake Michigan) uses wind, temperature, and two sediment size classes for input, whereas, the second test case (Old River) uses river, tide, and six sediment size classes.

3.1 Lake Michigan

The test case was setup to simulate the annual episode of spring sediment resuspension along the southern shores in Lake Michigan, a phenomenon that has been well demonstrated by satellite observations. A 4-km grid with 67 x 138 points is used and the grid has 20 vertical layers with a denser distribution near the surface. The circulation and sediment transport are driven by a simplified wind field, which exists in the first day (linearly increase to 15 m/s during the first 18 hours and remain unchanged till the end of the first day) and becomes zero during the rest of the simulation period. Spatially, the wind is from the north and uniformly distributed over the lake. Two size classes of sediments (diameter 0.01 mm and 0.1 mm) are used in the simulation. The finer one initially distributes mainly in the southern part and the coarser one mainly in the northern lake. The initial temperature is setup so that an isothermal layer of 240C is used for the top 11m and the temperature from 11m to 20m decreases linearly to 70C. Then the temperature decreases to 4.50C from 20m to 90m and remains at 4.50C for the rest of the depth. Heat flux at the bottom and lateral boundaries and water surface are set to zero. The above mentioned initial conditions were setup for the current version of the code and tested on a single processor of the Origin 2000 and Cray T3E by running 600 iterations with a time step of 2 minutes. The results obtained were used as the baseline for testing the parallel code.

3.2 Old River

The Mississippi river at the old river control structure complex was used to test some of the features that were not tested with the Lake Michigan test case. The key components that were tested in this case are the river and tidal conditions. Also six sediment size classes were used in this test case, which requires a significant amount of computations. A computational grid of size 50 x 345 with 10 vertical layers is used and the simulation is performed for 200 iterations with a time step of 15 seconds with sediment computations switched on after 100 iterations. With these initial conditions the simulation was performed on a single processor and then the results were compared with the parallel runs.

4. Performance Analysis

The parallel version of the code was executed on the SGI Origin 2000 and the Cray T3E platforms at the CEWES MSRC for different number of processors. Figure 1 provides the execution time on the two platforms for the Lake Michigan test case. These results were obtained by running the Lake Michigan test case with the 69 x 138 grid for 600 iterations with a time step of 2 minutes. It was observed that the execution time reduced significantly upto 32 processors after which the reduction is not very significant. This behavior was expected since the number of I lines per processor will be less than 4 or 5 per processor and each processor has to exchange data after performing computation on 4 or 5 columns (around 100 computational cells). With more than 32 processors the time spent in data exchange will be significant compared to the time spent in computation and thus the parallel efficiency goes down. Also when more than 32 processors are used, balancing the number of computational cells per processor will become a difficult task since we cannot split the I line between two processors and a single I line can cause significant load imbalance. Also the parallel tridiagonal solver used in the external mode adds additional overhead when large number of processors are used due to the sequential nature of the implementation. We also observe that when the number of processors are increased the execution time suddenly increases for certain number of processors on both architectures. This is again due to the uneven distribution of the computational cells when a large number of processors are used. The use of more sophisticated domain decomposition methods along with new numerical schemes could potentially improve the scalability and efficiency of the coupled hydraulics and sediment model.

Extensive performance measurements were not performed on the Old River test case since the intention was only to test the functionality of the parallel version and make sure that identical results were obtained from the sequential and parallel codes. However, the parallel version of the code was used to perform a longer simulation on 32 processors of the Cray T3E.

Figure 1. Execution time for the Lake Michigan test case on the Origin 2000 and Cray T3E

5. Summary and Conclusions

The development of a coupled and scalable hydraulics and sediment model was described in this report. Parallelization was carried out with a three-stage approach: pre-processing and partitioning, parallel computation, and post-processing. During the pre-processing and partitioning phase the grid is partitioned using a simple load-balancing technique to obtain roughly equal number of computational cells per processor and separate grid files were generated for each processor along with the other necessary input files. All the processors performed the computation in parallel and exchanged data at the processor boundaries in the parallel computation phase. At the end each processor wrote the corresponding part of the solution to disk. These individual files are later merged during the post-processing stage to generate output files suitable for plotting the results. The simulation was performed on two different test cases, Lake Michigan and Old River, to test the full functionality of the coupled model. The Lake Michigan test case was tested on the Origin 2000 and the Cray T3E for different number of processors and the performance was analyzed.

References

Johnson 1991. Raymond S. Chapman, Billy H. Johnson, and S. Rao Vemulakonda. User's Guide for the Sigma Stretched Version of CH3D-WES, U.S. Army Corps of Engineers, HL-96-21.

Spasojevic 1994. Miodrag Spasojevic and Forrest M. Holly, Jr. Three-Dimensional Numerical Simulation of Mobile-Bed Hydrodynamics, Iowa Institute of Hydraulic Research, HL-94-2.

Spasojevic 1997. Miodrag Spasojevic and Forrest M. Holly, Jr. Three-Dimensional Numerical Simulation of Mobile-Bed Hydrodynamics, Iowa Institute of Hydraulic Research, Technical Report #262.

Zhu 1998. J. Zhu, B. Johnson, P. Bangalore, D. Huddleston, and A. Skjellum. On the Parallelization of CH3D, Mississippi State University, CEWES MSRC/PET TR/98-15.