Running multiphysics solutions on clusters
The installation package for the TMG thermal-flow solvers provides sample custom scripts to run multiphysics solutions on Linux clusters installed with Slurm (Simple Linux Utility for Resource Management) or PBS (Portable Batch System) job schedulers high-performance computing (HPC) environment.
You can find the bash custom scripts slurm-script.sh and pbs-script.sh in the tmg/if/scripts directory and use them to launch simulations on clusters.
Understanding the scripts
When you submit a custom script requesting specific computing resources, it generates a job script with the requested resources and submits it to the job manager. Once the job manager executes the job, it launches the job script. This job script identifies the nodes and CPUs allocated for the job by the job manager and uses this information to generate the parallel configuration file with the correct number of cluster nodes. Finally, it launches the thermal-flow solver using the input file and the generated parallel configuration file.
These scripts start the TMG Executive Menu with the tmgnx.com
executable without using the ND argument, which is not compatible with multiphysics solutions. This is because multiphysics solutions can have up to the following three input files:
- Thermal-flow solver XML input file
- Nastran solver DAT input file
- Multiphysics MPDAT input file
Instead, these scripts use the NX or MD argument depending on the input file type specified, XML or MPDAT, respectively.
The provided job schedulers do not use the same commands to submit jobs or same options to personalize directives to the job scheduler, but have equivalent functionality. For example:
- Slurm uses commands such as
sbatch
,squeue
,scancel
. - PBS uses commands such as
qsub
,qstat
,qdel
.
The provided custom scripts use these commands and options described in following sections to run thermal/flow and multiphysics solutions seamlessly.
Before running the scripts
The SPLM_LICENSE_SERVER, UGII_BASE_DIR, and UGII_TMG_DIR environment variables must be defined on the head node.
You must have a DMP license.
Slurm or PBS job scheduler must run on the cluster.
The following files are required and must be stored together in a run directory accessible to all cluster nodes:
- One of the provided scripts, which you can adapt as needed.
- The following solution input files:
- The thermal-flow solver XML input file required for thermal, flow, thermal-flow, and multiphysics solutions with the following settings:
<Property name="Run Solution in Parallel"> <Value>1</Value> </Property> <Property name="Parallel Configuration File"> <Value>ParallelConfigurationFile.xml</Value> </Property>
- The Nastran solver DAT input file required for multiphysics thermal-structural. thermal-flow-structural, and flow-structural solutions.
- The multiphysics MPDAT input file required for multiphysics solutions. It is recommended that you modify the paths to the XML and DAT files to absolute paths inside the MPDAT file.
- The thermal-flow solver XML input file required for thermal, flow, thermal-flow, and multiphysics solutions with the following settings:
Script options
You run the provided job scheduler script from the command line using appropriate input options. The following options are available:
-n
specifies the total number of cores for the thermal-flow solvers. The default value is 1. For the Slurm script, this is the total number of cores. For the PBS script, this is the number of cores per computing node.-N
specifies the number of computing nodes. The default value is 1.-m
specifies the total number of cores for Nastran in a multiphysics solution. The default value is the same as for the thermal-flow solvers.-s
specifies the input file: <simulation/model name>-<solution/analysis name>.xml for thermal/flow solutions and <simulation name>-<solution name>.mpdat for multiphysics solutions. This option is always required.-j
specifies the name of the job script that the job scheduler script creates before launching the simulation. By default, it is submit_job.sh.
To get the list of all input arguments, their definitions and default value, run the script without any argument.
The script parses these arguments and:
- Generates the job script with the number of cores used for the thermal-flow solver.
- Writes the code to generate the ParallelConfigurationFile.xml file in the job script with the node name, and the number of cores specified for view factors, the thermal solver, and the flow solver. If the file already exists, it is overwritten.
- Modifies the .mpdat file to include the parallel option with the specified number of cores for multiphysics solutions.
- Launches the simulation.
Command to solve a thermal/flow solution with the XML file called Solution-file.xml with Slurm job scheduler on 2 nodes with 128 cores:
./slurm-script.sh -n 128 -N 2 -s Solution-file.xml
Command to solve a multiphysics solution with the MPDAT file called MP-Solution-file.mpdat with PBS job scheduler on 2 nodes with 128 cores—64 cores on each node:
./pbs-script.sh -n 64 -N 2 -s MP-Solution-file.mpdat
Command to solve a multiphysics solution with the MPDAT file called MP-Solution-file2.mpdat with PBS job scheduler on 1 node with 64 cores for the thermal solver and 32 cores for Nastran:
./pbs-script.sh -n 64 -m 32 -s MP-Solution-file2.mpdat
-n
argument for core per node and Slurm uses it for total cores.Troubleshooting
If the job remains in the queue without starting and no other jobs are running, possible causes include:
- The job scheduler is not running.
- The communication between nodes is down.
- The requested allocation exceeds available resources.
To check if the job scheduler is running and nodes are in idle status, use the following commands:
- Slurm:
scontrol show node -a
- PBS:
pbsnodes -a
If your cluster environment uses the ssh connection between nodes and the job is crashing in one of the thermal solver parallel modules—VUFAC, DOMDEC, or ANALYZ—with MPI issues, it might be related to the ssh connection between nodes that require a user password.