Frequently Asked Questions

Accounts

How do I obtain a HiPerGator account?

Fill out the Account Request form. Requests submitted by non-faculty members must be confirmed by faculty sponsors prior to account creation. Accounts are generally created within twenty-four hours of receiving a request, but may take longer if the faculty sponsor does not reply promptly to a verification request.

⇒ Direct link to this FAQ

How do I change or recover my HiPerGator password?

Research Computer does not manage user credentials. GatorLink credentials are used throughout the system. Visit the GatorLink Account Management page for all your account needs or contact UFIT HelpDesk.

⇒ Direct link to this FAQ

How do I get help with my HiPerGator account?

If you encounter problems or have questions, please open a support ticket. Support tickets provide a traceable, permanent record of your issue and are systematically reviewed on a daily basis to ensure they are addressed as quickly as possible. You may also contact us by email or visit our offices in person.

⇒ Direct link to this FAQ

What happens to my HiPerGator account when I leave UF?

The answer to this question depends mostly on how your affiliation with UF will change when you leave the university. If the new affiliation with UF does not allow for you to have an active GatorLink account, you will no longer be able to access HiPerGator.

If you will need to maintain access to your HiPerGator account after leaving UF, contact your group sponsor and request to be affiliated within their department as a “Departmental Associate.” This will ensure your GatorLink account, and therefore HiPerGator account, remain active after your previous affiliations are removed.

Of course, general UFRC Account Policies are still in effect; any HiPerGator accounts that remain inactive for more than 1 year are automatically deactivated.

⇒ Direct link to this FAQ

Cluster Details

What hardware is in HiPerGator?

HiPerGator consists of 51,000 compute cores and 3 petabytes of high-performance storage.

For detailed information on HiPerGator hardware components, view the the HiPerGator Hardware Specification Sheet.

⇒ Direct link to this FAQ

What are ResVault and ResShield?

ResVault and ResShield are systems for working with restricted data, each having its own set of use cases.

ResVault is a secure environment for research projects that use electronic Protected Health Information (ePHI), which are required to comply with the Health Insurance Portability and Accountability Act (HIPAA). ResVault can also be used for projects that require a “low” FISMA compliance rating, and for projects that require International Trade in Arms Regulation (ITAR) compliance.

ResShield is a secure environment for research projects that require a “moderate” Federal Information Systems Management Act (FISMA) compliance rating. Using ResShield requires a yearly, per-user fee for the licenses required to maintain system compliance.

⇒ Direct link to this FAQ

Storage

What types of storage are available and how should each type be used?

Please see our Storage “Getting Started” page.

⇒ Direct link to this FAQ

Can my storage quota be increased?

You can request a temporary quota increase. Submit a support request and indicate: (1) how much additional space you need; (2) the file system on which you need it; and (3) how long you will need it. Additional space is granted at the discretion of Research Computing on an “as available” basis for short periods of time. If you need more space on a long-term basis, please review our storage options and contact us to discuss an appropriate solution for your needs.

⇒ Direct link to this FAQ

How can I check my /ufrc storage quota and current usage?

Use the following command to check /ufrc usage and storage quota, replacing <your_username> with your username:

lfs quota -u <your_username> /ufrc

While storage usage is tracked at the user level, storage quotas are generally only assigned at the group level. This means that the limits returned are a function of the group’s quota and total storage usage by other members in the same group.

To check storage usage for an entire group, use the following command:

lfs quota -g <your_group> /ufrc

The following is an example of the output returned from this command:

 Filesystem  kbytes  quota      limit      grace  files   quota   limit   grace
   /ufrc  15799470596  22548578304 22548578304 -   183580  0      0           -

All numbers listed in the output of the command are in kilobytes. The output includes a number of columns. The first six columns contain all useful information:

  • Filesystem: name of the filesystem
  • kbytes: amount of space used
  • quota: amount of space assigned to the group
  • limit: maximum amount of space accessible by the group
  • grace: - means the default grace period of seven days
  • files: number of files owned by the group

A group can exceed the assigned quota up to the maximum quota for the duration of the grace period at which point the group’s ability to write new files will be restricted.

⇒ Direct link to this FAQ

Why can't I run jobs in my home directory?

Home directories are intended for relatively small amounts of human-readable data such as text files, shell scripts, and source code.  Neither the servers nor the file systems on which the home directories reside can sustain the load associated with a large cluster. Overall system response will be severely impacted if they are subjected to such a load. This is by design, and is the reason all job I/O must be directed to the /ufrc file system.

⇒ Direct link to this FAQ

Software

What software applications are available?

The full list of applications installed on the cluster is available at the Installed Software wiki page.

⇒ Direct link to this FAQ

What software applications can run on the GPU partition?

GPU-accelerated computing is intended for use by highly parallel applications, where computation on a large amount of data can be broken into many small tasks performing the same operation, to be executed simultaneously.  More simply put, large problems are divided into smaller ones, which can then be solved at the same time.

Since GPU is a special purpose architecture, it supports restrictive programming models; one such model is nVIDIA’s CUDA. On HiPerGator, only applications that were written in CUDA can run on the GPU partition. Currently, these applications are:

⇒ Direct link to this FAQ

May I submit an installation request for an application?

Yes, if the software you need is not listed on our Installed Software page, you may submit a support request to have it installed by Research Computing staff. Please observe the following guidelines:

  1. Provide a link to the web site from which to download the software
  2. If there are multiple versions, be specific about the version you want
  3. Let us know if you require any options that are not a standard part of the application

If the effort required to install the software is 4 hours or less, the request is placed in the work queue to be installed once an RC staff member is available to perform the work, usually within a few business days.

If initial evaluation of the request reveals that the effort is significantly greater than 4 hours, we will contact you to discuss how the work can be performed. It may be necessary to hire Research Computing staff as a consulting service to complete large and complex projects.

You may also install applications yourself in your home directory.

Please only ask us to install applications that you know will meet your needs and that you intend to use extensively. We do not have the resources to build applications for testing and evaluation purposes.

⇒ Direct link to this FAQ

Why do I get the 'command not found' error message?

The Linux command interpreter (shell) maintains a list of directories in which to look for commands that are entered on the command line. This list is maintained in the PATH environment variable. If the full path to the command is not specified, the shell will search the list of directories in the PATH environment variable and if a match is not found, you will get the “command not found” message. A similar mechanism exists for dynamically linked libraries using the LD_LIBRARY_PATH environment variable.

To ease the burden of setting and resetting environment variables for different applications, we have installed a “modules” system. Each application has an associated module which, when loaded, will set or reset whatever environment variables are required to run that application – including the PATH and LD_LIBRARY_PATH variables.

The easiest way to avoid “command not found” messages is to ensure that you have loaded the module for your application. See Modules for more information.

⇒ Direct link to this FAQ

Job Management

What is a batch system? / What is a job scheduler?

The purpose of a batch system is to execute a series of tasks in a computer program without user intervention (non-interactive jobs). The operation of each program is defined by a set or batch of inputs, submitted by users to the system as “job scripts.”

When job scripts are submitted to the system, the job scheduler determines how each job is queued. If there are no queued jobs of higher priority, the submitted job will run once the necessary compute resources become available.

⇒ Direct link to this FAQ

What are the differences between batch jobs, interactive jobs, and GUI jobs?

batch job is submitted to the batch system via a job script passed to the sbatch command. Once queued, a batch job will run on resources chosen by the scheduler. When a batch job runs, a user cannot interact with it.

An interactive job is any process that is run at the command line prompt, generally used for developing code or testing job scripts. Interactive jobs should only be run in an interactive development session, which are requested through the srundev command. As soon as the necessary compute resources are available, the job scheduler will start the interactive session.

A GUI job uses HiPerGator compute resources to run an application, but displays the application’s graphical user interface (GUI) to the local client computer. GUI sessions are also managed by the job scheduler, but require additional software to be installed on the client side computer.

⇒ Direct link to this FAQ

How can I check what compute resources are available for me to use?

Use the following command to view your group’s total resource allocation, as well as how much of the allocation is in use at the given instant.

$ module load ufrc
$ slurmInfo <group_name>

Allocation information is returned for the both the investment QOS and burst QOS of the given group.

⇒ Direct link to this FAQ

How do I submit a job to the batch system?

The primary job submission mechanism is via the sbatch command via the Linux command line interface

$ sbatch <your_job_script>

where <your_job_script> is a file containing the commands that the batch system will execute on your behalf. Jobs may also be submitted to the batch system through the Galaxy web interface as well as the Open Science Grid’s Globus interface.

⇒ Direct link to this FAQ

How do I run applications that use multiple processors (i.e. parallel computing)?

Parallel computing refers to the use of multiple processors to run multiple computational tasks simultaneously. Communications between tasks use one of the following interfaces, depending on the task:

  • OpenMp – used for communication between tasks running concurrently on the same node with access to shared memory
  • MPI (OpenMPI) – used for communication between tasks which use distributed memory
  • Hybrid – a combination of both OpenMp and MPI interfaces

You must properly configure your job script in order to run an application that uses multiple processors. View sample SLURM scripts for each case below:

⇒ Direct link to this FAQ

Why do I get the error 'Invalid qos specification' when I submit a job?

If you get this error, it is most likely either because

  1. You submitted a job with a specified qos for which you are not a group member
  2. Your group does not have a computational allocation

To check what groups you are a member of, log in to the cluster and use the following command:

$ groups <user_name>

To check the allocation of a particular group, log in to the cluster and use the following command:

$ module load ufrc
$ slurmInfo <group_name>

⇒ Direct link to this FAQ

Why do I get the error 'slurmstepd: Exceeded job memory limit at some point'?

Sometimes, SLURM will log the error slurmstepd: Exceeded job memory limit at some point. This appears to be due to memory used for cache and page files triggering the warning. The process that enforces the job memory limits does not kill the job, but the warning is logged. The warning can be safely ignored. If your job truly does exceed the memory request, the error message will look like:

slurmstepd: Job 5019 exceeded memory limit (1292 > 1024), being killed
slurmstepd: Exceeded job memory limit
slurmstepd: *** JOB 5019 ON dev1 CANCELLED AT 2016-05-16T15:33:27 ***

⇒ Direct link to this FAQ

How do I check the status of my jobs?

You can easily check the status of your jobs using the Job Status utility. To navigate to the utility from the Research Computing website, use the Access menu header along the top of each page.

Alternatively, you can use the following command to check the status of the jobs you’ve submitted:

$ squeue -u <user_name>

To check the status of jobs running under a particular group, modify the command with the -A flag:

$ squeue -A <group_name>

To also return QoS information for jobs under a particular group, use the following command:

$ squeue -O jobarrayid,qos,name,username,timelimit,numcpus,reasonlist -A <group_name>

⇒ Direct link to this FAQ

How do I delete a job from the batch system?

You can use the command

$ scancel <job_id>

to delete jobs from the queue. You can only delete jobs that you submitted.

⇒ Direct link to this FAQ

Why did my job die with the message '/bin/bash: bad interpreter: No such file or directory'?

This is typically caused by hidden characters in your job script that the command interpreter does not understand. If you created your script on a Windows machine and copied it to the cluster, you should run

$ dos2unix <your_job_script>

This will remove any characters not recognized by Linux command interpreters from the text file.

⇒ Direct link to this FAQ

What are the wall time limits for each partition and QoS?
Partition Wall Time Limit
Compute partitions, investment QoS 31 days
Compute partitions, burst QoS 4 days
Development partition (hpg2-dev) 12 hours
GPU-enabled partition (hpg2-gpu) 31 days
GUI partition 4 days

⇒ Direct link to this FAQ

How can I check how busy HiPerGator is?

Use the following command to view how busy the cluster is:

$ slurmInfo

⇒ Direct link to this FAQ

Development

How do I develop and test software?

You should use the interactive test nodes for software development and testing. These nodes are kept consistent with the software environment on the computational servers so that you can be assured that if it works on a test machine, it will work via the batch system. Connect to the cluster and use the following command to start a developmental session:

$ module load ufrc
$ srundev

The srundev command can be modified to request additional time, processors, or memory, which have defaults of 10 minutes, 1 core, and 2GB memory, respectively. For example, to request a 60-minute session with 4 cores and 4GB memory, use:

$ module load ufrc
$ srundev --time=60 --cpus-per-task=4 --mem-per-cpu=4gb

Generally speaking, we use modules to mange our software environment including our PATH and LD_LIBRARY_PATH environment variables. To use any available software package that is not part of the default environment, including compilers, you must load the associated modules. For example, to use the Intel compilers and link against the fftw3 libraries you would first run:

$ module load intel
$ module load fftw

Which may be collapsed to the single command:

$ module load intel fftw

⇒ Direct link to this FAQ

What compilers are available?

We have two compiler suites, the GNU Compiler Collection (GCC) and the Intel Compiler Suite (Composer XE). The default environment provides access to the GNU Compiler collection while the Composer XE may be accessed by loading the intel module (preferably, the latest version).

⇒ Direct link to this FAQ

MATLAB

How do I run MATLAB programs?

You may use the interactive MATLAB interpreter on the test nodes. However, in order to run MATLAB programs through the batch system, you must compile your MATLAB source code into a standalone executable. This is required because there are not enough MATLAB licenses available to run the programs directly. To learn how to compile your MATLAB program please see our MATLAB wiki page.

⇒ Direct link to this FAQ

How do I compile a MATLAB program?

Generally speaking, you will load the MATLAB module and then use the MATLAB compiler, mcc, to compile your MATLAB program(s). See our MATLAB wiki page for more detailed instructions.

⇒ Direct link to this FAQ

Why can't I check out a MATLAB compiler license?

If you tried to use the MATLAB compiler, mcc, and received the message “Could not check out a compiler license” it is because Research Computing does not have its own MATLAB licenses but relies on the UF campus license. There are a limited number of MATLAB compiler licenses shared by the whole campus. When the license is checked out during an interactive MATLAB session, it does not get checked back in until the MATLAB session is terminated, which could take a long time depending on what the user is doing. Unfortunately, you will not be able to run mcc until a license becomes available.

⇒ Direct link to this FAQ

Galaxy

How can I add large datasets to the Galaxy?

Create your Galaxy upload directory as ‘/ufrc/apps/galaxy/incoming/<your email address in galaxy> and copy your datasets there. Note that it may take up to 15 minutes for Galaxy to fix the permissions before uploaded files will become available in a list at the bottom of the Galaxy ‘Get Data > Upload File’ tool.

Use the ‘Upload File’ tool and either upload smaller files from your local machine or select files in your galaxy incoming directory by clicking on the checkboxes in front of the file names and click on the ‘Execute’ button to start the upload.

⇒ Direct link to this FAQ

How do I report a Galaxy problem?
I think I have a Galaxy issue, but I'm not sure about it. What should I do?

You can always open a support request when you have questions even if you are not sure whether there is an issue. If you’d like you can check the list of known Galaxy issues that are already being worked on before searching for help.

⇒ Direct link to this FAQ

I'd like to use a particular tool, but I can't find it in the Galaxy. What should I do?

Please submit a support request. The tool in question could already be wrapped by someone and available in the Galaxy Tool Shed. If it’s in the Tool Shed we can usually make it available in the UF Galaxy instance almost immediately. If the tool is not available in the Galaxy Tool Shed, we can look at the tool to determine if we can “wrap” it into the Galaxy interface and what the timeline for the project may be.

⇒ Direct link to this FAQ