Large Volume Imports

Overview

Using the file import utility for the bulk import of tens of thousands (or more) issues currently requires a large investment of time due primarily to the sequential processing of the records being imported.  ExtraView supports parallelism of the file import process by spawning multiple add/update requests to be processed simultaneously.

Current File Import Processing

There are two components to the overall processing.  The File Import Utility is invoked from the Administration interface.  This utility is responsible for:

  • Interacting with the user to set up and performing the file import
  • Uploading the file to be imported and saving it internally
  • Creating the mapping of fields and rows to item sub-objects
  • Starting the background File Import Worker task
  • Reporting on the progress of the File Import process
  • Formatting, displaying and downloading errors and results to the user.

The File Import Worker is the second component and is responsible for:

  • Reading the internal file
  • Performing the necessary field and row mapping to item objects
  • Adding or updating records to the database
  • This is a background task, is not interuptable, and only communicates its progress to the user interface component.

Each of these entities performs its responsibilities in a sequential, non-parallel manner.

Parallelization

The user can create multiple files to be uploaded within the File Import utility.  With configuration of multiple File Import Worker background tasks, the import operations are then performed in parallel, increasing the number of simultaneous operations that can be processed in a given period of time.

Configuring ExtraView to Handle Parallel Imports

Each input file is processed separately as a background operation by a File Import Worker task.  If there is more than one File Import Worker task, the input files are processed simultaneously, in parallel, improving performance.  In a multi-node installation, you should configure at least one File Import Worker task running on each node.  Generally, it makes no difference on which node a File Import Worker task is running on unless an attachment field, image, or document is being uploaded, in which case the File Import Worker task must be running on the same node as the File Import front end, so as to be capable of reading the referenced files.  You should consider configuring one File Import Worker task for each file containing your import data that you are going to upload for processing at the same time.

The File Import Worker task(s) sends add/update requests to a queue for the task named Add Update, which also runs in the background.  There must be at least one Add Update task running on each node within a multi-node system.  This queue is processed by each running Add Update task taking the top entry from the queue, and performing the add or update to the database.  When multiple Add Update tasks are configured, the queue is processed more quickly than if there is a single Add Update task.

Note that there is no direct relationship between the number of File Import Worker tasks and the number of Add Update tasks.  There are also limits to the number of each tasks that you should configure as there are diminishing returns, dependent upon factors such as the speed of the hardware.

The optimization process is somewhat arbitrary and there are no hard and fast rules as to the optimal configuration.  As a rule of thumb, these factors have proven useful in configuring large imports of data:

  • For less than 30,000 records on a single node installation, only configure a single File Import Worker task and process a single import file.  You might consider configuring a second Add Update task
  • If you are working with larger data sets, and you have a multi-node installation, the parallelization configuration becomes worthwhile.  Consider implementing a File Import Worker task for each import file and a similar number of Add Update tasks
  • A limitation is that if you are importing images/documents/attachments, you must have a shared file system or limit your workers to a single node, as the file system is used to hold the temporary copy of the file so the tasks must both reside on the same physical server.

The CF_RUN_AS_ADMIN Security Permission Setting

This permission setting enables a checkbox on the file import screen.  When this checkbox is set, the import utility will behave as the ADMIN user account and will ignore security permissions when importing the records.  This proves to be a useful method of significantly speeding the import.  It should be understood by the user performing the import that this is happening and that no field-level permission checking is happening while they import the data.