Bulk import from an S3 bucket
If you want to import S3 objects with a presigned url, refer to aidbox.bulk data import.
aidbox.bulk/load-from-bucket
It allows loading data from a bunch of .ndjson.gz
files on an AWS bucket directly to the Aidbox database with maximum performance.
Be careful You should run only one replica of Aidbox to use aidbox.bulk/load-from-bucket
operation.
Files content and naming requirement
- The file must consist of Resources of the same type.
- 2. The file name should start with a name of the Resource type, then some postfix is possible, and extension
.ndjson
is required. Files can be placed in subdirectories of any level. Files with the wrong path structure will be ignored. - 3. Every resource in
.ndjson
files MUST contain id property.
Resource requirements for aidbox.bulk/load-from-bucket
:
Operation | id | resourceType |
---|---|---|
aidbox.bulk/load-from-bucket | Required | Not required |
Valid file structure example:
fhir/1/Patient.ndjson.gz
fhir/1/patient-01.ndjson.gz
Observation.ndjson.gz
Invalid file structure example:
import.ndjson
01-patient.ndjson.gz
fhir/Patient
Parameters
Object with the following structure:
bucket
\ defines your bucket connection string in formats3://
threadnum
defines how many threads will process the import. Thedefaultis 4.account
credential:accesskeyid
\ AWS key IDsecretaccesskey
\ AWS secret keyregion
\ AWS Bucket regiondisableidx?
thedefaultisfalse
. Allows to drop all indexes for resources, which data are going to be loaded. Indexes will be restored at the end of successful import. All information about dropped indexes is stored atDisabledIndex
resources.dropprimarykey?
thedefaultisfalse
. The same as the previous parameter, but drops primary key constraint for resources tables. This parameter disables all checks for duplicates for imported resources.upsert?
thedefaultisfalse
. Ifupsert?
isfalse
, import for files withid
uniqueness constraint violation will fail with an error, iftrue
records in the database will be overridden with records from import. Even whenupsert?
istrue
, it's still not allowed to have more than one record with the same id in one import file. Setting this option to true will cause a decrease in performance.scheduler
possiblevalues:optimal
,bylastmodified
, thedefaultisoptimal
. Establishes the order in which the files are processed. Theoptimal
value provides the best performance.bylastmodified
should be used withthreadnum = 1
to guarantee a stable order of file processing.prefixes
array of prefixes to specify which files should be processed. Example: with value["fhir/1/", "fhir/2/Patient"]
only files from directory"fhir/1"
andPatient
files from directory"fhir/2"
will be processed.connecttimeout
thedefaultis0
. Specifies the number of milliseconds after which the file is considered as failed if connection to the resource could not be established. (e.g. in case of network issues). Zero is interpreted as an infinite timeout.readtimeout
thedefaultis0
. Specifies the number of milliseconds after which the file is considered as failed if there is no data available to read (e.g. in case of network issues). Zero is interpreted as an infinite timeout.
Returns the string "Upload started"
Returns error message
Example
POST /rpc
content-type: text/yaml
accept: text/yaml
method: aidbox.bulk/load-from-bucket
params:
bucket: s3://your-bucket-id
thread-num: 4
account:
access-key-id: your-key-id
secret-access-key: your-secret-access-key
region: us-east-1
result:
message: Upload from bucket <s3://your-bucket-id> started. 6 new files added.
progress:
total: 6
new-files-count: 6
Loader File
For each file being imported via load-from-bucket
method, Aidbox creates LoaderFile
resource. To find out how many resources were imported from a file, check the loaded
field.
Loader File Example
{
"end": "2022-04-11T14:50:27.893Z",
"file": "/tmp/patient.ndjson.gz",
"size": 100,
"type": "Patient",
"bucket": "local",
"loaded": 20,
"status": "done"
}
{
"end": "2022-04-11T14:50:27.893Z",
"file": "/tmp/patient.ndjson.gz",
"size": 100,
"type": "Patient",
"bucket": "local",
"status": "error",
"error": {
"code": "23505",
"source": "postgres"
},
"message": "23505: ERROR: duplicate key value violates unique constraint \"patient_pkey\""
}
Sources of Error
There are the following sources of error for this request.
- AWS Error
- PostgreSQL Error
- Aidbox Error\
AWS Error
Code | Description |
---|---|
InvalidAccount | The AWS access key ID or AWS secret access key that you provided is not valid. |
NoSuchKey | The specified S3 bucket or S3 object key does not exist. |
\
PostgreSQL Error
See Documentation of PostgreSQL.\
Aidbox Error
Any other errors than the above can be caught as Aidbox Error. An error message will be provided if available.\
How to reload a file one more time
On launch aidbox.bulk/load-from-bucket
checks if files from the bucket were planned to import and decides what to do:
- If
ndjson.gz
file has it's relatedLoaderFile
resource, the loader skips this file from import - If there is no related
LoaderFile
resource, Aidbox puts this file to the queue creating aLoaderFile
resource
In order to import a file one more time you should delete related LoaderFile
resource and relaunch aidbox.bulk/load-from-bucket
.
Files are processed completely. The loader doesn't support partial re-import.\
AWS User Policy: Minimal Example
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "MinimalUserPolicyForBulkImport",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<your-bucket-name>",
"arn:aws:s3:::<your-bucket-name>/*"
]
}
]
}
\
aidbox.bulk/load-from-bucket-status
Returns status and progress of import for specified bucket. Possible states are: in-progress
, completed
, interrupted
.
State interrupted
means that aidbox was restarted during the loading process. If you run aidbox.bulk/load-from-bucket
operation again on the same bucket, it will be continued.
Example
POST /rpc
content-type: text/yaml
accept: text/yaml
method: aidbox.bulk/load-from-bucket-status
params:
bucket: s3://your-bucket-id
result:
state: in-progress
progress:
total: 6
pending: 2
done: 4