Import large number of files from disk

joelspangler · Apr 8th 2014

I have about 15 million DICOM files (600k studies?) on an external disk array. Disk has DICOM files only (no software) and I'm hoping to import them into an instance of conquest, then make available to other systems via query/retrieve. Is there a way to have conquest simply inventory the files (ie put them in the database), but leave the file structure on the external disk array as-is? The folder structure is a random 8 character alpha-numeric folder for each study, and DICOM files within are that same random name + .1 .2 .3 (up to .2000 or higher for large CT studies - files do not have a .dcm identifier). Initial testing of putting a sample of the files in the "data" folder and telling the system to re-inventory failed - the files were not recognized. Placing files in a watch folder or the "incoming" folder works, but conquest seems to only inventory one single file at a time, and it seems to copy to a new folder then delete the original rather than just move the existing file. The speed of the sampling that I did was slow, so much so that my calculations say that it would take about 5 days to complete importing via this method.

What is the best/fastest way of importing these files using conquest and/or another tool (if there is something better suited for my situation).

blub_smile · Apr 9th 2014

Hi

Usually it is as simple as:
1. Make the disc/drive available within the OS
2. Add the path of your dicom data to Conquest as a mag device
3. re-index that specific device
4. done

I would copy a couple of hundred MBs for testing to another location first.

Speed of re-indexing varies on the disc-drive speeds and database enginge. With MySQL on a ssd and 7200rpm RAID 6 as data storage average speed is between 200-400 images/s (CT/MRI - CR is slower)- depending on defragmentation status.

You said putting files in the data-directory an re-index failed. That is kinda strange.
Therefore the following questions:
1. What is the file extension of your files?
2. Do ya have a log of the re-index (not the whole log - just some lines)

joelspangler · Apr 9th 2014

Files have extensions of .1 through however many series the case has - so .2000 on a 2000 slice CT. Full file names would be something like 01K46HS8.1 - 01K46HS8.2000 (and these would be in a folder named 01K46HS8). When I tested the other day, I think i put one more "level" of directories in there so data\testdata\01K46HS8\01K46HS8.1 (does it only go so "deep" when re-indexing?

I don't have a log to send at this point, but will try again. Is there a specific re-index log, or would it be contained in one of the other logs (windows based instance)

marcelvanherk · Apr 9th 2014

Hi,

version 1.4.17d allows these filenames, older ones not.

Marcel

blub_smile · Apr 10th 2014

Hi

does it only go so "deep" when re-indexing?

No it doesn't. I haven't come across a limit of "deep" it goes.

joelspangler · Apr 10th 2014

I upgraded to 1.4.17d (from 17c) today and ran against a larger test dataset. It now regens a small subset of images. It is picking up series with .1000 and higher filenames, but is still completely ignoring files with .1, .2, etc. Out of the 45K images that are in the directory, only 5.5K were recognized. Is there any configurable option to tell conquest what to inventory vs what to ignore when doing a regen? (ie - I'd like to tell it that .wav and .tsrt files shouldn't be inventoried)

marcelvanherk · Apr 10th 2014

Hi,

this may be a bug. I will have a look.

Marcel

marcelvanherk · Apr 11th 2014

Hi,

the regen code in 1.4.17d allows any 3-letter extension. Is it an option to rename the files to .dcm?

Marcel

joelspangler · Apr 11th 2014

I wouldn't be against renaming to .dcm, but files are in thousands of folders, and there are 15 million of them. I don't know of a bulk renaming tool that can handle that volume. I'd have to keep the full filename and add .dcm, so I'd go from something like 8ghy9h3v.1 to 8ghy9h3v.1.dcm

If anyone knows of a tool that can accomplish a mass-rename job like this, I'd appreciate information on it. Most tools I'm finding can only replace a file extensions, which would result in duplicate file names.

marcelvanherk · Apr 11th 2014

Hi Joel,

I can also wip up an update that will allow the naming.

Marcel

joelspangler · Apr 14th 2014

I decided to just put them in the input folder and let it crunch though them - got through about 45% of them over the weekend. I figured out with my previous testing that it was compressing each image - now changed to "as" compression. I'm getting about 1000 studies per minute on average. Some benefits of doing it this way: the files are organized a bit more logically, and in the event of a crash the process is restartable without having to re-read any files. I thought about trying to set up two instances both hitting the same storage location and MSSQL database, but I was afraid that I might make a mistake and cause myself more issues.

marcelvanherk · Apr 14th 2014

Nice!

Marcel

Import large number of files from disk

Participate now!

Share