Cleaning up data using SQL?

  • I have a conquest server with almost 250,000 studies in it, or around 10TB of data. There are thousands of cases that have no accession number, or the accession number is 0. Also, because of a common, but bad, practice in the past, there are also thousands of duplicate accession numbers.


    There are also other fields such as the patient birthday that we would like to clean up.


    It seems that the easiest way to accomplish this would be to dump the data that we are missing from the RIS into a CSV, make a new table and import it, and update the dicomstudies and dicompatients tables.


    Are there any caveats to doing this? The system is still receiving new cases.


    I suppose I could always regenerate the database if I mess it up, but of course, a regen would take quite a while.


    There are some fields that are in two tables, will it cause weird behaviour if I only update one table? I assume it would only mean that the results would be different depending on if you were doing a patient or study level query.


    Brian

  • Hi Brian,


    I have a few clients on similar setups and amounts of data using a MySQL backend. We had a similar problem a few years ago with a new RIS and accession number grief and lots of fat finger typos with names, dobs, mismatched MRN/DOB's, duplicate MRN's, etc...


    One of the main things that can catch you up here is if you modify such things only in the database without actually modifying the image headers. (I wasn't sure if you were doing this somehow or not.)


    Reason being, many PACS systems (including Conquest) that are interconnected to RIS systems or use an existing database will simply reject or ignore inbound image transfers that contain mismatched UID's, AN's, patient information, etc... when sent from your modalities.


    Also in this case, if you do this and do have a mistake, when you regenerate the database, you'll be getting the data from the image headers and go back to your original undesired data right?


    Marcel posted a good writeup of the inherent issues of having mismatched data in the database tables and image headers/file names here:
    http://forum.image-systems.biz/viewtopic.php?f=33&t=618


    I'd highly recommend doing this through Conquest, or if the volume of data exceptions is too large, alter the image data programmatically (See below).



    Our Solution:
    We ended up searching for mismatched patient data and other problems with a set of SQL queries and a SQL browsing client, exporting the results to CSV, and had file clerks verify/correct data with the the patient charts & patient questionnaires in the CSV. Once this was completed, I created a small program in C# to read the CSV data, make the appropriate changes to the conquest MySQL database and gather data about the images that needed to be modified, and execute commands via dgate.exe options to modify the images themselves programmatically. (You can find an explanation of most of the flags/arguments for dgate.exe in the Conquest manual.) Eventually, we somewhat automated this process using the RIS data as a base, "queued" the changes that needed to be done, and had an interface to approve/edit/deny the changes in a grid control.


    It's been several years, but I may still have the Visual Studio project for this laying around somewhere. I will upload it if I can find it.


    I hope I understood your project and problem correctly. Hopefully this information will help a bit?


    Kind Regards,
    -Matt

  • Hi,


    you can clean up the database (I would suggest doing it on all tables synchroneously) for queries but this will not clean the data itself - that you recieve after a move. Eventually you would have to rewrite every incorrect object - this will also update and clean the database. I can support the suggestion to use dgate command line options to do this.


    One other solution is to consider the old data 'dirty' and split it off from a new 'clean' database and server. By using a virtual server you could still make them appear as one.


    Marcel

  • Marcel - I do need the data to be clean after a move, so thanks for that. I have been considering the dirty data / clean data separation option, it may be what I need to do.


    Mattb - Thanks for the info, I'd certainly appreciate the VB project if you can find it. What you've explained is what I'll have to do... Wrong data is going to cause problems on another device that the data is being sent to. There are probably over 10,000 records that are really going to need to be corrected, so it'll be programmatically or not at all. :)


    I'll start looking into how to use dgate to clean up the data. Many thanks.


    Brian

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!