Segmentation fault after regendir

  • OS: CentOS5
    Conquest 1.4.14


    I often get a segmentation fault resulting in dgate crashing after I do the following (testing dgate commands).


    1.) Copy patient directory into a temp folder (or from MAG0 into MAG1).
    2.) Delete the patient ( ./dgate --deletepatient:12345).
    3.) Copy the saved directory back into MAG0.
    4.) Regen just the restored dir: (./dgate --regendir:MAG0,12345).


    This is not 100 percent repeatable but happens very often. It appears dgate crashes after the regen is complete (as noted
    by the serverlog. It doesn't appear to be time related (I've waited 5 minutes between the delete and regen calls).
    If I get any more clues I'll post.


    Thanks!
    Tego

  • More testing:


    OS: Linux - Centos (x86)
    Conquest: 1.4.14
    MySql: mysql Ver 14.12 Distrib 5.0.45


    Testing using the default DB (DBASE III) doesn't seem to crash (or hasn't in my testing). I'll continue researching this
    and start looking at code to see if I can get more clues. I'll continue testing with the default DB in additon to MySql and report
    when i have additional information/progress.


    Also, I'm starting some documentation on "dgate comand usage examples" that I hope will help others too.


    Tego

  • I have found that --deleteseries causes a segfault with mysql but not dbase that might be a related problem. I am not explicitly running gdb but the output is quite informative.



    I also tried it on a series containing only 1 image with similar results.

  • I found some clues that I hope will help (not fully up to date on the code yet).
    My testing to this point is only with MySQL and dbase. Only MySQL crashes and only
    after deleting a patient followed by a "regendir" while dgate is running.


    clue 1: When dgate is stopped after deleting the patient then restarted the "regendir" call does NOT crash dgate.
    clue 2: The function Database::NextRecord is where the crash happens:


    Code snippet from odbci.cpp:


    for(i=0;i<ncolumns;i++)


    {


    if(row[i]) //Steini 20050507


    switch (CTypeCodes[i])


    {


    case SQL_C_CHAR:


    strcpy((char *)(vPtrs[i]),row[i]);


    *LengthNeededs[i] = lengths[i]; // strlen(row[i]);


    break;


    When the above code is executed AFTER deleting a patient and then restarting dgate followed by
    a "regendir" the array item row[i] (i=0) is NULL and the strcpy((char *)vPtrs[i]),row[i])) is NOT executed.
    NOTE: vPtrs[0] has a NULL pointer at this point.


    However, when a deletepatient is immediately followed by a "regendir", the array item "row[0]" has a value
    and strcpy((char *(vPtrs[i],row[i]) is called with a NULL pointer in vPtrs[0]. This crashes dgate ( vPtrs[0]=NULL).


    I wont have time to work on this more for several days but I assume the deletepatient call should be cleaning some params that are initialized to a known state on program startup.


    Hope this is helpful and I'll continue with more debugging when I get back.


    Thanks,


    Tego

  • Hello Marcel

    Quote

    You wrote <snip>
    from where was the crashing function Database::NextRecord called? I am not able to reproduce the error with SqLite


    Sorry I have only tested with dbase III and MySql.


    However, I just tried with sqLite and it doesn't seem to have the same problem ( regendir after deletepatient ).


    Therefore the "deletepatient, followed by a regendir" ONLY crashes in MY testing when I use MySQL.
    I will test further using sqLite to ensure that I just didn't get lucky.


    For the bug that I was seeing, I recompiled with "debug flags" and backtraced from the segmenation fault (using gdb
    and "ddd" for debugging). When I isolated the segmenation fault (in the code listed previously) I set breakpoints and tested using
    the following method:


    NOTES:
    A.) The code snippet in my previous post is called in each of my three tests below.
    B.) THIS crash occurs in MySql and not with dbase III or sqLite (this code is only included when compiled for MySQL).
    C.) In tests 1 and 2 below, "row[0]" is NULL and the strcpy function is NOT called (vPtr[0]=NULL in all 3 tests).
    However, in the 3 below, "row[0]" has a value and the strcpy is called with vPtr[0]=NULL (which causes the segmentation
    fault.


    TESTING:
    1.) Test repeating "regendir" command WITHOUT deleting the patient .


    Result: No crash.


    2.) Delete patient, stop and restart dgate then do a "regendir".


    Result: No crash.


    3.) Delete patient, immediatly followed by a "regendir".


    Result: Segmentation fault caused by strcpy to NULL pointer.
    ENDTESTING:


    I didn't follow the function calls into this code, just found where the code was crashing and looked for the cause
    ( strcpy to NULL pointer).


    I WILL study the code as soon as I can so I can actually understand what I'm debugging and see if I can find the root cause.
    Also, I will try and debug the crash that "markperson" noted too ( deleteseries) I have hopes they are all related.


    Thanks,
    Tego

  • Hi,


    I really appreciate your debug efforts: they are very helpful.


    I believe that ncolumns may be wrong, but only if I know where the function is called from, I can know what it should be (0?) and what it actually is. In any case the problem may originate from loops like this in the delete code (in dbsql.cpp):


    Code
    if (aDB.Query(WorkListTableName, "DICOMWorkList.PatientID", QueryString, NULL)) { aDB.BindField (1, SQL_C_CHAR, Dum, 255, &sdword); while (aDB.NextRecord()) aDB.DeleteRecord(WorkListTableName, DeleteString); }

    The DeleteRecord call may affect the ncolumns variable use by Nextrecord. To test an easy possible fix, use a separate DB for DeleteRecord:


    Code
    bDB.DeleteRecord(WorkListTableName, DeleteString);


    bDB would have to be opened with the usual passwords etc.


    Addition: Or you can remove the line:


    Code
    return NewDeleteFromDB(DDO, DB);


    in RemoveFromWorld (dbsql.cpp). This would enable alternative code. If that fixes the crash, we have localized it.


    Marcel

  • Hi Marcel,


    Quote

    I removed return NewDeleteFromDB(DDO, DB); from dbsql.cpp and the crash still happens


    Following is a back trace from the original code (note the back trace from the modified code shows the same -- except for memory locations changed by the recompile ).


    #9 0x0804e36d in DriverHelper () at dgate.cpp:10445
    #8 0x0804e334 in Driverapp::ServerChildThread () at dgate.cpp:10420
    #7 0x080c5ea8 in StorageApp::ServerChild () at dgate.cpp:12455
    #6 0x0809b84b in Regen () at regen.cpp:505
    #5 0x0809b680 in RegenDir () at regen.cpp:296
    #4 0x0809b3d5 in RegenToDatabase () at regen.cpp:135
    #3 0x0809b0be in SaveToDataBase () at dbsql.cpp:858
    #2 0x08099b8b in UpdateOrAddToTable () at dbsql.cpp:1314
    #1 0x080912f2 in Database::NextRecord () at dbsql.cpp:1314
    #0 0x00552b93 in strcpy () from libc.s0.6


    Currently, I have a test that fails 100 % of the time. I have one patient in the database (I can send the data if that would help). I start dgate (dgate -v), then delete the patient ( ../dgate --deletepatient:23284) and then immediately follow with a regendir ( ./dgate --regendir:MAG0,23284).


    The same test works fine with sqLite and dbase III.
    I'm sorry I'm not more help. I REALLY need to read through the code before I can be useful.


    Tego

  • Tego,


    can you printf row[i] and particularly variable ncolums in routine NextRecord() that crashes? You can use OperatorConsole.printf to put it into the log. It may also help to give a debug log of both operations.


    I tried to apt-get mysql and it drives me crazy. After hours of downloading using various debian mirrors, I still can't get mysql to install (I use an old Knoppix installed to HD that works fine except for that). So I cannot debug the code yet, bacuse the same sequence you list does not crash with MySQL under windows.


    Marcel

  • Marcel,
    Following is a print out:


    1.) First is a printout when the dgate starts with the single patient in MAG0: (This gives a base starting point).
    ./dgate -v -r
    i=0,ncolumns=14,row[0]=1.2.392.200036.9125.2.1767268739.64567296987.17566
    i=1,ncolumns=14,row[1]=20081020
    i=2,ncolumns=14,row[2]=161627.000
    i=3,ncolumns=14,row[i]=NULL
    i=4,ncolumns=14,row[i]=NULL
    i=5,ncolumns=14,row[i]=NULL
    i=6,ncolumns=14,row[i]=NULL
    i=7,ncolumns=14,row[i]=NULL
    i=8,ncolumns=14,row[i]=NULL
    i=9,ncolumns=14,row[9]=CR
    i=10,ncolumns=14,row[10]=xxxxx
    i=11,ncolumns=14,row[i]=NULL
    i=12,ncolumns=14,row[12]=F
    i=13,ncolumns=14,row[13]=23284
    i=0,ncolumns=4,row[0]=23284
    i=1,ncolumns=4,row[1]=xxxxxxx
    i=2,ncolumns=4,row[i]=NULL
    i=3,ncolumns=4,row[3]=F
    [Regen] ./data/23284/1.2.392.200036.9125.3.1767268739.64567296987.17567_1001_001001_12248872290003.v2 -SUCCESS


    2.) Next dgate is started and the patient is deleted:


    dgate -v
    # sent ./dgate --deletepatient:23284
    DGATE (1.4.14, build Fri Oct 31 15:36:19 2008) is running as threaded server
    Server command sent using DGATE -- option
    Deleting patient: 23284
    Query On Image
    Issue Query on Columns: DICOMImages.SOPInstanc, DICOMSeries.SeriesInst, DICOMStudies.PatientID, DICOMStudies.StudyInsta
    Values: DICOMStudies.PatientID = '23284' and DICOMSeries.StudyInsta = DICOMStudies.StudyInsta and DICOMImages.SeriesInst = DICOMSeries.SeriesInst
    Tables: DICOMImages, DICOMSeries, DICOMStudies
    i=0,ncolumns=4,row[0]=1.2.392.200036.9125.4.0.135239000.39391488.440608599
    i=1,ncolumns=4,row[1]=1.2.392.200036.9125.3.1767268739.64567297361.17569
    i=2,ncolumns=4,row[2]=23284
    i=3,ncolumns=4,row[3]=1.2.392.200036.9125.2.1767268739.64567296987.17566
    i=0,ncolumns=4,row[0]=1.2.392.200036.9125.4.0.135238488.31920384.440608599
    i=1,ncolumns=4,row[1]=1.2.392.200036.9125.3.1767268739.64567296987.17567
    i=2,ncolumns=4,row[2]=23284
    i=3,ncolumns=4,row[3]=1.2.392.200036.9125.2.1767268739.64567296987.17566
    Records = 2
    RemoveFromPACS 2 images
    i=0,ncolumns=2,row[0]=23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2
    i=1,ncolumns=2,row[1]=MAG0
    Removed file: [MAG0:23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2]
    IOD File being removed: ./data/23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2
    i=0,ncolumns=1,row[0]=1.2.392.200036.9125.4.0.135239000.39391488.440608599
    i=0,ncolumns=1,row[0]=1.2.392.200036.9125.2.1767268739.64567296987.17566
    i=0,ncolumns=1,row[0]=23284
    i=0,ncolumns=2,row[0]=23284/1.2.392.200036.9125.3.1767268739.64567296987.17567_1001_001001_12248872290003.v2
    i=1,ncolumns=2,row[1]=MAG0
    Removed file: [MAG0:23284/1.2.392.200036.9125.3.1767268739.64567296987.17567_1001_001001_12248872290003.v2]
    IOD File being removed: ./data/23284/1.2.392.200036.9125.3.1767268739.64567296987.17567_1001_001001_12248872290003.v2
    i=0,ncolumns=1,row[0]=1.2.392.200036.9125.4.0.135238488.31920384.440608599


    3.) Finally the print out after sending the regendir command:
    # ./dgate --regendir:MAG0,23284


    Server command sent using DGATE -- option
    [Regen] ./data/23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2 -SUCCESS
    i=0,ncolumns=14,row[0]=1.2.392.200036.9125.2.1767268739.64567296987.17566
    Segmentation fault


    ####
    Perhaps something is left in the database that should have been removed with the "deletepatient" command ?


    I'm happy to do anything I can to help!


    Tego

  • Marcel,
    Following is a printout using %p for row[i] and vPtrs[i]. I also have another clue (I think). IF I run --deletepatient twice
    on the same patient ID and then run --regendir the system does NOT crash. This patient has two images which might be another clue.


    **Initial startup with ./dgate -v -r with the single patient in MAG0
    ----------------------------------------------------------------------------------------------
    Step 2: Load / Add DICOM Object files
    Regen Device 'MAG0'
    [Regen] ./data/23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2 -SUCCESS
    row[0]=0x9905f4c,vPtr[0]=0x80481f8
    row[1]=0x9905f7f,vPtr[1]=0xbfd036a4
    row[2]=0x9905f88,vPtr[2]=0xbfd037a4
    row[3]=(nil),vPtr[3]=0xbfd038a4
    row[4]=(nil),vPtr[4]=0xbfd039a4
    row[5]=(nil),vPtr[5]=0xbfd03aa4
    row[6]=(nil),vPtr[6]=0xbfd03ba4
    row[7]=(nil),vPtr[7]=0xbfd03ca4
    row[8]=(nil),vPtr[8]=0xbfd03da4
    row[9]=0x9905f93,vPtr[9]=0xbfd03ea4
    row[10]=0x9905f96,vPtr[10]=0xbfd03fa4
    row[11]=(nil),vPtr[11]=0xbfd040a4
    row[12]=0x9905fa6,vPtr[12]=0xbfd041a4
    row[13]=0x9905fa8,vPtr[13]=0xbfd042a4
    row[0]=0x9905f24,vPtr[0]=0x80481f8
    row[1]=0x9905f2a,vPtr[1]=0xbfd036a4
    row[2]=(nil),vPtr[2]=0xbfd037a4
    row[3]=0x9905f3a,vPtr[3]=0xbfd038a4
    [Regen] ./data/23284/1.2.392.200036.9125.3.1767268739.64567296987.17567_1001_001001_12248872290003.v2 -SUCCESS
    Regeneration Complete
    [root@pacs1 conquest]#


    #run --deletepatient
    -------------------------------------------------------------
    [root@pacs1 conquest]# ./dgate -v
    DGATE (1.4.14, build Sat Nov 1 12:19:15 2008) is running as threaded server
    Server command sent using DGATE -- option
    Deleting patient: 23284
    Query On Image
    Issue Query on Columns: DICOMImages.SOPInstanc, DICOMSeries.SeriesInst, DICOMStudies.PatientID, DICOMStudies.StudyInsta
    Values: DICOMStudies.PatientID = '23284' and DICOMSeries.StudyInsta = DICOMStudies.StudyInsta and DICOMImages.SeriesInst = DICOMSeries.SeriesInst
    Tables: DICOMImages, DICOMSeries, DICOMStudies
    row[0]=0x8580c6c,vPtr[0]=0x857a398
    row[1]=0x8580ca1,vPtr[1]=0x857a4d0
    row[2]=0x8580cd4,vPtr[2]=0x857a608
    row[3]=0x8580cda,vPtr[3]=0x8584c08
    row[0]=0x8580d34,vPtr[0]=0x857a398
    row[1]=0x8580d69,vPtr[1]=0x857a4d0
    row[2]=0x8580d9c,vPtr[2]=0x857a608
    row[3]=0x8580da2,vPtr[3]=0x8584c08
    Records = 2
    RemoveFromPACS 2 images
    row[0]=0x858084c,vPtr[0]=0xb7ed412a
    row[1]=0x85808a3,vPtr[1]=0xb7ed402a
    Removed file: [MAG0:23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2]
    IOD File being removed: ./data/23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2
    row[0]=0x8580848,vPtr[0]=0xb7ed3d18
    row[0]=0x8580848,vPtr[0]=0xb7ed3d18
    row[0]=0x8580848,vPtr[0]=0xb7ed3d18
    row[0]=0x858084c,vPtr[0]=0xb7ed412a
    row[1]=0x85808a3,vPtr[1]=0xb7ed402a
    Removed file: [MAG0:23284/1.2.392.200036.9125.3.1767268739.64567296987.17567_1001_001001_12248872290003.v2]
    IOD File being removed: ./data/23284/1.2.392.200036.9125.3.1767268739.64567296987.17567_1001_001001_12248872290003.v2
    row[0]=0x8580848,vPtr[0]=0xb7ed3d18

    # Run --regendir:MAG0,23284
    ------------------------------------------------------------------
    Server command sent using DGATE -- option
    [Regen] ./data/23284/1.2.392.200036.9125.3.1767268739.64567297361.17569_1002_001002_12248872300004.v2 -SUCCESS
    row[0]=0x861cb64,vPtr[0]=(nil)
    Segmentation fault


    Tego


    Is there a command I could call to run through NextRecord AFTER doing the delete so we could see the state
    of row[] and vPtrs[] ?

  • Hi,


    looking at your last post, I may have found it. You see VPTR[1]...[N] fall nicely in a sequence, but VPTR[0] is not. Looking at the code in dbsql.cpp, I see that is is actaually never filled in function UpdateOrAddToTable! So the query asks for the UID in the database, but never actually reads it. So the crash occurs when the pointer in VPTR[0] is invalid from a previous operation.


    Code
    for (i=1; i<=lastfield; i++) { if(!DB.BindField (i+1, SQL_C_CHAR, s[i-1], 255, &sdword)) return ( FALSE ); s[i-1][0] = 0; }


    Try to add here after:


    Code
    char dum[256];
    DB.BindField (1, SQL_C_CHAR, dum, 255, &sdword));


    I am curious wether this helps for this and maybe a few other mystery crashes!
    Marcel

  • Hi Marcel,


    Good catch !


    I added a global var at the top of the file just before the "forward declarations" in dbsql.cpp.

    Code
    //TKchar dummy[256]={0};


    Then I added your snippet following the initialiation loop
    Existing code:


    New code:

    Code
    //TK
    // Added global var dummy[256] at top of file.
    DB.BindField (1, SQL_C_CHAR, dummy, 255, &sdword);


    This fixed my test case for --deletepatient, --regendir. I'm going to comment out the code and see if I can
    get dgate to crash with the "--deleteseries" noted by markpearson.
    If I can duplicate his crash then I'll reenable this code and see if it fixes that problem also.


    Thanks!
    Tego

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!