August 2, 2005
Lately I've been working on .Net-ifying a program at work. It hasn't been as easy as I thought it would be, because I'm learning how to do threads and tweak the garbage collection as I go. So I was pleasantly surprised when I got something working Sunday afternoon.
Monday morning it starting goofing up left and right. The basic process is:
- Data comes in
- Create an object with the data in it
- Kick off a thread
- In the thread:
- Do some stuff with the data
- Write to a database
- Update the UI
- Clean up the object's variables, abort the thread and let the object get GCed
Well, that's the theory anyway. And all day Sunday it worked beautifully. CPU usage stayed somewhere around 5-10% and memory usage was pretty flat. Monday was different.
After an indeterminate amount of time, it just stopped writing to the database. No exceptions were thrown, it just wasn't doing anything. It managed to skip straight to the part where it updates the UI somehow, so on the surface it looked like it was working.
Checking the database, though, revealed that data was no longer getting written (we keep a column that just gets populated with GETDATE() whenever it's written, so we can tell at a glance whether it's working or not). After a couple minutes objects stopped getting cleaned up and CPU usage went up.
Obviously, something's not happenning and the objects aren't getting cleaned up any more. Time to figure out why. I've been averaging about 45 minutes of uptime, so I create a new version of the executable at 3:30, one that spews a bunch of debug data to a file each time a thread kicks off, and turn it loose. As I leave work it's still humming along with no cares in the world. No problem, I think, I'll just connect via Remote Desktop and take a peek, since I'll have to restart the thing anyway.
It's been running for five hours now and shows no sign of stopping. I don't want to leave it run overnight, because when it screws up the data is lost, and I don't want to explain to the clients (or my boss) why we have an eight-hour gap in the data. I also don't want to shut it off, because the data will be stored by the machines sending it in and we'll get flooded with packets when I get to work and fire the thing up.
So now I'm actually in the odd position of wanting my program to fail.
Edit, 8:36 AM: Seventeen hours later, it's still going. I just have to remember to delete the text file every couple of hours until the thing chokes.
Edit, 4:03 PM: And it's still going. This thing's never run for more than 24 hours. So instead of logging in to kill and restart it, now I apprently have to log in to flush the debug file since it grows at the rate of about 20 megs a day.
Edit, Wednesday @ 1:37 PM: The only thing I've been able to think of is that the brief delay during which it accesses the text file is letting it take its time on the database stuff, which I think may be the culprit. I just have too much to do today to test it out. I wonder if I'll have to rig up a system where I set a flag when I start writing and set it back when I'm done, making other threads wait a few milliseconds if the connection is busy.