Here are huge amount of replies, but I'll try to sum up my ideas about this.
Something crashes or does not work correctly could be result of your own mistakes - then it should be easily to fix when you locate the problem. But there is also possibility of hardware failures - and that's difficult if not impossible to fix in overall.
I would recommend first to try to catch the problematic situation by logging (stack, registers, function calls) - either by logging them somewhere into file, or transmitting them somehow directly ("oh no - I'm crashing").
Recovery from such error situation is either reboot (if software is still alive and kicking) or hardware reset (e.g. hw watchdogs). Easier to start from first one.
If problem is hardware related - then logging should help you to identify in which function call problem occurs and that can give you inside knowledge of what is not working and where.
Also if code is relatively complex - it makes sense to "divide and conquer" it - meaning you remove / disable some function calls where you suspect problem is - typically disabling half of code and enabling another half - you can get "does work" / "does not work" kind of decision after which you can focus into another half of code. (Where problem is)
If problem occurs after some time - then stack overflow can be suspected - then it's better to monitor stack point registers - if they constantly grows.
And if you manage to fully minimize your code until "hello world" kind of application - and it's still failing randomly - then hardware problems are expected - and there needs to be "hardware upgrade" - meaning invent such cpu / ram / ... -hardware combination which would tolerate radiation better.
Most important thing is probably how you get your logs back if machine fully stopped / resetted / does not work - probably first thing bootstap should do - is a head back home if problematic situation is entcovered.
If it's possible in your environment also to transmit a signal and receive response - you could try out to construct some sort of online remote debugging environment, but then you must have at least of communication media working and some processor/ some ram in working state. And by remote debugging I mean either GDB / gdb stub kind of approach or your own implementation of what you need to get back from your application (e.g. download log files, download call stack, download ram, restart)