@(#)AIX crash dump analysis 01 MAY 2000 Rob Thomas robt@cymru.com AIX Crash Dump Analysis Sample I was called in to review a failed IBM RS/6000 F50 running AIX 4.3.2. The box, serving as a firewall and running Check Point FireWall-1, was crashing and rebooting intermittently. Fortunately, savecore had been enabled and I was able to peruse a good crash dump. A brief aside is necessary here: If you do not have savecore enabled, you are missing a valuable repository of post-mortem information. It costs nothing to have it enabled -- until a crash dump occurs -- and it provides your kernel analyst with valuable information. Without a crash dump, it may be well-nigh impossible to determine the exact cause of the failure. Enable savecore. Note that while this article details an AIX crash dump analysis, it is somewhat close to the steps I perform on a Solaris box. It isn't interchangeable, however, with Solaris. The crash man page will provide some sketchy details regarding the command. Unfortunately, without some UNIX source code knowledge, the information crash provides is almost worthless. For this reason, there is a great deal of reading you should perform prior to performing kernel crash dump analysis. My book list includes the relevant collection of tomes, and can be found at: http://www.cymru.com/Books/index.html I always begin my AIX crash dump analysis with a quick query to learn a bit about the host. Here we have the output of the stat subcommand: > stat sysname: AIX nodename: deadbeef release: 3 version: 4 machine: 000315684C00 time of crash: Mon Apr 24 00:01:19 CUT 2000 age of system: 84 day, 19 hr., 36 min. xmalloc debug: disabled abend code: 300 csa: 0x2ff3b400 exception struct: dar: 0x00000000 dsisr: 0x00000000: srv: 0x00000000 dar2: 0x00000000 dsirr: 0x00000000: (errno) "Error 0" Ah, an AIX 4.3 box, as expected. We also have the date and time of the crash, which can be corollated with other logging mechanisms. This will become quite important in the final analysis, as we shall see later in this article. Next I fire off the sysconfig subcommand to learn a bit more about the host. > sysconfig SYSTEM CONFIGURATION architecture: POWER_PC implementation: 604 version: 604 width: 0x00000020 ncpus: 0x00000002 cache_attrib: 0x00000001 cach_cong: 0x00000000 icache_size: 0x00008000 dcache_size: 0x00008000 icache_asc: 0x00000004 dcache_asc: 0x00000004 icache_block: 0x00000020 dcache_block: 0x00000020 icache_line: 0x00000020 dcache_line: 0x00000020 L2_cache_size: 0x00040000 L2_cache_asc: 0x00000001 itlb_size: 0x00000080 dtlb_size: 0x00000080 itlb_asc: 0x00000002 dtlb_asc: 0x00000002 tlb_attrib: 0x00000001 resv_size: 0x00000020 priv_lck_cnt: 0x00000000 prob_lck_cnt: 0x00000000 rtc_type: 0x00000002 virt_alias: 0x00000000 model_arch: 0x00000003 model_impl: 0x00000006 Xint: 0x00077359 Xfrac: 0x00004F28 Now I have the hardware platform. The real fun begins now as I peruse the stack trace. This gives me the last few calls run by the kernel just before the panic. I will use the trace -k command for this step. > trace -k STACK TRACE: 0x2ff3b400 (excpt=00000000:0a000000:00000000:00000000:00000106) (intpri=11) IAR: .simple_lock+18 (00009518): stwcx. r6,r0,r3 LR: .fselpoll_cleanup+cc (0008c104) 2ff3b260: .select+988 (00198a4c) 2ff3b3c0: .sys_call_ret+0 (00003a10) IAR not in kernel segment. Interesting! The kernel was in a select call when things went awry. This indicates that the process (yet to be determined) was looking into a socket descriptor. This is starting to smell like a network issue. I next determine what processes were running on this box at the time of the crash. This I accomplish with the status subcommand. > status CPU TID TSLOT PID PSLOT STOPPED PROC_NAME 0 104d 16 d46 13 yes errdemon 1 1945 25 1740 23 yes inetd A-ha! We have a daemon with a great deal of network savvy, inetd. This fits nicely with the call to select() above. We next ensure that we are tracing CPU 1 (as noted above) and continue to look into inetd a bit further. To determine (and set, if necessary) which CPU we are tracing, we use the cpu subcommand. > cpu Selected cpu number : 1 Good, we are tracing CPU 1. Now we take a look at the proc and u_ structures of the inetd process. Starting with the proc structure, we must reference procs in the kernel using the process SLOT, as noted in the output of status above (PSLOT). > p -e 23 SLT ST PID PPID PGRP UID EUID TCNT NAME 23 a 1740 c4e 1740 0 0 1 inetd FLAGS: swapped_in orphanpgrp execed Links: *child:0x00000000 *siblings:0xe3002100 *uidl:0xe3002100 *ganchor:0xe3002280 *pgrpl:0x00000000 *ttyl:0x00000000 Dispatch Fields: pevent:0x00000000 *synch:0xffffffff lock:0x00000000 lock_d:0x00000000 Thread Fields: *threadlist:0xe6001900 threadcount:1 active:1 suspended:0 local:0 terminating:0 Scheduler Fields: nice: 20 repage:0x00000000 scount:0 sched_pri:0 *sched_next:0x00000000 *sched_back:0x00000000 cpticks:0 msgcnt:0 majfltsec:15 Misc: adspace:0x000071e7 kstackseg:0x00000000 xstat:0x0000 *p_ipc:0x00000000 *p_dblist:0x00000000 *p_dbnext:0x00000000 Signal Information: pending:hi 0x00000000,lo 0x00000000 sigcatch:hi 0x00000000,lo 0x00082001 sigignore:hi 0x7fffffff,lo 0xfff29efe Statistics: size:0x00000000(pages) audit:0x00000038 accounting page frames:37 page space blocks:0 pctcpu:10 minflt:1509 majflt:15 The proc structure looks fairly lucid. A look into the u_ area is next, and this is accomplished with the u subcommand. Remember that everything in the kernel is by process SLOT, not process ID. The output is both verbose and informative. > u 23 UTHREAD AREA FOR SLOT 23 (fw) SAVED MACHINE STATE curid:0x00001428 m/q:0x00000000 iar:0x000357c8 cr:0x20442822 msr:0x00001030 lr:0x000357c8 xer:0x00000000 kjmpbuf:0x00000000 backtrack:0x00 tid:0x00000000 fpeu:0x01 excp_type:0x00000000 ctr:0x00000000 *prevmst:0x00000000 *stackfix:0x2ff3b1b0 intpri:0x00 o_iar:0x00000000 o_toc:0x00000000 o_arg1:0x00000000 excbranch:0x00000000 o_vaddr:0x00000000 msr flags: ME IR DR cr flags: | = | | > | > | = |< | = | = | Exception Struct 0x103c8cf8 0x4000d030 0x6000d12d 0x103c8cf8 0x00000107 MST Segment Regs 0:0x00000000 1:0x0000d00d 2:0x0000c16c 3:0x007fffff 4:0x007fffff 5:0x0000e00e 6:0x007fffff 7:0x0000b00b 8:0x00017017 9:0x00018018 10:0x00019019 11:0x007fffff 12:0x007fffff 13:0x60015015 14:0x00004004 15:0x007fffff alloc flags: 0xe5ef0000 (Seg Regs: 0, 1, 2, 5, 7, 8, 9, 10, 12, 13, 14, 15) General Purpose Regs 0:0x00000000 1:0x2ff3b1b0 2:0x00209df0 3:0x00000000 4:0x00000002 5:0x2ff3b400 6:0x00000000 7:0x00000000 8:0x00000000 9:0x00000000 10:0x2ff3b220 11:0x00000000 12:0x00001030 13:0x00000001 14:0x00000008 15:0x00000003 16:0x00000000 17:0x00000000 18:0x42228820 19:0x20000000 20:0x00000001 21:0x11000001 22:0x00000003 23:0x00000001 24:0x00000000 25:0x00000004 26:0x00000000 27:0xe600179c 28:0x2a222824 29:0xe3001e00 30:0xe6001700 31:0x00000010 Kernel stack address: 0x2ff3b400 SYSTEM CALL STATE error code:0x00 *kjmpbuf:0x00000000 PER-THREAD TIMER MANAGEMENT Real/Alarm Timer (ut_timer[TIMERID_ALRM]) = 0x0 Virtual Timer (ut_timer[TIMERID_VIRTUAL]) = 0x0 Prof Timer (ut_timer[TIMERID_PROF]) = 0x0 Posix Timer (ut_timer[POSIX0]) = 0x0 Posix Timer (ut_timer[POSIX1]) = 0x0 Posix Timer (ut_timer[POSIX2]) = 0x0 Posix Timer (ut_timer[POSIX3]) = 0x0 Posix Timer (ut_timer[POSIX4]) = 0x0 SIGNAL MANAGEMENT *sigsp:0x0 oldmask:hi 0x0,lo 0x80000 code:0x0 MISCELLANOUS FIELDS: fstid:0x00000000 ioctlrv:0x00000000 selchn:0x00000000 link:0x00000000 loginfo:0x00000000 fselchn:0x500042a0 selbuc:0x00000000 sigssz:0x00000000 User msr:0x0000d030 *context:0x00000000 **errnopp:0x20232a1c *stkb:0x00000000 *audsvc:0x00000000 scsave[0]:0x2ff22c24 scsave[1]:0x2ff22a18 scsave[2]:0x2022a7c4 scsave[3]:0xf012fb88 scsave[4]:0x2ff229f8 scsave[5]:0x00000002 scsave[6]:0x20041270 scsave[7]:0x200419c8 USER AREA OF ASSOCIATED PROCESS fw (SLOT 20, PROCTAB 0xe3001e00) handy_lock:0x00000000 timer_lock:0x00000000 map:0x00000000 *semundo:0x00000000 *pinu_block:0x00000000 compatibility:0x00000000 lock:0x00000000 ulocks:0xffffffff *message:0x00000000 irss:0x0000000019b65bb8 lock_word:0xffffffff *vmm_lock_wait:0x00000000 vmmflags:0x00000000 SIGNAL MANAGEMENT Signals to be blocked (sig#:hi/lo mask,flags,&func) 1:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 2:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 3:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 4:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 5:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 6:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 7:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 8:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 9:hi 0x00000000,lo 0x00000000,0x00000048,0x00000000 10:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 11:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 12:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 13:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 14:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 15:hi 0x00000000,lo 0x00000000,0x00000008,0x20051fe0 16:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 17:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 18:hi 0x00000000,lo 0x00000000,0x00000000,0x00000001 19:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 20:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 21:hi 0x00000000,lo 0x00000000,0x00000000,0x00000001 22:hi 0x00000000,lo 0x00000000,0x00000000,0x00000001 23:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 24:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 25:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 26:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 27:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 28:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 29:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 30:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 31:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 32:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 33:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 34:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 35:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 36:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 37:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 38:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 39:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 40:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 41:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 42:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 43:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 44:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 45:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 46:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 47:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 48:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 49:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 50:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 51:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 52:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 53:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 54:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 55:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 56:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 57:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 58:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 59:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 60:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 61:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 62:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 63:hi 0x00000000,lo 0x00000000,0x00000000,0x00000000 USER INFORMATION euid:0x0000 egid:0x0000 ruid:0x0000 rgid:0x0000 luid:0x00000000 suid:0x00000000 ngrps:0x0000 *groups:0x2ff20ab8 compat:0x00000000 ref:0x00000002 pag:0x00000000 cr_lock:0x00000000 acctid:0x00000000 sgid:0x00000000 epriv:0xffffffff ipriv:0xffffffff bpriv:0xffffffff mpriv:0xffffffff u_info: ACCOUNTING DATA start:0x3893bd8e ticks:0x00001d38 acflag:0x0001 pr_base:0x00000000_00000000 pr_size:0x00000000 pr_off:0x00000000_00000000 pr_scale:0x00000000 process times: user:0x000003d5s 0x0c845880us sys:0x00000500s 0x1a39de00us children's times: user:0x00000000s 0x00000000us sys:0x00000000s 0x00000000us CONTROLLING TTY *ttysid:0x00000000 *ttyp(pgrp):0x00000000 ttyd(evice):0x00000000 ttympx:0x00000000 *ttys(tate):0x00000000 tty id: 0x00000000 *query function: 0x00000000 RESOURCE LIMITS AND COUNTERS ior:0x00000000_00000000 iow:0x00000000_00000000 ioch:0x00000000_00000005 text:0x00000000_00472cc0 data:0x00000000_00343000 stk:0x01000000 max data:0x08000000 max stk:0x01000000 max file(blks):0xffffffff *tstart:0x00000000_100001c8 sdsize:0x00000000 *datastart:0x00000000_20000000 *stkstart0x00000000_2ff23000 soft core dump:0x7fffffff hard core dump:0x7fffffff soft rss:0x7fffffff hard rss:0x7fffffff cpu soft:0x7fffffff cpu hard:0x7fffffff hard ulimit:0x7fffffff minflt:0x00000000_00000000 majflt:0x00000000_00000000 AUDITING INFORMATION auditstatus:0x00000000 SEGMENT REGISTER INFORMATION ADSPACE SEGSTATE Reg Value Alloc # Segs Fno/Shmptr Flags 0 0x60000000 yes 0x0000 0x00000000 AVAILABLE 1 0x6000d12d yes 0x0001 0x00000000 TEXT 2 0x6000c16c yes 0x0000 0x00000000 AVAILABLE 3 0x007fffff 0x0000 0x00000000 AVAILABLE 4 0x007fffff 0x0000 0x00000000 AVAILABLE 5 0x007fffff 0x0000 0x00000000 AVAILABLE 6 0x007fffff 0x0000 0x00000000 AVAILABLE 7 0x007fffff 0x0000 0x00000000 AVAILABLE 8 0x007fffff 0x0000 0x00000000 AVAILABLE 9 0x007fffff 0x0000 0x00000000 AVAILABLE 10 0x007fffff 0x0000 0x00000000 AVAILABLE 11 0x007fffff 0x0000 0x00000000 AVAILABLE 12 0x007fffff 0x0000 0x00000000 AVAILABLE 13 0x60015015 yes 0x0001 0x00000000 TEXT 14 0x007fffff 0x0000 0x00000000 AVAILABLE 15 0x6000e16e yes 0x0001 0x00000000 WORKING FILE SYSTEM STATE *curdir:0x13ef49a0 *rootdir:0x00000000 cmask:0x0002 maxindex:0x0004 fd_lock:0x00000000 fso_lock:0x00000000 lockflag:0x00000000 fdevent:0xffffffff FILE DESCRIPTOR TABLE *ufd: 0x2ff3c1a0 fd 0: fp = 0x10000210 flags = 0x0080 count = 0x0000 fd 1: fp = 0x10000e70 flags = 0x0080 count = 0x0000 fd 2: fp = 0x100005d0 flags = 0x0080 count = 0x0000 fd 3: fp = 0x10001200 flags = 0x0080 count = 0x0001 Rest of user area paged out. The u_ area looks to be fairly mundane as well. We have two key bits of information at this point, however: 1) The box crashed while in a select() call, and 2) the select() call was made on behalf of inetd. Further, we know that inetd is a network savvy daemon and therefore we need to look further into the sockets attached to inetd at the time of the crash. We can do this with the socket subcommand. Once again, remember that everything is referenced in the kernel by process slot, not process ID. The output is a bit voluminous, but enlightening. > socket -p23 fd 0: 702d2e00: type:0x0002 (DGRAM) opts:0x0000 () state:0x0080 (PRIV) linger:0x0000 pcb:0x700dc480 proto:0x000dfb30 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:0 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/0 23/13 fd 4: 702fec00: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x702fee44 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/4 fd 5: 702fe800: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x702fea44 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/5 fd 6: 702fe400: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x702fe644 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/6 fd 7: 702fe000: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x702fe244 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/7 fd 8: 702d6c00: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x702d6e44 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/8 fd 10: 7005d800: type:0x0002 (DGRAM) opts:0x0000 () state:0x0082 (ISCONNECTED|PRIV) linger:0x0000 pcb:0x7007e880 proto:0x000dfb30 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:0 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/10 fd 11: 702c6000: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x702c6244 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/11 fd 12: 70328c00: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x70328e44 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/12 fd 13: 702d2e00: type:0x0002 (DGRAM) opts:0x0000 () state:0x0080 (PRIV) linger:0x0000 pcb:0x700dc480 proto:0x000dfb30 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:0 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/0 23/13 fd 14: 70328800: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x70328a44 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/14 fd 15: 70328400: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x70328644 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x7000a000 qlen:1 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/15 fd 16: 70328000: type:0x0001 (STREAM) opts:0x0006 (ACCEPTCONN|REUSEADDR) state:0x0080 (PRIV) linger:0x0000 pcb:0x70328244 proto:0x04f33f90 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:1000 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/16 fd 17: 702d2c00: type:0x0000 () opts:0x0000 () state:0x0000 () linger:0x0000 pcb:0x00000000 proto:0x00000000 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:0 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 proc/fd: 23/17 Well, the last file descriptor (17) is certainly a bit odd. While all of the others look quite rational, file descriptor 17 suffers a curious paucity of information and data. I now take a closer look at the socket with the socket subcommand. > socket -b 702d2c00 702d2c00: type:0x0000 () opts:0x0000 () state:0x0000 () linger:0x0000 pcb:0x00000000 proto:0x00000000 q0:0x00000000 q0len:0 q:0x00000000 qlen:0 qlimit:0 head:0x00000000 timeo:0 error:0 oobmark:0 pgid:0 rcv: cc:0 hiwat:0 mbcnt:0 mbmax:0 lowat:0 mb:0x00000000 events:0x0000 iodone:0x00000000 ioargs:0x00000000 flags:0x0000 () timeo:0 lastpkt:0x00000000 snd: cc:0 hiwat:0 mbcnt:0 mbmax:0 lowat:0 mb:0x00000000 events:0x0000 iodone:0x00000000 ioargs:0x00000000 flags:0x0000 () timeo:0 lastpkt:0x00000000 Hrm, no data. This can not be good. Every socket, once out of the q queue, contains data such as the state, program control block (pcb), and the like. This socket is devoid of all this. Using the network debugger subcommand set (ndb) of crash, we take an even closer look into this troublesome socket. ndb> sockets 702d2c00 ----------------- SOCKET INFO ------------------- type:0x0000 (BOGUS) opts:0x0000 () state:0x0000 () linger:0x0000 pcb:0x00000000 proto:0x00000000 q0:0x00000000 q0len: 0 q:0x00000000 qlen: 0 qlimit: 0 head:0x00000000 timeo: 0 error: 0 oobmark: 0 pgid: 0 snd: cc: 0 hiwat: 0 mbcnt: 0 mbmax: 0 lowat: 0 mb:0x00000000 events:0x 0 iodone:0x00000000 ioargs:0x00000000 flags:0x0000 () rcv: cc: 0 hiwat: 0 mbcnt: 0 mbmax: 0 lowat: 0 mb:0x00000000 events:0x 0 iodone:0x00000000 ioargs:0x00000000 flags:0x0000 () Eureka! The type code ("BOGUS") says it all. The kernel created a bogus socket and handed it to inetd. Once in the select() call, the kernel came into this problematic socket and crashed. Further perusal of the various logs (system and Check Point, MRTG, etc.) showed that this box was enduring an extremely high load. Previous tests had demonstrated that a high network load would unearth certain IP stack instabilities and cause random crashes. Such was the case here. This box was eventually replaced by a more durable and robust firewall platform. While an understanding of C, UNIX, and UNIX source code is a MUST for the crash command, once mastered it can reveal the "smoking gun" for several types of problems. Further, crash can reveal key details about a running system as well. The crash command is a great addition to any UNIX mortician's toolkit. Rob Thomas, robt@cymru.com http://www.cymru.com