Event: Unplanned service outage
Date:  10-OCT-2007
Duration:  09-OCT-2007@17:00PDT(approx) to 10-OCT-2007@10:00PDT (17 Hrs)
Summary:  Core NFS file system (/home/public, /opt) offline.

Effected: Access to user "home" directories (/home/public)
          E-mail delivery to /home/public/<login>/Maildir/ ("INBOX")
          Access to personal web pages
          (eventually) access to /opt NFS mounts
          (eventually) users' ability to login to NICS-managed hosts

Our core file server that contains /home/public developed file system problems some time around 5PM yesterday (Tuesday 09-OCT).  The problem apparently didn't become acute until much later in the evening, when users' home directories started becoming unresponsive at varying points during the night.

This also caused e-mail to start bogging up in the mail scanning/delivery pipe line, as inbound e-mail continued to be received and processed, but could not be delivered to user home directories.  Outbound mail continued to be delivered.

Eventually the NFS-mounted /home/public went offline and /opt became "stuck" as a result.

NICS staff discovered the mail service outage early this morning.  A reboot of the file server failed to bring the disk partition containing /home/public back online.  A file system check (fsck) process was started at about 6:30.  Due to the size of the partition (>.5TB) this process didn't complete until about 10AM, at which point the file server was rebooted and the rest of the services (e-mail, printing, personal web pages, etc) came back to life.

The initial e-mail backlog was approximately 16 hours.  As of 15:00PDT this afternoon, the mail system appears to been reduced to less than 6 hours.  We estimate that, barring any inbound "floods" the system will be fully caught up by about 7PM this evening.

The cause has yet to be determined.

zydeco outage 20071010 (last edited 2007-10-10 22:28:48 by JohnRickard)