Event: Unplanned service outage
Date: 10-OCT-2007
Duration: 09-OCT-2007@17:00PDT(approx) to 10-OCT-2007@10:00PDT (17 Hrs)
Summary: Core NFS file system (/home/public, /opt) offline.
Effected: Access to user "home" directories (/home/public)
E-mail delivery to /home/public/<login>/Maildir/ ("INBOX")
Access to personal web pages
(eventually) access to /opt NFS mounts
(eventually) users' ability to login to NICS-managed hosts
Our core file server that contains /home/public developed file system problems some time around 5PM yesterday (Tuesday 09-OCT). The problem apparently didn't become acute until much later in the evening, when users' home directories started becoming unresponsive at varying points during the night.
This also caused e-mail to start bogging up in the mail scanning/delivery pipe line, as inbound e-mail continued to be received and processed, but could not be delivered to user home directories. Outbound mail continued to be delivered.
Eventually the NFS-mounted /home/public went offline and /opt became "stuck" as a result.
NICS staff discovered the mail service outage early this morning. A reboot of the file server failed to bring the disk partition containing /home/public back online. A file system check (fsck) process was started at about 6:30. Due to the size of the partition (>.5TB) this process didn't complete until about 10AM, at which point the file server was rebooted and the rest of the services (e-mail, printing, personal web pages, etc) came back to life.
The initial e-mail backlog was approximately 16 hours. As of 15:00PDT this afternoon, the mail system appears to been reduced to less than 6 hours. We estimate that, barring any inbound "floods" the system will be fully caught up by about 7PM this evening.
The cause has yet to be determined.