Iridis System Status
Note that this page is not updated regularly under normal operations, but we will endeavour to publish information in a timely fashion in the event of a problem with a major component of the cluster. Messages related to current operational conditions will also be publicised via the "message of the day" on the login nodes. This is displayed when you login but can be viewed at any time with the command "cat /etc/motd"
Current Status
Wedenesday, Septmber 23rd, 14:50
Iridis is up. and currently running a reduced capacity service following the decommissioning of the majority of the single-core nodes in early August. Apologies for any delays caused by the reduced service. Note that we will have to decommission all of the compute nodes at 08:00 on Friday 25th September to enable the installation of Iridis 3.
Scheduled maintenance on Iridis
25th September 09
We will have to decommission all of the compute nodes at 08:00 on Friday 25th September to enable the installation of Iridis 3. . See the Iridis news page for more details of the transition.
We apologise for the loss of capacity and other disruption during the transition to Iridis 3 but this is impossible to avoid, and we trust you will find the vastly improved capability of Iridis 3 will more than compensate.
Recent Problems
23/6/09
A power supply failure in the data centre affected the storage array that hosts the Iridis filestore and caused most of the compute nodes to be shut down. The filestore was restored and checked by mid-morning, but compute nodes could not be powered up until after lunch to enable further work on the power supply.
5/6/09
Instabilities in the external power supply failure in the data centre affected the storage array that hosts the Iridis filestore and required most of the compute nodes to be shut down. Full service was restored on the following Monday, June 8th.
26/9/08
There was a problem on the hardware that hosts the Iridis filestore. The Iridis filestore was checked and some minor repairs made. There was no evidence of corruption and the filesystems were put back online.
2/7/08
The file server for the /scratch1,2&3 filesystems failed around 17.30. The fileserver was restarted the next morning and the affected filesystems checked. There appeared to be no problems on the filesystems but please let us know if you notice anything odd with any files on these filesystems.
9/6/08
A further problem with an air-conditioning unit required about 85 nodes to be shut down to reduce the heat load in the data-centre.
8/6/08
A major failure of an air-conditioning unit occured on Sunday evening which required an emergency power-down of the majority of the single core nodes to reduce the heat load in the critical area. These nodes were returned to service the following morning but some jobs may have been lost or restarted.
16/5/08
A problem with an air-conditioning unit required about 100 nodes to be shut down to reduce the heat load in the data-centre.
8/5/08
Problems encountered with the login nodes hanging. We have now re-instated a kernel upgrade on blue18 that was overlooked when the node had to be rebuilt in March. This looks to have dealt with the problem. The kernel on blue14 has also been upgraded.
18/3/08
The login node failed late afternoon, probably due to a disk failure. We have swapped the disk and rebuilt the O/S on the node. One important side effect of having to rebuild the node is that the ssh hostid has changed - see the section on Dealing with changes of SSH hostid for advice on how to deal with the consequences of this change.
24/9/07
The fileserver for the /scratch1-3 filesystems failed in the early hours of Sunday morning. We have rebooted the server and there appear to be no obvious problems. We have run a check on the affected filesystems as a precaution. There were no problems revealed by the check, but if you do notice anything unusual please let us know. Apologies to users whose jobs were affected by this problem

News feeds