Troubleshooting a Sun X4-2 server's power supply units using its iLOM command-line interface

An alternative to the iLOM web interface

Troubleshooting a Sun X4-2 server's power supply units using its iLOM command-line interface

The iLOM interface

iLOM is an abbreviation for Integrated Lights Out Manager. This is the system management firmware you can use to monitor, manage, and configure a variety of Oracle server platforms. Read here, on Oracle's documentation website for a complete overview of this software.

This iLOM firmware runs on a separate, mini-computer called a service processor (SP) that is built into the server, complete with its own network interface card and IP address for remote access. This mini-computer is used to administer the server remotely through the iLOM firmware.

Accessing your server's iLOM interface

To gain access to the iLOM firmware, there are two options available to you. The first option is through a web interface, and the 2nd option is through a command-line interface.

This article deals with the command-line interface. You can read my other article on how to perform these very same tasks through the web interface, here.

Command-line Interface vs Web interface

When compared to the web interface, the command-line interface (CLI) is more powerful, in that it enables you to drill down to individual components’ sensors and retrieve more detailed information.

As an example, and as demonstrated in this article, it is possible to query a power supply unit's (PSU) sensors to determine if power is actually flowing into the unit or not. In other words, you could potentially determine whether the reported loss of power is due to a broken PSU, or whether the external power source is at fault. This type of drilling down when troubleshooting is not possible with the web interface.

How to remotely check a faulty power supply unit using the SP

You access the service processor (or SP) through SSH. You will need a user account and the SP's IP address to log in.

Replace the user name and IP address with your system's information. After logging in, you should see something like this:

user1@demo:/$ ssh user1@ip_address
Password: 

Oracle(R) Integrated Lights Out Manager

Version 4.0.4.22 r127068

Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved.

Warning: HTTPS certificate is set to factory default.

Hostname: demo-server-sp

->

Identify faults

show faulty

As a first step, I always check for any and all faults. I do this by executing the command show faulty which will list all the components that are in a faulted state.

In this case, shown below, a power supply unit is being reported as being faulty. See the first entry, which is referring to a component called PS0. A whole bunch of other technical information is also provided relating to the faulty component, which you can ignore for now.

-> show faulty
Target                                 | Property                                     | Value                                                            
---------------------------------------+----------------------------------------------+------------------------------------------------------------------
/SP/faultmgmt/0                        | fru                                          | /SYS/PS0
/SP/faultmgmt/0/faults/0               | class                                        | fault.chassis.power.ext-fail
/SP/faultmgmt/0/faults/0               | sunw-msg-id                                  | SPX86-8003-73
/SP/faultmgmt/0/faults/0               | component                                    | /SYS/PS0
/SP/faultmgmt/0/faults/0               | uuid                                         | f3a1a759-a8c4-c9b2-c8b1-ed8634fc336d
/SP/faultmgmt/0/faults/0               | timestamp                                    | 2022-02-21/10:12:41
/SP/faultmgmt/0/faults/0               | system_component_firmware_releases           | (ILOM)2018.08.28
/SP/faultmgmt/0/faults/0               | system_component_firmware_versions           | (ILOM)4.0.4.22
/SP/faultmgmt/0/faults/0               | system_component_firmware_manufacturer       | Oracle Corporation
/SP/faultmgmt/0/faults/0               | detector                                     | /SYS/PS0/STATE
/SP/faultmgmt/0/faults/0               | fru_rev_level                                | 99
/SP/faultmgmt/0/faults/0               | fru_serial_number                            | 476856F+1401CE00HG
/SP/faultmgmt/0/faults/0               | fru_manufacturer                             | Astec International LTD
/SP/faultmgmt/0/faults/0               | fru_name                                     | A256_Power_Supply
/SP/faultmgmt/0/faults/0               | fru_part_number                              | 7060951
/SP/faultmgmt/0/faults/0               | system_component_manufacturer                | Oracle Corporation
/SP/faultmgmt/0/faults/0               | system_component_name                        | SUN SERVER X4-2
/SP/faultmgmt/0/faults/0               | system_component_part_number                 | 32547475+2+1
/SP/faultmgmt/0/faults/0               | system_component_serial_number               | 1415NML003
/SP/faultmgmt/0/faults/0               | chassis_manufacturer                         | Oracle Corporation
/SP/faultmgmt/0/faults/0               | chassis_name                                 | SUN SERVER X4-2
/SP/faultmgmt/0/faults/0               | chassis_part_number                          | 32547475+2+1
/SP/faultmgmt/0/faults/0               | chassis_serial_number                        | 1000NML003
/SP/faultmgmt/0/faults/0               | system_manufacturer                          | Oracle Corporation
/SP/faultmgmt/0/faults/0               | system_name                                  | SUN SERVER X4-2
/SP/faultmgmt/0/faults/0               | system_part_number                           | 32547475+2+1
/SP/faultmgmt/0/faults/0               | system_serial_number                         | 1000NML003

->

show System

If the show faulty command finds several faulty components, the format of the output can be overwhelming and too much to process sensibly. In the above example, only one component was found to be faulty, so the little bit of output is not a problem. But when several faults are reported, a more user-friendly option is to execute the command show system.

The show system command output includes an overview of the system's health and provides some health details in a more readable format compared to show faulty.

-> show System

 /System
    Targets:
        Open_Problems (1)    ---> notice a problem is being reported
        Processors
        Memory
        Power
        Cooling
        Storage
        Networking
        PCI_Devices
        Firmware
        BIOS
        Log

    Properties:
        health = Service Required
        health_details = PS0 (Power Supply 0) is faulty. Type 'show /System/Open_Problems' for details.
        open_problems_count = 1
        type = Rack Mount
        model = SUN SERVER X4-2
        qpart_id = Q10540
        part_number = 32547475+2+1
        serial_number = 1000NML003
        system_identifier = (none)
        system_fw_version = 4.0.4.22
        primary_operating_system = Not Available
        primary_operating_system_detail = Comprehensive System monitoring is not available. Ensure the host is running with the Hardware Management Pack. 
                                          For details go to http://www.oracle.com/goto/ilom-redirect/hmp-osa
        host_primary_mac_address = 00:10:e0:56:53:90
        ilom_address = *your_ip_address_will_be_shown_here*
        ilom_mac_address = 00:10:E0:56:53:94
        locator_indicator = Off
        power_state = On
        actual_power_consumption = 259 watts
        action = (Cannot show property)

    Commands:
        cd
        reset
        set
        show
        start
        stop

->

show /System/Open_Problems

Now that you can see from the report that there are "open problems", you can execute the command show /System/Open_Problems for details of all faulty components.

-> show /System/Open_Problems

Open Problems (1)
Date/Time                 Subsystems          Component
------------------------  ------------------  ------------
Mon Feb 21 10:12:41 2022  Power               PS0 (Power Supply 0)
        A loss of AC input to a power supply has occurred. (Probability:100, UUID:f3a1a759-a8c4-c9b2-c8b1-ed8634fc336d, Resource:/SYS/PS0, Part 
        Number:7060951, Serial Number:400006F+1001CE00HG, Reference Document:http://support.oracle.com/msg/SPX86-8003-73)

The above output reports that a loss of input power has occurred. In other words, the loss of power delivered by this unit to the server is caused by an external problem. Thus this power supply unit is not really faulty after all.

Contact onsite IT support team

At this point, to save time, you should probably ask the onsite IT support team to start checking the external power supply.

Drill down deeper to gather evidence

Query the sensors.

It is usually a good idea to drill down further and gather more evidence that you can provide to your onsite IT support team. Gathering this evidence will also be beneficial in cases where you suspect that a component is indeed faulty and needs replacing. You will need to provide this evidence to Oracle Support when you raise a service request (SR).

Query the power supply unit

show /SYS/PS0

-> show /SYS/PS0

 /SYS/PS0
    Targets:
        PRSNT
        P_IN
        P_OUT
        STATE
        T_OUT
        V_12V
        V_12V_STBY
        V_IN

    Properties:
        type = Power Supply
        ipmi_name = PS0
        fru_description = A256_Power_Supply
        fru_manufacturer = Astec International LTD
        fru_part_number = 7060951
        fru_rev_level = 99
        fru_serial_number = 476856F+1401CE00HG
        fault_state = Faulted      ---> notice the faulted state
        clear_fault_action = (none)

    Commands:
        cd
        set
        show

->

Query the power supply unit STATE

show /SYS/PS0/STATE

In the output shown below, notice the value property. It indicates that there is a power cable present, i.e. the cable is plugged into the power supply unit. However, the input power is out of range.

-> show /SYS/PS0/STATE

 /SYS/PS0/STATE
    Targets:

    Properties:
        type = Power Supply
        ipmi_name = PS0/STATE
        class = Discrete Sensor
        value = [Presence detected][Input out-of-range, but present]
        alarm_status = major

    Commands:
        cd
        show

->

Query the incoming power

show /SYS/PS0/P_IN

Query the power supply unit's incoming power sensor, which is denoted by P_IN. In the output shown below, we can see that there is no incoming power to the unit, as is indicated by the value property, which is equal to 0.0 Watts

-> show /SYS/PS0/P_IN 

 /SYS/PS0/P_IN
    Targets:

    Properties:
        type = Power Unit
        ipmi_name = PS0/P_IN
        class = Threshold Sensor
        value = 0.000 Watts    ---> notice that there is no incoming power
        upper_nonrecov_threshold = N/A
        upper_critical_threshold = N/A
        upper_noncritical_threshold = N/A
        lower_noncritical_threshold = N/A
        lower_critical_threshold = N/A
        lower_nonrecov_threshold = N/A
        alarm_status = cleared

    Commands:
        cd
        show

->

Conclusion

In this case, there is no power flowing into the relevant power supply unit. Upon further investigation by the remote onsite IT support team, it was confirmed that the relevant circuit breaker was down.

Oracle and iLOM are registered trademarks of Oracle Corporation