Big news! The community will be moving to a new platform April 21. Read more.
Big news! The community will be moving to a new platform April 21. Read more.
Absent Member.
Absent Member.
5372 views

Hardware error: machine check events? Process 2 below temp

New install Dell Poweredge 620 server.

Install OES11 with SLES SP2 combo dvd.

Anything below a concern that I have a hardware issue?

Install seems fine, but when I look in messages and warn the following messages appears:

in6-rs:/var/log # tail -1000 /var/log/messages | grep Error
Jul 27 15:31:17 lin6-rs kernel: [73233.862066] [Hardware Error]: Machine check events logged
Jul 27 17:59:55 lin6-rs kernel: [ 1198.134214] [Hardware Error]: Machine check events logged
Jul 27 19:47:25 lin6-rs kernel: [ 7637.917673] [Hardware Error]: Machine check events logged
Jul 27 19:49:55 lin6-rs kernel: [ 7787.680058] [Hardware Error]: Machine check events logged

lin6-rs:/var/log # tail -20 warn
Jul 29 10:30:26 lin6-rs kernel: [146797.664693] CPU2: Package power limit notification (total events = 145)
Jul 29 10:30:26 lin6-rs kernel: [146797.664697] CPU4: Package power limit notification (total events = 145)
Jul 29 10:30:26 lin6-rs kernel: [146797.664700] CPU6: Package power limit notification (total events = 145)
Jul 29 10:30:26 lin6-rs kernel: [146797.664707] CPU0: Package power limit notification (total events = 144)
Jul 29 10:31:10 lin6-rs mcelog: Processor 2 below trip temperature. Throttling disabled
Jul 29 10:31:10 lin6-rs mcelog: Processor 4 below trip temperature. Throttling disabled

Any concern here? The NOS installation is SLES11 SP2 with OES11 SP1 Added on.

lin6-rs:/var/log # uname -a

Linux lin6-rs 3.0.13-0.27-default #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b) x86_64 x86_64 x86_64 GNU/Linux

Thanks,
Linda

Labels (2)
0 Likes
6 Replies
Knowledge Partner Knowledge Partner
Knowledge Partner

Hi.

On 29.07.2013 18:46, lcurrie wrote:
>
> New install Dell Poweredge 620 server.
>
> Install OES11 with SLES SP2 combo dvd.
>
> Anything below a concern that I have a hardware issue?
>
> Install seems fine, but when I look in messages and warn the following
> messages appears:
>
> in6-rs:/var/log # tail -1000 /var/log/messages | grep Error
> Jul 27 15:31:17 lin6-rs kernel: [73233.862066] [Hardware Error]:
> Machine check events logged


This indeed could point to a Hardware problem. Does the server exhibit
weird behaviour, crashes or slowness?


> lin6-rs:/var/log # tail -20 warn
> Jul 29 10:30:26 lin6-rs kernel: [146797.664693] CPU2: Package power
> limit notification (total events = 145)
> Jul 29 10:30:26 lin6-rs kernel: [146797.664697] CPU4: Package power
> limit notification (total events = 145)
> Jul 29 10:30:26 lin6-rs kernel: [146797.664700] CPU6: Package power
> limit notification (total events = 145)
> Jul 29 10:30:26 lin6-rs kernel: [146797.664707] CPU0: Package power
> limit notification (total events = 144)
> Jul 29 10:31:10 lin6-rs mcelog: Processor 2 below trip temperature.
> Throttling disabled
> Jul 29 10:31:10 lin6-rs mcelog: Processor 4 below trip temperature.
> Throttling disabled


This points to either a known, sort of cosmetic kernel bug:

https://bugzilla.kernel.org/show_bug.cgi?id=36182

Or alternatively, your machine *did* indeed overheat for some reason,
this throttled the CPU.

I would call Dell.

CU,
--
Massimo Rosen
Novell Knowledge Partner
No emails please!
http://www.cfc-it.de
CU,
--
Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de
0 Likes
Cadet 1st Class Cadet 1st Class
Cadet 1st Class

Hi,

Its a cosmetic bug most likely. The BIOS / hardware specifies odd thermal limit values, or the CPU is newer than the code and the sensor data does not line up and it interprets the out of bounds condition as a thermal throttling event. You can tell the difference here:

# cd /sys/devices/system/cpu/cpu2/thermal_throttle
# cat *
0
0
0
0

zeros = good. Other values indicate the number of times each ( use ls to list them ) condition was reached.

And again, if you have instructed the BIOS to manage power, this is one ways its doing it. When it comes to servers, I say "drill baby drill."

-- Bob
0 Likes
Absent Member.
Absent Member.

mrosen;2274875 wrote:
Hi.

On 29.07.2013 18:46, lcurrie wrote:
>
> New install Dell Poweredge 620 server.
>
> Install OES11 with SLES SP2 combo dvd.
>
> Anything below a concern that I have a hardware issue?
>
> Install seems fine, but when I look in messages and warn the following
> messages appears:
>
> in6-rs:/var/log # tail -1000 /var/log/messages | grep Error
> Jul 27 15:31:17 lin6-rs kernel: [73233.862066] [Hardware Error]:
> Machine check events logged


This indeed could point to a Hardware problem. Does the server exhibit
weird behaviour, crashes or slowness?


> lin6-rs:/var/log # tail -20 warn
> Jul 29 10:30:26 lin6-rs kernel: [146797.664693] CPU2: Package power
> limit notification (total events = 145)
> Jul 29 10:30:26 lin6-rs kernel: [146797.664697] CPU4: Package power
> limit notification (total events = 145)
> Jul 29 10:30:26 lin6-rs kernel: [146797.664700] CPU6: Package power
> limit notification (total events = 145)
> Jul 29 10:30:26 lin6-rs kernel: [146797.664707] CPU0: Package power
> limit notification (total events = 144)
> Jul 29 10:31:10 lin6-rs mcelog: Processor 2 below trip temperature.
> Throttling disabled
> Jul 29 10:31:10 lin6-rs mcelog: Processor 4 below trip temperature.
> Throttling disabled


This points to either a known, sort of cosmetic kernel bug:

https://bugzilla.kernel.org/show_bug.cgi?id=36182

Or alternatively, your machine *did* indeed overheat for some reason,
this throttled the CPU.

I would call Dell.

CU,
--
Massimo Rosen
Novell Knowledge Partner
No emails please!
Untitled Document



Calling Dell. I would say this server seemed slow at times during install.

Linda

0 Likes
Absent Member.
Absent Member.

I have genned 3 servers up now.
The all get the CPU throttling messages.

Only the one has the hardware error. Of course it is the one installed into tree.

Linda

0 Likes
Cadet 1st Class Cadet 1st Class
Cadet 1st Class

And of course you could use the sensors command to look at the thermal values in real time.

The Machine Check errors are from the alleged thermal issues. If there were a "real" issue, you would have a crash.

-- Bob
0 Likes
Knowledge Partner Knowledge Partner
Knowledge Partner

Bob.

On 29.07.2013 22:56, Bob-O-Rama wrote:
>
> Hi,
>
> Its a cosmetic bug most likely. The BIOS / hardware specifies odd
> thermal limit values, or the CPU is newer than the code and the sensor
> data does not line up and it interprets the out of bounds condition as a
> thermal throttling event. You can tell the difference here:


I'd agree with you if there were only the Package Power limit events in
the log.

The MCE plus the rather outspoken throttle events however say otherwise.
And no, a MCE does not always lead to a kernel panic or crash. For
instance, a recoverable ECC Memory error triggers a MCE event, but never
a crash. Throttling the CPU due to overheating also triggers a MCE, but
no crash or panic.

I'm relatively certain this machine overheats. Quite possibly a badly
seated (or dropped off entirely) Heatsink, or a dead fan.

CU,
--
Massimo Rosen
Novell Knowledge Partner
No emails please!
http://www.cfc-it.de
CU,
--
Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.