[tech] Temperature Monitoring in Server Room [repost]
Melissa Star
melissa at netexperts.com.au
Mon Mar 18 15:13:50 AWST 2019
Hi Everyone,
I just realised - if you have smartmontools installed on linux machines, each hard drive or SSD will provide its “Airflow Temperature”, which I can extract via script.
I'm thinking of centralising this for all the servers I run, and collecting the data to chart, having a display at home that gives me live info for all machines under my control.
I could make a similar display for UCC, which could be on the website and/or a monitor in the club room (although this would likely be in the winter holidays due to increasing workload).
Note the reallocated sector count for SSDs, once this starts to happen, the drive should be replaced.
For SSDs (and also HDDs) mounted at the front of servers, because they are getting airflow to the sensor sucked in directly from ambient air, and are thermally insulated from the rest of the machine, this will be equal to the temperature of the room.
For example, right now, the UCC server room temperature is 29 degrees, according to 3 of the four installed drives, and 30 degrees according to the 4th one.
For PCs, the same test will provide the temperature in the case. Some drives also have a count of total hours run outside of their acceptable temperature range and G/shocks or drops, as well as all types of other interesting data.
If there is an interest, I could parse this data, and the page with Ashera-related information could provide it and could also e-mail (and/or SMS) warnings to anyone on the list if the temperature passes a key threshold.
Here is what the data actually looks like (I've highlighted the airflow temperature):
smartctl -d sat -a /dev/pass1
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-STABLE amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 860 QVO 1TB
Serial Number: S4CZNG0M138175F
LU WWN Device Id: 5 002538 e701b1df5
Firmware Version: RVQ01B6Q
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Mar 18 15:03:46 2019 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
... (cut to prevent this email becoming ridiculous) ...
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 648
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 15
177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always - 0
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 071 058 000 Old_age Always - 29
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 13
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 336661820
SMART Error Log Version: 1
No Errors Logged
Regards,
Melissa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ucc.gu.uwa.edu.au/pipermail/tech/attachments/20190318/a7a2006b/attachment.htm
More information about the tech
mailing list