[tech] Temperature Monitoring in Server Room [repost]

Melissa Star melissa at netexperts.com.au
Mon Mar 18 15:13:50 AWST 2019


Hi Everyone,

I just realised - if you have smartmontools installed on linux machines, each hard drive or SSD will provide its “Airflow Temperature”, which I can extract via script.

I'm thinking of centralising this for all the servers I run, and collecting the data to chart, having a display at home that gives me live info for all machines under my control.

I could make a similar display for UCC, which could be on the website and/or a monitor in the club room (although this would likely be in the winter holidays due to increasing workload).

Note the reallocated sector count for SSDs, once this starts to happen, the drive should be replaced. 

For SSDs (and also HDDs) mounted at the front of servers, because they are getting airflow to the sensor sucked in directly from ambient air, and are thermally insulated from the rest of the machine, this will be equal to the temperature of the room.

For example, right now, the UCC server room temperature is 29 degrees, according to 3 of the four installed drives, and 30 degrees according to the 4th one.

For PCs, the same test will provide the temperature in the case. Some drives also have a count of total hours run outside of their acceptable temperature range and G/shocks or drops, as well as all types of other interesting data.

If there is an interest, I could parse this data, and the page with Ashera-related information could provide it and could also e-mail (and/or SMS) warnings to anyone on the list if the temperature passes a key threshold.

Here is what the data actually looks like (I've highlighted the airflow temperature):

smartctl -d sat -a /dev/pass1
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-STABLE amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 860 QVO 1TB
Serial Number:    S4CZNG0M138175F
LU WWN Device Id: 5 002538 e701b1df5
Firmware Version: RVQ01B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar 18 15:03:46 2019 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
... (cut to prevent this email becoming ridiculous) ...

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       648
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       15
177 Wear_Leveling_Count     0x0013   100   100   000    Pre-fail  Always       -       0
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   071   058   000    Old_age   Always       -       29
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       13
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       336661820

SMART Error Log Version: 1
No Errors Logged






Regards,

Melissa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ucc.gu.uwa.edu.au/pipermail/tech/attachments/20190318/a7a2006b/attachment.htm 


More information about the tech mailing list