In the last two articles for the Prospectors, Miners and 49er's series, I introduced dual CPU/GPU Mining with sgminer-arm-5.5.6-RC and briefly examined system thermal trends and GPU tuning. In this third article, we'll take a look at the broader operational issues of crypto-currency mining and its system and maintenance ramifications, as well the results of a dual CPU/GPU four-day, eight-hour mining stability test. In some ways these are the most important of the areas covered and can be the difference between a stable running system and instability; even possible physical damage to your system.
Why is my system hanging, crashing, or not stable while cryptocurrency mining? Some people facing this issue have asked this question. There is not one single solution that will answer this question. The act of CPU mining, let alone dual CPU/GPU mining, is a complex and extreme computing activity for a system-on-a-chip regardless of the manufacture of the SOC or SBC. The engineered and deployed use case for SOC's and SBC systems did not include the type of extreme computing they are being more frequently subjected to these days. The following insight is offered from experience gathered while actively operating a mining cluster of thirty ODROIDs. The cluster is made up of twenty-five OEM active cooled XU4s, one custom active cooled XU4, and an MC1 quad system.
If we consider typical uses for general computing, almost without exception none approach the resource allocation and stress of modern cryptocurrency mining and other extreme cluster computing applications. Yet many extreme applications are regularly run unmanaged at or near maximum system physical capability and resources. Any disturbance in a multitude of areas can, and will, cause instability. These instabilities manifest themselves in system hangs, crashes, errors, and potentially damaged hardware. When using these types of applications, a systematic approach must be used to prove out the many criteria for deployment. They include CPU frequency, system temperature, cooling capability, power usage, ambient temperature, application and system resource usage, and preventive maintenance. The dynamic nature of any environment, even one thought to be controlled, must be monitored and appropriate adjustments made. Any variance in one factor potentially changes and affects others and the system as a whole.
This is a new frontier for ARM SBC's, so keep in mind you are on the sharp edge of extreme system utilization. To emphasize this point, we’ll use the analogy of getting in your car and driving as fast as it will go, with the tachometer redlined 24 hours a day, 7 days a week. It can be done, but how long will the car last and what other problems will it cause? How reliable will it be? Cars were not designed for that type of use, and neither was the hardware we're using to mine cryptocurrency. How do we deal with this? To start, constant monitoring and adjustment, but there is another question you must ask: What is my operational philosophy? There are two trains of thought that most miners fall into. One group thinks that the capital cost of mining equipment is sunk and will have no residual value at the end of its life cycle. They believe the best approach is to mine the equipment as hard as they can with the sole purpose of maximizing profitability, then in a couple of years, disposing of the hardware with zero residual value. The other group feels that there is, or should be, a residual value after a couple of years and as such, run their mining rigs much more conservatively. Which are you? The answer to this question will dictate how you operate and what is or is not acceptable. Someone else may have a different opinion and approach. Regardless of your strategy, these eight topics must be consider for reliable 24/7 operation:
- CPU frequency
- System temperature
- Cooling capability
- Ambient temperature
- Power usage
- Application and system resource usage
- Preventive maintenance
- Active management
The designed use of SOC's and SBC's do not allow them to mine at their maximum clock frequency. As a general guideline when configuring a system start at approximately 60% of the rated frequency. This gives a comfortable starting point to prove out your configuration. If there is any doubt about the suitability of a given frequency, err on the conservative side until you stabilize the mining rig. You can easily increase the frequency once the other areas are proven. Expect to be constantly adjusting the frequency as part of actively managing your miner setup; more on that later.
The simple reality is that 70°C-75°C (158°F-167°F) is the maximum XU4/MC1/HC1/HC2 SOC temperatures that can be sustained for a 24/7 mining operation. If you run hotter, you’ll likely experience intermittent problems. It may take a day or two, or more, but you will get system hangs, crashes, errors, and an increased likelihood of permanent SOC damage the longer and higher the temperature. Not all same model SBC's will perform identically either. There are a number of reasons for this, that include not only everything in this article, but what some refer to as the “silicon lottery.” If you’re running a medium to large cluster, it is recommended that you divide your cluster into thermal groups. Something as simple as a four-tier system of hot, warm, cool, and cold will allow you to manage the cluster more effectively and set different parameters that are appropriate to a given thermal group. This particularly applies to the ability to control system temperature through manipulation of the clock speed for a given group.
The first order of cooling is to make sure that you have 100% coverage of thermal paste on the SOC and that there are no air voids. Air voids and uncovered areas are a form of insulation and will cause heat retention and abnormal thermal flows. Even though most manufacturers use acceptable thermal paste in the range of 2.5W/mK, consider upgrading to something better. There are many thermal pastes with 2-3 times better thermal conveyance. Look for one that can perform in the 5W/mK-8W/mK range. This alone will help move more heat away from the SOC to the heatsink. Be wary of anything that isn't clearly labeled or uses a different metric.
In general, passive cooling should not be used for mining. Adding a fan to a passive cooled system can be fraught with problems. The quantity and quality of airflow depends on many factors and unless the time is taken to prove out a give change, stick with an OEM active cooled system. If you're going to try something different, some factors to consider include fan proximity, angle, coverage, airflow quantity, and static pressure. Only quantitative testing will tell whether an improvement was actually realized. Having a bigger heatsink is not necessarily a better solution in itself The development of the XU4 Split Airflow case covered at the Odroid Forum (https://forum.odroid.com/viewtopic.php?f=97&t=26373) and in the April 2017 and June 2017 issues of Odroid Magazine (https://magazine.odroid.com/wp-content/uploads/ODROID-Magazine-201704.pdf and https://magazine.odroid.com/wp-content/uploads/ODROID-Magazine-201706.pdf) can serve as an example. Many people like the large, tall North Bridge heatsink used in that project. But it's not perfect and has some nuances that need to be addressed to be significantly better. It's worth taking a few minutes to talk about them as a guide to doing custom miner cooling.
In the initial prototype design, the fan and case were not boxed, which lowered the static air pressure. As such, it gave very similar performance of the OEM stock active cooled heatsink. It wasn't until it was fully boxed and a fan with a higher air volume was used that much of a performance increase was recognized. Only after a copper perch and spreader were added to affect the thermal pipeline did it see a significant improvement as the testing revealed (https://forum.odroid.com/viewtopic.php?f=97&t=26373&start=104). Even still, the further out in time one tested, the less effective the heatsink became under heavy stress. It eventually becomes saturated as time increases. It's fine for general computing, but when mining 24/7, the improvement is going to be less meaningful.
If a fan is used on top, as is often the case, static air pressure drops significantly because the heatsink is not flat and the fins are thicker and closer together. Both work against better cooling by reducing the amount of pressure to force air down the heatsink, and deflecting more air out at the top of the heatsink. It is easy to assume that because it is bigger, it should perform much better, without realizing that this may not be true for mining. If you simply remove the fan from the OEM stock heatsink and mount it on the top, the fan itself is not boxed which further reduces the static pressure allowing even less air to actually penetrate the heatsink. Most of it will be going sideways. The lesson here is if you’re customizing a cooling system, pay attention to all of the details and do long term testing. A bigger heatsink or fan may not always be significantly better for mining, depending on how it is deployed and whether further improvements are applied.
Ambient Room Temperature
One of the most overlooked areas for mining systems is the ambient temperature, especially for unmanaged systems. The temperature change in uncontrolled and unmonitored environments can be very significant. The average house's ambient temperature can vary greatly during a 24-hour period. This matters a lot when pushing the boundaries of a mining operation. Experienced miners know this and are constantly checking their rigs for this reason. Let the sun shine on all or part of a mining system and the effect is even greater. Even a location within a room or building can be significant. When you’re running in the 70°C-75°C (158°F-167°F) range, as you should be in most cases, it only takes a change of 1-2 degrees to affect your miners and potentially push them out of a safe range of operation.
Active Management and Maintenance
Many times new mining operators set up their rigs, start them using all the system resources they can, and think they are done. This is a sure way to have serious instability in a mining operation, whether you're running one system or a large cluster. All of the factors we are talking about must be constantly monitored and adjustments made in order to have a reliable operation. At minimum system resources, CPU/GPU temperature and ambient room temperature must be monitored constantly and the CPU/GPU frequency or workload changed accordingly.
Preventative maintenance is another important area that is often neglected. At minimum, it must happen on a regular schedule. Even then, with fans becoming dirty or lubricant expended, constant vigilance for decreased RPM, noise, dust, and dirt must be maintained. heatsinks and fans must be kept clean. Fans must spin at their full RPM's. Continued operation and static electricity significantly increases the collection of dust, dirt, and pollen. It only took a few months for this system to exhibit reduced performance and instabilities in a room with an open window.
Fans must be maintained and should probably be re-lubricated every few months as well. The best time to do this is when the fans and heatsinks are being cleaned. For the stock OEM active heatsink, the four screws can be removed and the plastic fan assembly can be separated from the heatsink without disrupting the heatsink and thermal paste. Use a dry toothbrush to thoroughly clean the heatsink fins and both side of the fan blades. An appropriate lubricant can be applied to the fan hub. For a noisy fan, this can be accomplished In between maintenance cycles by holding the SBC upside down with the fan spinning while using a spray extension to apply a lubricant, temporarily stopping the fan with the extension and allowing the lubricant to drip down into the hub assembly. Though not appropriate in all cases WD-40 will work in a pinch and is non-conductive. Have some extra fans handy as you should expect to have to replace them.
In general computing, maintenance is something that people can let slide a bit without catastrophic impact. When dual CPU/GPU mining, the increased demand of some crypto algorithms and pool mining, while running systems at their full potential, you do so at your own peril. System reliability can be seriously impacted when multiple factors are allowed to be introduced through poor management, and accumulate through insufficient maintenance. Keep in mind we are not talking about average computer usage: We're talking about pushing operating systems at their full potential for what amounts to an indefinite timeline. Remember our car analogy; pedal to the metal with a redlined tachometer. Coming full circle, back to our original question: Why is my system hanging, crashing, or not stable while cryptocurrency mining? Here is a guide for places to look.
Long Term Stability Test
After a four-day, eight-hour stability test dual CPU/GPU mining Monero using the cryptonight algorithm on a ODROID-MC1 cluster, everything ran as expected with no errors reported in any syslog. Sgminer-arm-5.5.6-RC1, XMRig and cpuminer-multi were used and ran normally. Approximate reported hashrate for each GPU's 19h/s, CPU's 19h/s as reported by the application. All machines 1.7Ghz frequency, ambient temperature 71f (21.66c)
Linux c5n0 4.14.5-92 #1 SMP PREEMPT Mon Dec 11 15:48:15 UTC 2017 armv7l armv7l armv7l GNU/Linux
c5n0 - GPU sgminer-5.5.6-ARM-RC1, CPU XMRig version 2.44 c5n1 - GPU sgminer-5.5.6-ARM-RC1, CPU XMRig version 2.51 c5n2 - GPU sgminer-5.5.6-ARM-RC1, CPU cpuminer-multi version 1.3.1 c5n3 - GPU sgminer-5.5.6-ARM-RC1, CPU cpuminer-multi version 1.3.1
sgminer-5.5.6-ARM-RC1 GPU Configuration -I 6 -w 32 -d 0,1 --thread-concurrency 8192 --monero --pool-no-keepalive XMRig version 2.44 & 2.51 CPU Configuration -t 7 --cpu-affinity 0xFE cpuminer-multi CPU Configuration -t 7 --randomize --no-redirect --cpu-affinity 0xFE
sgminer-arm-5.5.6-RC1 Results Summary
[13:53:00] Shutdown signal received. [13:53:00] Summary of runtime statistics: [13:53:00] Started at [2018-03-25 05:38:19] [13:53:00] Pool: stratum+tcp://pool.supportxmr.com:3333 [13:53:00] Runtime: 104 hrs : 14 mins : 40 secs [13:53:00] Average hashrate: 0.0 Kilohash/s [13:53:00] Solved blocks: 0 [13:53:00] Best share difficulty: 16.2M [13:53:00] Share submissions: 1012 [13:53:00] Accepted shares: 995 [13:53:00] Rejected shares: 17 [13:53:00] Accepted difficulty shares: 5006256 [13:53:00] Rejected difficulty shares: 85000 [13:53:00] Reject ratio: 1.7% [13:53:00] Hardware errors: 352 [13:53:00] Utility (accepted shares / min): 0.16/min [13:53:00] Work Utility (diff1 shares solved / min): 0.16/min [13:53:00] Stale submissions discarded due to new blocks: 0 [13:53:00] Unable to get work from server occasions: 272 [13:53:00] Work items generated locally: 407984 [13:53:00] Submitting work remotely delay occasions: 0 [13:53:00] New blocks detected on network: 3096 [13:53:00] Summary of per device statistics: [13:53:00] GPU0 | (5s):9.359 (avg):9.341h/s | A:2522369 R:25000 HW:170 WU:0.081/m [13:53:00] GPU1 | (5s):9.361 (avg):9.329h/s | A:2483886 R:60000 HW:182 WU:0.081/m
[13:52:55] Shutdown signal received. 13:52:55] Summary of runtime statistics: [13:52:55] Started at [2018-03-25 05:38:28] [13:52:55] Pool: stratum+tcp://pool.supportxmr.com:3333 [13:52:55] Runtime: 104 hrs : 14 mins : 26 secs [13:52:55] Average hashrate: 0.0 Kilohash/s [13:52:55] Solved blocks: 1 [13:52:55] Best share difficulty: 1.23M [13:52:55] Share submissions: 1027 [13:52:55] Accepted shares: 1008 [13:52:55] Rejected shares: 19 [13:52:55] Accepted difficulty shares: 5053564 [13:52:55] Rejected difficulty shares: 95000 [13:52:55] Reject ratio: 1.9% [13:52:55] Hardware errors: 353 [13:52:55] Utility (accepted shares / min): 0.16/min [13:52:55] Work Utility (diff1 shares solved / min): 0.16/min [13:52:55] Stale submissions discarded due to new blocks: 0 [13:52:55] Unable to get work from server occasions: 223 [13:52:55] Work items generated locally: 407460 [13:52:55] Submitting work remotely delay occasions: 0 [13:52:55] New blocks detected on network: 3096 [13:52:55] Summary of per device statistics: [13:52:55] GPU0 | (5s):9.331 (avg):9.351h/s | A:2405910 R:50000 HW:176 WU:0.078/m [13:52:55] GPU1 | (5s):9.324 (avg):9.340h/s | A:2647653 R:45000 HW:177 WU:0.086/m
[13:52:48] Shutdown signal received. [13:52:48] Summary of runtime statistics: [13:52:48] Started at [2018-03-25 05:38:38] [13:52:48] Pool: stratum+tcp://pool.supportxmr.com:3333 [13:52:48] Runtime: 104 hrs : 14 mins : 9 secs [13:52:48] Average hashrate: 0.0 Kilohash/s [13:52:48] Solved blocks: 1 [13:52:48] Best share difficulty: 50.1M [13:52:48] Share submissions: 1034 [13:52:48] Accepted shares: 1009 [13:52:48] Rejected shares: 25 [13:52:48] Accepted difficulty shares: 5081646 [13:52:48] Rejected difficulty shares: 125000 [13:52:48] Reject ratio: 2.4% [13:52:48] Hardware errors: 334 [13:52:48] Utility (accepted shares / min): 0.16/min [13:52:48] Work Utility (diff1 shares solved / min): 0.17/min [13:52:48] Stale submissions discarded due to new blocks: 0 [13:52:48] Unable to get work from server occasions: 257 [13:52:48] Work items generated locally: 414051 [13:52:48] Submitting work remotely delay occasions: 0 [13:52:48] New blocks detected on network: 3099 [13:52:48] Summary of per device statistics: [13:52:48] GPU0 | (5s):9.226 (avg):9.186h/s | A:2607526 R:45000 HW:172 WU:0.084/m [13:52:48] GPU1 | (5s):9.225 (avg):9.188h/s | A:2474119 R:80000 HW:162 WU:0.081/m
[13:52:38] Shutdown signal received. [13:52:38] Summary of runtime statistics: [13:52:38] Started at [2018-03-25 05:38:47] [13:52:38] Pool: stratum+tcp://pool.supportxmr.com:3333 [13:52:38] Runtime: 104 hrs : 13 mins : 51 secs [13:52:38] Average hashrate: 0.0 Kilohash/s [13:52:38] Solved blocks: 3 [13:52:38] Best share difficulty: 4.01M [13:52:38] Share submissions: 1059 [13:52:38] Accepted shares: 1028 [13:52:38] Rejected shares: 31 [13:52:38] Accepted difficulty shares: 5165010 [13:52:38] Rejected difficulty shares: 155000 [13:52:38] Reject ratio: 2.9% [13:52:38] Hardware errors: 350 [13:52:38] Utility (accepted shares / min): 0.16/min [13:52:38] Work Utility (diff1 shares solved / min): 0.17/min [13:52:38] Stale submissions discarded due to new blocks: 1 [13:52:38] Unable to get work from server occasions: 251 [13:52:38] Work items generated locally: 405818 [13:52:38] Submitting work remotely delay occasions: 1 [13:52:38] New blocks detected on network: 3096 [13:52:38] Summary of per device statistics: [13:52:38] GPU0 | (5s):9.319 (avg):9.247h/s | A:2365471 R:75000 HW:175 WU:0.078/m [13:52:38] GPU1 | (5s):9.336 (avg):9.265h/s | A:2799539 R:80000 HW:175 WU:0.092/m