ODROID-HC1 Swarm Cluster

The Docker team has developed a clustering and scheduling tool for Docker containers, called swarm. This article describes how one can create an ODROID-HC1 based swarm cluster. This cluster fits in a 19″ rack.

Figure 1 - The HC1 set of devices for this article

For several months now, I have been running my applications on a home made cluster of 4 ODROID-HC1 units, running Docker containers orchestrated by swarm. The cluster is powered by a homemade power supply. I chose ODROID-HC1 and not the ODROID-MC1 because the former model offers SSD support. An SSD based system is a lot faster than one using microSD cards.

3D printed 19” rack

Figure 4 - I soldered dupont cables to power the fan directly from the 5V input of each ODROID-HC1, as shown here.

Figure 5 - Here is how it appears with all the fans installed:

Figure
6 - The final rack setup looks like so:

Software install

I opted to install Archlinux, which can be obtained from https://archlinuxarm.org/platforms/armv7/samsung/odroid-xu4. For the infrastructure management part, I opted to use saltstack. I used saltstack to “templatize” all my servers. I installed the saltstack master and minion (the NAS will be the master for all others servers), using steps I already documented at https://www.bluemind.org/deploying-saltstack-master-minion-archlinux-arm/

From the salt master, a simple check shows that all nodes are under control:

$ salt -E "node[1-4].local.lan" cmd.run 'cat /etc/hostname'
  node4.local.lan:
      node4
  node3.local.lan:
      node3
  node2.local.lan:
      node2
  node1.local.lan:
      node1

Rootfs on the SSD

First, partition the ssd :

$ salt -E "node[1-4].local.lan" cmd.run 'echo \
-e "o\nn\np\n1\n\n\n\nw\nq\n" | fdisk /dev/sda'

Format the future root partition:

$ salt -E "node[1-4].local.lan" cmd.run 'mkfs.ext4 -L ROOT /dev/sda1'

Mount ssd root partition:

$ salt -E "node[1-4].local.lan" cmd.run 'mount /dev/sda1 /mnt/'

Clone sdcard to ssd root partition:

$ salt -E "node[1-4].local.lan" cmd.run 'cd /;tar \
-c --one-file-system -f - . | (cd /mnt/; tar -xvf -)'

Change boot parameters so root is /dev/sda1:

$ salt -E "node[1-4].local.lan" cmd.run 'sed -i -e \
"s/root\=PARTUUID=\${uuid}/root=\/dev\/sda1/" /boot/boot.txt'

Recompile boot config:

$ salt -E "node[1-4].local.lan" cmd.run 'pacman -S --noconfirm \
uboot-tools'
$ salt -E "node[1-4].local.lan" cmd.run 'cd /boot; ./mkscr'

Reboot:

$ salt -E "node[1-4].local.lan" cmd.run 'reboot'

Remove all from sdcard, and put /boot files at root of it:

$ salt -E "node[1-4].local.lan" cmd.run 'mount /dev/mmcblk0p1 /mnt'
$ salt -E "node[1-4].local.lan" cmd.run 'cp -R /mnt/boot/* /boot/'
$ salt -E "node[1-4].local.lan" cmd.run 'rm -Rf /mnt/*'
$ salt -E "node[1-4].local.lan" cmd.run 'mv /boot/* /mnt/'

Adapt boot.txt because the boot files are in root of boot partition and are no longer in the /boot directory:

$ salt -E "node[1-4].local.lan" cmd.run 'sed -i -e "s/\/boot\//\//" \
/mnt/boot.txt'
$ salt -E "node[1-4].local.lan" cmd.run 'pacman -S --noconfirm \
uboot-tools'
$ salt -E "node[1-4].local.lan" cmd.run 'cd /mnt/; ./mkscr'
$ salt -E "node[1-4].local.lan" cmd.run 'cd /; umount /mnt'
$ salt -E "node[1-4].local.lan" cmd.run 'reboot'

Check that /dev/sda is root:

$ salt -E "node[1-4].local.lan" cmd.run 'df -h | grep sda'
Node4.local.lan: /dev/sda1 118G 1.2G 117G 0% /
Node3.local.lan: /dev/sda1 118G 1.2G 117G 0% /
Node1.local.lan: /dev/sda1 118G 1.2G 117G 0% /
Node2.local.lan: /dev/sda1 118G 1.2G 117G 0% /

SSD Benchmark

Benchmarking is complex and I am not going to say that I did it perfectly right, but at least it gives an idea of how fast an SSD can be. The SSD I used on each of the ODROID-HC1 is a 128GB Sandisk X400. I ran the following test 3 times:

hdparm -tT /dev/sda => 362.6 Mb/s

dd write 4k => 122 Mb/s

$ sync; dd if=/dev/zero of=/benchfile bs=4k count=1048476; sync

dd write 1m => 119 Mb/s

$ sync; dd if=/dev/zero of=/benchfile bs=1M count=4096; sync

dd read 4k => 307 Mb/s

$ echo 3 > /proc/sys/vm/drop_caches
$ dd if=/benchfile of=/dev/null bs=4k count=1048476

dd read 1m => 357 Mb/s

$ echo 3 > /proc/sys/vm/drop_caches
$ dd if=/benchfile of=/dev/null bs=1M count=4096

I tried the same test with with IRQ affinity to big cores, but it did not shown any significant impact on performance.

Finalize installation

I am not going to copy paste all my saltstack states and templates here, as it obviously depends on personal needs and tastes. Basically, my “HC1 Node” template does the following on each node:

Change mirrorlist
Install custom sysadminscripts
Remove alarmuser
Add some sysadmin tools (lsof, wget, etc.)
Change mmc and ssd scheduler to deadline
Add my user
Install cron
Configure log rotate
Set journald config (RuntimeMaxUse=50M and Storage=volatile to lesser flash storage writes)
Add mail ability (ssmtp)

I then changed the password for my user using saltstack:

$ salt "node1.local.lan" shadow.gen_password 'xxxxxx' # gives password hash in return
$ salt "node2.local.lan" shadow.set_password myuser 'the_hash_here'

Finally, to ensure that no disk corruption would stop a node from booting, I forced fsck at boot time on all nodes by :

adding “fsck.mode=force” in kernel line in /boot/boot.txt
compile it with mkscr
rebooting

Deploy Docker Swarm

The swarm module in my saltstack could not be recognized, despite using version 2018.3.1 modules. So, I ended up with executing the commands directly, which is not really a problem as I was not going to add a node everyday.

Build the master:

$ salt "node1.local.lan" cmd.run 'docker swarm init'

Add worker:

$ salt "node4.local.lan" cmd.run 'docker swarm join --token xxxxx \
node1.local.lan:2377'

Add the 2nd and 3rd master for a failover ability:

$ salt "node1.local.lan" cmd.run 'docker swarm join-token manager'
$ salt "node3.local.lan" cmd.run 'docker swarm join --token xxxxx \
192.168.1.1:2377'
$ salt "node2.local.lan" cmd.run 'docker swarm join --token xxxxx \
192.168.1.1:2377'

On checking all nodes’ status with “docker node ls”, we can now see one leader and 2 nodes that are “reachable”. Then, I deployed a custom docker daemon configuration (daemon.json) to switch storage driver to overlay2 (the default one is to slow on xu4) and allows the usage of my custom docker registry:

{
  "insecure-registries":["myregistry.local.lan:5000"],
  "storage-driver": "overlay2"
}

Docker images for the Swarm cluster

As of now, using a container orchestrator implies to either use stateless containers or to use a global storage solution. I first tried to use glusterfs on all nodes. It was working perfectly, but was way too slow (between 25 and 36 Mb/s even with optimized settings and IRQ affinity to big cores). I ended up with a simple but yet very efficient solution for my needs: An automated daily backup of all volumes on all nodes (to a network drive) An automated daily mysql database backup on all nodes (run only when mysql is detected) Containers that are able to restore their volumes from the backup during first startup An automated daily clean-up of containers and volumes on all nodes

Thus, each time a node is shut down or a stack restarted, each container is able to start on any node, retrieving its data automatically (if not stateless).

The daily backup script is shown below:

# monthly saved backup
firstdayofthemonth=`date '+%d'`
if [ $firstdayofthemonth == 01 ] ; then
  BACKUP_DIR="$BACKUP_DIR/monthly"
else
  firstdayoftheweek=$(date +"%u")
  if [ day == 1 ]; then
  BACKUP_DIR="$BACKUP_DIR/weekly"
  fi
fi
 
volumeList=$(ls /var/lib/docker/volumes | grep $DOCKER_VOLUME_LIST_PATTERN)
 
for volume in $volumeList
do
  archiveName=$(echo $volume | cut -d_ -f2-)
  mv "$BACKUP_DIR/$archiveName.tar.gz" \
"$BACKUP_DIR/$archiveName.tar.gz.old"
  cd /var/lib/docker/volumes/$volume/_data/
  tar -czf $BACKUP_DIR/$archiveName.tar.gz * 2>&1
  rm "$BACKUP_DIR/$archiveName.tar.gz.old"
done

This is the daily cleanup script:

# remove unused containers and images
docker system prune -a -f
 
# remove unused volumes
volumeToRemove=$(docker volume ls -qf dangling=true)
 
if [ ! -z "$volumeToRemove" ]; then
  docker volume rm $volumeToRemove
fi

Simple distributed image build

To make a simple distributed build system, I made some scripts to distribute my docker containers build across the 4 devices. All containers are then put in a local registry, tagged with the current date. Local image builder that build, tag and put in registry (script name : docker_build_image) :

if [ $# -lt 3 ];  then
  echo "Usage: $0   "
  echo "Example : $0 myImage armv7h myregistry.local.lan:5000"
  echo ""
  exit 0
fi
 
arch="$2"
imageName="$arch/$1"
registry="$3"
tag=`date +%Y%m%d`
 
docker build --rm -t $registry/$imageName:$tag -t \
$registry/$imageName:latest .
docker push $registry/$imageName
docker rmi -f $registry/$imageName:$tag
docker rmi -f $registry/$imageName:latest

Build several images given in argument (script name:docker_build_batch) :

# usage : default build all
if [[ "$1" == "-h" ]]; then
  echo "Usage: $0 [image folder 1] [image folder 2] ..."
  echo "Example :"
  echo "  build two images : $0 mariadb mosquitto"
  echo ""
  exit 0
fi
 
# if any parameter, use it/them as docker image to build
if [[ $# -gt 0 ]]; then
  DOCKER_IMAGES_DIR="${@:1}"
else
  echo "Nothing to build. try -h for help"
fi
 
echo -e "\e[1m--- going to build the following images :"
echo -e "\e[1m$DOCKER_IMAGES_DIR\n"
 
# build and send to repository
for image in $DOCKER_IMAGES_DIR
do
  echo -e "\e[1m--- start build of $image:"
  cd /home/docker/$image
  docker_build_image $image armv7h myregistry.local.lan:5000
done

Distribute the builds using saltstack on the salt-master, using the previous script. The special image “archlinux” is built first if found, because all other images depend on it.

DOCKER_IMAGES_DIR=""
SPECIAL_NAME="archlinux_image_builder"
NODES[0]=""
 
# usage : default build all
if [[ "$1" == "-h" ]]; then
  echo "Usage: $0 [image folder 1] [image folder 2] ..."
  echo "Examples :"
  echo "  build all found images : $0"
  echo "  build two images : $0 mariadb archlinux_image_builder"
  echo ""
  exit 0
fi

echo -e "\e[1m--- Update repository (git pull)\n"
# update git repository
cd /home/docker
git pull

# if any parameter, use it/them as docker image to build
if [[ $# -gt 0 ]]; then
  DOCKER_IMAGES_DIR="${@:1}"
else
  DOCKER_IMAGES_DIR=$(ls -d */ | cut -f1 -d'/')
fi

echo -e "\e[1m--- going to build the following images :"
echo -e "\e[1m$DOCKER_IMAGES_DIR\n"
 
 
# if archlinux images in array, build it first
if [[ $DOCKER_IMAGES_DIR = *"$SPECIAL_NAME"* ]]; then
  echo -e "\e[1m--- found special image: $SPECIAL_NAME, start to build   
           it first"
  echo -e "\e[1m--- update repository on node1\n"
  salt "hulk1.local.lan" cmd.run "cd /home/docker; git pull"
 
  echo -e "\e[1m--- build $SPECIAL_NAME image on hulk1\n"
  salt "hulk1.local.lan" cmd.run "cd /home/docker/$SPECIAL_NAME; \
./mkimage-arch.sh armv7 registry.local.lan:5000"
 
  DOCKER_IMAGES_DIR=${DOCKER_IMAGES_DIR//$SPECIAL_NAME/}
fi

# update repository on all nodes
echo -e "\e[1m--- update repository on node[1-4]\n"
salt -E "node[1-4].local.lan" cmd.run "cd /home/docker; git pull"

# Prepare build processes on known swarm nodes
i=0
for image in $DOCKER_IMAGES_DIR
do
  NODES[$i]="${NODES[$i]} $image"
  i=$((i + 1))
 
  if [[ $i -gt 3 ]]; then
      i=0
  fi
done
 
echo -e "\e[1m--- build plan :"
echo -e "\e[1m--- node1 : ${NODES[0]}"
echo -e "\e[1m--- node2 : ${NODES[1]}"
echo -e "\e[1m--- node3 : ${NODES[2]}"
echo -e "\e[1m--- node4 : ${NODES[3]}\n"

# distribute and launch build plan
salt "node1.local.lan" cmd.run "docker_build_batch ${NODES[0]}"
salt "node2.local.lan" cmd.run "docker_build_batch ${NODES[1]}"
salt "node3.local.lan" cmd.run "docker_build_batch ${NODES[2]}"
salt "node4.local.lan" cmd.run "docker_build_batch ${NODES[3]}"
echo -e "\e[1m--- build plan finished"

Sources are available on GitHub (https://github.com/jit06/docker-images) and Thingiverse (https://www.thingiverse.com/thing:3218912). For comments, questions and suggestions, please visit the original article at https://www.bluemind.org/odroid-hc1-based-swarm-cluster-19-rack.