summaryrefslogtreecommitdiffstats
path: root/docs/performance.txt
diff options
context:
space:
mode:
Diffstat (limited to 'docs/performance.txt')
-rw-r--r--docs/performance.txt54
1 files changed, 54 insertions, 0 deletions
diff --git a/docs/performance.txt b/docs/performance.txt
new file mode 100644
index 0000000..b31c02a
--- /dev/null
+++ b/docs/performance.txt
@@ -0,0 +1,54 @@
+Divergence from the best practices
+==================================
+ Due to various constraints, I had take some decisions contradicting the best practices. There were also some
+ hardware limitations also resulting in suboptimal conifugration.
+
+ Storage
+ -------
+ - RedHat documentation strongly discourages running Gluster over large Raid-60. The best performance is achieved
+ if disks are organized as JBOD and each assigned a brick. The problem is that heketi is not really ready for
+ production yet. I got numerous problems with testing. Managing '3 x 24' gluster bricks manually would be a nightmare.
+ Consequently, i opted for Raid-60 to simplify maintenance and ensure no data is lost due to mismanagement of gluster
+ volumes.
+
+ - In general, the architecture is more suitable for many small servers, not just a couple of fat storage servers. Then,
+ the disk load will be distributed between multiple nodes. Furthermore, we are can't use all storage with 3 nodes.
+ We need 3 nodes to ensure abitrage in case of failure (or network outtages). Even if we the 3rd node only stores the
+ checksums, we ca't easily use it to store data. OK. Technically, we can create a 3 sets of 3 bricks and put the arbiter
+ brick on different nodes. But this again will complicate maintenace. Unless proper ordering is maintained the replication
+ may happen between bricks on the same node, etc. So, again I decided to ensure fault tollerance over performance. We still
+ can use the space when cluster is scalled.
+
+ Network
+ -------
+ - To ensure high speed communication between pods running on different nodes, RedHat recommends to enable Container Native
+ Routing. This is done by creating a bridge for docker containers on the hardware network device instead of OpenVSwitch fabric.
+ Unfortunatelly, IPoIB is not providing Ethernet L2/L3 capabilities and it is impossible to use IB devices for bridging.
+ It is still may be possible to solve somehow, but further research is required. The easier solution is just to switch OpenShift
+ fabric to Ethernet. Anyway, we had idea to separate storage and OpenShift networks.
+
+ Memory
+ ------
+ - There is multiple docker storage engines. We are currently using LVM-based 'devicemapper'. To build container, the data is
+ copied from all image layers. The new 'overlay2' provides a virtual file system (overlayfs) joining all layers and performing
+ COW if the data is modified. It saves space, but more importantly it also enables page cache sharing reducing the memory
+ footprint if multiple containers sharing the same layers (and they do share CentOS base image at minimum). Another adantage a
+ slightly faster startup of containers with large images (as we don't need to copy all files). On the negative side, it is not
+ fully POSIX compliant. Some applications may have problems because. For major applications there is work-arrounds provided by
+ RedHat. But again, I opt for more standard 'devicemapper' to avoid hard to debug problems.
+
+
+What is required
+================
+ - We need to add at least another node. It will double the available storage and I expect significant improvement of storage
+ performance. Even better to have 5-6 nodes to split load.
+ - We need to switch Ethernet fabric for OpenShift network. Currently, it is not critical and will only add about 20% to ADEI
+ performance. However, it may become an issue if optimize ADEI database handling or get more network intensive applications in
+ the cluster.
+ - We need to re-evaluate RDMA support in GlusterFS. Currently, it is unreliable causing pods to hang indefinitely. If it is
+ fixed we can re-enable RDMA support for our volumes. It hopefully may further improve storage performance. Similarly, Gluster
+ block storage is significnatly faster for single-pod use case, but has significant stability issues at the moment.
+ - We need to check if OverlayFS causing any problems to applications we plan to run. Enabling overlayfs should be good for
+ our cron services and may reduce memory footprint.
+
+