diff options
Diffstat (limited to 'docs/performance.txt')
-rw-r--r-- | docs/performance.txt | 54 |
1 files changed, 54 insertions, 0 deletions
diff --git a/docs/performance.txt b/docs/performance.txt new file mode 100644 index 0000000..b31c02a --- /dev/null +++ b/docs/performance.txt @@ -0,0 +1,54 @@ +Divergence from the best practices +================================== + Due to various constraints, I had take some decisions contradicting the best practices. There were also some + hardware limitations also resulting in suboptimal conifugration. + + Storage + ------- + - RedHat documentation strongly discourages running Gluster over large Raid-60. The best performance is achieved + if disks are organized as JBOD and each assigned a brick. The problem is that heketi is not really ready for + production yet. I got numerous problems with testing. Managing '3 x 24' gluster bricks manually would be a nightmare. + Consequently, i opted for Raid-60 to simplify maintenance and ensure no data is lost due to mismanagement of gluster + volumes. + + - In general, the architecture is more suitable for many small servers, not just a couple of fat storage servers. Then, + the disk load will be distributed between multiple nodes. Furthermore, we are can't use all storage with 3 nodes. + We need 3 nodes to ensure abitrage in case of failure (or network outtages). Even if we the 3rd node only stores the + checksums, we ca't easily use it to store data. OK. Technically, we can create a 3 sets of 3 bricks and put the arbiter + brick on different nodes. But this again will complicate maintenace. Unless proper ordering is maintained the replication + may happen between bricks on the same node, etc. So, again I decided to ensure fault tollerance over performance. We still + can use the space when cluster is scalled. + + Network + ------- + - To ensure high speed communication between pods running on different nodes, RedHat recommends to enable Container Native + Routing. This is done by creating a bridge for docker containers on the hardware network device instead of OpenVSwitch fabric. + Unfortunatelly, IPoIB is not providing Ethernet L2/L3 capabilities and it is impossible to use IB devices for bridging. + It is still may be possible to solve somehow, but further research is required. The easier solution is just to switch OpenShift + fabric to Ethernet. Anyway, we had idea to separate storage and OpenShift networks. + + Memory + ------ + - There is multiple docker storage engines. We are currently using LVM-based 'devicemapper'. To build container, the data is + copied from all image layers. The new 'overlay2' provides a virtual file system (overlayfs) joining all layers and performing + COW if the data is modified. It saves space, but more importantly it also enables page cache sharing reducing the memory + footprint if multiple containers sharing the same layers (and they do share CentOS base image at minimum). Another adantage a + slightly faster startup of containers with large images (as we don't need to copy all files). On the negative side, it is not + fully POSIX compliant. Some applications may have problems because. For major applications there is work-arrounds provided by + RedHat. But again, I opt for more standard 'devicemapper' to avoid hard to debug problems. + + +What is required +================ + - We need to add at least another node. It will double the available storage and I expect significant improvement of storage + performance. Even better to have 5-6 nodes to split load. + - We need to switch Ethernet fabric for OpenShift network. Currently, it is not critical and will only add about 20% to ADEI + performance. However, it may become an issue if optimize ADEI database handling or get more network intensive applications in + the cluster. + - We need to re-evaluate RDMA support in GlusterFS. Currently, it is unreliable causing pods to hang indefinitely. If it is + fixed we can re-enable RDMA support for our volumes. It hopefully may further improve storage performance. Similarly, Gluster + block storage is significnatly faster for single-pod use case, but has significant stability issues at the moment. + - We need to check if OverlayFS causing any problems to applications we plan to run. Enabling overlayfs should be good for + our cron services and may reduce memory footprint. + + |