As the eDirectory documentation mentions; the most common bottleneck is the disk subsystem, which is especially accentuated in large deployments and/or environments with high consumption of eDirectory services.
It also lists some recommendations to take into account to improve the performance of this critical point, however, everything has a limit. In a recent case, an environment with very high demand for readings and updates, in the order of millions of daily consumption, was reached and exceeded the performance offered by storage devices, leading to all members of the replica ring presented a significant decrease in performance.
To give a little more context, here are some approximate details of the arrangement we had:
6 servers included in a replica ring, each with:
RAM: 160 GB.
Processing: 24 cores.
Storage: 800 GB (LUN - EMC).
O.S. SLES 12 SP3
DIB Size: 350 GB
Daily transactions: In order to 7-8 million.
Although storage operates normally, the transfer rate offered was not sufficient to meet the I/O load eDirectory required to operate optimally, but what was the problem?
Each server has 1 to 3 LUNs (emcpowerX) on which a structure of LVMs was built, each one was configured as PV, added to a VG and on this layer an LV was created with the default configurations, here is the key point and the recommendation I intend to transmit with this post, the default mapping for the LV is linear.
Example: On one server the partitions emcpoweri, emcpowera and emcpowerg were held in the same VG, however, only one attended to the load
But, what is linear mapping in LVM?
It consists of assigning a range of consecutive PE's to each PV, that is: PE 0 - 100 = PV1, PE 101 - 200 = PV2, and so on. In this way, the storage is in the form of a stack, once all the storage capacity of PV1 has been consumed, PV2 will begin to be consumed. In this specific case the size of each PV was approximately 80% of the total size of the DIB, so that the load was unbalanced, in certain critical hours a PV received virtually the entire load, while the rest operated far below, thus generating saturation in a PV and resulting in performance problems eDirectory.
After running the corresponding analysis, the possibility of requesting more storage to integrate it into the same scheme was first ruled out, so we decided to experiment with LVM's Stripe mode.
STEP 1 - LVM STRIPE
What is Stripe mapping in LVM?
The striped mapping intertwines the logical extents chunks through a series of physical volumes, what will take to that the data that conforms the DIB is, to level of the LVM, divided in stripes of a determined size and each one stored in a sequential way in each PV:
Stripe 1 >> PV1
Stripe 2 >> PV2
Stripe 3 >> PV3
Stripe 4 >> PV1
In this way the stored data will be balanced between the Stripes/PV that are part of the VG, leading to a moment of high reading and updating of objects in the DIB, and since each PV has a portion of this, work together, and through the LVM layer their I / O buffers are added and give transfer rates higher at the level of the LV.
How do I create an LV in stripe mode?
This is the procedure followed to migrate from LVM Linear to LVM Stripe, and is merely a suggestion in case the reader decides to take this step.
According to the documentation the total storage of the volume destined to store the DIB must be at least twice the size of the DIB. Therefore, the sum of the capacity of the PVs must correspond to this rule, or if possible triple the size of the DIB.
Independent disks are required to those already used, and they must be at least 2, since otherwise the mapping in stripe in only 1 PV is meaningless, in the real scenario that was presented it was resorted to 3 LUN of 300 GB each one, which offered transfer rates of 500 MB/s.
NOTE: Stripe mapping is configured at the logical volume level, so it is recommended to use other disks than those already used to store the DIB.
NOTE2: Steps described below require root permissions.
Having obtained extra disks, independent of the previous scheme, we will proceed with the partitioning and the creation of the PV and VG. These first steps will be the traditional way, paying special attention to the size of the Physical Extent, which by default is 4MB.
To create the partitioning, I have used parted, but it can be fdisk which I consider safer to use, in case of using parted I recommend extreme precautions:
parted -s /dev/sdX u % mklabel msdos mkpart primary 0 100
NOTE: Run it as many times as disks we will include in the new array.
Later it is necessary to configure the new partitions as PV:
pvcreate /dev/sdX1 /dev/sdW1 /dev/sdZ1
Where: X, W and Z correspond to the different disks that are going to be included in the VG.
Where: <vg-name> will be the name of the volume group that is considered convenient.
Once this point is reached we will proceed to create the LV with stripe mapping, this is the key point for our arrangement, for it will be executed:
lvcreate -i3 -I4MB -l 100%FREE -n <lv-name> <vg-name>
## -i3: Indicates the number of Stripes (PVs) in which the information will be stored, which depends on the PVs that are part of the VG.
## -I4MB: Defines the size of each stripe/chunk of information, which will be stored sequentially in each PV, here it is important to mention that the maximum size of each chunk can not exceed the size of the PE (default 4MB). If not defined, the default size will be 64KB for each chunk. In the real case the maximum allowed by the configuration of the PE was configured as the DIB is quite large.
## -l 100%FREE: Allows to indicate to lvcreate the size of the logical volume, using a notation of labels which is conveniently more suitable and comfortable for this case.
## -n <lv-name>: Defines what name will be given to the logical volume.
## <vg-name>: Corresponds to the name of the volume group in which the new LV will be created, this must match the name given to the VG created in the previous step.
Finally, the LV will be checked with Stripe mapping, so it will be executed:
At this point, we have successfully created an LV in Stripe mode!!
STEP 2 - XFS
Another recommendation eDirectory documentation is the type of file system, in previous versions was ReiserFS, however, for the present and future this is increasingly in disuse in favor of more efficient solutions, one of them is XFS, which is supported by eDirectory 9.x, as in SLES 12.
This file system is optimized for large files, has parallelization properties, in addition to being a journaling file system, supports hot resizing so it is ideal for use with LVM, is a high-performance file system, and has its own set utilities.
Here the only point to take into account is the size of the block used in eDirectory's FLAIM engine, this by default uses 4KB blocks, however, when configuring it for the first time it can be customized to 8KB, so the size of the block used by FLAIM and the one configured in the file system must be the same.
A good news is that the default size of each block in XFS is also 4KB, so if you have not changed this parameter at the time of configuring the eDirectory instance, no additional settings will be required to create the file system.
Just run it directly:
Or calling the frontend mkfs:
At this point we already have our LV in Stripe mode with XFS ready for use!!!
STEP 3 - MIGRATING TO THE NEW LOGICAL VOLUME.
For this step we must take extreme precautions since an incorrect execution can seriously and irreparably affect the DIB.
To migrate the DIB from one LV to another, I have used the command rsync which performs the necessary checks to ensure the integrity of the files, however, it is possible to use other utilities, for the bravest is the command cp.
1. Stop eDirectory service, this is obligatory since the copy must be done with the service stopped to avoid that the DIB undergoes modifications during the process, for it is executed:
2. Mount the new file system, this can be mounted in, or under /mnt:
3. Copy files located in the previous file system, used to store original DIB files, in our case we had everything in an LV mounted under /edirectory, so we can run it:
rsync -avz /edirectory/ /mnt
NOTE: It is important to place a diagonal at the end of the source directory name, as this will cause rsync to copy the contents of the directory and write it to the new location, if this diagonal is omitted the directory and its contents will be copied to the destination.
4. Confirm integrity DIB's files: To do this you can run a checksum md5 or sha256 recursively from the original DIB and then compare it with the DIB that was copied to the new file system., this must match.
5. Unmount both file systems; new one where the copied DIB is now and old one where the original DIB is.
6. Mount the new file system at the mount point of the old file system:
8. Start the eDirectory service again, once started you will see the typical errors resulting from the unavailability of a replica (625, 626), however, these should disappear as the minutes go by.
NOTE1: Do not delete the original DIB until you confirm that the service operates in a stable way and without errors.
NOTE2: It's important to point out that the more time passes, less useful the original DIB will be, because it will be outdated with respect to the rest of the replicas in the ring.
9. Update in /etc/fstab the mount point reference for the eDirectory file system, taking care to place the correct reference and file system type.
In case when starting the service, it does not operate in a stable way, with unknown errors, Blocked DIB, closed ports, it is recommended:
In case the problem continues, it is possible to go back to the previous scheme very easily, for this it must be done. Note, if a node in the ring does not replicate correctly, the rollback should be executed over the whole ring, returning to the original DIBs in each node, this to avoid synchronization problems related to the timestamp of the objects.
In the real scenario, this solution has been successfully implemented in the 11 servers of two replica rings (5 in one and 6 in another). The I/O performance obtained in this way with 3 PV in the VG is 300% with respect to the previously observed, reaches transfer rates between 1.5 - 1.7 GB/s at peak times.
The tools used for the diagnosis were:
To test the performance of the new structure it is possible to use the Bonnie tool, in case it is not feasible to install it, we can use cp together with iostat to stress the file system and observe the operation.
NOTE: Occasionally iostat incorrectly calculates the LV usage percentage in Stripe mode, so to confirm a true saturation it is recommended to observe the usage percentages of each device separately and average them.
######## RESULTS!! ##########
As you can see, in the following image although the VG has 4 PV, these are in linear mode so only one attends the load received, when saturated the LV is affected and generates the bottleneck.
The following image shows the operation of LV in Stripe mode, here the PVs underlying to LV work in parallel, receiving the load between the 3 and increasing the I/O rate up to an average of 1.5 GB/s.