My life this year has been a project of implementing RecoverPoint and SRM. No, it really shouldn’t take all year but it’s been a hell of a year. My project has officially been declared complete, by me. I’ve been doing some extensive testing of SRM and integration with a few physical hosts all using EMC CLARiiON storage, CLARiiON splitters and RecoverPoint. I’m currently on RecoverPoint 3.2 SP 1. My complaints about the splitter write failures when enabling snapshot consolidation are gone and I’m mostly loving my replicated life. I say mostly…
I did uncover what I can best call a limitation of the implementation of these two technologies. My implementation is probably far from large but it’s also not small. I’m currently replicating 8 TB, soon to increase significantly, of mostly VMFS datastore. I came up with 24 total LUN containing 96 virtual machines.
During some testing of a full site recovery I found that RecoverPoint would start issue warning of volumes hitting a high water mark. Unfortunately, logging didn’t really specify exactly what volume this was that was hitting a high water mark. During a full site recovery tests that also included about 1.5 TB of physical Oracle and MSSQL Server databases I was able to keep all systems up for less than one hour, the last SRM test never completed before all storage became inaccessible. These tests were fairly high profile and had the attention of (my most understanding and cooperative) CIO, did not look good for me to say the least.
After spending some time tying to put the pieces together I was beginning to believe that was I experienced was an issue with insufficient journal space allocated to image access logging. This is user configurable and based upon a percentage of allocated journal space. I put in a call to support and was told that this was a limitation of memory in the RPA, not the image access logging I had suspected. With the faith I have in EMC first tier support I also placed a call into my account manager to who was able to escalate to RecoverPoint engineering. What I finally got was that this was not a memory issue but a hard coded 40 GB cap on data change in volumes on the DR side during virtual image access. Virtual image access is currently the standard with the EMC RecoverPoint SRA and does not appear to be user configurable. It makes perfect sense that if I’m starting up the top two tier of servers in my environment, leaving the rest in a no power on state, that I could easily hit 40 GB of writes to storage. This effectively means that with the combination of RecoverPoint 3.2 and SRM I can not reliably run a full site recovery test.
Oddly, I was able to perform this operation, with only about 6 TB replicated, and keep the test running for several hours prior to upgrading to version 3.2. I don’t know if there was a change in the SRA or RecoverPoint between version 3.1 and 3.2 and I can’t get anyone to admit that there was.
From engineering I also got information, that hopefully doesn’t violate my NDA, that the SRA for RecoverPoint 3.3 will no longer use virtual image access but physical instead. This will effectively eliminate this problem.