Limitations using SRM with RecoverPoint

My life this year has been a project of implementing RecoverPoint and SRM.  No, it really shouldn’t take all year but it’s been a hell of a year.  My project has officially been declared complete, by me.  I’ve been doing some extensive testing of SRM and integration with a few physical hosts all using EMC CLARiiON storage, CLARiiON splitters and RecoverPoint.  I’m currently on RecoverPoint 3.2 SP 1.  My complaints about the splitter write failures when enabling snapshot consolidation are gone and I’m mostly loving my replicated life.  I say mostly…

I did uncover what I can best call a limitation of the implementation of these two technologies.  My implementation is probably far from large but it’s also not small.  I’m currently replicating 8 TB, soon to increase significantly, of mostly VMFS datastore.  I came up with 24 total LUN containing 96 virtual machines.

During some testing of a full site recovery I found that RecoverPoint would start issue warning of volumes hitting a high water mark.  Unfortunately, logging didn’t really specify exactly what volume this was that was hitting a high water mark.  During a full site recovery tests that also included about 1.5 TB of physical Oracle and MSSQL Server databases I was able to keep all systems up for less than one hour, the last SRM test never completed before all storage became inaccessible.  These tests were fairly high profile and had the attention of (my most understanding and cooperative) CIO, did not look good for me to say the least.

After spending some time tying to put the pieces together I was beginning to believe that was I experienced was an issue with insufficient journal space allocated to image access logging.  This is user configurable and based upon a percentage of allocated journal space.  I put in a call to support and was told that this was a limitation of memory in the RPA, not the image access logging I had suspected.  With the faith I have in EMC first tier support I also placed a call into my account manager to who was able to escalate to RecoverPoint engineering.  What I finally got was that this was not a memory issue but a hard coded 40 GB cap on data change in volumes on the DR side during virtual image access.  Virtual image access is currently the standard with the EMC RecoverPoint SRA and does not appear to be user configurable.  It makes perfect sense that if I’m starting up the top two tier of servers in my environment, leaving the rest in a no power on state, that I could easily hit 40 GB of writes to storage.  This effectively means that with the combination of RecoverPoint 3.2 and SRM I can not reliably run a full site recovery test.

Oddly, I was able to perform this operation, with only about 6 TB replicated, and keep the test running for several hours prior to upgrading to version 3.2.  I don’t know if there was a change in the SRA or RecoverPoint between version 3.1 and 3.2 and I can’t get anyone to admit that there was.

From engineering I also got information, that hopefully doesn’t violate my NDA, that the SRA for RecoverPoint 3.3 will no longer use virtual image access but physical instead.  This will effectively eliminate this problem.

14 thoughts on “Limitations using SRM with RecoverPoint

  1. Heya Tim,
    Hope your doing well. Just to make sure I understand what is happening, so your saying that EMC is telling you there is a 40G limit on changed data on the recovery side LUNs within a CG or on a per RPA basis when using virtual image access? How many RPA’s per site are you running? Are your CG’s balanced, load wise, across all RPA’s? I could see this being a lack of memory issue on a recovery side RPA, but a cap..well that’s down right strange ; ) any primus articles you can point us to? thanks for the info..

    1. Hello to you Joe!

      As I attempted to point out, this is as explained to me by engineering. This was explained as a 40 GB data write limit on the recovery side cluster, so neither CG nor RPA. Yes, seems very odd to me as well. Please keep in mind that this reported limitation exists only during virtual image access!

      My RPA are closely balanced though I’m only running a 2 node cluster.

      I do not have a primus article to reference yet, I have asked for one if it exists and have yet to receive a response. This was less than one full business day ago so I do not expect that I would have seen a response, in EMC;s defense. I will definitely update if I receive a response!

  2. Whoa, 40G limit on the cluster, I was hoping you werent saying that, thats pure evil. I still stand with my comments in my post, if EMC wouldn’t push such precarious limits then it wouldnt be an issue. A roll to physical option when setting up the recovery plan, of if the next SRA in general eliminates the mode all together is a leap in the right direction. Would hard memory reserves on your VM’s, effectively eliminate this problem, preventing the boot write storm of vm swap files on your recovery side cluster? Thanks for the info Tim, good stuff.

  3. I’ve asked for more clarification on exactly where this limit is coming from, where data is store if it’s not a component of the image access logging area on journal LUNs and about changes on the road map. No response yet. I find this just too hard to believe but my testing has seemed to support the response. I’ve noted that the image access utilization seems to increment fairly closely on all CGs with image access enabled. When removing image access to one CG I see a decrement on utilization.

  4. Crap, a quick read of the 3.2 release notes turns this little gem up:

    “In virtual access, maximum writes per RPA: 40 GB”

    There will apparently be no further information on this from EMC, somebody read this and must not have been thrilled. Oh well, RTFM.

  5. More findings, in the RecoverPoint Command Line Reference Guide this time;

    “In most cases, enabling access to an image disables the distribution of replicated data to replica volumes on storage. However, when the ’access_mode’ is set to ’virtual_without_roll’, distribution is not disabled. When enabling image access causes distribution to stop, the copy can continue to receive the replication data, and to store it in the journal. However, in the event that journal capacity is reached, the system unilaterally pauses transfer. For virtual access, the maximum size of the image access log is approximately 40 GB. Any data written to the image access log will persist until the ’disable_image_access’ or ’undo_writes’ commands are used, or until access is enabled to another image.”

  6. Can you not start the SRM Test, then go into the RecoverPoint Management Application and then “Roll to Image”, thus making your virtual access become physical? Granted, SRM won’t do this for you, but it’ll let you extend your test window, would it not?

    Or will this break things in new and interesting ways that I haven’t considered?

    1. Interesting though I’m not so sure it would have an impact on image access time during virtual access mode. Since, in virtual access mode, logging is not on the image access space allocated from the journals but instead a hard coded limitation of the RPA I doubt it would flush at that point. Worth a test to prove me wrong.

      Also, with RecoverPoint 3.2 the CG would have to be put into maintenance mode to perform a modification to CG settings. I’ve never attempted this during image access though I’m don’t believe there would be a problem, still need to test.

      Interesting thought though, I might need to test!

    2. The only option that I see when SRM has the CG in virtual access mode is to drop writes. That is it. No roll to image or anything. So it is basically useless.

  7. My inital test of this resulted in the entire clariion storage group being yanked out from under my recovery-site ESX cluster. It’s just 4x400GB replicated LUNs and 1 Local LUN for templates and mgmt VMs, and all of the LUNs were unavailable to ESX. I got several “RPA is only single connected” errors while it tried to roll to image which seemed to indicate either the RPAs or the CX3-10c just couldn’t handle the amount of traffic or processing required. I still had replication going though. I’ll need to test it with RecoverPoint paused. But definitely do this test in a lab.

    Additional Details:
    Clariion CX3-10c
    RecoverPoint 3.1SP1(u.37)
    vCenter 4.0.0
    SRM 4.0

    1. With what little knowledge I have about image access modes I can totally see this happening, with the exception of the single path error. I really doubt the limitation was you CLARiiON, that’s still a beefy little array! For what it’s worth, I have it on good authority that the 3.3 version of RP and the SRA will use physical image access instead of virtual. This should effectively eliminate the issue.

      Thanks for the feedback!

  8. Two points I would like to clarify.
    Depending on the version of RP and the splitter in use you will have roll in background. CX splitters did not support this until flare 29 I believe.

    In any case this issue has been resolved with the use of logged access as opposed to virtual access by default with RP 3.3 and above.

  9. Thanks Rick, in an earlier comment it was acknowledged that physical access mode was going to be used in RecoverPoint 3.3. I’ve got it in place and working nicely now.

Comments are closed.