Service Manager and Disaster Recovery/Backup Contingency Plan/Scheduled Maintenance

Good morning,

I'm curious as to what other companies are doing in the area of Disaster Recovery or Backup Contingency Planning for outages for Service Manager.  

In our company, when we make changes to Service Manager that require changes to the dbdicts, we schedule an outage, kick everybody off the application, perform our changes and checkout, and then bring the application back up.  While the main Service Manager application is down, we actually have our users create tickets in another system, and then we use ConnectIt to move those tickets over to HPSM.  The other system is limited, does not have all the data our Production system has, and it's kind of a pain trying to keep the core data in sync.

We use that same system for unexpected outages - like if HPSM goes down for some reason.

For 'Hard Down' disaster recovery, the database team and server teams bring up whole copies of Service Manager in another data center, but that's only for a company-wide disaster, and not something that's available for scheduled outages or unexpected HPSM outages.

So, I'm curious how other companies have solved it.  What do you guys do in your companies for unexpected outages or shceduled maintenance?  

Parents
  • Hi Jacob

    For scheduled maintenance, I think you are being ultraconservative. I never "shutdown" the system before creating new fields in dbdict. You can make a dbdict backup or hot backup.... I don't see your point in doing this. For some activities that may require an outage/partial outage, like change of table structure, if your client can't face some time of unavailability you can always find a solution to make it quicker - in my example, for instance: clone the table, perform the change, sync and replace the original.

    For unexpected outages, in 10 years working with this tool in different clients all problems that last for more time than a full restart were related to components that were out of my control, like LDAP or Hardware LoadBalancer or general network failures.

    General advice would be:
    RTE: Have a standby or snapshot of yours HPSM RTE servers. Discuss with your Windows/Linux/Virtualization team the best place to have it.

    Database: The same for any other application. Appropriate RAID setting, Full backup, Differential Backup, Log Backup, Different storage for database, logs, backup.... All depends on what Infrastructure you already have and the money you are willing to invest.
    Discuss with your DB Team strategies for your database. Depends on what downtime you accept, most of them are transparent for HPSM and in case of need requires few minutes to be ready to use.

    This topic is very interesting. I liked it and would like to have more people joining in this discussion ...

    Thanks!

  • So you do the dbdict changes at the database level, rather than via the UI?

    We kick users out because if you're updating a record in the Incident module (or Change module or whatever) and have a record lock while changes are being made to the dbdict, those changes don't actually make it down to the database.  

    And as much as we'd love to have short outages (our work could be done in less than 30 minutes, when we're doing release activities) we have process owners who, though we have 4 years of releases with zero defects introduced with the release that were 'new' in Production, they feel the neccessity to do a full regression test in Prod on release day.

  • So you do the dbdict changes at the database level, rather than via the UI?

    It depends on the environment. Most clients I worked for, the hpsm db user had ddl grants so mostly I do by dbdict utility only. But the behavior is the same, the difference is that if your user does not have the DDL you need to involve the DB Team in the change. It does not change anything else. 

    We kick users out because if you're updating a record in the Incident module (or Change module or whatever) and have a record lock while changes are being made to the dbdict, those changes don't actually make it down to the database.  

    Are you sure? I never face this situation and I just did a test in my private 9.4x playground and it's not what I observed.

    And as much as we'd love to have short outages (our work could be done in less than 30 minutes, when we're doing release activities) we have process owners who, though we have 4 years of releases with zero defects introduced with the release that were 'new' in Production, they feel the neccessity to do a full regression test in Prod on release day.

    You probably can reduce the outage and keep satisfying your process owners by creating a pre-prod environment. Indeed, 4 years of releases with zero defects is quite impressive, congratulations!

  • So you do the dbdict changes at the database level, rather than via the UI?

    It depends on the environment. Most clients I worked for, the hpsm db user had ddl grants so mostly I do by dbdict utility only. But the behavior is the same, the difference is that if your user does not have the DDL you need to involve the DB Team in the change. It does not change anything else. 

    We kick users out because if you're updating a record in the Incident module (or Change module or whatever) and have a record lock while changes are being made to the dbdict, those changes don't actually make it down to the database.  

    Are you sure? I never face this situation and I just did a test in my private 9.4x playground and it's not what I observed.

    And as much as we'd love to have short outages (our work could be done in less than 30 minutes, when we're doing release activities) we have process owners who, though we have 4 years of releases with zero defects introduced with the release that were 'new' in Production, they feel the neccessity to do a full regression test in Prod on release day.

    You probably can reduce the outage and keep satisfying your process owners by creating a pre-prod environment. Indeed, 4 years of releases with zero defects is quite impressive, congratulations!

  • Are you sure? I never face this situation and I just did a test in my private 9.4x playground and it's not what I observed.

    On 7.1x, 9.2x and 9.3x, we observed this.  We were trying to do in-flight (leting users stay on).  We'd make a change to the dbdict, save, exit HPSM, log back in, and the changes would be gone.  No recycle, just log out.

    You probably can reduce the outage and keep satisfying your process owners by creating a pre-prod environment.

    Heh, we have a pre-prod, which is why I can state we haven't introduced any 'new' defects.  The theory is, the process teams are supposed to full regression test in the UAT environment, sign off, then we move code to Staging, they smoke test and sign off, and then we move that code into Prod.  What _really_ happens is they sign off in UAT and Staging without really testing.  Then we go into Prod, they regression test, and find issues... So then we go back to the lower environment (where they said it was defect free) and replicate the issue.  So if exists in the code they signed off on, then at that point, it's working as designed (even if we have to change it as soon as possible).  We give 3 weeks for testing, and they do more in the 4 - 6 hour outage than in those 3 weeks.

  • We give 3 weeks for testing, and they do more in the 4 - 6 hour outage than in those 3 weeks.

    Hehe so few customers do proper testing in advance.

  • As for when to kick out users/restart server it depends on the type customization applied.

    For example most simple dbdict changes I do with users online except if a new field also becomes mandatory then I will do the update in a maintenacne window. Just adding a new field require users to logoff and on to have effect.  Reindexing of large tables is always done in maintenance windows since this locks the entire table.

    Very few customisations actually require restart of server. In that case it is always done in the maintenance window.

Reply
  • As for when to kick out users/restart server it depends on the type customization applied.

    For example most simple dbdict changes I do with users online except if a new field also becomes mandatory then I will do the update in a maintenacne window. Just adding a new field require users to logoff and on to have effect.  Reindexing of large tables is always done in maintenance windows since this locks the entire table.

    Very few customisations actually require restart of server. In that case it is always done in the maintenance window.

Children
  • Thanks guys, this has been a most insightful thread for ages.

    Recently I listened to the webinar, that stated that for upcoming releases ITSM will be adopting container approach for installation and upgrading procedures. I'm no means expert in that area, but what I heard and understood, it would make a disaster recovery a bit easier. (At least for the pieces you have control over...)

    If you are interested, search for a webinar called "HPE ITSM Automation and Containers – Accelerating Deployment and Time to Value".