close
Jump to content

Switch Datacenter/DeploymentServer

From Wikitech

This page describes the procedure to switch over the active deployment server from one host to another.

This is suitable for use either as part of the regularly scheduled datacenter switchover or more generally when replacing the active deployment server with a newer one.

In either case, you will want to schedule an exclusive window in the Deployments calendar (at least one hour) and coordinate with potential deployers.

Procedure

  • Disable puppet on deployment servers. From a cluster-management host:

sudo cumin 'A:deployment-servers' 'disable-puppet "Deployment server switchover - TXXXXXX"'

  • Create an Alertmanager silence for a duration of at least one hour, matching alertname=ATSBackendErrorsHigh and backend=deployment.eqiad.wmnet. This will prevent spurious pages while SpiderPig is unavailable during the switch.
  • On the deployment server in the site the primary is moving from:
    • After making sure there is no active SpiderPig job (check https://spiderpig.wikimedia.org), stop SpiderPig services: sudo systemctl stop spiderpig-apiserver; sudo systemctl stop spiderpig-jobrunner
  • On all deployment servers in the site the primary is moving to:
    • Perform a final scap-master-sync to sync spiderpig state: sudo /usr/local/bin/scap-master-sync <old-active-server-name>
  • Downtime release* hosts to prevent temporarily failing jobs from creating tickets:

sudo cookbook sre.hosts.downtime --hours 2 -r 'Deployment server switchover' -t TXXXXXX 'releases*'

  • Merge a DNS change that points the deployment.eqiad.wmnet CNAME record to the new active host, see this change for an example.
  • Change deployment_server and scap::deployment_server variables in hiera, see this change for an example.
  • Run puppet on the new active deployment server: sudo run-puppet-agent -e "Deployment server switchover - TXXXXXX"
  • Run puppet on all the other deployment servers (same command). You can use sudo cumin 'A:deployment-servers' --dry-run to retrieve the list of current deployment servers.
  • Using sudo cumin 'A:deployment-servers' 'grep block_deployments /etc/scap.cfg' (on a cluster-management host), verify that:
    • Deployments on the previously active server are blocked
    • Deployments on the newly active server are not blocked
  • Since keyholder configuration might have changed but not been reloaded on the newly active deployment server when it was a spare, restart the keyholder service there and test it:
$ sudo systemctl restart keyholder-proxy.service
$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet
  • Workaround for task T197470: Run the following on all deployment servers after replacing the deployment server URLs accordingly. For example, if switching from deploy1003 to deploy2002:
$ sudo -i
# find /srv/deployment -name DEPLOY_HEAD | xargs sed -i "s/git_server: deploy1003.eqiad.wmnet/git_server: deploy2002.codfw.wmnet/"
  • Test a scap deployment, noting that this may take quite some time on the first attempt: scap sync-world "Test deployment to validate deployment server switchover - TXXXXXX". This will also test helmfile deployments.
  • Email ops@lists.wikimedia.org and product-tech-all@wikimedia.org about the switch of active deployment server and update any IRC channels where the ongoing work is being tracked.
  • Update Deployment_server to reflect the change in active server.