+ Reply to Thread
Results 1 to 14 of 14
Like Tree6Likes
  • 5 Post By Seelinnikoi
  • 1 Post By Kittyhawke

Thread: What goes on during downtime, technically? (software dev curious about other devs)

  1. #1
    Rift Disciple Jegodin's Avatar
    Join Date
    Nov 2013
    Posts
    174

    Default What goes on during downtime, technically? (software dev curious about other devs)

    I'm a software developer and avid Rift player. I wanted to start a topic while there is a lengthy downtime happening on the NA shards about what is going on during these downtimes. Of course anyone who knows anything about computer programming would have a snarky reply to this: "The shards are compiling." However I want to elaborate on this because I don't think it really satisfies anyone, nor does it completely answer the question... I also think this will be an interesting topic for anyone else into development as well as Rift junkies that are waiting out downtime!

    I have mainly worked on web servers that just host some simple application with content and database interactions (shards have this and much more). I'm assuming the process of releasing a patch or update to a web-application is a little more simpler because it (the web-application) is not as complex as a video game. I know the woe's of downtime though. One web server takes 15 minutes to get through its deploy process even if I just change 1 line of code, so I could imagine that it would take longer for 1 shard.

    Here are some questions I have technical interest in, and would think it really awesome if a dev who knows about the Rift deployment process could reply.
    1. What causes some updates to take an hour, and some to take 3 hours? (I know this is really general, but any technical example of why this might happen would satisfy me)
    2. Do you guys write any behavior tests for the code and, if so, is running these tests part of the downtime? (i.e. make sure tests are passing before allowing the deployment to complete)
    3. About how much of a new patch release automated? (i.e. what's getting taken care of via scripts and stuff versus what's having to be done by hand)
    4. Do patches ever just work on the PTS, but then run into trouble when released to the main shards?

  2. #2
    Ascendant
    Join Date
    Aug 2012
    Posts
    4,433

    Default

    Quote Originally Posted by Jegodin View Post
    1. What causes some updates to take an hour, and some to take 3 hours? (I know this is really general, but any technical example of why this might happen would satisfy me)
    2. Do you guys write any behavior tests for the code and, if so, is running these tests part of the downtime? (i.e. make sure tests are passing before allowing the deployment to complete)
    3. About how much of a new patch release automated? (i.e. what's getting taken care of via scripts and stuff versus what's having to be done by hand)
    4. Do patches ever just work on the PTS, but then run into trouble when released to the main shards?
    Any forum regular to answer those a little.

    1) Some are just simple patches and some do physical and software maintenance.

    2) They do and even write so on the announcemnts they have on occasion extended maintence when things aren't right.

    3) Probably all of it unless something goes wrong.

    4) Very often, there's rarely a content patch without issues and sometimes hotfixes too.

  3. #3
    Rift Chaser
    Join Date
    Jan 2012
    Posts
    384

    Default

    I would imagine that given the size of the databases that the Rift servers must need then even a simple database check/cleanup/optimisation could run for several hours.

  4. #4
    Shield of Telara Seelinnikoi's Avatar
    Join Date
    Feb 2011
    Location
    London
    Posts
    784

    Default

    Well after working as an admin for quite a while, this is what we used to do:

    1 - Clean the hamsters cage.

    2 - Place fresh water and food for them.

    3 - Cut a bit of carrot and stick between cage grid.

    4 - Pat the hamsters on the back and tell them they are doing a good job.

    5 - Place them back in.

  5. #5
    Ascendant
    Join Date
    Aug 2012
    Posts
    4,433

    Default

    Quote Originally Posted by Seelinnikoi View Post
    Well after working as an admin for quite a while, this is what we used to do:

    1 - Clean the hamsters cage.

    2 - Place fresh water and food for them.

    3 - Cut a bit of carrot and stick between cage grid.

    4 - Pat the hamsters on the back and tell them they are doing a good job.

    5 - Place them back in.
    Gerbils last longer.

  6. #6
    Plane Walker Kittyhawke's Avatar
    Join Date
    Jul 2012
    Posts
    445

    Default

    The hamsters have to eat some time.
    Level 60/Rank 90 Guardian Mage * Level 60/Rank 62 Defiant Rogue * Level 60/Rank 86 Guardian Cleric
    Level 60/Rank 82 Defiant Warrior
    * Level 60/Rank 86 Guardian Rogue
    * Level 60/Rank 53 Defiant Mage
    Level 60/Rank 24 Guardian Warrior
    * Level 60/Rank 15 Defiant Cleric
    * Wolfsbane/Zaviel

  7. #7
    Shield of Telara
    Join Date
    Jan 2011
    Posts
    758

    Default

    Quote Originally Posted by Jegodin View Post
    1. What causes some updates to take an hour, and some to take 3 hours? (I know this is really general, but any technical example of why this might happen would satisfy me)
    2. Do you guys write any behavior tests for the code and, if so, is running these tests part of the downtime? (i.e. make sure tests are passing before allowing the deployment to complete)
    3. About how much of a new patch release automated? (i.e. what's getting taken care of via scripts and stuff versus what's having to be done by hand)
    4. Do patches ever just work on the PTS, but then run into trouble when released to the main shards?
    I am not a Rift dev, nor do I play one on TV.

    1. Testing. They are probably using some form of automation software, possibly Chef. This will take their code and push it out. There is also different servers doing different things. They may not touch the 'instance' cluster, but touch the 'database' cluster. They may not touch the 'ecom' cluster, etc. If you think in terms of segmentation of processes, then you will understand that they do not have 3 computers running all of this. They more than likely have segmented out a good chunk of things. They probably have a CQ cluster, then the other WF's, then they probably have a T3 raid cluster, a T1/T2 raid cluster. During that time, they can also balance loads. They see tons of data about what the load is and they can probably adjust the capacity around. As an example, if there are 300 guilds in T3 and 20 guilds in T1, and the week before it was the opposite, allocating a few servers from T1 to T3 may make sense. After all the balancing, they need to test it.

    Backups. This is probably the best time for them to do a full backup to tape - if they do

    Maintenance - this is the best time to replace that dead drive in the raid. This is the best time to replace that server that is acting wonky.

    2. See #1, Chef
    3. See #1, Chef

    4. Probably. At best, I think PTS is a very small cluster and not as segmented. They probably have two maybe three servers total for that.

    That ends my 2 cents worth of a guess

  8. #8
    Official Rift Founding Fan Site Operator bctrainers's Avatar
    Join Date
    Apr 2010
    Location
    Kansas, USA
    Posts
    3,786

    Default

    Quote Originally Posted by Jegodin View Post
    I'm a software developer and avid Rift player. I wanted to start a topic while there is a lengthy downtime happening on the NA shards about what is going on during these downtimes. Of course anyone who knows anything about computer programming would have a snarky reply to this: "The shards are compiling." However I want to elaborate on this because I don't think it really satisfies anyone, nor does it completely answer the question... I also think this will be an interesting topic for anyone else into development as well as Rift junkies that are waiting out downtime!

    I have mainly worked on web servers that just host some simple application with content and database interactions (shards have this and much more). I'm assuming the process of releasing a patch or update to a web-application is a little more simpler because it (the web-application) is not as complex as a video game. I know the woe's of downtime though. One web server takes 15 minutes to get through its deploy process even if I just change 1 line of code, so I could imagine that it would take longer for 1 shard.

    Here are some questions I have technical interest in, and would think it really awesome if a dev who knows about the Rift deployment process could reply.
    1. What causes some updates to take an hour, and some to take 3 hours? (I know this is really general, but any technical example of why this might happen would satisfy me)
    2. Do you guys write any behavior tests for the code and, if so, is running these tests part of the downtime? (i.e. make sure tests are passing before allowing the deployment to complete)
    3. About how much of a new patch release automated? (i.e. what's getting taken care of via scripts and stuff versus what's having to be done by hand)
    4. Do patches ever just work on the PTS, but then run into trouble when released to the main shards?
    By no means an official response, but from what I've learned (and observed) over the years...

    1) Short downtimes (eg an hour or less) probably requires little changes to the database. Allows QA to check the new additions on live shards real quick. Long duration downtimes might be due to the ungodly huge database Trion has, just for Rift. Those long downtimes are quite possibly database wide backups being performed. That way, if "s-word hits the fan" as so to say, a rollback could be performed.

    2) Those are probably done during each new client build, given to QA.

    3) Fairly certain everything is automated with Trions build process. Only slow down is the human-verification process, making sure each bit and piece is functioning correctly.

    4) Yes, has happened a few times in the past. Best, and most recent example is the chat server daemon. It was implemented on PTS for a while, put on to live and all hell broke out for it.
    --BC

  9. #9
    Champion of Telara
    Join Date
    Feb 2014
    Posts
    1,266

    Default

    Most likely one of the biggest differences in time frames is any firmware\driver updates to the hardware and any OS security updates or patches that need applied prior to installing the patches to the game.

    In an environment like Rift you would want each node in cluster to be brought offline patched then tested prior to being active in the cluster again. Only once the full cluster was up and stable would you want to begin patching the application\game. This way any bugs introduced could be quickly narrowed down between individual updates and eliminated.

    Yes they probably certify all of this on an internal test server prior to going to the live ones, but I've worked server hardware a nice easy going firmware update on one server can turn to a nightmare on another with the same hardware. So best practice is to verify stability on each box.

  10. #10
    Rift Disciple Practicelap's Avatar
    Join Date
    Aug 2013
    Posts
    143

    Default

    Quote Originally Posted by Jegodin View Post
    I'm a software developer and avid Rift player. I wanted to start a topic while there is a lengthy downtime happening on the NA shards about what is going on during these downtimes. Of course anyone who knows anything about computer programming would have a snarky reply to this: "The shards are compiling." However I want to elaborate on this because I don't think it really satisfies anyone, nor does it completely answer the question... I also think this will be an interesting topic for anyone else into development as well as Rift junkies that are waiting out downtime!

    I have mainly worked on web servers that just host some simple application with content and database interactions (shards have this and much more). I'm assuming the process of releasing a patch or update to a web-application is a little more simpler because it (the web-application) is not as complex as a video game. I know the woe's of downtime though. One web server takes 15 minutes to get through its deploy process even if I just change 1 line of code, so I could imagine that it would take longer for 1 shard.

    Here are some questions I have technical interest in, and would think it really awesome if a dev who knows about the Rift deployment process could reply.
    1. What causes some updates to take an hour, and some to take 3 hours? (I know this is really general, but any technical example of why this might happen would satisfy me)
    2. Do you guys write any behavior tests for the code and, if so, is running these tests part of the downtime? (i.e. make sure tests are passing before allowing the deployment to complete)
    3. About how much of a new patch release automated? (i.e. what's getting taken care of via scripts and stuff versus what's having to be done by hand)
    4. Do patches ever just work on the PTS, but then run into trouble when released to the main shards?
    The majority of that downtime is not necessarily due to updates or anything computer programming related. Servers and their support architecture need servicing regularly and it is totally up to the departments responsible to implement a plan that adheres to their SOP to ensure the environment remains stable. I'm not sure if they can actually reply to you with their actual detailed scope of work on Wednesdays, but RIFT is far from the only game with weekly server downtime.

    For instance World of Warcraft and Guild Wars 2 also have a weekly maintenance schedule in which the servers are down for up to 12 hours on the same day every week. Same deal.

    There can be tons of different things going on, updates, db backups, automated checks and balances and manual ones and even physical preventive maintenance. Obviously even more things if a new patch is dropping. Some of those things would also differ depending on if their hardware is leased and supported by a contracted entity or all done in house with owned equipment. Mostly simple thing, just usually taking long because its a thorough process.

    I'm sure someone from the dev team or community team can elaborate, but its a fairly routine thing in large online games, specifically MMORPGs.

    It's not going to change any time soon so might as well plan around it.
    Last edited by Practicelap; 05-21-2014 at 03:51 PM.

  11. #11
    Telaran
    Join Date
    Jun 2011
    Posts
    84

    Default

    I thought this was going to be another one of those GIF threads like the rez one was

  12. #12
    Plane Walker
    Join Date
    Jul 2011
    Posts
    494

    Default

    First order of business is taking the mobile servers off-line and forgetting to restart them until Tuesday of the following week.

  13. #13
    Rift Master
    Join Date
    Jul 2010
    Location
    Austria
    Posts
    633

    Default

    Totally awesome how everyone turned into a Dev overnight.

    With all these new Devs the next patches must be wonderful.

  14. #14
    Rift Disciple Practicelap's Avatar
    Join Date
    Aug 2013
    Posts
    143

    Default

    Quote Originally Posted by Timrum View Post
    Totally awesome how everyone turned into a Dev overnight.

    With all these new Devs the next patches must be wonderful.
    I guess you didn't comprehend the part where the OP posted

    "and would think it really awesome if a dev who knows about the Rift deployment process could reply"

    Which means that anyone can reply, but if a RIFT dev does that would be awesome.

    You don't have to be a software developer to learn what happens during downtime of an online game. Especially since any planned downtime doesn't always have to do with development. Technical details are not all too different from game to game and IT operations encompasses a lot more different job roles than "dev".

+ Reply to Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts