Over the past weekend the online shop of SpaceCat stopped working.
That was the result of some unfortunate decisions and the fortunate situation of being featured. Anyway, that should have never happened, but I re-learned a few lessons.
Problem #1: The shop is our own implementation
I always defend to not reinvent the wheel, but in this case we did. Should we had used a 3rd system such TapJoy or ScoreLoop -which are already integrated-, this should have never happened, partly because their code is more tested, partly because it relies on their servers.
Why did we did it?
Because we wanted to have a shop that was not bounded to just Android Market to be able to distribute the game over other channels such as Amazon AppStore. We integrated PayPal for that. And we had to implement tracking of purchases per user, which is done for you when you use Android Market IAB .
ScoreLoop did not support it. We could have done it with TapJoy, but the initial release did not include that library, since this was originaly designed for Chalk Ball.
In summary, that was the result of legacy decisions that could have been changed later on but we did not want to throw out work away.
Was it worth it?
Hell no! Paypal is a nightmare from the accounting point of view (not entering other discussions about PayPal here) and at the end we did not delivered SpaceCat through other channels, so 100% not worth it.
Problem #2: The shop was hosted on a shared server
Yes, we just put it in on a server we used for hosting other websites. At the time it looked like a good idea and it was not like we were making too many requests. After all, users don’t open the shop that often, they do play, and that does not require the server.
Why did we did it?
Because we did not have a dedicated server out there and setting it up for this sounded like an overkill.
Was it worth it?
Yes, until the moment SpaceCat was featured. We did spare the cost of a dedicated server for 6 months, while the game was not popular enough.
We should have prepared the migration at the very moment SpaceCat was featured, but honestly I did not see this comming.
Problem #3: The URL of the API was hardcoded
Yes it was. What else do you want me to say?
It required an update of the game to get it fixed. Should this happened to an iOS game we could have been unable to fix it for days.
Why did we did it?
Good question. Because I did not think of it at the moment. No excuse.
Was it worth it?
No, it could have been a much worst scenario. I am glad it just required to do an update and that the continuous integration system was properly setup so we could do it quickly.
What happened exactly?
Once SpaceCat got featured, we started having around 50,000 new users a day, 50% of them opened the shop. So after a week, that database had grown from 50,000 to 300,000 rows.
In addition to that, we started having more than 4 requests per second hitting that table. While this should not be that bad, it was too much for a shared server and the account was suspended.
The fix
The solution was to set up a dedicated hosting so we can have total control and maybe install other tools that are not mySQL in the future if it is required.
Once the host was delivered (and the company did a very good job by completing it in less than 5 hours) installing the code of the shop and migrate the data was done in less than an hour.
Then, we had to reconfigure SpaceCat and publish an update on Android Market. I’m glad to have jenkins in place.
All in all the system was down only for 12 hours, which is not that bad.
Even more
In addition to that, once the dedicated server was up, I noticed that the table was being slow even if the server was not heavily loaded. I looked again at everything and I noticed that it was missing one index.
How did that happen? I don’t know, I was sure to have it in place. Probably because we lacked a proper deploy method and that index was only on the dev environment. Probably because of the lack of stress tests; until 300.000 entries & 4 request per second the table was not performing that bad and we never noticed it. This is, again, a problem of implementing your own solution.
Now the system is good and working, but it has given me too many headaches over the past days so, again, if you are going to create an app that uses a service and it has the potential to be used by many people, use a 3rd party solution if it is available.
I would say that growing pains are a good sign that the game is doing well. Congrats!
I’d like to float some ideas and see what is your take on them.
If the URL was hardcoded, my first idea would have been to deploy a new server and change the DNS. I don’t know exactly what are the needs of the store. If you are just selling goods and not doing any additional check to validate the existing users products, then you don’t even need to worry about DB replication.
It would have not provided an immediate solution for all the users, but as the updated DNS entries propagate you would have seen the load move from one server to the other, mitigating the problem without requiring the users to update the app.
Did you think of it? Do you think it would have worked for you?
Have you load tested your backend to know when the next growing pains may hit?
Well, it was hardcoded to a domain I did not own, so it was not possible to change the DNS.
What I was considering was to set up a redirect in the host, but redirecting POST requests is not immediate and the original host was suspended at the moment, so it was also not possible.
About testing the new server, I am monitoring it to see how much load it can handle, that’s how I discovered the missing index, but I want to set up some sort of alerts to be warned in advance, probably nagios.
Hey,
Nice problems It’s always nice to have problems related to many users liking the stuff you write…..
Stuff:
* Stay away from nagios and use icinga instead: https://www.icinga.org/
* Stick with MySQL if you do anything relational…. or accounting/payments.
I know all the noSQL kids are way cooler, but it’s nice to have your payment
related data in something like MySQL. Don’t use the oracle version, get your fix
here: http://www.percona.com/downloads/ They also have a webbased MySQL
configuration generator that I can highly recommend…
And a final shameless plug: https://github.com/ramonvanalteren/jenkinsapi
You need to use current master, PyPI is foobar and currently being fixed by Salim
But you might like it and I’ll personally code any feature requests
Cheers Ramon
Many thanks on your links and suggestions Ramon.
I remember when you said: “Forget about buildbot, use Jenkins” and icinga looks very nice
Pingback: 2012 Year Review | Platty Soft