n Friday morning all hell broke loose. The support team called me and were frantic that the website was broken. I immediately contacted the development team and asked them if they had made any unscheduled changes on Thursday night. They told me that they hadn’t made any changes since Monday afternoon. So this wasn’t a bug in the code it was a technical problem.
I went to our website to find out what was broken and everything I checked worked correctly. I contacted the support team again and asked them to explain exactly what was broken. This time they explained that only parts of the site were down so I asked them to send me a list of broken URLs. Since it takes me an hour to commute to the office I decided to hunker down and fix this problem from home.
Step 1 - Verify The Problem - This ensures that you understand what the problem is and that you are both talking about the same problem.
When I received the URLs they all worked. It is really hard to fix something if it is not broken. I contacted them again and it was still broken from the office. I tried connecting to the office via a VPN and when I checked the website as if I was located at the office it didn’t work. So I’ve been able to verify the problem but it seems to be limited to the office and a couple of customers which is a bit weird.
Step 2 - Occam’s Razor - Use Occam’s razor to come up with the simplest solutions and test each solution.
At this stage I decided there were three different things to test:
1. The computer at the office - Try using a different computer at the office - Same error.
2. The server - Try accessing the backup server from the office - No error.
3. The proxy at the office - Modified the host file at the office to switch the servers. - The error still only affect the live server.
Step 3 - Eliminate Options - Eliminate problems that aren’t backed up by the data and cross reference to figure out what the might cause the problem.
So at this stage I know:
1. From home I can access the backup site without errors.
2. From home I can access the live site without errors.
3. The office can access the backup site without any problems.
3. The office can access the live site but some URLs return blank pages or have their connections dropped.
The only difference in the data is that some URLs can’t be accessed from the office to the live server.
Step 4 - Isolate The Problem. - Try to find the boundary of what breaks and what doesn’t break.
At this stage I had no idea what the caused the error but I knew that if I could isolate exactly what was causing the problem then I might be able to fix it or work around the problem. I knew it wasn’t:
1. The users computer. (Elluminate because other computer at the same office had the same problem)
2. The proxy (A similar URL on the backup server worked)
3. The server (Everything worked from another location)
So I decided to try variations of the URL:
1. http://www.example.com/module/user.php?uid=123 - Was Broken - It returned a blank page at the office but the correct page from another location.
2. http://www.example.com/module/user.php - Worked - It displayed the same page at both locations.
3. http://www.example.com/module/user.php?id=123 - Worked - It displayed the same page at both locations.
4. http://www.example.com/index.php - Worked - It displayed the same page at both locations.
5. http://www.example.com/index.php?uid=789 - Was Broken - It returned a blank page at the office but the correct page from another location.
It took almost 30 minutes of trying different combinations to get obtain the 5 URLs above which enabled me to solve the problem. Based on the URLs above the problem it is possible to figure out that the problem is related to the query string (The part of the URL after the question mark). Specifically any query string that contained ‘uid’. I decided to try some more tests:
6. http://www.backup.com/index.php - Worked - It displayed the same page at both locations.
7. http://www.backup.com/index.php?uid=789 - Worked - It displayed the same page at both locations.
Combining the results of all these test allowed me to refine the problem to:
- There is a problem accessing the live server from the office using a URL that contains ‘uid’.
At this stage I was on a role and I felt like I was making progress. Can we isolate the problem in any other ways? Well what are the differences between the backup server and the live server?
- The OS is the same.
- The code is the same.
- The database is the same.
- The live server is located in Hong Kong.
- The backup server is located in China.
Maybe the problem is related to the location of the servers. Well that’s fairly easy to test:
8. http://www.google.com/?uid=123 (In America) - Was Broken - It returned a blank page at the office but the correct page from another location.
9. http://www.google.cn/?uid=123 (In China) - Worked - It displayed the same page at both locations.
10. http://www.adinobro.com/?uid=123 (My personal site in America) - Was Broken - It returned a blank page at the office but the correct page from another location.
I tested a number of other servers all over the world and basically any server outside of China was broken. Around this time some of the previous URLs started to work intermittently from the office.
The the problem can be defined as:
- There is a problem accessing the any server outside of china from the office using a URL that contains ‘uid’.
Here is a little bit of background information about the Internet in China:
The ‘Great Chinese Firewall’ is not actually a single firewall. Each ISP in China has different rules and filters traffic differently. While they all block traffic differently the end users experience is always the same - A blank web page is displayed or the browser reports that “The connection has been reset”. Also if the ISP has to much traffic then the excess traffic is often passed though without being filtered.
So after eliminating all the simple solutions the only solution left is that:
- The ISP that the office used was blocking and URLs to servers located outside of China (including Hong Kong) that contained ‘uid’ in the query string.
I did a final check and verified that the users that were having problems were using the same ISP that our office used. While I was on the phone with the customers the problem magically went away.
Why would the ISP block go out of it’s way to block ‘uid’? Well for a while some of us (Foreigner IT workers located in China) have suspected that Chinese ISP temporarily block access to sites and services not located in China to encourage companies to move their servers into a Chinese data center which cost significantly more and can be seized by the Chinese government. If they block the site in an obvious way then users will notice and get annoyed but if they block parts of the website then it looks like the website is broken. There is no way that companies can predict what will be broken next so if they want to be reliable then they have no choice but to move into a Chinese data center.
I was aware of these rumours but I thought Hong Kong would be treated as if it was part of China but not according to our ISP apparently.
So remember if you are trying to solve a problem:
- Verify the problem by replicating it.
- Occam’s Razor - Find the simplest explanations for the problem and test them.
- Eliminate options that aren’t possible.
- Isolate exactly what is causing the problem.
Once you have found the problem you can fix it.