As a test I was asked the following question
“describe how the internet works. Be detailed. Ignore the physical layer. You have 1 hour and no reference materials.”
Wow. It was kind of fun though. Below the jump is How The Internet Works by Me. Typos and all. Total stream of consciousness writing. Honestly, reference materials would have just slowed me down.
If you think it’s easy, get out a timer, and go - just don’t read mine first. ;-)
How the internet works:
At a very high level the internet is made up of lots of little networks which have agreed to use a common protocol and addressing scheme. This allows messages to travel from network to network - and just about any host (a host is a computer, server, router, printer - anything with an address) on the internet to talk to just about any other host.
Now, that’s all well and good, but there are several major flaws with the system described above. First off, if anyone can talk to anyone, then bad guys are able to do bad things to hosts they should not have access to (like ATM’s). Secondly, there are a finite number of addresses in the addressing scheme that was developed way back in the stone ages.
We’ll talk about the flaws in a bit. First we need a basic understanding of the lingo.
Host (as described above) is a computer, printer, router, or anything else that talks on the network (iPhones…)
An IP address is just like a postal address - and you read it very similarly. It looks like XXX.XXX.XXX.XXX with the last number being the most specific and the first being the most general. Like 1234 anystreet, anytown in reverse.
A MAC address is the physical address of a network card/device. IP addresses are translated to MAC addresses late in the game.
A router reads the IP address and determines if it needs to send the message along to another network. For example, if you are on the 10.10.0.0 network and you are sending a message to 192.168.0.45 - you will need to use a router to pass the packets between the subnets. Routers use Routing tables to find other networks. Since having a table of every network on the internet would be to big for a router to parse rapidly, we use several protocols for routers to determine who is next in line on the trail to the destination. Some of those protocols have odd names like OSPF, BGP or RIP. Misconfigured routers can cause all kinds of headaches including the dreaded “routing black hole” (insert scary music).
A subnet is a logical IP address grouping that might map to a physical network. It’s defined by the Subnet Mask. An IP address might be 10.10.0.2/255.255.255.0. that says that the host is only defined by the last octet (number) of the IP address (the 2). The rest of the address (the 10.10.0) defines the network. Notice that their is a distinction between 10.10.0.0 and 10.11.0.0 and 10.10.10.0 - all are different networks. Supernetting is used to fuse two or more subnets together into one network (this is used when you have several ip address ranges, and you don’t want the overhead of routing between them)
A gateway is the router that handles traffic in and out of the network that you live on.
A Firewall is a special type of router that examines messages (packets) to insure that they are not from bad guys. Firewalls also will disallow traffic to and from certain hosts, or only allow traffic in response to a request from an internal host. Many firewalls also do NAT.
NAT allows us to map internal IP addresses to external. You can do a one to one mapping (each internal address has a corresponding external address), or many to one mapping (several addresses internally share the same external address). NAT can cause problems with certain internet protocols - and some protocols have been developed which can traverse NAT routers with variable success.
Anyways, now that we have that out of the way, we can press on and talk about the two main issues. We are quickly running out of those addresses - so there is a new version of the addressing scheme that is being touted as the new hotness. Not many people use it yet, however it is slowing growing. One of the hindrances to adoption has been the universal use of NAT. What NAT does is allows IP addresses to be mapped from one to another. Now, there are several ranges of IP addresses which were declared “off limits” by the great grey beards in the sky back in the stone ages, so they cannot cross the internet. Properly configured routers will simply discard (drop) packets that are to or from such an address. NAT allows us to map these “internal” addresses to an external “internet” IP address.
Most SOHO gateways use NAT these days. It allows many internal machines to share 1 or a small number of external addresses. This has relived the burden on the IP Addressing space somewhat. Now not every machine needs an internet IP address - just some of them.
However, we are rapidly approaching “peak IP”, when the last IP address will be handed out. IPv6, the new hotness, will allow for a quantum leap in addressing space, and hopefully we will not need IPv7 for many many years.
One of the other major issues with the internet is (remember that scary music? yeah - play it now) Bad Guys. Bad Guys can do all sorts of nasty things, but the most common today seems to be taking a machine with an OS from the seattle area, inserting evil code, and using it for Distributed Denial of Service attacks, spam generate, phishing attacks and other bad things. They are not called Bad Guys for nothing
There are several things that can be done to keep the Bad Guys at bay. The first, is to properly secure your system before going out on the information superhighway. Patches, firewalls, Antivirus scanners are all a good idea. A better idea is to run a system that was designed with security in mind (*nix comes to mind) so you have less to worry about.
Internet based applications are another story entirely. Their issues are ones of scale, and traffic. If you get popular, how do you deal with all of the traffic? This is a harder problem then security and IP addressing ranges.
I actually wrote a paper at my last job detailing out a high level overview of scaling web applications. Basically, it comes down to this: Break your application into logical parts and make them do their job with as much speed as possible. Add hardware to the places that need it.
Virtualized hardware is a huge win here, as it can be added very quickly. Being able to spin up “servers” quickly (minutes or hours instead of days) and put them in the systems that need them in response to detected issues is a game changer.
Load Balancing (LB) is also huge. Many LB’s use NAT - so you have 1 IP address outside accepting requests, and they use some logic to split the traffic between several machines behind them. You can layer your LB’s even
In an extreme case it might look something like this:
Internet –> LB –> Content Servers –> LB –> Application Servers –> LB –> Database servers
The internet traffic hits the external LB, and is split into several content servers. If the content is a hard asset (a picture, or a cache item) it’s simply delivered back to the LB and sent on it’s merry way. If not, then the request hits another LB (which might be built into the content server) and the traffic is routed to a not so busy application server. If the app server needs to make a DB call, then the request hits another LB, and is routed to a DB that can handle the request and that is not busy.
In the past I have found that Nginx is a great content server. The application server depends on the back end application (mongrels/thin/passenger for rails, tomcat etc for java…), and not all DB’s support load balancing (at least for writes).
You need to have DB, App and Asset caching in there somewhere. The basic rule of thumb is that if you have to hit disk you are dead in the water. Disk access can add hundreds of milliseconds to a request. It a border line system or during a traffic spike requests can start to queue up behind the slow requests, and you have unhappy customers.