Review Webserver Metafiles for Information Leakage
This section describes how to test various metadata files for information leakage of the web application’s path(s), or functionality. Furthermore, the list of directories that are to be avoided by Spiders, Robots, or Crawlers can also be created as a dependency for Map execution paths through application. Other information may also be collected to identify attack surface, technology details, or for use in social engineering engagement.
- Identify hidden or obfuscated paths and functionality through the analysis of metadata files.
- Extract and map other information that could lead to a better understanding of the systems at hand.
How to Test
Any of the actions performed below with
wgetcould also be done with
curl. Many Dynamic Application Security Testing (DAST) tools such as ZAP and Burp Suite include checks or parsing for these resources as part of their spider/crawler functionality. They can also be identified using various Google Dorks or leveraging advanced search features such as
Web Spiders, Robots, or Crawlers retrieve a web page and then recursively traverse hyperlinks to retrieve further web content. Their accepted behavior is specified by the Robots Exclusion Protocol of the robots.txt file in the web root directory.
As an example, the beginning of the
robots.txt file from Google sampled on 2020 May 5 is quoted below:
User-agent: * Disallow: /search Allow: /search/about Allow: /search/static Allow: /search/howsearchworks Disallow: /sdch ...
The User-Agent directive refers to the specific web spider/robot/crawler. For example, the
User-Agent: Googlebot refers to the spider from Google while
User-Agent: bingbot refers to a crawler from Microsoft.
User-Agent: * in the example above applies to all web spiders/robots/crawlers.
Disallow directive specifies which resources are prohibited by spiders/robots/crawlers. In the example above, the following are prohibited:
... Disallow: /search ... Disallow: /sdch ...
Web spiders/robots/crawlers can intentionally ignore the
Disallow directives specified in a
robots.txt file. Hence,
robots.txt should not be considered as a mechanism to enforce restrictions on how web content is accessed, stored, or republished by third parties.
robots.txt file is retrieved from the web root directory of the web server. For example, to retrieve the
$ curl -O -Ss http://www.google.com/robots.txt && head -n5 robots.txt User-agent: * Disallow: /search Allow: /search/about Allow: /search/static Allow: /search/howsearchworks ...
Analyze robots.txt Using Google Webmaster Tools
Site owners can use the Google “Analyze robots.txt” function to analyze the site as part of its Google Webmaster Tools. This tool can assist with testing and the procedure is as follows:
- Sign into Google Webmaster Tools with a Google account.
- On the dashboard, enter the URL for the site to be analyzed.
- Choose between the available methods and follow the on screen instruction.
<META> tags are located within the
HEAD section of each HTML document and should be consistent across a site in the event that the robot/spider/crawler start point does not begin from a document link other than webroot i.e. a deep link. The Robots directive can also be specified using a specific META tag.
Robots META Tag
If there is no
<META NAME="ROBOTS" ... > entry, then the “Robots Exclusion Protocol” defaults to
INDEX,FOLLOW respectively. Therefore, the other two valid entries defined by the “Robots Exclusion Protocol” are prefixed with
Based on the Disallow directive(s) listed within the
robots.txt file in webroot, a regular expression search for
<META NAME="ROBOTS" is undertaken within each web page. The result is then compared to the robots.txt file in the webroot.
Miscellaneous META Information Tags
Organizations often embed informational META tags in web content to support various technologies such as screen readers, social networking previews, search engine indexing, etc. Such meta-information can be of value to testers in identifying technologies used, and additional paths/functionality to explore and test. The following meta information was retrieved from
www.whitehouse.gov via View Page Source on 2020 May 05:
... <meta property="og:locale" content="en_US" /> <meta property="og:type" content="website" /> <meta property="og:title" content="The White House" /> <meta property="og:description" content="We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all. – President Donald Trump." /> <meta property="og:url" content="https://www.whitehouse.gov/" /> <meta property="og:site_name" content="The White House" /> <meta property="fb:app_id" content="1790466490985150" /> <meta property="og:image" content="https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png" /> <meta property="og:image:secure_url" content="https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png" /> <meta name="twitter:card" content="summary_large_image" /> <meta name="twitter:description" content="We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all. – President Donald Trump." /> <meta name="twitter:title" content="The White House" /> <meta name="twitter:site" content="@whitehouse" /> <meta name="twitter:image" content="https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png" /> <meta name="twitter:creator" content="@whitehouse" /> ... <meta name="apple-mobile-web-app-title" content="The White House"> <meta name="application-name" content="The White House"> <meta name="msapplication-TileColor" content="#0c2644"> <meta name="theme-color" content="#f5f5f5"> ...
A sitemap is a file where a developer or organization can provide information about the pages, videos, and other files offered by the site or application, and the relationship between them. Search engines can use this file to navigate your site more efficiently. Likewise, testers can utilize ‘sitemap.xml’ files to gain deeper insights into the site or application under investigation.
The following excerpt is from Google’s primary sitemap retrieved 2020 May 05.
$ wget --no-verbose https://www.google.com/sitemap.xml && head -n8 sitemap.xml 2020-05-05 12:23:30 URL:https://www.google.com/sitemap.xml  -> "sitemap.xml"  <?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84"> <sitemap> <loc>https://www.google.com/gmail/sitemap.xml</loc> </sitemap> <sitemap> <loc>https://www.google.com/forms/sitemaps.xml</loc> </sitemap> ...
Exploring from there a tester may wish to retrieve the gmail sitemap
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <url> <loc>https://www.google.com/intl/am/gmail/about/</loc> <xhtml:link href="https://www.google.com/gmail/about/" hreflang="x-default" rel="alternate"/> <xhtml:link href="https://www.google.com/intl/el/gmail/about/" hreflang="el" rel="alternate"/> <xhtml:link href="https://www.google.com/intl/it/gmail/about/" hreflang="it" rel="alternate"/> <xhtml:link href="https://www.google.com/intl/ar/gmail/about/" hreflang="ar" rel="alternate"/> ...
security.txt was ratified by the IETF as RFC 9116 - A File Format to Aid in Security Vulnerability Disclosure which allows sites to define security policies and contact details. There are multiple reasons why this might be of interest in testing scenarios, which include, but are not limited to:
- Identifying further paths or resources to include in discovery/analysis.
- Open Source intelligence gathering.
- Finding information on Bug Bounties, etc.
- Social Engineering.
The file may be present either in the root of the webserver or in the
.well-known/ directory, for example:
Here is a real world example retrieved from LinkedIn 2020 May 05:
$ wget --no-verbose https://www.linkedin.com/.well-known/security.txt && cat security.txt 2020-05-07 12:56:51 URL:https://www.linkedin.com/.well-known/security.txt [333/333] -> "security.txt"  # Conforms to IETF `draft-foudil-securitytxt-07` Contact: mailto:firstname.lastname@example.org Contact: https://www.linkedin.com/help/linkedin/answer/62924 Encryption: https://www.linkedin.com/help/linkedin/answer/79676 Canonical: https://www.linkedin.com/.well-known/security.txt Policy: https://www.linkedin.com/help/linkedin/answer/62924
OpenPGP Public Keys contain some metadata that can provide information about the key itself. Here are some common metadata elements that can be extracted from an OpenPGP Public Key:
- Key ID: The Key ID is a short identifier derived from the public key material. It helps identify the key and is often displayed as an eight-character hexadecimal value.
- Key Fingerprint: The Key Fingerprint is a longer and more unique identifier derived from the key material. It is often displayed as a 40-character hexadecimal value. Key fingerprints are commonly used to verify the integrity and authenticity of a public key.
- Key Algorithm: The Key Algorithm represents the cryptographic algorithm used by the public key. OpenPGP supports various algorithms such as RSA, DSA, and ECC (Elliptic Curve Cryptography).
- Key Size: The Key Size refers to the length or size of the cryptographic key in bits. It indicates the strength of the key and determines the level of security provided by the key.
- Key Creation Date: The Key Creation Date indicates when the key was generated or created.
- Key Expiration Date: OpenPGP Public Keys can have an expiration date set, after which they are considered invalid. The Key Expiration Date specifies when the key is no longer valid.
- User IDs: Public keys can have one or more associated User IDs that identify the owner or entity associated with the key. User IDs typically include information such as the name, email address, and optional comments of the key owner.
humans.txt is an initiative for knowing the people behind a site. It takes the form of a text file that contains information about the different people who have contributed to building the site. This file often (but not always) contains information related to career or job sites/paths.
The following example was retrieved from Google 2020 May 05:
$ wget --no-verbose https://www.google.com/humans.txt && cat humans.txt 2020-05-07 12:57:52 URL:https://www.google.com/humans.txt [286/286] -> "humans.txt"  Google is built by a large team of engineers, designers, researchers, robots, and others in many different sites across the globe. It is updated continuously, and built with more tools and technologies than we can shake a stick at. If you'd like to help us out, see careers.google.com.
Other .well-known Information Sources
It would be fairly simple for a tester to review the RFC/drafts and create a list to be supplied to a crawler or fuzzer, in order to verify the existence or content of such files.
- Browser (View Source or Dev Tools functionality)
- Burp Suite