Public S3 Bucket Search Engine - Open Source Project

History

Currently, I have been working on making my reconnaissance research more efficient. This involved combining multiple scripts, make a CMS that will help to organize all the recon notes and also making a cron setup that will run all the codes every day. During this process, I created mini projects to go with the CMS. This was done so that I could in the end combine these projects and complete the CMS easily. 

During the process, I found that, if I wanted to make the job faster and efficient, I would need to make the output searchable. One of the prime example for this is with AWS Buckets. Companies sometimes leave their sensitive buckets for public to read. To find buckets possibly owned by companies, I had a CMS input that will take the domains and then strip the name. After that, every day a bruteforce will run and try to find the buckets. This was done by using following GitHub repo by Tom de Vries: https://github.com/tomdev/teh_s3_bucketeers.

This will save the data and check if there are new buckets. If there are new buckets, it would email me the name of the buckets and ACL policies they had in place. Soon I realized that I just did not want the email, I wanted to see if there were more buckets or at least be able to search them. This gave me the idea to create a searchable database. Considering how many domains I was watching, it was definitely not efficient to have a local db with these data and also to code a search engine. Instead, I decided to use a public system for this. After some research and docs reading, I chose Algolia. 

Algolia helps to push and search the db easily and they have their own JS client available for the search queries. This made my work way easy. 

So lets dig further into the coding portion of this. 

Coding/Designing

Getting Targets - Legal Dilema 

Re-inventing the wheel sucks. Instead of coding the same thing twice, I decided to use the public repo as mentioned above to find the s3 buckets. Next was finding the target. The CMS allows you to setup automated recon which grabs the target for you. However, I also wanted to make a public search engine that anyone could use. To do this, initially I manually added the targets which the recon will run. This however was extremely inefficient because I cannot read people's mind so I could not have every target that people needed. Another option I had was to grab every company's name out there from the Alexa top 1000 list. This however highlighted a grey area. Then the bulb went off in my head. 

About 1 year ago, I created something called HackerOne Bot in FB Messenger. There was a option on it where you could message it saying "Roulette" and it will give you a random company to hack and also say how much it pays. For this to work, it will simply grab list of public programs in HackerOne and chose one of them randomly. If a company has public program in HackerOne, you have permission to hack/test them. So, I use cURL request to grab every public program in HackerOne. This was done by getting JSON response of search request for the keyword: type:hackerone and listing it alphabetically. 

I then saved the data to a text file. 

Creating and uploading to database

Once the data was saved, I had to run the script to find buckets for each of the domains. This was not hard to do because you could create a basic loop in bash that could allow you to go through each line in a file and run a script. At the same time it was finding the buckets, I wanted it to save the data once it found all the buckets for the particular domain. This was where I will update the database with records. 

Lets first start with the bash script that runs 50% of the function. This bash script at first, opens the txt file and runs a loop until it reaches end of the file. Inside the loop it calls a bash script that is edited version of Tom's script mentioned above. After the buckets are found they are saved in a text file and then the python file upload.py is called.

Upload file is a relatively simple one too. It uses sys to get the input of the program handle after which it finds the file associated with it and grabs all the bucket names. Once that is done, it will split the bucket name with its policies. By default, when the bash file outputs the text file with buckets, it will also save the acl policies next to it. Each of the data is saved in a multi-dimensional array which is set as variable like this: 

buckets = [[],[]]

So within this buckets array, we save the name of the bucket and its policy. Once that is done, it will push the data to Algolia.

To utilize this, it uses the Algolia API. If the program does not exist already on the DB, then it will create a new record for it in the indice. If during the scan no bucket was found, the record will still be created but there will be no buckets listed in the db. This was designed that way because in the long run, if a new bucket is found it can be listed as new and then emailed to the user.

Features of the Search Engine

The search engine is still being developed. There are some new fixes and updates planned for the future. I am hoping to keep the search engine live for as long as I can. If anything changes in the future, users will know about it. Along that, this feature will also be integrated to the new CMS that will be released pretty soon. How it works in the CMS will be posted on the blog that will detail everything about the CMS. 

If any new update is released, it will be highlighted in the GitHub repo. Currently, as previously stated the list of the programs that you can search for are only the programs that are available in HackerOne directory. I haven't listed BugCrowd programs just yet but they will be added in a future release. 

If there are some programs that are not running in HackerOne or BugCrowd for example the ones that run their own independent program, and you want to have their buckets/record in the DB, please let me know. You can do this either by messaging me on Twitter (@uraniumhacker) or by opening an issue in the GitHub repo. 

Currently, the search engine searches the domain and lists the bucket for it. In the side, it will also note the ACL policy of the bucket: 


It is relatively simple for now because there are some limitations to what else it can do. New updates will have better functionality. One of the policies that is still being worked on is the Write policy. I haven't fully implemented it due to legal complications. In order to test the write policy, I will need to create a test file into the buckets which could also count for hacking without reporting if the bucket is owned by the company. 

Limitations and future updates

The main goal is to make this easy, efficient and useful. I am working to improve the bucket bruteforce script to make it fast and make it look for more things. In addition, I will update the engine soon with more information for example, it will show the probability of if this bucket is owned by the company or not. That way hackers do not have to be confused and take a hit in their reputation. Currently, this is not implemented so be wise when reporting buckets :) 

Some extra information 

For companies, law enforcements, lawyers, feds:
This is a public project so if any user uses this to attack a company and breach their information (hence publicly disclosing a s3 bucket owned by the company without reporting or if anyone downloads the data and holds a company in ransom), I will not be responsible for it. If any hacker is found doing this, please do not ask me to take actions against them because I am neither a fed nor have rights to take action (banning the hacker from my page also includes on this). This runs completely on Github.io so I cannot ban/unban a user directly. 

Also, for companies: If you notice that a bucket is listed that is not owned by your team, feel free to let me know. The goal is to make this accessible for all and also make it accurate to some instance. I will personally be changing or modifying the DB as I find any new buckets or any new information on the bucket. 

For hackers:
Hackers are curious. I can guarantee that because this is open source, you will be interested in checking the code. Currently, only the code for https://rojan-rijal.github.io/s3_search is public. This does not have the backend code. If you check the code for s3_search, you will be able to see two integrations for Algolia. They may look like potential sensitive API keys but they are not. The first one is application ID and the second one is the public api key. They are used by the engine to query the database. You should not be able to modify any DB content with that. If you do find a way to modify any information on there, please report it directly to https://hackerone.com/algolia.

Additionally, a bucket is listed for a company you searched for does not mean that the company owns it. There is no direct way of knowing if a company owns a certain AWS bucket or not. Unless you are 100% sure that a bucket is owned by the company, do not report it. I will not be responsible if you get 100 N/As on HackerOne or on any other platform. Use this for recon purpose and for research. 

If at any time, during your testing you find a s3 bucket that is not listed here, please feel free to let me know. I will be doing the same and will update the database if I manually find new buckets. It is important to know that currently this engine only lists bucket that are readable. I am still listing ACL policies because future updates will use the policies and look for more policies. 

This is an independent project. It does not represent my employer or anyone that I work with. Please do not reach out to them with unwanted question. If you have questions feel free to message me. 

Comments

Popular Posts