Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.



411 University St, Seattle, USA

+1 -800-456-478-23


Introducing Pyoneer – A Tool to Find Sensitive Data

Pyoneer was created to assist with the search for sensitive information while on customer engagements. The tool has been used in different scenarios, not just for penetration testing, but that is where the tools development began. Pyoneer’s base script was written overnight while sitting in a hotel room on an engagement. The idea came while completing another script, Spyder, to ingest a CSV file and mount shares, “Wouldn’t it be great to have something to automatically scan these shares?”. A quick search for an open-source tool turned up nothing, so I began writing the foundation of the script. It was in no way ready during the engagement and the development continued at home. It took roughly a week to complete the script.

While using the script on engagements it was clear it was lacking functionality and speed. The first iteration of the script treated all files as flat files and would perform regex searches based on a set of search terms and output a very basic log file. This led to figuring out how to scan file formats and the relevant data within files instead of the entire file. Enter file extension checking and “processing”. Functions were created to process doc(x), xls(x), pdf, image OCR (jpg, png, tiff). Each has its own requirements to pull only the data you want, office docs pre-office 2007 can be treated as OLE flies, 2007 and later need to be processed as zip files and specific xml tags contain the relevant data. Images require an OCR processor, PDFs need to be processed per page and images need to be extracted and run through an OCR processor separately.

Ok awesome, now I’m looking at the data I need, but do I need to open and scan every file or folder? Engagements are typically a week long, that doesn’t allow for much time to look for sensitive information even with automated scanning. Excluding as much as possible without missing important information is a balancing act. For this this I added filename, folder and extension exclusion lists to shorten the overall number of files being scanned. At the same time, options for database, virtual machine, and ransomware files were added.

Sweet, this thing is starting to run through file systems faster and produce more meaningful results. Is there any way to further speed up script? What about multi(threading/processing)? What about limiting concurrent matches in a single folder? What about being able to resume the script? The functionality could be endless….

AsyncIO and concurrent futures was added to help with multi-processing and file context handling. Python doesn’t have multi(threading/processing) natively and AsyncIO/Concurrent Futures is just a band aid but still works well enough. The integration of those two helped to speed up the script quite a bit. File match limiting per folder was also added, the limit is set via variable but helps to avoid scanning a folder with a large number of files that might contain the same data. The match is based on consecutive number of matches for the file extension. EG: 10 consecutive matches for .docx in the same folder. At this time, the ability to resume the script was added, this option is set via a Boolean variable in the script. It works by reading the last line in the output file and parsing the file path. The script the loops through the files till there is a match to the path and then picks up where it left off and starts scanning files again.

While the script has evolved quite a bit from where it first began, there is still more I’d like to add:

  1. OCR for images in PDF files
  2. User input on the CLI for user options
  3. Expanding the database functionality
  4. Expanding the regex patterns
  5. Searching filenames for search terms
  6. Better output
  7. Add data size search limit EG. Search the first 25MB of a file

Pyoneer is publicly available on the Stern Security Github:

There are some required python modules:

Pyoneer has a number of variables enable and disable functionality or set limits and paths:
excludeExt- list of file extensions that are excluded.
excludeFile- list of filenames with extension that are excluded.
excludedirs- list of folder names that are excluded.
checkfordb- Boolean true/false to check for the file extensions in listed in the dbExt variable.
dbExt- list of database file externsions to check for.
checkforransom- Boolean true/false to check for the file extensions in listed in the ransomExt variable.
ransomExt- list of ransomware file extensions to check for.
Checkforvm- Boolean true/false to check for the file extensions in listed in the vmExt variable.
vmExt- list of virtual machine file extension to check for.
searchTerms- regex list of words to search for separated by pipe ( | ).
rootPath- this is the path that you want to search. EG. /mnt
outputPath- this is the path where you want output file to be placed. EG. /home/output.csv
resumescript- Boolean, if true, file will read the last line of output and begin searching for that file and begin scanning once found.
matchlimit- this is the consecutive file extension limit per folder.

The script in its current state doesn’t require any CLI input and can be executed with ‘python3 pyoneer.py’

Peter Nelson
Senior Security Engineer



Leave a comment