By Brittany Clayton, Tom Goughnour, & Gavin Hartnett
One of the best publicly available resources for cost, technical, programmatic, and schedule data is the President’s Budget which is submitted annually to Congress. Some of the benefits of this data are that it is broken out by the organization, fiscal year, type of appropriation, and in many cases at a program level. A particular benefit to the cost analyst is the inclusion of actual expenditures from prior years and the ability to view costs for a given program over time. Additionally, the technical and schedule detail provides valuable context to users of this data.
While this data provides immense benefits, finding the desired data can be an arduous task for an analyst. It is a manual process that can be quite time-consuming. Each organization (e.g. a service branch or non-DoD federal organization) stores their respective data on its website, usually on that organization’s comptroller or financial management website. Once the website location is found, the analyst needs to search through multiple layers to find their desired data. Data are presented in separate PDF files for each service, fiscal year, and appropriation. Downloading all necessary files and copying data from each can be extremely time-consuming. Besides, a manual process of data collection easily introduces user error.
Because this data is often leveraged by RAND projects, researchers, and analysts, we see tremendous benefits to create a centralized and easily searchable repository for this data. The obvious benefit of a searchable central repository is the time saved by avoiding having to search for the relevant websites and manually reviewing hundreds of PDF files for the desired data. Additionally, errors can be avoided when data is stored in a well-structured and searchable database.
As a pathfinder for this effort, we chose to start with the Navy budget documents, including all Navy appropriations (Research & Development (R&D), Weapons Procurement, Navy (WPN), Aircraft Procurement, Navy (APN), Operation & Maintenance, Navy (OMN), etc.) going back to 1998. The primary tasks for this initial effort included:
- Crawl through budget exhibit websites and download all applicable PDF files;
- Information pre-processing and data extraction;
- Building a simple search tool.
The task workflow can be depicted as such:
For the first task, we scraped all of the Navy budget PDFs from their website. Manually, this would require the analyst to click on the desired fiscal year (from 1998 – 2020) and then the desired appropriation (e.g. R&D, Procurement, O&M). The total number of documents downloaded was 855. Using Selenium to visit each fiscal year webpage, we were able to scrape all of these PDFs in mere hours. Done manually, this would have likely taken days to pull all of this data.
First, we scrape all the PDFs from the website. Then we use the command-line utility pdftotext to convert the PDF text to raw text. And a simple Python wrapper of tabula-java (which can read tables in PDFs) extracts PDF data tables. These tables then get converted into Pandas dataframe objects. The raw text conversion and table extraction created a database of easily searchable raw text files.
Finally, a simple search tool was developed using Python which enables a user to search for substrings in PDFs or the tables. The search tool enables the user to quickly identify what PDF files or and/or extracted tables contain the searched term, therefore saving the analyst the time consuming and tedious process of clicking links and manually searching for terms on a website containing hundreds or thousands of PDF files.
There are several next steps we might consider to advance this pathfinder project beyond our initial work. A near-term goal would be to continue refining and reviewing the tables extracted from PDF files. Since table formatting across exhibits is often inconsistent, it’s important to audit the resulting scraped data to ensure the correct data is accurately represented. We could also explore expanding the dataset to other DoD services and other federal agencies (Air Force, Army).
As a long-term goal, it would be valuable to create a more robust internal, web-based search tool. In addition to basic search functionality, incorporating advances in ML that would allow the search tool to answer questions based on database relationships could be another useful future development. Another potential functionality of an internal tool might include the ability to compare program elements or line items by year.
This work was supported by the RAND Center for Scalable Computing and Analysis (SCAN).