Complex Dataset Management

Background

Since January 2024, I have been working part-time as a programmer/data manager for a collaborative research effort to monitor First Foods of cultural significance to Tribes in the interior Pacific Northwest. The project includes partners from Oregon State University, Confederated Tribes of the Umatilla Indian Reservation, Nez Perce Tribe, and several federal agencies, such as the USDA Forest Service. While much of the data and information is confidential and protected, below I describe my work to support the research and monitoring effort.

The research group has established numerous (~150) permanent plots across the forests, grasslands and shrublands of eastern Oregon to monitor the status, health and abundance of culturally significant plants (First Foods). To analyze and interpret the information gathered, data from many different sources first needs to be cleaned, integrated, and organized in ways that enable analysis and interpretation by the team. Data comes from many sources including field-collected data (GPS points, photographs, presence and abundance data for 20 plant species, site assessments and other biophysical factors) as well as available geospatial datasets (climate, land ownership, ecosystem, elevation, disturbance history, etc.). Additionally, as new data is collected, there is a need to develop accurate and efficient workflows that add the new data into the overall database.

Data is collected in permanent plots across the rugged and mountainous landscape of Oregon, and then combined with a variety of other geospatial and biophysical data.

The Process

Data is collected in the field by teams of technicians using Survey123 on iPads and then the data is uploaded to server. Before processing, I conduct QA/QC on the data to make sure it will process correctly. Then, in python, I wrote scripts using the pandas and numpy packages to calculate key metrics of interest for each site - species frequency, mean/median density, 95% confidence intervals, etc. Photograph of the sites and locations taken within the Survey123 app had to be exported and processed in ArcGIS Pro. Biophysical, ecological, and geospatial information (see figure below) was then obtained and linked to the output tables generated from my analysis of the field collected data. Queries and data tables are then be readily created for statistical analysis.

I also used the python-docx package to write scripts that automatically generate plot summaries (output as a Word document) that contain all the relevant data and information for each plot (including site photos, etc.). Currently, I am working on creating a geodatabase with the information so that the summary information can be loaded and displayed in ArcGIS as well as Google Earth. This will allow researchers and managers to quickly access and interpret critical information.

Since the project is ongoing, I am working on creating a desktop application that can be shared with the researchers, so as new data is collected, they are able to easily process the new data and generate the summary statistics, summary reports, and geodatabase.