The Research Data Management program at UC Berkeley has, for the past three years, provided on-the-ground consulting to researchers from disciplines spanning the campus. Much of the program's effort goes towards what we call “active data” – the files created, collected, processed and analyzed while a study is in progress. Frequently, this equates to terabytes and terabytes of material; or millions of files; or personally identifiable information from surveys, interviews, observations and experiments.
Managing active research data has its own needs and dynamics. Large-scale support solutions are elusive. A good example: free storage offered by cloud providers such as Box and Google Drive. These campus-managed services seem like a simple and cost-effective way to keep data safe from loss and to organize the output of multiple investigations running within a lab. At scale, however, researchers struggle just to move their files to these enterprise offerings. Helping them manage their transfers can entail hours of collaborative effort and greater complexity than the researchers anticipate. In working with principal investigators, lab managers and research assistants, it quickly becomes clear how the demands of time and technical expertise run up against other pressures -- competing obligations and deadlines, limited research funds, and even culture -- to shape a project’s approach and its willingness to engage.
This talk will focus on the experiences of the Research Data Management consulting team. It will describe the use of tools ranging from FileZilla and Globus to lftp, rclone, Python, *nix shell commands, and Jupyter notebooks, and showcase attempts to craft a broad response to individual data management needs. There will also be time to discuss what strategies and techniques others from around the UC system are developing and using.
No specific knowledge required, but a familiarity with the topics and tools to be discussed would be helpful.
None.