Please use this identifier to cite or link to this item: https://hdl.handle.net/11681/40203
Full metadata record
DC FieldValueLanguage
dc.contributor.authorSalter, R. Cody.-
dc.contributor.authorDong, Quyen T.-
dc.contributor.authorColeman, Cody A.-
dc.contributor.authorSeale, Maria A.-
dc.contributor.authorRuvinsky, Alicia I.-
dc.contributor.authorWalker, LaKenya K.-
dc.contributor.authorBond, W. Glennen_US
dc.date.accessioned2021-04-02T19:43:42Z-
dc.date.available2021-04-02T19:43:42Z-
dc.date.issued2021-04-
dc.identifier.govdocERDC/ITL TR-21-2-
dc.identifier.urihttps://hdl.handle.net/11681/40203-
dc.identifier.urihttp://dx.doi.org/10.21079/11681/40203-
dc.descriptionTechnical Report-
dc.description.abstractThe Engineer Research and Development Center, Information Technology Laboratory’s (ERDC-ITL’s) Big Data Analytics team specializes in the analysis of large-scale datasets with capabilities across four research areas that require vast amounts of data to inform and drive analysis: large-scale data governance, deep learning and machine learning, natural language processing, and automated data labeling. Unfortunately, data transfer between government organizations is a complex and time-consuming process requiring coordination of multiple parties across multiple offices and organizations. Past successes in large-scale data analytics have placed a significant demand on ERDC-ITL researchers, highlighting that few individuals fully understand how to successfully transfer data between government organizations; future project success therefore depends on a small group of individuals to efficiently execute a complicated process. The Big Data Analytics team set out to develop a standardized workflow for the transfer of large-scale datasets to ERDC-ITL, in part to educate peers and future collaborators on the process required to transfer datasets between government organizations. Researchers also aim to increase workflow efficiency while protecting data integrity. This report provides an overview of the created Data Lake Ecosystem Workflow by focusing on the six phases required to efficiently transfer large datasets to supercomputing resources located at ERDC-ITL.en_US
dc.description.sponsorshipEngineered Resilient Systems Program (U.S.)en_US
dc.description.tableofcontentsAbstract ................................................................................................................................................... ii Figures and Tables ................................................................................................................................. iv Preface ..................................................................................................................................................... v 1 Introduction ..................................................................................................................................... 1 1.1 Background ..................................................................................................................... 1 1.2 Objective .......................................................................................................................... 1 1.3 Approach ......................................................................................................................... 1 2 Data Lake ........................................................................................................................................ 2 3 Data Lake Ecosystem Workflow Process Elicitation ................................................................. 4 3.1 Stakeholder interview ..................................................................................................... 4 3.2 Activity diagram ............................................................................................................... 5 4 Data Lake Ecosystem Workflow ................................................................................................... 7 4.1 Stakeholders ................................................................................................................... 8 4.2 Workflow processes ...................................................................................................... 10 4.2.1 Initial customer interaction ............................................................................................... 14 4.2.2 HPC access ........................................................................................................................ 14 4.2.3 Data transfer agreements ................................................................................................. 15 4.2.4 Data access rules .............................................................................................................. 16 4.2.5 Data transfer ...................................................................................................................... 17 4.2.6 Data ingest ......................................................................................................................... 19 4.2.7 Raw data storage and update .......................................................................................... 21 4.2.8 Data transformation and data analytics .......................................................................... 21 5 Conclusion ..................................................................................................................................... 24 References ............................................................................................................................................ 25 Acronyms and Abbreviations .............................................................................................................. 26 Appendix A : Process Workflows Activity Diagram .......................................................................... 27 Report Documentation Page (SF 298) .............................................................................................. 40-
dc.format.extent48 pages / 2.48 MB-
dc.format.mediumPDF-
dc.language.isoen_USen_US
dc.publisherInformation Technology Laboratory (U.S.)en_US
dc.publisherEngineer Research and Development Center (U.S.)-
dc.relation.ispartofseriesTechnical Report (Engineer Research and Development Center (U.S.)) ; no. ERDC/ITL TR-21-2-
dc.rightsApproved for Public Release; Distribution is Unlimited-
dc.sourceThis Digital Resource was created in Microsoft Word and Adobe Acrobat-
dc.subjectBig dataen_US
dc.subjectDatasetsen_US
dc.subjectElectronic data processing--Workflowen_US
dc.subjectData curation--Workflowen_US
dc.titleData Lake Ecosystem Workflowen_US
dc.typeReporten_US
Appears in Collections:Technical Report

Files in This Item:
File Description SizeFormat 
ERDC-ITL TR-21-2.pdf2.48 MBAdobe PDFThumbnail
View/Open