Skip to end of metadata
Go to start of metadata


Work Plan Outline

  • Reach out to communities, users and providers. Users - what data formats?  Providers - what considerations?
  • Look at data sources ingested by BI Tools
  • Think about a method for a Seal of approval, How does FEBPA and federal data service come into play?
  • Think about a method for a Catalog of services, like ArcGIS Online, Federal Source Code Repository, Data.gov

Timeline

GoalDescriptionDue Date
framing the question
  • define what we are asking, what problems are we solving
April
research and learn

reach out to:

  • AI users
  • Data Providers
  • BI Tools
May
design matrix elements
June
Draft
July 1
External Review
  • reach out to some of the big providers
July - Aug
Final Product
August


Product(s) Description

Maturity Matrix

Format: Mostly graphical with supplemental text

Purpose: Give agencies an "on-boarding" ramp up approach to providing analytics-ready data

Description: What matrix elements to consider? Quality, Documentation, Governance vs. scaling with Technology


Survey Notes

  • Survey 1: AI Users
    • Target audience
  • Survey 2: Data Providers
    • Target audience


NEXT STEPS:

  • Tyler & Bob:
    • write a statement for target audience-- examples of who we should talk to within the agency
    • draft set of questions
    • send to Cassandra & Bob as a starting point
  • vet the questions with a domain expert (Bob-NIST; Beth-NSF; Beth S.-Smithsonian) before sending them out



Survey Summary

The purpose of this June/July 2019 survey was to gather information about agency readiness in provisioning data in a form suitable for use in Artificial Intelligence, and use the information to formulate a maturity matrix.  Data in a form suitable for AI use we call "analytics-ready data", a concept that is tightly connected to Open Data policies, Cloud First government initiatives, and American leadership in Artificial Intelligence. The survey assesses federal agency progress toward making available high-priority datasets in a form that are “AI-ready” and in the use and creation of new, integrated datasets serving societal benefits. Fifty-three (53) participants responded to the AI survey, a number while on the low side, is still sufficient to derive general principles for the first iteration of the maturity matrix.  It should also be noted that response rates to each question were between 50 and 75%, this could indicate that AI is still an emerging field in government and many were unable to provide responses. 17 unique agencies/bureaus are represented and another 18 respondents chose not to report their affiliation. The survey results will act as a baseline assessment of the current state of government for informing the content and its granularity for the maturity matrix.

Survey Responses

Q1. I am responding to this questionnaire as (select one of the following roles):

  •    Chief Data Officer or functional equivalent (21%)
  •    Data Steward (41%)
  •    Data Scientist (including AI) / Research Scientist (38%)

Audience: All / Answered: 53

The distribution of responses seems representative to the proportion of actual career positions. We would expect fewer CDO's (11) and more Data Stewards (22) and Data Scientists (20).


Q2. The groups at my agency for whom open data to support AI is most relevant are (select all that apply):

  • Data Scientists (including AI/ML): 84%
  • Research Scientists who use AI/ML as a tool: 80%
  • IT Operations: 20%
  • Business Analysts: 24%
  • Data Management Programs: 52%
  • other

Audience: CDO and Data Steward / Answered: 25 out of 33

The most popular responses by far were Data Scientists and Research Scientists indicating that the focus for most agencies is still in R&D and less so on developing Open Data programs. Conversely, IT and Business Analysts ranked low with Data Management Programs ranking in the middle.


Q3. How well do you understand what's required for your agency's open data to be considered analytics-ready to support AI?

  • Not very well. I am not familiar with that user community and the requirements are not clear: 36%
  • Somewhat. My agency provides some data for AI users and we have some feedback on their requirements: 52%
  • Very well. We provide at least some open data that is specifically analytics-ready for the external AI community and we have a clear understanding of their requirements: 12%

Audience: CDO and Data Steward / Answered: 25 out of 33

The responses to this question support the idea that many who replied to this survey do not have a clear understanding of AI researchers' data needs. Only 3 respondents (12%) had a good understanding of AI requirements.


Q4. Which of these statements best describes your agency's data delivery systems for open data to support AI? "Programmatically accessed" is defined as machine-to-machine interaction (e.g. Application Program Interfaces (APIs) or web/data services).

  • Data is stored in siloed systems. Data are frequently copied or downloaded for individual use: 16%
  • Data is stored in siloed systems. Some data can be programmatically accessed: 16%
  • Some common data systems and standards. Some data can be programmatically accessed: 28%
  • Some common data systems and standards. High-value data can be programmatically accessed. Some common tools exist to facilitate use: 24%
  • Core common data systems with well-documented standards. High-value data can be programmatically accessed. Common tools are in use across agency and are available externally to facilitate use: 16%

Audience: CDO and Data Steward / Answered: 25 out of 33

Responses represented a fairly evenly distributed bell curve between low maturity (siloed systems) and high maturity (well developed core data systems with programmatically accessed data).  This seems to be representative of the state of federal agencies.


Q5. Which of the following are most needed to improve in your agency for ideal open data delivery to support AI? (select all that apply)

  • Timeliness: 33%
  • Completeness: 42%
  • Consistency: 71%
  • Accuracy: 38%
  • Usefulness: 42%
  • Accessibility: 63%

Audience: CDO and Data Steward / Answered: 24 out of 33

Consistency and Accessibility stood out as the highest ranking responses. Consistency is related to delivering the same type of data in the same format repeatedly and the same types of data being in the same format (having the same definition) across data sets, databases, or records. Accessibility is an Open Data concept related to how end users obtain or can interact with data in the context of what is most suitable for their use cases. These two characteristics point out the common pain points of all Data Scientists including AI/ML. A majority of their effort is spent obtaining, formatting, and cleaning data sets to be combined and processed with AI/ML techniques because of a lack of consistency and accessibility from data providers.  

  

Q6. Briefly describe a challenge related to data stewardship (e.g. delivery, access, responsible use, protection of sensitive data) that you feel is a barrier or critical to success of open data to support AI. (free text)

Audience: CDO and Data Steward / Answered: 23 out of 33

This free text question was grouped into topics based on mentions so the total count of mentions is greater than the number of answers.  The challenges that elicited the most mentions are two topics:  i) data not usable for AI (9 mentions) and ii) data sensitivity issues (7 mentions).  Data not usable for AI includes weak metadata,  inconsistent formats, and poor quality data.   Sensitivity issues cover responsible use, privacy (GDPR), lack of guidance for use of sensitive data in AI.   The third most cited topic is (raised by 3 respondents) is inadequate cyberinfrastructure where there is a lack of federation mechanisms across agencies, siloed datasets, or insufficient infrastructure to handle the anticipated size of data.    Infrequent mentions (1 mention each) were made of missing business need, weak incentives, FAIR metrics, over-regulation in Federal space, lack of skilled workforce, and lack of adequate financial support.


Q7. If applicable, how is your agency's enterprise Cloud vendor environment used to support open data for AI? (select all that apply)

  • Storage: 60%
  • High-Performance Computing: 35%
  • Data as a Service (i.e. Applications or APIs to support delivery of specific datasets): 55%
  • Data Delivery Platform (e.g. data warehouse, data lake, data hub strategies): 50%
  • Other: 25%

Audience: CDO and Data Steward / Answered: 20 out of 33

There was a near even response between Storage, Data/Software as a Service, and Data Delivery Platform, with Storage being the most popular. This seems representative as most agencies exploring cloud start with Storage and mature through Data/ Software as a service into a Data Delivery Platform. However, other free text responses indicated that some agencies are still struggling to move to cloud operations and are only investigating or not using cloud. It should also be noted that HPC ranked relatively low, so there is an opportunity across government to increase use of cloud for HPC, a key component of most data science and AI/ML activities.


Q8. Briefly describe a challenge related to computing infrastructure (e.g. cloud adoption, data storage, computational bandwidth, etc.) that you feel is a barrier or critical to success of open data to support AI. (free text)

Audience: CDO and Data Steward / Answered: 18 out of 33

This free text question was grouped into topics based on mentions so the total count of mentions is greater than the number of answers.  The challenges around use of cloud resources elicited considerable commonality across the 18 respondents. The most mentions were to high barriers to entry to cloud use.  This includes insufficient in-house expertise; and regulatory, technical or funding barriers that slow adoption.   Two other topics received the concern of the respondents:   i) limits to local data storage and bandwidth received 5 mentions, while cloud charges received 3 mentions. Concerns about cloud charges include lack of price transparency in cloud pricing especially over time, and an inability to predict usage with usage such as egress costs being some of the higher costs in using clouds. This creates an uncertainty that some respondents found problematic. A couple of respondents thought the internal required cloud provider offerings were lacking either because they were behind the technology curve or because they prohibited use of tools that are needed for AI research.


Q9. Which of the following statements best describes your agency's open data culture? We understand that some parts of the agency might be at different stages, but come up with an overall assessment.

  • Uncoordinated and ad-hoc. Quality and interoperability issues limit usefulness for providing data that's ready for AI and analytics: 21%
  • Data use is by request (i.e. email or file download). Agency-wide data programs are nascent: 29%
  • Some data and analytics are routine and have programs supporting key assets: 38%
  • High demand for data across agency. Decision-making is driven by data that is ready for AI and analytics: 8%
  • High demand for data internally and for external agency partners. AI and/or analytics-ready data is available for all stakeholders to drive decision-making as a community: 4%

Audience: CDO and Data Steward / Answered: 24 out of 33

Culture is developing in the same fashion as data delivery systems (Question 4). This makes sense because the two typically are interrelated, one creating a need for the other. Only 3 respondents felt their agencies had a mature open data culture that supported a high demand for data internally or for external agency partnerships. 


Q10. Which of the following skill sets are most needed in your agency to support open data for AI? (select all that apply)

  • Data Science: 61%
  • AI / ML Research and Development: 52%
  • Business Intelligence: 22%
  • Data Visualization: 43%
  • Software Development: 43%
  • Data Stewardship: 43%
  • Data Engineers / Data Architects: 52%
  • High Performance Computing: 39%
  • Cloud Computing: 48%
  • Software DevOps: 35%
  • DataOps / Data as a Service Teams: 30%
  • Other: 9%

Audience: CDO and Data Steward / Answered: 23 out of 33

Response rates were consistently high across multiple categories indicating that skills sets across the board are lagging and significant effort is needed to update the workforce. The lowest ranking skill sets were Business Intelligence, High Performance Computing, Software DevOps, and DataOps / Data as a Service Teams which could indicate a lack of understanding about these skill sets or a lack of maturity in Open Data delivery platforms to take advantage of these skill sets. It is appropriate to characterize these skill sets as "emerging". 


Q11. Briefly describe a challenge related to organizational culture (e.g. business or research-based practices, skill sets, organizational structure, etc.) that you feel is a barrier or critical to success of open data to support AI. (free text)

Audience: CDO and Data Steward / Answered: 22 out of 33

This free text question was grouped into topics based on mentions so the total count of mentions is greater than the number of answers.  The challenges related to organizational culture raised several issues.   One is of expertise: inadequate skill sets of existing employees, difficulty attracting and retaining talent (5 mentions).   Another is of organizational structural barriers with contract vehicles that are inflexible for R&D prototyping, prior approvals for data access, weakly curated data, and funding aligned with mission systems.  The final mentions are around employee culture that prefers holding data rather than making it accessible.


-- QUESTIONS 12-18 WERE TARGETED TO FEDERAL DATA SCIENTISTS WHO USE AI IN THEIR WORK --


Q12. Which of the following are the most important data quality characteristics to support your AI-related work? (select all that apply)

  • Timeliness (i.e. near real-time or general speed of delivery): 67%
  • Completeness: 75%
  • Consistency: 100%
  • Accuracy: 83%
  • Usefulness: 67%
  • Accessibility: 83%
  • Other: 25%

Audience: Data Scientist / Answered: 12 out of 20

Summary:

Again, Consistency and Accessibility were among the highest ranking among Data Scientists. This is in agreement with Question 5 which represented the CDO and Data Steward point of view. In addition, Accuracy and Completeness were highly ranked. Timeliness and Usefulness were the lowest ranked, but still commonly selected. Overall this indicates the key importance of data quality for AI-ready Federal data.


Q13. What are the most important characteristics that make data easier to use for your AI-related work? (free text)

Audience: Data Scientist / Answered: 11 out of 20

This free text question was grouped into topics based on mentions so the total count of mentions is greater than the number of answers.  The most common answer was easy, free data access (5 mentions), but complete documentation was also mentioned often (4 mentions). Three aspects tied for third place (three mentions): labeled training data, delivery modes, and data quality. Two respondents also mentioned file format, with one mention each for timeliness and access to subject matter experts to help users interpret the open data.


Q14. What file formats or standards are needed to support your AI-related work? (free text)

Audience: Data Scientist / Answered: 11 out of 20

The answers to this question were extremely diverse. Many respondents emphasized that they are able to use many different standard formats in their work, but the preference is clearly for simple formats (text, XML, JSON, etc.). Specific formats that were mentioned, in order of frequency: csv, JSON/GeoJSON, netCDF, ascii, and dbf. Each of these formats also received one mention: XML, TFRecord, TIFF, GRIB, HDF, Zarr, and Open Data Metadata Standard v1.1.


Q15. Which data delivery option do your prefer the most to support your AI-related work?

  • File download to local environment: 25%
  • Data as a Service (i.e. web services / APIs to support data delivery): 33%
  • Platform as a Service (i.e. high-performance computing co-located with the data resources): 25%
  • Commercial cloud vendors' proprietary platforms (e.g. Google's BigQuery, Amazon EC2/Athena, Microsoft Data Science environment, etc.): 17%
  • Other: 0%

Audience: Data Scientist / Answered: 12 out of 20

APIs/Data as a Service was ranked highest, indicating that having data programmatically accessible is a definite preference. File download to local environment and Platform as a Service (HPC) were the next highest ranked, indicating that some data scientists prefer to have the data in an accessible computing environment. An additional 17% of respondents preferred to access the data in commercial cloud platforms. The preferences here were fairly evenly split, indicating that the best practice should be for data providers to offer several different ways to access the data. Free-text answers to other questions support the idea that their preference depends on the size of the dataset, the specific use case, and the tools being used.  In comparison to Questions 7 and 10, CDO's and Data Stewards may not recognize the importance of HPC, since HPC was among the lowest ranking options. 


Q16. What are the important tools (e.g. software packages, programming languages, Services, etc.) that you use in your AI work? (free text)

Audience: Data Scientist / Answered: 12 out of 20

For this audience of Federal AI researchers, the list of tools was weighted towards programming languages (as opposed to proprietary software), which may not reflect the broader commercial and academic user community. The specific tools mentioned, in order of frequency: Python, R, TensorFlow, Keras, Jupityr notebooks, and the AWS AI/ML/IoT stack. Each of these tools also received one mention: ArcGIS, Word2Vec, BERT, Excel, SAS, IDL, Fortran, GitHub, VIAME, Conda, Spark, Scikit-learn, MatLab, Watson Explorer/Studio, Azure AI/ML/IoT stack, and Google Cloud Platform AI/ML/IoT stack.


Q17. Briefly describe a challenge related to data delivery and use (e.g. data quality, documentation, cloud availability, dataset size, computational bandwidth, etc.) that you feel is a barrier or critical to success of open data to support AI. (free text)

Audience: Data Scientist / Answered: 11 out of 20

This free text question was grouped into topics based on mentions so the total count of mentions is greater than the number of answers.  The most common challenges mentioned were data quality and the size of the datasets (3 mentions each). Several respondents also mentioned timeliness, documentation, access to cloud resources, and developing labeled training datasets (2 mentions each). 


Q18. Which of the following skill sets are most valuable for your AI-related work? (select all that apply)

  • Data Science: 83%
  • AI / ML Research and Development: 75%
  • Business Intelligence: 8%
  • Data Visualization: 50%
  • Software Development: 50%
  • Data Stewardship: 33%
  • Data Engineers / Data Architects: 50%
  • High-Performance Computing: 58%
  • Cloud Computing: 17%
  • Software DevOps: 25%
  • Data Ops / Data as a Service Teams: 8%
  • Other: 17%

Audience: Data Scientist / Answered: 12 out of 20

Most intuitively Data Science and AI / ML research and development were the highest ranking responses due to their direct relationship to the question for the survey audience. The next highest ranking group was Data Visualization, Software Development, Data Engineers / Data Architects, and High Performance Computing. In comparison to CDO and Data Steward responses in previous questions where these choices were ranked lowest, these areas could represent gaps in understanding about data science/ AI/ ML at the enterprise/ upper management Open Data strategy level.


Q19. Please tell us the name of your agency, subordinate bureau, or office. (optional)

  • DHHS / CDC (x2)
  • DHHS NIH (x2)
  • VA / Veterans Health Administration
  • VA / Cooperative Studies Program
  • USDA
  • USDA / Agriculture Research Service (x3)
  • USDA / Economic Research Service
  • USDA / Forest Service (x3)
  • DOS / USAID Bureau for Management (x2)
  • DOI (x2)
  • DOI / BOEM (x3)
  • DOI USGS (x3)
  • DOC / NIST (x2)
  • DOC / NOAA (x4)
  • DOE Office of Science (x3)
  • Smithsonian
  • NASA

Matrix Notes

Major Take-aways from Survey:

  • There are some clear themes in the requirements: several aspects of data quality, good documentation, and offering a variety of data formats and delivery mechanisms
  • A discussion about data sensitivity needs to be included, but will not be included as an element in the matrices
  • CDO/ Data Steward data strategies and visions do not match with AI/ ML Data Scientist user needs
  • While OpenData activities are maturing across the Federal government, AI/ ML requirements are largely not understood
  • There are still huge barriers to cloud on-boarding and limited understanding about cloud-optimized strategies
  • Definite across the board lack of employee skill sets and inability to retain highly sought after talent


Dimensions we may consider:

Group 0:

  • skills
  • infrastructure
  • federal business environment
  • data stewardship
  • business culture

Group 1:

  • Data Stewardship
  • Computing Infrastructure
  • Organizational Culture

Group 2:

  • People, Skills, Organization
  • Technology and Solutions

Group 3:

  • Domain Readiness
    • Skills
    • Tools
    • Data
  • Cultural Readiness
    • Strategy and vision
  • Operational Readiness
    • Cloud on-boarding
    • Data delivery platforms



  • No labels