Tech Explainer: AI Model Training

Walker Robinson
Nov 17, 2024
4 min read

What It Is

AI model training is a fundamental process that serves as the lifeblood to modern artificial intelligence. While comparing it to human learning offers some useful context, these systems operate on an entirely different level. When we interact with AI applications such as Large Language Models (LLM) like Claude or ChatGPT or facial recognition systems like Clearview, we engage with a technology that has processed and learned from many billions of data points.

The datasets for training these systems need to include diverse information sources to produce the best result. Everything from social media content and news publications to satellite imagery is included. The volume and diversity of the training data directly correlates with how well the system performs when it completes its training.

Organizations that train AI models like Anthropic or OpenAI use sophisticated data selection, cleaning, and structuring processes to ensure the best AI learning outcomes for their models. The quality and accuracy of training data is incredibly important. Data that is compromised or incorrect can significantly impact system performance and reliability.

If you want to learn more about how the training is done I highly recommend reading this piece by Anthropic, as it covers a lot more of the highly technical details.

Why It's Important

The implications of data collection for AI training is more than just technical capabilities of these models, it also has a significant impact for U.S. national security interests. Each data point used in the training datasets could contain sensitive information about individuals, organizations, or national interests that could be exploited by adversarial nations.

For example, if an adversary were to collect data from all of the social media of the personnel on a military base over an extended period of time. Those individual media posts might appear unimportant on their own, but by analyzing all of the posts together the data could reveal critical patterns about personnel movement, operational schedules, or base protocols. While a human analyst could conduct the same process and reach the same result, it would take many hours to do and they still may not notice all of the details that an AI system would. Similarly, the collection of publicly available imagery could inadvertently expose security measures or important locations that present security vulnerabilities.

Data authenticity represents another critical concern in AI training. Adversaries could potentially manipulate training datasets to introduce specific biases or vulnerabilities into the AI systems. As mentioned in other Tech Explainers, this data “poisoning” could result in systematic errors or vulnerabilities that pop up during critical operations, potentially compromising security or operational success.

How It Impacts U.S. National Security

The Opportunities

By understanding how AI systems learn from data enables powerful new possibilities for U.S. national security. By properly controlling training data, AI models could be used to carry out specialized tasks to a great degree of efficiency. For example, training a model on satellite imagery could be used to detect subtle changes in adversary activities. Furthermore, training a model on past cyber attacks could enable it to identify patterns in those cyber attacks and better identify or defend against them in the future. Perhaps one of the most interesting uses could be using open-source intelligence to predict regional instability b. The key lies in carefully selecting and curating the training data to develop these capabilities to specifically empower these use cases.

On the defensive side, the U.S. could use AI systems trained on simulated attacks and security breaches to strengthen their protective measures both physically and digitally. By training these systems on data from past security incidents, they could develop better ways to detect and respond to future threats.

The development of synthetic data generation techniques (the creation of non-human data that imitates real world data) could enable organizations to create training datasets that are as effective as real world data while eliminating the risk of using sensitive information.

The Challenges

The risk of using “poisoned” data in AI training is incredibly important to address. The U.S. government and U.S. AI companies must make efforts to develop effective methods of protecting and validating the data they intend to use for AI model training. Should the data be compromised it would undermine the entire AI system and invalidate any results that the model produces. When using these models for matters of national security and other important situations there can be no room for error.

The inherently global nature of data collection presents significant challenges for U.S. national security as well. Once information becomes public, controlling its collection and use for AI training becomes incredibly difficult. Public sources such as social media platforms, government records, and openly available imagery are all essentially fair game. This availability doesn't just apply to the U.S.-made AI models, but to any model made around the world. There's a very good chance the Chinese-made AI models are using every bit of open-source data they can get their hands on, just like the U.S. AI companies. (I’d discuss Tiktok, but that's a long conversation for another time.)

Organizations also face a complex balance between maintaining competitive advantage in AI development and protecting sensitive information. Implementing restrictions on data the data these companies can use may enhance security but would likely impede AI advancement relative to competitors operating under fewer constraints. China is not likely to give the same respect to information privacy as the U.S.

Looking Ahead

As AI systems become increasingly integrated into security operations, training data security will become an even greater strategic importance. This evolution encompasses not only protection of sensitive information but also ensuring integrity and reliability of the data in those training datasets. It is essential to establish a balanced approach that supports technological progress while protecting national security interests.

Tech Explainers are our method of introducing and analyzing complex technologies in an easily digestible way. They are good practice for us as we grow in our own knowledge and become better at “translating” the technological side of things into the policy and national security side.

These technologies are incredibly complicated and thus it is difficult to address every detail. So while we strive to produce the best explanations we can, we may have overlooked something in the process. If you feel that is the case please reach out and let us know. We’re always happy to talk!

Tech Explainer: AI Model Training

What It Is

Why It's Important

How It Impacts U.S. National Security

Looking Ahead

Recent Posts

The Prometheus Security Project