Hello, and welcome to another issue of China Chatbot! This week: there are some very big words from Chinese state media about how Chinese AI can rearrange the world order; I work out how biased training data can help the Party control AI models; and am concerned by Nvidia hosting a version of DeepSeek that parrots CCP propaganda on Taiwan, and what this reveals about the industry’s priorities on AI security and integrity.
Enjoy!
Alex Colville (Researcher, China Media Project)
_IN_OUR_FEEDS(3):
Leading the World Through AI
Chinese media are framing homegrown AI as key to letting the country rearrange the global world order. On February 23, multiple outlets covered a report from a Beijing university saying China was predicted to overtake the US in AI by 2030. The report said Chinese AI would “break the world order dominated by the West.” In an article for the People’s Daily the next day, Gao Wen (高文), the computer scientist who led the Politburo’s only study session on AI back in 2018, wrote that whoever takes the initiative in AI experimentation and development “will command greater discourse power on the international stage.” CCTV gives Xi Jinping all the credit for the Chinese AI momentum post-DeepSeek, reiterating on February 25 a speech Xi Jinping gave back in 2018 saying AI would lead a new “technological revolution,” and that it had the potential to create a “lead goose effect” (头雁效应) for the country — meaning all other countries (geese) will follow wherever China (the lead goose) leads.
Targeting the Global South
At the same time, the People’s Daily, the official mouthpiece of the CCP, has run multiple pieces pointing out how China’s private enterprises, especially in AI, will benefit the Global South. On February 19, two days after a major meeting between Xi Jinping and leaders of China’s private tech sector, the paper ran a feature in their special “Harmony” (和音) column on the ways China’s private enterprises could assist the Global South. They argue that, in an increasingly unstable international situation, the Global South is increasingly looking to China for the support their private tech companies can offer. The government, it says, wants to help these companies “move to a broader stage.” On February 26, another “Harmony” column from the paper argued that Chinese innovation will drive global modernization due to its inclusive approach towards development. That is why, according to the paper, “it is no accident” that DeepSeek, an open-source AI model rivaling ChatGPT, was created in China.
Getting Organized on Data
Chinese government bodies continue to signal they want to better regulate and develop the training data AI companies use to train their AI models (see this issue’s _EXPLAINER). The National Data Administration announced that on March 1 they will launch a national-level public platform for local governments to upload any data they have for public use. Trivium China argues recent Chinese AI successes like DeepSeek have “lit a fire” under officials, speeding up a pre-existing plan to benefit China’s AI industry. Meanwhile, on February 23, CAICT, one of China’s major AI research institutions, released the nation’s first dataset “quality assessment system.” According to Xinhua, the system, named “CRISP-DECODE,” will evaluate the quality of a dataset according to key elements such as “integrity, standardization, accuracy and diversity.” CRISP-DECODE does not appear to have been made public.
TL;DR: There are very big words coming out of China about what AI can do for the economy and China’s place in the world. Private AI enterprises are being portrayed by the Party as the spearhead for bolstering national influence abroad. Meanwhile, the demand for high-quality and safe training data is an ongoing and important concern for officials (for more on this, see _EXPLAINER).
_EXPLAINER:
Training Data (训练语料 / 训练数据)
I’m not reading 600 words about Excel spreadsheets.
Indeed you’re not. This is 600 words about one way the Chinese Party-state can bias Chinese LLMs in their favor.
Through Excel spreadsheets?
No, through training data. Completely different thing.
Why is it a big deal?
Because it makes up the raw materials of an LLM’s imagination. Here’s one way to think of it:
I’ll give you three seconds. During that time, try to picture in your mind an object that has never existed before in the world.
Why?
Just bear with me. Ok, go.
….
….
…?
Ok. Whatever the object in your mind is, I’m willing to bet it’s a combination of things you’ve already seen before. Either in real life or in a film.
How did you know?
Because you use training data yourself. In the same way you can’t imagine something that isn’t a combination of things you haven’t already seen, just so an LLM can’t imagine patterns and ideas outside of its training data.
But what IS training data?
Hundreds of billions of items of text, images, and video scraped off the internet. Indeed, some AI developers end up scraping the WHOLE of the internet. Oversimplified, the developers then discard the data they don’t want and train a model on the remainder. It’s through all the different patterns the model identifies between words that allows it to form associations and patterns about the world.
So what’s in the training data?
We don’t know. Companies like Alibaba and DeepSeek haven’t released their datasets to the public.
Hmmm…interesting….
Not that suspicious actually, most AI companies do that. But we can tell Chinese companies have slightly different training data from Western ones.
How?
Firstly, we can look at the regulations and guidelines China has put in place for this area. Their Interim Measures for generative AI says training data has to come from “legitimate” sources, and developers need to take steps to “enhance the authenticity, accuracy, objectivity and diversity of training data.” A non-binding industry standard from our old friends TC260 also says the sources of training data should contain no more than five percent “illegal” material.
What’s wrong with that? Sounds the same way every country’s talking about AI safety.
That’s mostly true — no country wants an LLM telling people how to create, say, a molotov cocktail. But ask yourself: what would “objectivity” mean in Chinese law?
Whatever the CCP says is objective?
Whatever the CCP says is objective. Ironically, a lot of CCP bias has been worked into AI models in the name of “debiasing.” I’ve already written on how that manifests in things like Chinese evaluation benchmarks.
There’s also a telling little section in DeepSeek’s earlier V2 paper from summer 2024, about how they went about “debiasing” their training data: “We identify and filter out contentious content, such as values influenced by regional cultures, to avoid our model exhibiting unnecessary subjective biases on these controversial topics.”
The only example they give is of biases from American culture. The majority of the internet is Western content, a problematic bias in itself. But DeepSeek’s developers will almost certainly be defining values like human rights as relative, overly influenced by American bias.
But DeepSeek will talk about Tiananmen Square when run locally…so it’s in their training data?
That will probably vary a lot between companies and models, but generally speaking yes, enough for them to be aware the argument exists. Diversity of sourcing makes an AI model sharper (one reason why “five percent illegal material” would still be acceptable, rather than “zero percent”).
But having a dataset that emphasizes, for example, how “Taiwan has been part of China since ancient times” will help reinforce that pattern, making it more likely to relay this line as the truth rather than the messier reality.
Coders have long known under-representing a certain position in training data can result in less nuance and knowledge of that position. Take a facial-recognition model that is overwhelmingly trained on the faces of white men, for example. The New York Times reported one such model in 2018 as being less accurate at recognizing the faces of anyone who isn’t a white man. Developers are likely doing the same thing for CCP redlines.
Can we see this in action?
Yes we can. One model I found on Hugging Face, built by a Californian company but using DeepSeek as its base model (stay tuned for a CMP piece on this someday), has clearly been re-trained with additional datasets to address pro-CCP bias. That means it gives more balanced and factual answers about, say, Taiwan. But, in both Chinese and English, multiple Chinese state media claims were still coming through in the answers and treated as fact, like “Taiwan has been Chinese territory since ancient times,” “the majority of countries in the world recognize Beijing as the sole legal representative of China, which includes Taiwan,” etc.
A model from a Californian company run locally on a mac, when asked why some Taiwanese do not believe Taiwan is a part of China, includes in its answer a frequent message from Chinese state media: “Taiwan has been Chinese territory since ancient times.”
The key question is whether or not these biased neural networks can be completely re-trained once they’ve been forged. I don’t have an answer to that yet.
_ONE_PROMPT_PROMPT:
Since DeepSeek launched, companies have scrambled to host the hot new AI model on their servers. Some have attempted to rid the model of CCP bias. But others are treating data security as DeepSeek’s sole danger. They claim that by hosting it on their own servers, they have made it safe.
But what about knowledge security? There is a very real threat that some companies, hoping to cut costs, will not attempt to rid the model of CCP biases. These biases could then seep into our knowledge ecosystems once developers start building apps, chatbots, and agentic AI off it.
On January 30, Nvidia announced they would be deploying DeepSeek-R1 on their own “NIM” interface, saying businesses could “maximize security and data privacy” by using it this way. They are pitching it for enterprises to use via Nvidia’s chips for $4500 a pop.
But, if a trial version I accessed is anything to go by, Nvidia doesn’t appear to have made any changes to the model at all. It insists China’s human rights record in Xinjiang is progressive and above board…
…and that Taiwan has always been part of China “since ancient times” and that reunification is an “unstoppable historical trend that no force can prevent.”
In Nvidia’s “Trustworthy AI” policy, they say they want to “minimize” bias in their AI systems. However, in their product information, they say Trustworthy AI is in fact a “shared responsibility,” that developers using their services are actually the ones responsible for adapting the model. The company knows there are problems with DeepSeek — they caution that DeepSeek-R1 contains “societal biases” due to being originally crawled from the internet. That implies this is accidental, like the cultural biases formed by a Western model. But as we know from this issue’s _EXPLAINER, DeepSeek’s political biases are no accident.
Nvidia arguably has the most incentive of any Western tech company to filter propaganda out of DeepSeek. Taiwanese customers and chip manufacturers like TSMC make an integral contribution to their business. Indeed, Taiwan is also the birthplace of their CEO, Jensen Huang. Instead, the company may be providing a green light for official propaganda frames from China. Nvidia spokespeople replied to our inquiries but said that they declined to comment.
But if even Nvidia is not concerned about this, how many other Western tech companies hosting DeepSeek on their servers will feel the same way?