Text size
China’s economic data have always been fraught. Now, all eyes are on the coronavirus numbers, which economists and investors are using to estimate the outbreak’s toll—and they are too perfect to mean much.
A statistical analysis of China’s coronavirus casualty data shows a near-perfect prediction model that data analysts say isn’t likely to naturally occur, casting doubt over the reliability of the numbers being reported to the World Health Organization. That’s aside from news on Thursday that health officials in the epicenter of the outbreak reported a surge in new infections after changing how they diagnose the illness.
The most current WHO data count more than 60,000 cases of infection and nearly 1,400 deaths. Most of those cases and all but one death have been in China.
Torsten Sløk, chief economist at Deutsche Bank Securities, expects the outbreak to shave 1.5 percentage points off of Chinese gross domestic product this year. He recently revised his 2020 GDP estimate for China to 4.6% from 6.1%, and said he thinks the virus will take a 0.5 percentage point off of global growth this year. Estimates like Sløk’s, of course, rely on the tally of coronavirus infections and fatalities that China is reporting.
China’s official economic statistics often differ from private attempts to replicate the results. Take a set of purchasing managers’ indexes. In each month of 2019, the government-created version beat a closely watched private version. Some economists say that such data-reporting anomalies are a function of the country’s size, growth rate, and relative lack of transparency.
“It’s an emerging economy,” says Carl Weinberg, chief economist at High Frequency Economics. “There’s a natural roughness to the data.”
In terms of the virus data, the number of cumulative deaths reported is described by a simple mathematical formula to a very high accuracy, according to a quantitative-finance specialist who ran a regression of the data for Barron’s. A near-perfect 99.99% of variance is explained by the equation, this person said.
Put in an investing context, that variance, or so-called r-squared value, would mean that an investor could predict tomorrow’s stock price with almost perfect accuracy. In this case, the high r-squared means there is essentially zero unexpected variability in reported cases day after day.
Barron’s re-created the regression analysis of total deaths caused by the virus, which first emerged in the central Chinese city of Wuhan at the end of last year, and found similarly high variance. We ran it by Melody Goodman, associate professor of biostatistics at New York University’s School of Global Public Health.
“I have never in my years seen an r-squared of 0.99,” Goodman says. “As a statistician, it makes me question the data.”
Real human data are never perfectly predictive when it comes to something like an epidemic, Goodman says, since there are countless ways that a person could come into contact with the virus.
For context, Goodman says a “really good” r-squared, in terms of public health data, would be a 0.7. “Anything like 0.99,” she said, “would make me think that someone is simulating data. It would mean you already know what is going to happen.”
There’s one scenario where the data could be understandably jiggered, Goodman said. Because there are privacy concerns around public health data, it’s conceivable that someone would simulate the data based on real data, so as to make the individuals unidentifiable. But even then, the r-squared in this case is extraordinarily high. Moreover, says Goodman, when data are manipulated to protect privacy, it would need to be disclosed; there is no such disclosure on the WHO site.
What does this mean for investors and analysts? If something seems too good to be true, it probably is.
Al Root contributed to this column.
Write to Lisa Beilfuss at lisa.beilfuss@barrons.com