Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Jeremy3292

macrumors regular
Dec 9, 2012
137
129
Entirely untrue

If a person is vaccinated, they could still contract the virus and.... die. Odds are against it, but there is no 100% guarantee

Since you seem to enjoy semantics allow me to rephrase, offer 99.99%, complete protection against hospitalization and death due to covid-19".
 

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
Since you seem to enjoy semantics allow me to rephrase, offer 99.99%, complete protection against hospitalization and death due to covid-19".

Adding digits to the right of the decimal point doesn’t make an invented number any more true. You’re confusing precision with accuracy.
 
  • Like
Reactions: iGeneo

jerryk

macrumors 604
Nov 3, 2011
7,418
4,206
SF Bay Area
How many of those were obese or have preexisting conditions? As I said, 60 or at risk... healthy under 60 dying from this is an anomaly. Vaccines are becoming widely available and becoming easier to get.
Unfortunately, obesity is an epidemic in this country. In some states almost 50% of the adult population is obese. And this seems to be more true of younger people. Put down the phone and the game controller and go for a run or walk. https://www.cdc.gov/obesity/data/prevalence-maps.html#overall Note: this based on self-reporting obesity, which is likely a large underreporting of the actual rate.
 
  • Angry
Reactions: tonyr6

theluggage

macrumors 604
Jul 29, 2011
7,489
7,340
I don’t disagree with any of this in principle, though the numbers are a bit fast and loose.
Fast and loose is all anybody really has - fast-tracked clinical trials, attempts to untangle a base "R" from data that's affected by varying lockdown measures, seasonal factors etc. There are huge assumptions built into the model used for working out herd immunity.

It doesn’t need to be 100% to bring this to an end, we just need to keep R under 1— the further below, the faster this goes away.
...but R is hugely affected by public behaviour. Relaxing measures too quickly will push up R. Plus, R varies enormously by region, social factors etc. - all of these theories tend to look at averages over whole countries.

There have been several good BBC Horizon documentaries on this - and they pointed out things like how R isn't a terribly good description of reality - rather than each infected person passing it on to about R people, R tends to emerge from a minority of "spreaders" infecting rather more than R people (so a few super-spreader events can be quite significant) Also how small reductions in vaccine efficacy or small increases in effective R can easily push the required vaccine uptake to 90%+ rates which could be quite challenging to achieve.
 

iGeneo

macrumors demi-god
Jul 3, 2010
1,386
2,588
Siemens is planning to have their big every two year sales conference in Las Vegas in Dec. 5,000 attendees.
Masks + Distancing + Ventilation + Temperature checks + Every 72 Hour testing = Go for it!

Let's get back to business sensibly
 

jzuena

macrumors 65816
Feb 21, 2007
1,125
149
65 and older in the US. Anyone younger gets thrown to the back of the vaccination bus after obese teenagers and Millenials.

And a lot of people under 65 have died. About 15 percent (78,000+) of the 525,000 Americans that have died are this age. The disease has in 1 year killed more Americans than all previous wars combined.

The US Civil War alone killed roughly 650,000 soldiers without counting any civilian casualties. WW I and WW II together also killed roughly 525,000... around the same as COVID.

Cancer has a higher annual death toll. That war statistic is pretty good. I will use that one.

see above
 

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
Fast and loose is all anybody really has
If you don’t have support for a specific number, then don’t quote a number.
...but R is hugely affected by public behaviour. Relaxing measures too quickly will push up R. Plus, R varies enormously by region, social factors etc. - all of these theories tend to look at averages over whole countries.
Again, we agree on the principles. I simply disagree with you setting a threshold at 100%. I think you disagree with that to, based on what you’re saying.
There have been several good BBC Horizon documentaries on this - and they pointed out things like how R isn't a terribly good description of reality - rather than each infected person passing it on to about R people, R tends to emerge from a minority of "spreaders" infecting rather more than R people (so a few super-spreader events can be quite significant) Also how small reductions in vaccine efficacy or small increases in effective R can easily push the required vaccine uptake to 90%+ rates which could be quite challenging to achieve.
Welcome to statistics. This is the same argument that’s made when people say “yeah, but for that family the unemployment rate is 100%”. Re is an aggregate number, that’s the whole point.

Random isn’t uniform and smooth like peanut butter, it‘s uneven and chunky like, well, peanut butter...


Again, we agree on just about everything here. I just took issue with the idea that a vaccine needs to be 100% effective for us to start getting back to normal.
 
Last edited:

theluggage

macrumors 604
Jul 29, 2011
7,489
7,340
If you don’t have support for a specific number, then don’t quote a number.
Sorry - my numbers came from: https://www.thelancet.com/article/S0140-6736(20)32318-7/fulltext

I thought I'd cited that in the original post. I didn't. My mistake.

I just took issue with the idea that a vaccine needs to be 100% effective for us to start getting back to normal.
I didn't say that. Somebody else was (wrongly) claiming that the vaccine was 100% effective.

You don't need a 100% effective vaccine... but reductions in the effectiveness of the vaccine and/or increases in R (e.g. due to variants and, in the case of R, getting sloppy with precautions) increase the % of the population that need to be vaccinated to hard-to-achieve levels. Hence, the arrival of a vaccine doesn't mean it's time to stop worrying about R.

Re is an aggregate number, that’s the whole point.
No, that's why you don't just use the aggregate number blindly. At the very least you consider the standard deviation/variance of the data, and various other tests of statistical significance which will give you ranges/confidence intervals/p-values on any results...

Random isn’t uniform and smooth like peanut butter, it‘s uneven and chunky like, well, peanut butter...

Truly random data may look lumpy, but it will eventually build up to form a nice, smooth curve following a mathematical distribution. Normally a "normal distribution". duh. - and many of the statistical calculations you use to get your results and assign confidence intervals are based on the assumption that the data follows such a distribution.

With something like an R number, if you just take a bunch of infected people and count how many they each infect in turn, you're assuming those numbers will be normally distributed around a central value, so the average and the variance will let you accurately calculate your "95% confidence" range for the possible values of R.

That's usually a good guess - but if you're badly wrong about that distribution then your results - particularly your confidence intervals - are suspect. A situation where only a fraction of the infected actually do the spreading - something that is being investigated - is the sort of thing that could drastically change the distribution and require a long, hard think as to whether the average actually meant anything and, at the very least, what the likely range really was.

NB: Not trying to suggest that the experts are making dumb assumptions about distributions or schoolkid mistakes... but they are making assumptions, which they carefully state, along with confidence intervals etc. - which the media then have a distressing tendency to strip out before cherry picking either the least- or worst- case scenario (depending on today's news agenda) for the headline.... and assumptions can and do change.

This is the same argument that’s made when people say “yeah, but for that family the unemployment rate is 100%”.

Say 1% of the population is unemployed. Maybe they're evenly distributed around the country, or maybe they're concentrated in a few towns where a major employer has closed - the average is going to be the same in both cases but the problems caused and possible solutions would be wildly different... and if you measured the average by taking a random sample of people from around the country you could completely miss some of all of the affected towns and get a totally wrong "average". Distributions are important.
 

Sakurambo-kun

macrumors 6502a
Oct 30, 2015
572
672
UK
The delusions many seem to be fostering that this will somehow magically be over by the summer are utterly laughable. This is going to take a very, very, very long time. To give an example, at the current rate of vaccination France won't have reached every adult until the end of 2023. That's of course if they can even persuade their population to believe science and evidence instead of their idiot conspiracy theorist family and friends on Facebook.

I don't expect large scale events will be possible until the late next year at the very earliest.
 

Sakurambo-kun

macrumors 6502a
Oct 30, 2015
572
672
UK
I can't speak to Europe, but in the US I'd fine with this. Pretty much at this point if you are above 60 or at risk you can get the vaccine and if your not in those categories who gives a crap because this statistically isn't dangerous for you. Life should be getting back to normal soon...but it won't because people are stupid and overreact. You are probably more likely to die in a car accident on the way to getting a vaccine than from COVID if you are a healthy person under
Younger people are less likely to die of Covid but they are still very much susceptible to long-covid. Around 1 in 20 survivors are hit with this, and it can be incredibly debilitating. And there's no treatment for it, or even much understanding of what exactly it is.
 
  • Like
Reactions: jerryk

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
Thanks! Good read from a respected source.
I didn't say that. Somebody else was (wrongly) claiming that the vaccine was 100% effective.
Ah, ok, that wasn’t clear to me. Nobody you directed your response to was saying that. They were just saying that the vaccine was expected to be well distributed by June. What I saw in your response was this:

The vaccination is important, but it is not magic. It doesn't give 100% immunity, it may only offer partial protection against the known variants, and there's always the risk of a new variant showing up that evades the vaccine. Masks and testing aren't magic, either - they just improve the odds.

[...]

Sorry, but huge exhibitions/conferences like this need to go the way of the dodo.
Which reads like you're saying "it doesn't matter what we do, we can never have nice things again".

You don't need a 100% effective vaccine... but reductions in the effectiveness of the vaccine and/or increases in R (e.g. due to variants and, in the case of R, getting sloppy with precautions) increase the % of the population that need to be vaccinated to hard-to-achieve levels. Hence, the arrival of a vaccine doesn't mean it's time to stop worrying about R.
Agreed. Thus my comment:
We need to continue the good hygiene habits of washing our hands and coughing into our elbows, and wearing masks when we don’t feel well or when there’s a known outbreak of whatever comes next
But once we've vaccinated everyone willing and able, and given time to reach steady state, we're back to the strategy addressing local outbreaks and keeping the hospitals from overflowing. The broad restrictions we've imposed are necessary and justifiable short term, but not permanently. Now is the time when it is critical to be vigilant because just about any death we can postpone is a death we can avoid and fewer cases mean fewer mutations. The vaccine and the time we take to stabilize once it’s distributed is about as far as we're going to be able to take the broad social restrictions. Narrower restrictions might persist, such as requiring vaccination for certain activities (as we have now for other diseases) and new regulations on ventilation in buildings. As your article points out, we may wind up with a schedule of periodic vaccinations to top up, or we may wind up crafting new vaccines each year aimed at that year’s strains, but we can't ban sporting events, rock concerts and business conventions forever.

Mumps has a R0 of 10-12, and a vaccine efficacy of only 88% but we never banned conventions because mumps exists. Different disease with different concerns, sure, but my point is that I can’t imagine COVID is the one that suddenly makes us ban large gatherings forever.
 

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
Doing this in two parts because my last post continues the thread of my original comment, and this post looks at stuff that’s come up in our discussion since.

No, that's why you don't just use the aggregate number blindly. At the very least you consider the standard deviation/variance of the data, and various other tests of statistical significance which will give you ranges/confidence intervals/p-values on any results...
To be clear, I'm not using R blindly or otherwise... You brought it into the conversation. R is simply a shorthand way of describing the rate of spread.

As with any aggregate statistic, the variances and confidence intervals will become smaller as the population you're measuring becomes larger.


Truly random data may look lumpy, but it will eventually build up to form a nice, smooth curve following a mathematical distribution.

The distribution is only smooth because it is not random. The distribution is not the data, it's a frequency plot of the sorted data. Sorting, by definition, makes it non-random.

With something like an R number, if you just take a bunch of infected people and count how many they each infect in turn, you're assuming those numbers will be normally distributed around a central value, so the average and the variance will let you accurately calculate your "95% confidence" range for the possible values of R.

That's usually a good guess - but if you're badly wrong about that distribution then your results - particularly your confidence intervals - are suspect. A situation where only a fraction of the infected actually do the spreading - something that is being investigated - is the sort of thing that could drastically change the distribution and require a long, hard think as to whether the average actually meant anything and, at the very least, what the likely range really was.
This is why aggregate data is important. The larger your dataset, the more normal the distribution of your aggregate statistics will become. When you look at smaller and smaller populations, the statistics become too uncertain to make any real policy decisions. Most governments make policy for fairly large populations. At the national or provincial level, the numbers are big enough that the statics are more clear— the data within those populations are lumpy, cities might have reasonable variances but their values will differ relative to each other and the national numbers, for example. Neighborhoods and households probably start to get small enough that the data is pretty meaningless in isolation, which is why they tend to aggregate it in different ways (neighborhoods of color, income brackets, senior living developments, etc).

The R number is not measured directly, they don’t take infected people and count how many they each infect— that’s intractable. The look at the rate of growth of infection, fit a parameterized epidemiological model to the data, and estimate R from that.


Say 1% of the population is unemployed. Maybe they're evenly distributed around the country, or maybe they're concentrated in a few towns where a major employer has closed - the average is going to be the same in both cases but the problems caused and possible solutions would be wildly different... and if you measured the average by taking a random sample of people from around the country you could completely miss some of all of the affected towns and get a totally wrong "average". Distributions are important.

Here you’ve shifted your focus from “spreaders” (individuals) to towns (aggregates). Choosing the right aggregate at the right scale is important, for sure.

The same is true of crime, by the way. Car thefts are performed by individuals, but policing policy is applied to communities.

Do you think spreader-individuals are evenly distributed or concentrated in a few towns. “Spreader” in this case meaning their propensity to spread, behavioral patterns aside. Unless this is one of the exceedingly few expressed characteristics tied to race or national origin or somehow highly correlated to wealth (all aggregation classes that we somehow tend to organize our communities around), I’d imagine it’s evenly distributed because there is no factory closure analog to the unemployment metaphor.

So, unless you have a way of profiling the spreaders and isolating them out in advance, it doesn’t matter if the agents of transmission are a few random individuals— we don’t know who those individuals are so need to set policy for the community as a whole. If we could isolate them out, say by a genetic test, then you’ve just formed an aggregate we can address.
 
Last edited:

fenderbass146

macrumors 65816
Mar 11, 2009
1,451
2,530
Northwest Indiana
Younger people are less likely to die of Covid but they are still very much susceptible to long-covid. Around 1 in 20 survivors are hit with this, and it can be incredibly debilitating. And there's no treatment for it, or even much understanding of what exactly it is.
Show me the source on that...the only studies I've seen were on people that were hospitalized. Long term side effects are just the newest scare tactic. Worst I've seen and at this point I've known dozens who have had COVID is slow return of taste and smell.
 
  • Like
Reactions: tonyr6

theluggage

macrumors 604
Jul 29, 2011
7,489
7,340
The distribution is only smooth because it is not random. The distribution is not the data, it's a frequency plot of the sorted data. Sorting, by definition, makes it non-random.

....Sorry, now going off the deep end because I'm trying to convince myself that I'm right... feel free to wander off and actually talk about Macs or the MWC while I ramble :)

I think you're confusing the data itself with how you choose to arrange it - and, in the end, that means that you're forgetting what it is that the data is supposed to represent.

Example:

Someone generates 100 random numbers.

Maybe they are throwing a pair of regular 6-sided dice (D6+D6) - so they'd get an average throw of 7.
Or, they could be using using a single 12-sided D&D die (D12) - so they'd get an average of 6.5

Both of those could be described as "random" numbers, unless you choose to cherry-pick a narrow definition of "random". They're certainly both considered a fair basis for a game of chance.

Here's some "real" results (actually a simulation in Excel):

Alice's average throw: 6.65 (real ans: D6+D6)
Bob's average throw: 6.55 (real ans: D12)
Celia's average throw: 7.09 (real ans: D12)
Dave's average throw: 6.6 (real ans: D6+D6)

(Disclaimer, yes, I 'cherry picked' those a bit but only out of a dozen or so runs - you really can't reliably answer this by seeing if the average is closer to 6.5 or 7).

However, here's what Alice and Bob's distributions look like:

Untitled.png

celia.png
Dave.png


Edit: Major brainfart. D6+D6 is not a normal distribution. It is, however, a triangular distribution that is very distinct from the "flat" D12 distribution.

...and, as they say, a picture paints a thousand words. You can clearly see that Alice & Dave's are the beginning of a bell-shaped normal distribution (the expectation for D6+D6) while the others are either a much broader bell or something else entirely. (You can also get a clue to that by calculating at the variance - about 6 for Alice, 12 for Bob). It's the distribution that allows you to learn useful things about what Alice and Bob are doing. If I had all day, I'd generate a bunch of graphs and we could see how accurately we could identify them, but I'm pretty confident it would be vastly more accurate than looking at the averages.

...and just to map that on to Covid, imagine these "models" (NOTE: THESE ARE IMAGINARY!):

A: You have covid, every time you meet someone else, you throw a D12 - 7 or more means you pass it on to them
B: You have covid, every time you meet someone, you throw a D6 to decide how infectious you are, they throw a D6 to decide how vulnerable they are, if the total is 7 or more, they get infected.
C: You throw a D12 to decide whether to go to MWC. Throw a 12 and you go. At MWC, you toss a coin every time you meet someone and, if it's heads, you give them a big, sloppy kiss.

...OK, obviously wouldn't want to take any of those too seriously, but they're vaguely plausible as crude models for transmission, but they'd produce different distributions and certainly affect the confidence intervals of any measurements you made. The reality is far more complex (and unknown) and, for a start, R is also going to be a function of how many people the "average" person meets (and many other things)... What it isn't is "random".


This is why aggregate data is important. The larger your dataset, the more normal the distribution of your aggregate statistics will become.

No, it won't unless what you're measuring is actually following a normal distribution. In the above example, after a million trials, Bob's distribution will look completely flat and Alice's will be a nice smooth bell curve. They won't both turn into normal distributions. Even Alice's graph won't change its overall shape and proportions: those are fixed by the way Alice is generating her "random" numbers.

Yes, with more samples, the D6+D6 averages will converge on 7, and the D12 on 6.5. So, if you know that the data is either D12 or D6+D6 then, with enough samples, you can use the average to determine which. But... here's the catch - that's only valid because you know exactly what the two possible distributions are. If you don't know - which is the case with most real-life scenarios - or maybe only have a few guesses as to what they might be - then the best you can do is assume normal distribution, in which case - once you also factor in the standard deviations, you get something like 7 +/- 2.5 from D6+D6 and 6.5 +/- 3.5 from D12.


E.g. if I just gave you a pair of averages, and told you that one was using D12 and the other D6+D6 you could do pretty consistently by just picking the highest average as the D6+D6. If I give you a list of averages and ask you to pick out the D12s, you'll get a lot of them wrong. Take away the guarantee that those are the only two possibilities, and the averages tell you pretty much nothing. Stats is really, really sensitive to the precise question being asked.

The thing about aggregate measures is that they're basically rubbish - taking any aggregate stat, average, variance, median, whatever essentially means throwing away most of your data and replacing it with sweeping assumptions about distributions etc. Even the rules for turning variance into confidence intervals, or deciding whether two measurements are significantly different, make assumptions about distribution. It's all a bit "1950 called" - when getting a frequency graph meant sending an order to the drawing office and waiting a week, or when running an algorithmic simulation meant rounding up a room full of human slide-rule jockeys then simple formulae based on aggregates were the only game in town. Now we have computers that can run simulations with huge numbers, munge datasets in a jiffy and generate a massive range of visual representations - why throw away data? Aggregates and simple formulae are good for back-of-the envelope calculations and demonstrating principles, but I hope the experts are using something a bit more late-20th-century for the crucial decisions (I'm sure they are).

So, from that we can conclude that Dungeons and Dragons players are the prime spreaders of Covid-19 and should be forced to stay locked up in their parents' attic for the duration of the crisis. Difficult, but worth it :)

+++ End of rant +++
 
Last edited:

DanTSX

Suspended
Oct 22, 2013
1,111
1,505
Good. Time to get back to normal. If you have other health issues, or live to worship technocrats and quote statistics that you do not understand, stay at home.
 
  • Disagree
Reactions: Suckfest 9001

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
Warning to the disinterested: this post is really, really long so if you're not at all interested in our little statistical sidebar, start scrolling rapidly.

....Sorry, now going off the deep end because I'm trying to convince myself that I'm right... feel free to wander off and actually talk about Macs or the MWC while I ramble :)
No worries-- I tend to do the same. Discussing things like this forces me to think through my assumptions... Hopefully everyone else will tolerate the long post, but I included code because sometimes that's more clear than language, and the thread is rather quiet anyway so I don't think many people are really going to notice...
I think you're confusing the data itself with how you choose to arrange it - and, in the end, that means that you're forgetting what it is that the data is supposed to represent.
I don't think so. If you look at the plots you're sharing, they don't present the data, they present a summary of the data. The data is the sequence of dice roles in your examples. I just generated 5 d12 rolls and got [6,7,11,7,3]. That's the data.

That data is lost in the plots you shared-- they tell me I rolled a 3, a 6, 2 7's and an 11. It's been sorted and counted so when I plot it it will look organized, it is no longer random.

The data (dice rolls) may represent an action in a game. The histogram plots, in this case, indicate whether the dice are well balanced.

Sometimes it's the data you're after, sometimes it's the summary statistics (histogram, mean, variance, modes, etc).

Using Python, I'll create a d12 and a d6, then roll the d12 and a pair of d6's 100,000 times:
Python:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

d12=stats.randint(1,13)
d6=stats.randint(1,7)
d12Rolls=d12.rvs(size=1000000)
d6Rolls1=d6.rvs(size=100000)
d6Rolls2=d6.rvs(size=100000)
twoD6Rolls=d6Rolls1+d6Rolls2

For the d12, the first portion of the data looks like this:
Python:
plt.plot(d12Rolls[0:1000],'.')
plt.title("d12 Rolls");plt.xlabel("Roll");plt.ylabel("Value");
1615450011240.png


You'll notice that it's "lumpy". There are places where the same number gets rolled frequently, and places where it doesn't get rolled for a long time.

This is the histogram of the data, which is what you're plotting:
Python:
plt.hist(d12Rolls,bins=12)
plt.title("Histogram of d12 Rolls")
plt.xlabel("Value");plt.ylabel("Count");
1615450543088.png

It is a smooth, flat line. The histogram is not technically the distribution, but with enough points it will converge to the distribution-- in this case a uniform distribution. Notice that the horizontal axis now is the value (which was the vertical axis) and the vertical is the count. The actual data (the rolls) are nowhere to be found. There are now only 12 numbers left-- the counts each die value occured. It is sorted and summarized, and is no longer random.

You can clearly see that Alice & Dave's are the beginning of a bell-shaped normal distribution (the expectation for D6+D6) while the others are either a much broader bell or something else entirely.

d6+d6 is most certainly not bell shaped and is not a normal distribution. The d12 is a uniform distribution.

The data from the first thousand rolls:
Python:
plt.plot(twoD6Rolls[0:1000],'.')
plt.title("d6 + d6 Rolls");plt.xlabel("Roll");plt.ylabel("Value");
1615451476854.png


Again, lumpy. (If you've ever won Settlers of Catan because that 12 on your wheat kept striking gold, then lost at Settlers of Catan because you kept waiting for that 12 on your wheat to hit at least once in the game and it never happened, you'll know what I mean by lumpy.)

And the histogram of 100,000 rolls, however, is not lumpy. It is a nice smooth curve (well as smooth as it can be with only 11 values):
Python:
plt.hist(twoD6Rolls,bins=11)
plt.title("Histogram of d6 + d6 Rolls")
plt.xlabel("Value");plt.ylabel("Count");
1615451726182.png


That is not a bell curve, it's a triangle, and not because it only has 11 values. Here's the same experiment with d100's:
Python:
d100=stats.randint(1,101)
d100Rolls1=d100.rvs(size=100000)
d100Rolls2=d100.rvs(size=100000)
twoD100Rolls=d100Rolls1+d100Rolls2
plt.hist(twoD100Rolls,bins=199)
plt.title("Histogram of d100 + d100 Rolls")
plt.xlabel("Value");plt.ylabel("Count");
1615452273705.png


Still a triangle, versus 100,000 points sampled from a normal distribution:
Python:
plt.hist(stats.norm().rvs(size=100000),bins=199);
1615452388077.png

which looks like the archetypical bell.

Now, let's look at your "small number of spreaders" example. Can't really model that with a single die roll. Your roll a die and apply a threshold, above which you're a spreader model is a reasonable one-- I'll model it here as a Bernoulli distribution, which is essentially the same principle. In this case, 5% of people are spreaders and each infects 10 people, the rest of the population give it to no one. I'll generate a population of 350 million people.

Python:
spreaderStats=stats.bernoulli(0.05)
spreaders=spreaderStats.rvs(size=350*1000*1000)*10

Here, I'll show the distribution first.
Python:
plt.hist(spreaders,bins=11)
plt.title("Infectivity of the Population")
plt.xlabel("Infectivity");plt.ylabel("Number of People");
1615453391612.png


Again this isn't the data, it's been sorted and is showing the count of one group who would infect no one and the group that would infect 10 people. It's hard to call a binary distribution like this "smooth" because there's only two values, but you get the idea. In reality it may be some other bimodal distribution where you have a large group who would infect a small number of people and a small group who would infect a large number of people.

This is the data:
Python:
plt.plot(spreaders[0:1000],'.')
plt.title("Infectivity of Individuals")
plt.xlabel("Specific Individual");plt.ylabel("Infectivity");
1615454394818.png


To see how this data is lumpy, assume this is a 1 dimensional nation with all of the people lined up shoulder to shoulder and where town boundaries are every 100 people (the first 100 people live in Onetown, the next hundred people live in Twotown, etc.), then and look at how many spreaders are in each town:

Python:
townsize=100
towns=np.count_nonzero(spreaders.reshape((townsize,int(len(spreaders)/townsize))),axis=0)
plt.plot(towns[0:1000],'.')
plt.title("Spreaders in each town")
plt.xlabel("Specific Town");plt.ylabel("Spreaders in Town");
1615478279603.png


In the first 1000 towns, there are some without spreaders and one with 15. Among all the towns there were some as high as 20 in this run. Using a larger town size of 10,000 people, the data looks like this:

1615492450772.png


Data like that is a little hard to interpret directly, so if we want to know, for example, how many spreaders we could expect in each town, we can summarize the data with a histogram of how many towns we'd expect to have how many spreaders (not specific towns in this case, but summary statistics):

Python:
plt.hist(towns,bins=towns.max()-towns.min());
plt.title(f"Spreaders in each town of {townsize}")
plt.xlabel("Number of Spreaders in Town");plt.ylabel("Number of Towns with That Many Spreaders");
1615495248481.png


? Whoah! Where'd the bell shape come from all of a sudden?!

Now we're back to my point about aggregate statistics tending toward normal...

No, it won't unless what you're measuring is actually following a normal distribution.

If you'd followed the link I gave you it would have taken you to the description of the Central Limit Theorem, which says that it doesn't matter what the underlying distribution is-- if you start adding samples from it they converge to a normal distribution. So, the sample mean for example, which is a sum of samples divided by the number of samples, will have a normal distribution.

This is where the bell curve above comes from (not strictly normal because all values in these samples are positive and a true normal distribution has tails to negative infinity, but closer to log-normal which approaches normal as you get further from zero).

In your dice rolling experiments, you saw that the means were a bit shaky-- that's because the sample mean is itself a random variable.

Let's say we look at the average of 50 rolls of a d12, and we run that test 10,000 times. The histogram of those results looks normal (well, log-normal):
Python:
rolls=50;trials=10000
d12Trials=d12.rvs(size=(trials,rolls))
plt.hist(d12Trials.mean(axis=1),bins=200)
plt.title(f"Histogram of mean of {rolls} of d12");
1615521919878.png


If we do the same scenario, but roll two d6's, also normal-ish:

Python:
rolls=50;trials=10000
d6TrialA=d6.rvs(size=(trials,rolls))
d6TrialB=d6.rvs(size=(trials,rolls))
d6Trials=d6TrialA+d6TrialB
plt.hist(d6Trials.mean(axis=1),bins=200)
plt.title(f"Histogram of mean of {rolls} of 2d6");

1615522282514.png


And if we look at our Bernoulli distribution:

Code:
rolls=500;trials=10000
spreaderTrials=spreaderStats.rvs(size=(trials,rolls))*10
plt.hist(spreaderTrials.mean(axis=1),bins=100)
plt.title(f"Histogram of mean infectivity of {rolls} people");

1615522815490.png


In all cases, the expected value of the sample mean is the population mean and the variance of the sample mean is the population variance divided by the sample size.

That's what I mean by aggregate statistics, such as the sample mean, becoming more normal for large datasets regardless of the underlying distribution.


So, in the end, it really doesn't matter what the underlying distribution is. If you're looking at the average number of people infected by each infected person, that average is represented by a normal distribution.
 

theluggage

macrumors 604
Jul 29, 2011
7,489
7,340
That is not a bell curve, it's a triangle, and not because it only has 11 values.

You're absolutely right, and I goofed by describing D6+D6 as a normal distribution. (and one day I'll learn to go straight to code and not try to do a 'quicky' in Excel because you always regret it...)

Point is though, D6+D6 and D12 do give distinctly different distributions, and you can tell them apart from each other with a relatively small number of samples, whereas you need thousands of samples to reliably spot the different averages and ranges. If you re-write my argument with Alice using stats.norm() and Bob using random() then I think it works even better (apart from the D&D joke)...

I think the issue here is that there can be two distributions at play: the "fundamental" distribution of the phenomena you are trying to observe and the "error" distribution of your measurement technique. If you're (say) measuring the height of a pole with a ruler that can only measure to within +/- 1mm then you'll get measurements distributed around the actual height of the pole. If you're measuring 100 poles that are supposedly the same height with a super-accurate measure, then you'll get a different distribution based on how accurately the pole factory cuts their poles. If you're measuring 100 different poles with a +/- 1mm ruler then you've got a combination of the two distributions.

...and if it then turns out that, in reality, rather than just random, small manufacturing errors, 1 in 10 of the poles is 2mm short but the other 9 are within 0.1mm, that might have very different implications as to how you interpret the "average" height of a pole...
 

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
Ok, so it sounds like we’re in agreement now that the distribution is not the data, and that aggregate statistics such as the sample mean will, in fact, tend toward a normal distribution which makes determining variances and confidence intervals straight forward.

Point is though, D6+D6 and D12 do give distinctly different distributions, and you can tell them apart from each other with a relatively small number of samples, whereas you need thousands of samples to reliably spot the different averages and ranges.
I know this isn’t your point, but the hundred sample trials you illustrated earlier are already pretty good at separating the two averages. Here’s the distribution of sample means for 100 rolls:

1615581914701.png


If you identified all averages over 6.75 as 2d6 and all averages below as d12, you‘d be right about 80% of the time.

Your point is that we can’t distinguish between two distributions with the same sample mean by looking at the mean. That’s true, but I’m not sure how it’s relevant to this discussion.


Getting back to the COVID question, why do we care what the underlying distribution is? How would it affect our policy decisions?
 

theluggage

macrumors 604
Jul 29, 2011
7,489
7,340
If you identified all averages over 6.75 as 2d6 and all averages below as d12, you‘d be right about 80% of the time.
Aha, a challenge. This is easily tested:
A. simulate 100 throws of either a D12 or D6+D6
B. apply your rule: average > 6.75 means 2D6. Now compare with reality.
C. run a few thousand trials and work out the accuracy

So, I've tried that (using bits of your Python code for A) and, yes, your method gets a bit over 80% right. Cool, but not the sort of 95% confidence that you'd want to count on.

My method: calculate the standard deviation for the 100 throws (np.std(twoD6Rolls,ddof=1)) - if it's less than 3 it's 2d6, otherwise it's a d12. (or use the variance and 9 as a threshold - same diff)

Try it - it's over 99.8% accurate for 100 throws.
It is over 80% accurate for just 10 throws.
Your method gets to ~99% for 1000 throws, so yeah, you're measuring the mean more accurately, but at that point my method is scoring 100.0%

The difference? You're confusing the normal distribution you get from trying to measure a single, well-defined value using a limited number of samples, with trying to measure some phenomena that has its own non-normal population distribution.

In this case, the d12 distribution is so far from being normal that it shows up in the variance/standard deviation long before you have enough samples to reliably estimate the population mean. Even the distinction between population and sample variance (ddof, the N-1 term) makes stuff all difference because even that assumes that the sample distribution is an estimate of a normal population distribution.

Ok, so it sounds like we’re in agreement now that the distribution is not the data, and that aggregate statistics such as the sample mean will, in fact, tend toward a normal distribution which makes determining variances and confidence intervals straight forward.

Not really. The distribution is the data, transformed to the frequency domain rather than the time/order domain. The only thing you lose is the order of the samples, which may or may not be important, and which you're throwing away anyway - along with a lot more - when you take aggregate statistics. The aggregates can just as easily be measured from the distribution plus you need to look at the distribution to be sure that the aggregates are a reasonable representation of the phenomenon you are trying to measure. If you just work with means, variances etc. using the usual Stats 101 techniques then, sooner or later, you'll have made the implicit assumption that the population distribution is normal and/or has a well-defined central value.

Getting back to the COVID question, why do we care what the underlying distribution is? How would it affect our policy decisions?

First, I'm assuming that the experts know what they're doing better than us (at least up to the point where politics starts interfering), and are using more sophisticated stats and models than get reported in the mass media, or than we're arguing about here, so I hope anybody looking for Covid advice has long since tuned out...

But, crudely (and without being too literal), say you assumed that the "real" value of R in the population followed a normal distribution with a single, well-defined central value, did your survey on a limited sample of the population and found an 80% chance that it was below 1, while I had a better model of the underlying distribution and was able to use the same sample size to get a 99.8% chance that it was below 1, which would be more helpful for policy decisions?

The original discussion was about the formula for the vaccination rate needed for herd immunity: that was pretty sensitive to the value of R, and for high R values the vaccination rate reaches problematic levels. So you don't just need a good estimate of the average R, you need a good idea of what the highest and lowest likely values are.

...and if you hit those problematic levels, the only thing to do is to reduce "real" R via lockdowns etc. at which it's a good thing to have some models as to how "R" actually works... If transmission does turn out to be a subset of infected people accounting for the majority of infections via super-spreader events (which was suggested in one of the BBC documentaries) - which you'd expect to show up in the distribution - then doing things like stopping idiots holding massive face-to-face tech conferences could be a high priority.
 

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
Aha, a challenge.
That's not how I meant it. You said you can’t reliably distinguish based on the mean with out thousands of samples, and I just quantified what could be done with the few hundred you were originally working with. It was also meant to illustrate that it’s pretty straight forward to understand how predictive your model is and that running a handful of trials by hand doesn’t always give insight.

Of course standard deviation works better to distinguish between the two cases. I specifically pointed out that the mean only works at all because one dataset has 11 possible values and the other has 12 so the sample means are distinct. Likewise I could construct a case where the standard deviations are the same but the means are not (for example, a d12 numbered 1-12 and another numbered 5-17).

I’m not sure what your point was with this example from the beginning, but its a weird exercise to be discussing. If we wanted to determine the distribution, we’d collect a lot of data and compare the fit of its pdf/pmf to various distributions.

The difference? You're confusing the normal distribution you get from trying to measure a single, well-defined value using a limited number of samples, with trying to measure some phenomena that has its own non-normal population distribution.
I am not. You started this d12 / 2d6 discussion. I’ve only been correcting or clarifying assertions around it. I don’t completely understand how it fits into the conversation. You can’t reconstruct the distribution from a few summary statistics just like you can’t reconstruct the data from the histogram.

I think maybe you are suggesting that the process of evaluating the sample mean lacks value because it is not reversible? You started by talking about an inability to determine confidence intervals in mean values without knowing the data distribution. I’m making the point that aggregate statistics, such as the sample mean are normally distributed, by the CLT, making it possible to determine those variances and confidences. I demonstrated that for a few varied distributions— regardless of the distribution, the sample statistics approach normal. I haven’t claimed that that you can determine the original distribution from those sample statistics.

The distribution is the data

No, it is a summary of the data. If I take 10,000 dice rolls and show you a 12 valued histogram, I've destroyed a ton of information.

Take that Bernoulli distribution I was using before, but instead of a 5% threshold, let's use a 50% threshold and generate 8192 samples.

Python:
seq=stats.bernoulli(.5).rvs(2**13)
plt.figure(figsize=(12,5))
plt.plot(seq[0:200],'.')
plt.title(f"200 of {len(seq)} Coin Tosses");
plt.xlabel("Toss");plt.ylabel("0: Tails, 1: Heads");
1615703919594.png


The actual data is 8192 bits (1 or 0 per datapoint), or 8kB. Because the data is random, no lossless compression algorithm will be able to shrink it down.

The histogram is 2 13 bit values:
Python:
plt.hist(seq,bins=2)
plt.title("Histogram of Coin Tosses");plt.xlabel("Tails / Heads");
1615704155542.png


which would fit comfortably in 4 bytes. The histogram has clearly thrown away information. It is not the data.

Here are histograms of two other datasets:
Python:
dataset1=d12.rvs(1000000)
plt.hist(dataset1,bins=12)
plt.title(f'Dataset 1, $\mu$={dataset1.mean():.3} $\sigma$:{dataset1.std():.3}');

dataset2=(np.arange(0,1000000)%12)+1
plt.hist(dataset2,bins=12);
plt.title(f'Dataset 2, $\mu$={dataset2.mean():.3} $\sigma$:{dataset2.std():.3}');
1615704597508.png
1615704658726.png


Same flat histogram, same mean and standard deviation. Here's a view of the data for each:
Python:
plt.figure(figsize=(12,5))
plt.plot(dataset1[0:200],'.')
plt.title("Dataset 1");

plt.figure(figsize=(12,5))
plt.plot(dataset2[0:200],'.')
plt.title("Dataset 2");
1615705361080.png

1615705462305.png


transformed to the frequency domain rather than the time/order domain

No, the distribution is not in the frequency domain. These are the magnude plots of the frequency transformed data for the last two plots above:

Python:
plt.plot(np.abs(np.fft.fftshift(np.fft.fft(dataset1-dataset1.mean()))));
plt.suptitle("Frequency Magnitude Plot of Dataset 1")
plt.title("(adjusted by mean to remove DC)");

plt.plot(np.abs(np.fft.fftshift(np.fft.fft(dataset2-dataset2.mean()))));
plt.suptitle("Frequency Magnitude Plot of Dataset 2")
plt.title("(adjusted by mean to remove DC)");

1615706280141.png
1615706314387.png


Notice that dataset 1 is just as random in the frequency domain as it was in the time domain (while dataset 2 looks pretty structured in both domains). The frequency transform is reversible, so the frequency domain representation is actually the data-- just in a different form.

Python:
plt.figure(figsize=(12,5))
plt.plot(dataset1[0:200],'.',ms=20,label="Non-Transformed")
plt.plot(np.abs(np.fft.ifft(np.fft.fft(dataset1)))[0:200],'.',label="Transformed");
plt.legend();plt.title("Comparison of Data Before and After Frequency Transform");
1615706858606.png


The only thing you lose is the order of the samples
No, you lose the identity of the samples. In your dice roll examples, that's pretty much just the order. In the covid spreader scenario you're talking about though, you would be losing the identity of which people are capable of spreading the most. That sounds like something important, if you it were data you were able to generate.

So, no, the distribution is not the data.


But, crudely (and without being too literal), say you assumed that the "real" value of R in the population followed a normal distribution with a single, well-defined central value, did your survey on a limited sample of the population and found an 80% chance that it was below 1, while I had a better model of the underlying distribution and was able to use the same sample size to get a 99.8% chance that it was below 1, which would be more helpful for policy decisions?

The original discussion was about the formula for the vaccination rate needed for herd immunity: that was pretty sensitive to the value of R, and for high R values the vaccination rate reaches problematic levels. So you don't just need a good estimate of the average R, you need a good idea of what the highest and lowest likely values are.

Maybe I don't understand what you think R is. R (R0, Re, etc) are average values-- they are the average number of new people infected by one infected person. They are averages. I feel like I keep repeating this: Averages are normally distributed regardless of the underlying distribution for sufficiently large samples, where sufficiently large is typically much less than 100. For R estimation, we're talking about samples the size of communities-- they're plenty large to yield normally distributed results.

But, again, R is almost certainly not an input to most epidemiological models-- it's is likely being reported by the model once the model has been fit to the dynamic case count data.

If transmission does turn out to be a subset of infected people accounting for the majority of infections via super-spreader events (which was suggested in one of the BBC documentaries) - which you'd expect to show up in the distribution - then doing things like stopping idiots holding massive face-to-face tech conferences could be a high priority.

Here's where you lose me. Why does it matter if one in 5 people spread to 10 others, or 5 of 5 people spread to 2 each? The rate of spread is the same.
 

Attachments

  • 1615706828743.png
    1615706828743.png
    104.3 KB · Views: 64

theluggage

macrumors 604
Jul 29, 2011
7,489
7,340
I am not. You started this d12 / 2d6 discussion. I’ve only been correcting or clarifying assertions around it. I don’t completely understand how it fits into the conversation.

d12/2d6/3d4/whatever are just ways of generating an unpredictable number with distinctly different distributions. Another way would be to infect someone with - let's call it Randovirus so nobody think's we're trying to solve Covid here - and count how many people it will infect... It's not likely that throwing a pair of dice is a good way of modelling that, but some combination of unpredictable events that could be modelled with random numbers is quite plausible.

...what is not plausible is that every single person infected with Randovirus inflects (say) 7 people, no more, no less. It's more likely that the population distribution of number of infections is normally distributed around an average R=7, and that's not a bad guess in the absence of any other evidence. However, the super-spreader model, or a combination of two slightly different strains, for example, produce an asymmetrical or double-peaked distribution which won't be normal. There will still be a mean that you can measure, and your attempts to measure that mean will be normally distributed around the true mean but the actual number of people an individual infects might be following a completely different distribution.


No, the distribution is not in the frequency domain.

The frequency distribution of a data set literally is the data transformed into the frequency domain. Yes, you're right, you can't go back to the exact same set of data points - but you can use the frequency distribution to generate a realistic sequence of data points - which is literally what you're using the bernoulli/nomal/etc. functions in python, and you absolutely could use a 12-valued histogram and generate a realistic sequence of 2d6 rolls. Often, the original sequence/identity of the data points will be the alphabetic order of the subject's names, or the order in which the forms got posted back or suchlike.

If it helps, just imagine writing out each of your original data records on an index card (don't forget to number them if you want to put them back in order) then physically sort them into neatly-piled 'bins'. There's your frequency histogram, and nothing has been lost. Of course, doing that on a computer is trivial...

C.f. what you seem to be advocating: writing the mean (and maybe the variance) on a post it and never returning to the original data at all...

Maybe I don't understand what you think R is. R (R0, Re, etc) are average values-- they are the average number of new people infected by one infected person.

...and any real decisions made using them have to take into account the likely distribution of the real number of new infections around that average, in order to work out the best/worst case scenarios for risk assessments etc.

You're mixing that distribution up with the distribution of the estimates of that average obtained by sampling. And/or assuming that the real distribution will be normal, just because that it often is. A super-spreader scenario, or a mix of different variants, or a collection of sub-populations with drastically different numbers of social contacts could all give distinctly non-normal distributions which won't become normal no matter how many samples you take, CLT or not.

...and that could potentially muck up your confidence intervals.

But, again, R is almost certainly not an input to most epidemiological models

It is fairly important in the model used to predict the level of vaccinations required for herd immunity (https://www.thelancet.com/article/S0140-6736(20)32318-7/fulltext) which is what kicked off this conversation. That also includes the efficacy of the vaccine, which is subject to all the same uncertainties about distributions (for starters, you've got half a dozen vaccines with different efficacies, so at the very least it will be half a dozen peaks).

Also, if you look back at that article, notice that everything is a range. They don't plug in "R=3" they plug in "R is between 2.5 and 3.5" and get 60-72% out. The likely range of values of R is critical.

Here's where you lose me. Why does it matter if one in 5 people spread to 10 others, or 5 of 5 people spread to 2 each? The rate of spread is the same.

...but the population variance of those two scenarios will be significantly different, and that could affect the margin of error/level of confidence on any forecasts made based on that rate.

Also, this is not just about the total summary statistics on the scale of whole nations: regional hotspots, spikes in particular sections of the population etc. are all potential issues which make relying on the central limit theorem to smooth everything out at national/world level inadequate.
 

Analog Kid

macrumors G3
Mar 4, 2003
8,860
11,382
A super-spreader scenario, or a mix of different variants, or a collection of sub-populations with drastically different numbers of social contacts could all give distinctly non-normal distributions which won't become normal no matter how many samples you take, CLT or not.

Please, cite some support for this assertion. Why do you think the Central Limit Theorem would not apply in these scenarios? Your refutation of this basic principle seems core to your whole argument. Let's focus on that.

I showed you a simple simulation of a super-spreader scenario above and the averages were normally distributed as expected by CLT.


...but the population variance of those two scenarios will be significantly different, and that could affect the margin of error/level of confidence on any forecasts made based on that rate.
And I'm asking for a clear example of how the population variance affects decision making. You keep talking about the importance or R. R is a population average. If you could average the whole population, there would be no confidence ranges. Since we can't average the whole population, we fit data to a subset (sample) of it. The value we get based on that sample will vary from the true population average-- that is, our estimate is itself a random variable. CLT says that for the sample sizes we're working with, that random variable is normally distributed regardless of the underlying population variance.

Unless you can somehow show otherwise.

Also, this is not just about the total summary statistics on the scale of whole nations: regional hotspots, spikes in particular sections of the population etc. are all potential issues which make relying on the central limit theorem to smooth everything out at national/world level inadequate.
Yes, we need to look at sub populations as well, but every population we look at is large enough that the statical methods are valid-- otherwise there's no point. I don't see anything in that Lancet article about vaccination policies targeting individual apartment blocks of 25 people. As I pointed out much earlier in this discussion, we're making policy for groups that are sufficiently large.


Just about everything from here has nothing to do with the epidemiology, it's mostly just trying to clarify statistics in general.

...what is not plausible is that every single person infected with Randovirus inflects (say) 7 people, no more, no less. It's more likely that the population distribution of number of infections is normally distributed around an average R=7, and that's not a bad guess in the absence of any other evidence. However, the super-spreader model, or a combination of two slightly different strains, for example, produce an asymmetrical or double-peaked distribution which won't be normal. There will still be a mean that you can measure, and your attempts to measure that mean will be normally distributed around the true mean but the actual number of people an individual infects might be following a completely different distribution.

There's nothing I disagree with here.

The frequency distribution of a data set literally is the data transformed into the frequency domain.

No, those are different things. I showed you the difference. The frequency distribution is a data reduction technique that summarizes data by binning and counting. The frequency domain representation is the result of transforming a time series into a form that highlights recurring patterns-- typically by decomposing that data into a set of sinusoids of varying magnitude and phase.

The frequency distribution is a summary of the data, the frequency domain is another way of showing the actual data.
Yes, you're right, you can't go back to the exact same set of data points
If you can't go back, then what you have left is no longer the data. That was my only point.
you can use the frequency distribution to generate a realistic sequence of data points - which is literally what you're using the bernoulli/nomal/etc. functions in python, and you absolutely could use a 12-valued histogram and generate a realistic sequence of 2d6 rolls.
This isn't strictly true. There may be patterns in the data that aren't detected by a simple histogram. I showed you two datasets with the same histogram that were quite different-- of note, one was random, and the other was not. It was clear in that data, but may not always be. Data encoding for communication signals are looking ever more like random data as they try to get closer to the Shannon limit of a communication channel but if you just captured their frequency distribution and tried to recreate "realistic data", you would lose the information content of the original signal.

(don't forget to number them if you want to put them back in order) [...] and nothing has been lost
It is specifically that numbering that has been lost. I think you know that, which is why you needed to insert that caveat?

...and any real decisions made using them have to take into account the likely distribution of the real number of new infections around that average, in order to work out the best/worst case scenarios for risk assessments etc.
Agreed.
You're mixing that distribution up with the distribution of the estimates of that average obtained by sampling.
I'm not sure I follow your meaning here. Are you saying that I'm making the mistake of following standard statistical procedure by sampling a large population to determine the statistics of that population?

It is fairly important in the model used to predict the level of vaccinations required for herd immunity (https://www.thelancet.com/article/S0140-6736(20)32318-7/fulltext) which is what kicked off this conversation. That also includes the efficacy of the vaccine, which is subject to all the same uncertainties about distributions (for starters, you've got half a dozen vaccines with different efficacies, so at the very least it will be half a dozen peaks).

Also, if you look back at that article, notice that everything is a range. They don't plug in "R=3" they plug in "R is between 2.5 and 3.5" and get 60-72% out. The likely range of values of R is critical.
The second half of my sentence made clear that what I meant is that we don't measure R by going out and counting how many people each individual infected. We take the case data over time and fit a model to it. R would be one of the fitted parameters that we'd read out.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.