Audio Player
Starting at:
David Hand: Ai, Dark Data, LLMs, Peer Review
August 14, 2023
•
1:27:27
•
undefined
Audio:
Download MP3
⚠️ Timestamps are hidden: Some podcast MP3s have dynamically injected ads which can shift timestamps. Show timestamps for troubleshooting.
Transcript
Enhanced with Timestamps
202 sentences
14,029 words
Method: api-polled
Transcription time: 85m 34s
The Economist covers math, physics, philosophy, and AI in a manner that shows how different countries perceive developments and how they impact markets. They recently published a piece on China's new neutrino detector. They cover extending life via mitochondrial transplants, creating an entirely new field of medicine. But it's also not just science they analyze.
Culture, they analyze finance, economics, business, international affairs across every region. I'm particularly liking their new insider feature. It was just launched this month. It gives you, it gives me, a front row access to The Economist's internal editorial debates.
Where senior editors argue through the news with world leaders and policy makers in twice weekly long format shows. Basically an extremely high quality podcast. Whether it's scientific innovation or shifting global politics, The Economist provides comprehensive coverage beyond headlines. As a toe listener, you get a special discount. Head over to economist.com slash TOE to subscribe. That's economist.com slash TOE for your discount.
Once upon a time, three bears found a golden haired guest who wouldn't leave their home. This Duncan cold is just right. And this Duncan cold is just right. Should we introduce ourselves? No, Barry. The home with Duncan is where you want to be.
Most current AI systems in these large language models are based on data-driven models, and there is this risk of them being fundamentally brittle.
Man, I'm excited to bring Professor David Hand to the Toll Podcast. He's an eminent British statistician and a professor of mathematics at Imperial College London. David's made significant contributions to the field of data analysis, statistical theory, as well as something else that he's pioneered called
Professor Han has the ability to dissect and simplify ordinarily convoluted statistical concepts and articulate them in a digestible manner. This is a rare skill and it's one of the main reasons I'm honored to introduce you to him. Questions explored today are what is dark data?
Are there different kinds? What is this relationship to dark matter? And how could it be that the data we're missing is more important than the data we have? Why isn't it as simple as just collecting more data? Also, you've heard the phrase that there are known knowns and unknown knowns and unknown unknowns. Can you categorize those unknown unknowns? Is this a perennial problem or are there techniques to overcome this ostensible intractable conundrum? Further, this is something extremely relevant.
How do these issues become exacerbated by the large language models that we see coming out, a new one every month or so now? Also, what's the difference between a data driven model and a theory driven model? Make sure to stick around until the end of the podcast because I go over all 15 dark data types. You may want to watch this section first to gain an overview or wait until the end as a summary. The timestamp is above. Thank you and enjoy this episode with Professor David Hand.
Welcome, Professor David Han. I'm super excited to be speaking with you. I was watching some of your lectures and reading your book now for at least two and a half months. So this is two and a half months in the making. Great. Thank you. Can you give the audience an overview as to what dark data is? Yeah. Okay. So I sometimes describe dark data in a phrase. It's data you haven't got. What I mean by that is it's data that you
assume you have or think you have or perhaps would like to have, which would have had an impact on any conclusions you're drawing, the results of any analysis you may have undertaken, but which for one reason or another you don't have. You might assume you've got it. Let me take some examples. It might be simple missing data of various kinds, and we can talk about different kinds later on. And you might think, well,
I can go ahead with the analysis of the data I've got. The fact that some of the data are missing, I haven't got some responses from respondents or observations from some of the experiments won't matter. But it can be absolutely crucial. It really depends how the missingness mechanism is related to what you have got. So the missing data can be even more important in many ways when the data you've got can certainly mislead you.
One of the illustrations you give in the book or in a lecture, because I've forgotten conflating them now, perhaps both, is in an earthquake, you may think that, well, let's send the help to where the most 911 calls are coming from. And then you think, well, that sounds reasonable if you do some overlay, a heat map of where are the most requests for help. Well, it seems like let's send it to where it is. And you could do some analysis of like, okay, well, it's weighted by population density.
But then what about the places where there's zero? It could be that it was so disastrous there that the cell towers themselves have been taken out and that's actually where the help is needed most. So can you give some other counterintuitive illustrations or examples that even made you while you were researching this book go like, man, we're going about this all wrong and holy. That's such a, I never thought of it like that.
That's right. That's the example you gave. In fact, in the book I talk about Hurricane Sandy at the time, the second most costly hurricane in American history, $75 billion worth of damage. And all the publicity was, of course, on that physical storm, but it was accompanied by the Twitter storm with 20 million tweets being sent about it.
But exactly as you say, those tweets are great. Twitter tells you exactly what's happening, where it's happening, when it's happening. So as you say, you know where to send the emergency services. But what about a community which has been completely obliterated? You wouldn't know. That's really where it's critically important you should send the emergency services. So that's a great example of missing data you don't know about, which is completely misleading you.
And there are any number of other examples. Oh, think of a pandemic, of course. Maybe there are lots of infection. We've learned this about Covid, but any other pandemic, maybe there are lots of infectious people out there without symptoms. They could be spreading the disease through the community and that could be disastrous and you wouldn't know if they're not presenting with symptoms and going to the medical services. So that would be another example of that.
A classic sort of example which people have researched in great depth is non-response in surveys. Oh, I suppose traditional surveys, you have a sampling frame, you know who you're supposed to be getting answers from. And if some people don't answer, you know they haven't answered. These are Rumsfeld's known unknowns. That's one kind of missing data which is possible without too much difficulty to adjust for.
But what about doing a web server? You write a blog or a podcast and you get lots of comments. And all the comments might say, oh, this is a wonderful blog or podcast. But maybe there are lots of people out there who hate it and can't be bothered because they hate it. So to tell you so, you'd be getting them completely. I'm sure you just get wonderful comments and all the people out there who don't say anything also think it's wonderful. But one could be completely misled by this. And so it goes on.
Here's an anecdotal example, not an anecdote, it's an anecdote, it's a joke really. But it conveys a particular kind of dark data to do with what I call self-selection. You go to a new town on a train, you get out at the railway station and outside the railway station there's a map with a big dot in the middle saying you are here.
And the man thinks, how did they know I was here? Well, obviously they knew he was here because you have to be there. But that, in order to read the map, but exactly the same sort of thing happens in other contexts. My classic example is the anthropic principle, which you'll be familiar with being a physicist, you know, about why is the universe, it's extraordinary that the universe has got lots of numerical constants which have to have
Very precisely the values they have, or humanity wouldn't exist. Isn't that incredible? Well no, if the constants were different, humanity wouldn't exist, so we wouldn't see the universe. So there are all sorts of ways one can be subtly misled by data you don't have.
Can you talk about the data that we don't have and some of the DD types? And by the way, just so you know, when you've watched this, you have already now seen the summary of the 15 DD types. Why don't you go over that example right now? How changing definitions sounds like that's something that's great and what we want to keep up with the times. So we change definitions of all sorts for all reasons. Yeah, let me give you two medical examples.
The definition of autism has changed over time and the amount of autism in the community has gone up. Now has it gone up just because in the early days we were excluding people whom we would now regard as having autism from having autism and other medical diseases also undergo that sort of revision. But another example would be during the pandemic where
countries were very keen to report their COVID infection rates and death rates and saying we are doing better with our strategy and so on and so forth. But of course, when a death from COVID is reported, you have to be very careful about exactly what you mean. Do you mean death
from COVID or with COVID, with a formal diagnosis of COVID, with symptoms of COVID. What exactly do you mean? And there are, funnily enough, my book appeared right at the start of the pandemic, so the words pandemic and COVID don't appear in it, but you could easily write an entire book about dark data and the pandemic. But that illustrates it. The death rate for COVID depends crucially on what you exactly mean by
death with COVID from COVID and so on. Right. And that could be partially or largely responsible for why different countries had such drastically different rates of death. I'm sure it absolutely. I'm sure it. Are you planning on writing that second part? No, I probably won't. I have been tempted, but probably not. Just so you know, so the audience is aware, you have another book that's come out afterward on coincidences.
Are you planning on writing another book? Are you working on one right now? I'm always working on another book, but the way I work, I spend quite a lot of time
Formulating the structure of the book, getting examples and creating the book before I sort of go to the world and say, this is going to be my next book. I don't want to jinx it, if you see what I mean. Well, I'm sure it's not jinxing. What's the real reason? It's because you don't want to look foolish if you don't end up pursuing that particular one. Or if you say it and it's premature and someone gives you a look askance, then you're like, oh, this wasn't such a great idea. But if I pursued it more, I would have had more confidence behind it.
It's like that. Looking foolish is almost it, but not quite. I don't mind looking foolish, but if I tell people I'm going to work on this and then I decide that it really doesn't make a very good book, then people will be saying, well, where is this? Well, I decided it wasn't. I just want to be sure that this is what I want to work on and finish before I actually
Do it. Yeah, you look footloose and erratic. Perhaps 10 ideas occur to you each day and you don't want to say, yeah, I'm going to pursue these 10 ideas. Then only one out of the whole month actually makes it through. Then you look like you're responsible. Yeah. If you look in my computer and you'll see folders for, I don't know, hundreds of books, say, well, brilliant idea. That would be, but you know, then have I got
the resources to put a year's work into writing that book. And so most of those hundred ideas are just a few lines or sentences or even pages. But for one reason or other, you can't do everything, you know, so I have to decide which ones are worth pursuing. And you're collecting examples primarily because for books, the stories matter so much because as a statistician, I'm sure you don't care too much about the anecdotes. You care about the data set.
Yeah, professionally I care about the data and what understanding you can squeeze, what illumination you can extract from data, so what statistics is all about. But when I write these sort of popular books,
They are driven. I want people to identify with them, so the Coincidences and Rare Events book is full of things that have happened to people, just like the dark data is full of real stories about how significant, how important, how crucial dark data can be, so that people think this is important.
Personally do you find that part it's not something you care about doing it's a bit taxing it's grueling it's just your editor is like we need it something that we need an emotional help here and you're like hey is the theorem an emotional enough i'm a mathematician. No no no no it's not the editor it's me doing it so sometimes people ask me how long does it take to write a book and i get i said there are two answers.
One is 20 years of collecting examples, illustrations, recognising that there is a common thread and deciding it should be pulled together. The other answer is, and I get them both answers, the other answer is six months. I've got all these examples and I know what the common thread is, now I have to sit down and actually put it all together.
So really I'm always collecting examples and in fact in my hundred folders of potential books whenever I see something which illustrates one of these sort of common threads then I'll put it in and you know and coming back to your earlier question if I find that sooner or later I've got lots and lots of examples like with the improbability principle and dark data then you know I know I've got a sensible book on my hand there are so many
examples and aspects to it, but it's got to be written down and tied together. Are you using a particular app like Microsoft Word or Notion or Obsidian? I just use Word, yeah. I'm very lucky. I mean, I'm lucky because not all academics like writing and I do. So, you know, I just use Word. Why do you think that is? Why do I like it? Yeah. And why do you think most academics don't? I don't know if most don't, but a lot don't.
There's a very good question. I suppose many might not. They might just enjoy the hunt, the search, running the experiments, analyzing the data, collecting the data, and maybe actually sitting down and writing it all up as a chore. I like it because I always feel that when you sit down and write it up, you often find new questions, but you're trying to force it into a structure.
You're trying to understand it. You know, they say that the way to understand something properly is to teach it, because that forces you to structure it and make it coherent and consistent and sensible. And I think for me, the writing process plays a similar role.
Maybe not for everybody. Yeah, I've always been skeptical of this. Hey, if you understand something, you can explain it to a 10 year old or a five year old. Yeah, I'm not saying that you necessarily can. Yeah, I'm saying that if you want to hope to explain it, then you've got to understand. And the reason why I say that is because at least in the University of Toronto, that's my alma mater, it's known that the brightest people are the worst at lecturing. It's not necessarily inverse correlation, but it's not terribly correlated.
I think it may even be an inverse correlation. If they're right, they don't understand why you don't understand it. I can't understand this, but it's obvious, isn't it? Not to me, it isn't. There's a name for that. It's called the expert's curse. Yes, I didn't know that. You can't place yourself in the position of someone. You can't remember what it's like to not know it. And even if you don't know it, your reason for not knowing it's not the same as the... Exactly. Very good. Yes, that's spot on.
Have you used any large language models like ChatGPT or any of the ones that have come out for your writing? Not for my writing, no. No, no, no. I don't mean to say you're plagiarizing it. No, no, no, no. I, as part of what I enjoy putting things down so that they're saying what I want them to say. I have to say, over my career, I have been immensely frustrated by
PhD students, not all PhD students, just some who can't write. This is a sort of bettenoir of mine in a sense. It's perfectly understandable. In the UK, it may be a decade in which they haven't had to write essays or things like that.
you know, from early, relatively early, perhaps late, but sometime in their school education through to the end of university, they've done maths or physics or whatever it is, they haven't been told to write coherent prose. And so I have found sometimes that, and I'm not talking about people whose first language isn't English, I'm talking about people whose first language is English,
They haven't really thought to think carefully about the words and the punctuations and the paragraphs and what have you, so that those actually convey what's in their head. Either that or what's in their head isn't sufficiently clear to enable conveying. But I have found that a bit of a, that's as I say, sort of one of my things I complain about. But I enjoy the process of
It's quite interesting actually. I enjoy sitting down and trying to beat my thoughts into and then making sure that each word is where it should be and conveys what it should. And it's interesting because all too often I will go back, read something I've written and think, I can improve that. That's not very good. Occasionally I think, oh yeah, that's good. I wish I could still write like that. But all too often I think, oh dear, you know, why didn't I
It would have been better had I done it that way and so on. So how do you use the language models? I haven't used those. I just use Word. I have played with Trap GPT just to explore what it's like and so on and we can talk about its strengths and weaknesses but I don't use those. It's part of me getting to grips with things and understanding and
Often in machine learning, well, plenty these days, we hear about bias. And until I read your book, I didn't realize that most of the time when people are referencing bias, they're biased in referencing their bias. And the reason I say that is that they're selecting about three or four of the DD types when they're talking about bias. So DD type three and four and perhaps 12.
I can explain more like they're talking about certain qualities have been overemphasized or de-emphasized or asymmetrically acquired. That's only a couple of them and they're a vast I don't know if you would call all of dark data can be summarized as bias but they're a vast array of other ways that you can have bias or dark data. My question is what are the ways that we have bias in these large language models that we aren't talking about much that we perhaps should be talking about more? I think I mean
So I'm going to take a step back if that's okay. When I talk about data science in general or statistics in general, I say there are two kinds of models, but this isn't taught enough, it ought to be taught more, but there are two kinds of models. There are theory-driven and data-driven models, they sometimes go under other names but those would do. Theory-driven models are based on some
sort of assumed underlying understanding of what's going on. So if I'm building a, if I'm trying to understand something in physics, I might base it on Newton's laws of motion or laws of thermodynamics or something, some sort of theory behind it. And then I will collect my data and do my statistical analysis and so on to try to, as I say, squeeze more understanding from it. Those are theory driven. In psychology, you might have a model based on prospect theory or something.
The other kind of model are data-driven models. Data-driven models just take large, usually, perhaps vast, as in the case of these large language models, corpuses of data, text for large language models, and they summarize it. They look for structures, correlations, patterns, features, anomalies, this sort of thing. So those models are based entirely on the data that you've fed them. Hear that sound?
That's the sweet sound of success with Shopify. Shopify is the all-encompassing commerce platform that's with you from the first flicker of an idea to the moment you realize you're running a global enterprise. Whether it's handcrafted jewelry or high-tech gadgets, Shopify supports you at every point of sale, both online and in person. They streamline the process with the internet's best converting checkout, making it 36% more effective than other leading platforms.
There's also something called Shopify Magic, your AI-powered assistant that's like an all-star team member working tirelessly behind the scenes. What I find fascinating about Shopify is how it scales with your ambition. No matter how big you want to grow, Shopify gives you everything you need to take control and take your business to the next level. Join the ranks of businesses in 175 countries that have made Shopify the backbone.
of their commerce. Shopify, by the way, powers 10% of all e-commerce in the United States, including huge names like Allbirds, Rothy's, and Brooklyn. If you ever need help, their award-winning support is like having a mentor that's just a click away. Now, are you ready to start your own success story? Sign up for a $1 per month trial period at shopify.com slash theories, all lowercase.
Razor blades are like diving boards. The longer the board, the more the wobble, the more the wobble, the more nicks, cuts, scrapes. A bad shave isn't a blade problem, it's an extension problem. Henson is a family-owned aerospace parts manufacturer that's made parts for the International Space Station and the Mars Rover.
Now they're bringing that precision engineering to your shaving experience. By using aerospace-grade CNC machines, Henson makes razors that extend less than the thickness of a human hair. The razor also has built-in channels that evacuates hair and cream, which make clogging virtually impossible. Henson Shaving wants to produce the best razors, not the best razor business,
So that means no plastics, no subscriptions, no proprietary blades and no planned obsolescence. It's also extremely affordable. The Henson razor works with the standard dual edge blades that give you that old school shave with the benefits of this new school tech. It's time to say no to subscriptions and yes to a razor that'll last you a lifetime.
Visit HensonShaving.com slash everything. If you use that code, you'll get two years worth of blades for free. Just make sure to add them to the cart. Plus 100 free blades when you head to H E N S O N S H A V I N G dot com slash everything and use the code everything.
which of course has the potential to lead to conventional bias, gender bias or whatever it happens to be, for example. But this is the crucial thing, I think. They also have the potential to be brittle because they're based on this corpus of data or text or whatever it happens to be. If things change, you think of financial crisis, a pandemic or a war or whatever, if things change, that data can become
Less relevant, irrelevant, so that your data-driven model is fundamentally brittle. If things change, it could no longer apply. We saw this actually with the, well, we've seen it lots of times, but I used to do a lot of work in retail consumer credit scoring and so on, and with the
financial crash in 2008, some of the models started, they just weren't very good anymore. Well, not surprisingly, they were based on a relatively benign economic period, and they worked great under those circumstances, but if the circumstances change. Anyway, so we have these two kinds of models, theory driven and data driven models. Almost all of this, the
recent advances in not entirely all but almost all of the recent advances in machine learning and AI are based on data-driven models. You give it a lot of data and it trawls through looking for correlations patterns and so on and produces a structure which allows it to generate new text or whatever it happens to be based on what it's found in the data you've given it.
If things change, those data no longer apply, so they're fundamentally brittle. Theory-driven models, of course, if your theory is wrong, if Newton's laws turned out to be wrong, which would of course be the case if you were traveling very fast or whatever it happened to be, then you could be wrong as well. But theory-driven models are more likely to be right
because of the sort of continuity they imply. You know, okay, circumstances change, but that means I'm just in a different part of the space spanned by my theory, for example. But of course, they can also be wrongly for theories wrong. Okay, that's interesting. Please expand more on the difference between theory driven and data driven. Like why I still don't get why is theory driven? You mentioned continuity is a reason that it may be more. Is there a technical term for this brittleness? Is it called robustness? Robustness would be another word, but that also has sort
Okay, we'll stick with brittle. Why is it that something that is theory driven is less brittle compared to something that is data driven? Sorry, models that are... Yeah, so my theory driven model... Let me give you two examples then. Yeah, okay. I can use a theory driven model to predict the trajectory of a thrown stone. And I can use Newton's laws to do that. But I could also, if I didn't know anything about
Newton's laws. I could also try to build a model just based on things I happen to observe. In retail credit scoring, for example, they use logistic regression trees. They basically partition the population and they use a predictive regression type model, a logistic regression model in each of those partitions, a separate model. Now, if in my thrown rock example,
I suddenly go way out of the data I've got and start so that the world changes. No longer am I throwing rocks with a fairly gentle speed. I'm now throwing them much harder. My data now looks nothing like the data I got before, but Noon's laws still apply, so my predictions will still be pretty accurate. But my data driven model, the logistic regression model, if now
People start behaving differently or I'm getting a different population. I try to apply them to a different population. No reason why the structures and correlations I've found in the data I had before, no reason why they should also apply here. Suppose my initial population was youngsters, people under the age of 30, so
And now I'm applying them to people over the age of 70. People over the age of 70 are mostly retired. The circumstances are completely different. Why would I expect the credit model to apply? But of course, you know, so it may completely collapse. Actually, a better example would be to flip it around. I built my credit scoring model on people aged 70 plus.
Youngsters are more risky usually. I built my credit scoring model at people aged 70 plus and then applied it to people aged less than 30. I might find that my bank's going bankrupt because of my models.
Just no good anymore. Okay, so in that latter case, would that be a data driven model? Yeah, that's exactly right. That's data driven because I took my 70 plus people, people aged 70 plus, and I said, okay, let's correlate default risk with whatever other factors we can think of. And we have found this relationship. And these relationships, we can build quite elaborate models. And then I try to apply it to the people under 30.
Where the relationships unknown to me, the relationships are completely different. If I had a very good psychological theory, theory driven, then I could perhaps make, you know, make that extrapolation down the ages. But without that, if it's just based on what I observe in my 70 plus people, it could go completely wrong.
Is there a strict dichotomy between those two? No, very good. Perfect question. You actually know a lot about this. I can see that. In fact, it's a sort of leapfrogging. It's like your classic way of advancing science. You collect data, you look at it, you try to make sense of it. You conjecture a theory. You then go away and test the theory. But
conjecture a theory, you make some predictions based on that theory, then you apply it to new data. So they overlap sometimes in some models and mixtures of them both. Others, it's difficult to say which side of the line they're on. It might be on one side of the line for one person, different for the other. So it's not a hard and fast division. But nevertheless, it's a useful division because, well, in the modern world,
These large language models and most current AI systems and ML machine learning systems are based on data driven models and there is this risk of them being fundamentally brittle. So why don't you give an example of we had earlier an example of the cell towers going down or the tweets not being sent out. Can you give an example of something perhaps it hasn't occurred yet but it could with regard to these large language models of some danger that we haven't thought much about?
What I say is, but we haven't thought much about, what I mean is that, like you perhaps have thought plenty about, but that in the media, we talk about, we as in not myself, but we here talked about certain AI risk scenarios, that AI alignment, misalignment, and there are other risks, but something else that you from your dark data book that you can see is like, here's another way you can go awry that's subtle, but it's extremely impactful. Yeah, from the dark data book, I think it is
The fundamental question that I dealt with there is the fact that you've built your model on what you think is a good representative data set which describes the phenomena you're trying to study or about which you're trying to make predictions, the disease that you're trying to diagnose or whatever it happens to be, but you are being misled because you're missing something crucial. Covid provides plenty of examples and in that it was a while
As data began to come in, COVID is a nice example. The pandemic is a nice example because early on in the pandemic, there were lots of things we didn't know. We didn't know most things. But people had to make decisions. They couldn't hang around. So they had to build models, try to build medicines and so on, try to work out what they thought the likely causes were. And then later, you know, errors were corrected and so on. But early on, they didn't recognise that
Age was a crucial factor in susceptibility to COVID and serious consequences. So you could build a model and say, right, everybody must behave like this, and then realize later on that older people are all dying because of what you had done. So I think that's the fundamental problem, the link between the two here that
You're building a model based on what you think is a nice, comprehensive, representative data set describing the entirety of the phenomenon you're interested in, and you're missing something crucial. And you're missing something crucial. There might be particular variables you're failing to measure. In many other examples other than COVID, sex might be a crucial variable that
causes different impacts between people and if you didn't think of that before that's an obvious one so you probably would have thought that but if you hadn't thought of that before and but there may be all sorts of other things that you hadn't thought of which can have a big impact so it might be particular variables or it might be other kinds of distortion which lead you to a misunderstanding let me take a let me take a physics example sure
There's something called the Marmquist bias, and you may well be familiar with this, which basically says it's the bias in the models you build because of the fact that brighter astronomical objects are easier to see. So if you build a model just based on the objects that you can see or detect the radiation from, you might be getting a completely misleading impression because there might be
actually there might be lots of stuff out there that you're missing and in fact another physics example where that you see where this is going another physics example is of course dark matter yeah yeah dark matter i'm sorry yeah
Yeah, and earlier you said with regard to COVID, I don't care about being political, this is not a political question, that we didn't know what was occurring, but we had to do something. Now, it's not always the case. Is it better as a rule of thumb that if we feel like we have only partial data, and we have a, well, this is a tricky question. Like, we have freeze for a reason in our biologically, like don't do anything.
If it's extremely uncertain, just stop, stop for a moment, even more than a moment. So when we get partial data from the government, for whatever reason, they don't want to release it. They don't have it. They're embarrassed by it or fraud. And this is not just with COVID, but it could be with whatever. There's something called FOIA requests in the States. So Freedom of Information Act, you request certain documents and redacted, redacted, or they'll just say they don't have it or they won't get back to you. I'm always interested in is this what they give you this partial truth
Can it be worse than saying nothing?
We think this is how it is, but we're going to collect more data. We recognise we've got to do something now. We're in the political arena. People will die if we don't. We're going to do what we think is the best, but we recognise that we may be missing something. That is at least safer. People will keep their eyes open and look and so on. But if they say, this is how it is, then that's disastrous. And I think they can't sort of say, well, we're going to do nothing.
If you're in the political arena, you know, you have to make decisions. Even doing nothing is making a decision. We're not going to vaccinate anybody. Well, that's a decision just as much as vaccinating people is. Is there an in-principle problem with these unknown unknowns? So in this matrix of like known knowns and then known or unknown knowns? Yeah, yeah. Okay, whatever. You get the idea. There's four of them.
that it seems like there's no way we could ever make any decision or we could ever get any information about an unknown unknown. Now, is that true, though? Is there some systematic way of thinking about them, of classifying the unknown unknowns, like some taxonomy there? Have you uncovered something about this? What seems like an epistemological black hole? I agree. Yeah, I think this is probably very context dependent. But yeah, I mean,
One thing you should always do is sense check your conclusions and your results. If they seem totally bizarre, that casts doubt on what you're missing. In clinical trials, there's something called a funnel plot. Funnel plot? Funnel, as in funnel web spider. Exactly. Actually, it's like an inverted funnel, normally plotted.
One of the problems with clinical trials is that if they produce a clear result then they tend to get published. But if the results are not that clear or perhaps even don't go in the direction you expected, well maybe you've got other things to do rather than spending your time writing up this thing. So often this is
I've got to write a lecture for tomorrow or I've got to go and chair this meeting. I'll write it up when I get time and you never do. Exciting results, however, you know, well, I'm going to write this up because it's going to attract a lot of attention. What this means is that there can be a bias away from certain kinds of results in the funnel plot.
is a way of plotting all the results of clinical trials in a particular area or indeed you can do it for other kinds of studies and you can sometimes see that there is a sort of void, a gap. If there's a dot for each the result of each clinical trial occasionally you can see this void and you think well
It's incredibly unlikely that there were no results forming in that region. So we're getting, if we just analyze the dots that we have got, it's going to bias us. So that's one way you can do it. Very much, I think, depends upon the context. But yeah, I think there are ways. This doesn't mean that it's always possible. After all, bottom line,
You can't measure everything. If we're studying human beings, you can't measure everything about that human being. I can measure their age and weight and BMI and IQ and preferences for politicians and what have you. But I can't measure everything. So naturally, I must be by definition missing almost everything about those people. So there's always a risk. You can never guarantee having everything.
never guarantee that your scientific theory is, in inverted commas, right. For all you know, new data tomorrow might cast out on it. You see this all the time at the front of the physics, very exciting, where people push the boundaries and say, well, it seems to be a bit of a problem with a standard model or something like that. But of course, that's true. And of course, you see it all the time in medicine. You really see it all the time in medicine, where if you read the papers or the news media,
Every day you get what appear to be contradictory results. Today coffee is good for you, tomorrow next week coffee is bad for you. Well the reason for this is they are looking at different aspects of it or they're looking at different consequences of it or whatever.
This inverted plot seems like it's so tricky to do a meta analysis then I can't imagine doing my brother's a professor of statistics at U of T the math finance program but anyway I just thought he's technically under the umbrella of statistics.
I once showed him this meta-analysis and I said is this an okay meta-analysis or I don't even recall what it was years and years ago. He said that would take me a month or more to go through and I remember thinking how like you're a professor of statistics but he's saying no no it's so it's extremely subtle and they also use different techniques so it's not like everyone knows all the techniques but it's just it's so subtle to go through the reason you need to comb through
Can I say, I'm with him on this. I get lots of questions.
The short answer is yes, okay, but it's going to take me a while to dig down, look at exactly what they did, look at their comments about each of the studies. So, yeah, I'm with him on it. What's that effect or the study that showed that in high prestige journals, the results tend to be less reliable than those of medium or low quality? Yes. What's that phenomenon called? The primary one is regression to the mean you may be thinking of.
Let me describe how that works because that's an example of another kind of data. Let's suppose we're carrying out lots of experiments on the same topic. An experiment, and let's suppose, I don't know, we're comparing two treatments or whatever, an experiment might show a pronounced positive effect for one of these treatments compared to the other for two reasons. One is maybe the underlying data, maybe there really is an effect.
The second reason is that there's always random variation in these sorts of results, and perhaps this time, just by chance, the random variation has gone on the high side.
You put those two things together and the very highest observations, the most pronounced, significant observations in your collection of results are going to be the ones which have both of those effects combining. There's a real underlying effect plus random variations just giving you an extreme thing. And when you replicate the study,
And because it's a big effect, it goes to a high prestige journal. They're really delighted and you're delighted as well. But when you replicate the study, any real effect will still be there. But there's a 50-50 chance which way the random aspect will go. And it's much more likely that it's less than the extremely high random bit that you've got before. It's more likely to be lower. So things
A classic example is the offspring of tall couples are likely to be tall but not as tall as them and offspring of short couples are likely to be taller but still likely to be short but not as short as them. Regression to the mean.
There's the potential to be misled. You're being misled by the data you've got because it includes aspects of random variation. Is there another effect happening that's creating this unreliability in the high prestige journal? There's the file drawer effect, of course. High prestige journals. Yeah, high prestige journals. One wants to publish one's results in a high prestige journal. So if you get a big result,
highly significant people are going to pay attention that's where you send it and of course those journals maintain their prestige by publishing these important significant results which means that all of the other results tend to go elsewhere or not get published which means that the high prestige journals are more likely to be susceptible to the regression to the mean effect
The file drawer effects means that the things which weren't significant are less likely to be reported. Let's take an example of a comparison between two drugs where really there's no difference between them. We don't know that, but there's no difference between them.
So we do 20 experiments. So in reality, these two drugs, they work the same. We don't know that we're trying to find out. So it's like Namebrand, Advil and ibuprofen. Exactly, exactly. So we got 20 groups comparing these two treatments. Now, all of those 20 groups produce a result. And there's no difference. So the results are randomly scattered around zero.
the difference between the two treatments is randomly scattered around zero. Some show A better than B, some show B better than A. The ones which might depend whose funding the
The most significant ones are likely to be the ones which attract attention and get published. If there's no difference, I'm not going to rush out and say, hey guys, I did this experiment and found no difference between the drugs. Well, nobody will care about that. Maybe they should, but it's the ones which have a big difference that are likely to get published. So there's the file drawer effect, meaning that the ones which are around zero don't get published.
And then remember that these results were actually random. We'll take one of the ones, we'll take the most extreme one, which showed a great deal, treatment A, a great deal better than treatment B. It was actually random. So if the group which did that experiment were to repeat it, it could just as easily go the other way. It's very unlikely to be as high as it was that way. So that's regression to the mean coupled with the file drawer effect.
What are some ways around this file drawer effect? I think there's something called pre registered trials. Yes, exactly. There's been a big move to do this. What about creating some high prestige? I know this is difficult. If there were journals, I don't know if this exists, but if there are journals that just said, hey, give us your null results, like we will publish them so that you don't have to feel like, okay, I don't have anything. It's not worth publishing. No, no, no, it'll go here. This is a high prestige, no result journal.
Yeah, yeah, and indeed there are medical journals which do that sort of thing now. They say before you do the experiment, write up your paper just leaving out the results section.
If we think it's good enough, carefully, we'll be carefully enough done and so on. We guarantee to publish it. And that's called pre-registered? Is that different? Yeah, you're doing it. And there's a bigger move, which also says if you're going to do clinical trials, you've got to say so before you actually carry them out. You've actually got to register with a registry that you're going to do them. Yeah, I mean, this is a way to tackle these sorts of biases
and help tackle the unknown unknowns in that case often when we hear some results like smoking is good for you or is not good for you we'll say yeah but look who was funded by like the not good for you or by the people who are anti-smoking or good for you is by the cigarette companies themselves so we'll say that and we'll dismiss it because it was funded by so why do we even care about any results that are funded by if we're going to just question them anyhow
Well, we rely on the sort of morality and ethics of the funding body, but it's a very good, very good question. Or is it just in comment sections that people are quarreling with one another saying, oh, yeah, you shouldn't listen to it because it's funded by someone. But the professors behind the scenes or the researchers are like, no, no, that's a statistically relevant result. I'm glad they did this. Yeah, well, I think it depends on the particular case. But I mean, yeah, if it's funded by
Well, you gave the example of tobacco companies and that has a very long and interesting history. If it's funded by such an organization, they may well have a vested interest in getting the results a certain way. So in some sense, there's always this question of, did they manipulate things in some way? There are classic cases of this where an organization would like to show that
Drug A is better than drug B for a particular illness. They carry out the experiment and they find no difference. So what they then do is look for secondary endpoints. Okay, it wasn't good in this way, but what about this? Did people survive for longer in this way or did they suffer less from such and such in that way? You keep looking through, this is a
another aspect. You keep looking through the possible things in which A is better than B. You're bound to find something eventually. The truth is that in the past, probably still now, but certainly in the past, there have been cases of data distortion of various kinds or data selection
Raise a spoon to Grandma, who always took all the hungry cousins to McDonald's for McNuggets and the Play Play Slide. Have something sweet in her honor. Come to McDonald's and treat yourself to the Grandma McFlurry today. And participate in McDonald's for a limited time.
Whether one can trust the results. The trusting of the results sounds like an issue in the places where you're dealing with data collection.
But it's not so much an issue in math when you're publishing a theorem. But I do have a question about peer review in general. Like, is there some problem with peer review other than the main critique, which is also the plus side, but the main critique is like, yeah, but you're excluding some results that could be let in. But you're like, yeah, but we have to also put guardrails up because we don't want to let any result in. What are the pros and the cons of peer review that you see from your particular point of view? Because you've yes, because you've studied dark data. If there's something that can be applied from dark data to this. Yeah, I mean, peer review.
The whole question of peer review is a very interesting one. It's a bit like what's been said about democracy. There's a crappy system, but it's the best we've got sort of thing. It does have its shortcomings, but it does mean that there are all sorts of implications. If you're in charge of peer review, you have a wand, what would you change?
I'm not sure that I could come up with a better system and therefore that there's anything I would change. I was hesitating over whether to say, hey, reviewers, but I'm not sure that that would help. It could lead to its own problems and I can see that. Just as open access publication has led to its own problems, because we all saw this coming. I can remember talking about it in various meetings.
You know open access publishing? No, is that like the archive or is that different? No, it's a bit different. Archive is interesting. Oh yeah, that's another topic. In a moment, yeah, let's talk about open access. In the old days, the basic publication model was you could do your work, write up your paper, submit it to a journal,
and it would go into that journal and only the people who had subscription to that journal or the universities which have subscriptions to that journal would have access to it and then it was gradually recognized that this was not to the benefit of the whole scientific community and hence humanity but also it was a bit unfair. Lots of these projects were funded by public money so you know if I'm latently
funding this research, I ought to be able to get access to it. Okay. So they switched to an open access model whereby essentially the authors pay a fee to have their papers published. It's got to be accepted first, go through the refereeing process, whatever first, but then they have to pay to have it accepted. And then anybody can read it. It's open access after that.
So the business model for the publishers is now they're collecting money from the authors instead of from the subscribers. One of the adverse consequences of this is, of course, that a load of crooked journals have been set up. There are thousands of these now, which will publish your paper, no matter how crappy it is, for a fee. And so they make their money by this one
Predatory journals? Predatory journals. There's a classic case of someone who submitted a paper to one of these journals, paid his $500 to get it published, and the paper
The title of the paper was, this is effing nonsense. And the text of the paper said, this is effing nonsense, this is effing nonsense. It just repeated that and it got published, which shows you that it's nonsense. So that's the risk of open access.
I don't know if this is a published paper, but this is hilarious. It's a paper, it's just called Chicken, Chicken, Chicken, Chicken, but it's written in LaTeX and so it's beautiful. And then there's like an equations Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken,
You know, they produce nonsense, but that's the consequences. So what classifies them as a journal? Like I could create some platform, call it a journal, but do I have to petition to somewhere else for it to be called a journal or? No, no, no, no, no. There's no overseeing regulatory body. You just call yourself a journal. Exactly. Exactly. The self-publish and just create your own journal. Yeah, yeah, that's right.
The Advanced Journal of Artificial Intelligence Innovations and Applications. There we are. And we charge $1,000 to everybody who submits their paper to us. We'll have a very high acceptance rate. And yeah, we're away. Okay, now that's the issue with open access. And what's the issue with the archive? And for those who are unfamiliar, the archive is? Okay. Oh, yeah, archive is a way. I think this is great innovation. It's a way of
getting your paper out there before it goes through the refereeing process and appears formally in a journal. So it hasn't been refereed. None of your peers have had a chance to read it and say, oh, there's a mistake here, or, you know, not sure about this bit, this isn't clear, whatever. It hasn't gone through that process. But
I think archive and similar things are great because in a way the papers become dynamic. They're out there but they can get revised and replaced as it goes through the refereeing process or as other people comment on it.
so you know you write the paper you publish it in this electronic archive other people can read it it's meanwhile going through the refereeing process so it will appear in a journal but you're getting feedback perhaps and it's all it's out there if you've got an important result people can read it earlier than the months perhaps years it would take to appear in a conventional journal i didn't know that you can update your link with a different pdf the same exact link you can update it may depend on the particular
Now going back to large language models, what are some issues that you see that we haven't spoken about so far and that you don't see talked about much? Okay, I mean, whether they're spoken about, perhaps they're not. I mean, the fundamental problem is that they are based on, okay, we've actually the theory driven data driven stuff, large language models are not based on an understanding of the world.
They are just based on what people have written and so on on the web or on text or wherever. So when you throw a rock and you ask chat GPT or one of the other large language models, what happens when I throw a rock? And it says when you throw a rocket describes a parabolic trajectory determined by how hard you've thrown it and the air resistance and so on.
It's not saying that because it understands what happens when you throw a rock. It's saying that because it's looked through all of its masses of billions of records of text and this is how it's described. So the fundamental risk is that it can do silly things and you will be aware from looking at the web about large language models, they can say absolutely stupid things, they can say nonsense. A classic example which you may have come across
Jack GPT was asked which of these two presidents was older, Grover Cleveland, this is American presidents, Grover Cleveland or George Bush. And it said, whichever way around it was, Cleveland was older because he was 47 and George Bush was 64. And this guy said, the guy asking it said, okay, the 47 is bigger than 64. And the machine said, yeah, the 47 is bigger than 64. The machine said, yeah, a number is considered bigger than another one if it's larger.
So that I can imagine after a pause, the human interlocutor said, okay, could you count up from 64 to 47 for me? And the machine started doing this and then gave up after a while. And there are many other examples of nonsense produced by these things. So
I think that's the risk. They don't understand them. Think of my throne rock example. They don't have a model, a theory about how the world works. All they have is what people say, the text, and they try to put it together and generate a response based on that. So they don't know if they're saying something absolutely stupid in terms of the way the world works. Now, I have a sort of corollary to this. You know, often when you're talking to ordinary people, talking to me,
People say stupid things as well. It's just that the kind of stupid things that the large language model says is different from the kinds of stupid things that people say. You've got the same sort of thing with chess programs. The way a chess program plays is different from the way humans do. So common sense has been a long standing challenge in AI. To explain why, let me draw an analogy to dark matter.
So only 5% of the universe is normal matter that you can see and interact with, and the remaining 95% is dark matter and dark energy. Dark matter is completely invisible, but scientists speculate that it's there because it influences the visible world, even including the trajectory of light. So for language, the normal matter is the visible text, and the dark matter is the unspoken rules about how the world works,
including NIVA physics and folk psychology, which influence the way people use and interpret language. So why is this common sense even important? Well, in a famous thought experiment proposed by Nick Bostrom, AI was asked to produce and maximize the paper clips, and that AI decided to kill humans to utilize them as additional resources
to turn you into paper clips. Because AI didn't have the basic human understanding about human values. The final moves it makes is different from the way humans do. And I have to say, I was just going to say, I have to say that the most powerful chess playing systems are combinations of the two.
And I can see that also applying in other AI contexts in the future. Yeah, so I don't see that being much of an issue. In the chess example, a chess computer can...
Obliterate I think Magnus Carlsen is the top player easily and even when it makes a mistake it's easily identifiable as a mistake that only a computer would make at least some of the time and there's some controversy around that or at least there was a few months ago about cheating but regardless who cares if it makes a mistake and makes a mistake of a particular kind but overall it's better than a person playing chess now that's a defined game.
What's the danger in that sounds like it's a boon is like a helpful assistant if i had an extra high school student to help me with this channel.
That's great if I had 100 extra for 10 cents at a moment's notice. That's wonderful. I won't trust it when it comes to certain, verify certain facts, but. I would say, yeah, I'd say exams aren't like the real world. Would you want to, suppose you had one of these systems without human involvement, telling you what medicines and what doses to take?
and it could easily say 47 is bigger than 64. Would you trust it to tell you what dose to take? Or think of an earlier example of a software system, the Boeing 747 Max 8 crash. Now, this was telling the pilot to
put the nose down. The autopilot took over and was driving the thing towards the ground, even though the passengers looking out of the window could see the ground getting closer and closer together. It just didn't have the bigger perspective, the sort of theory perspective. It was just based on the data being fed to it. And it was unable to look outside that way, in a way that the humans can. So again, let me come back to a medical case. You might
Someone might go to the doctor and say, I got this terrible backache doctor and the machine will look at its vast records of information on backaches and prescribe something. But what it won't know is that you spend your job is lifting heavy sacks of flour or something like that. It doesn't have this broader perspective that you do have, which is why I think the two things together can be so much more beneficial.
even in chess that's true the two things together can do better than either alone yeah i heard that somewhere but i never saw the source of this i heard some claim that if you take a mediocre player with a somewhat great machine that it can beat a fantastic machine with zero input from a human something like that but then i don't know what is fantastic what is mediocre like well exactly subject to that sort of qualification but the general principle is right i mean at some point um you know you would
You've got to have some, I suppose, reasonable level on each side. Can that be mitigated with these plugins? Like, I don't know if you've heard, but Wolfram Alpha is now integrated into ChatGPT. And so when it comes to calculations, the chat model knows, hey, when it comes to numbers, go use this tool. And this tool is a reliable tool. And in fact, mathematicians use Mathematica. I think Mathematica and Wolfram Alpha have some interplay. Yes. I think that is a very interesting idea. And in a way, it's sort of like a combination
What are your reservations? Yeah, exactly. I suppose I have two reservations. There's a version of one of these systems which writes code, probably more than one many systems which write computer code, and they can sometimes do similar silly things. But I think my real hesitancy
is you've got a complex system and you're tweaking it and when you start to tweak complex systems to overcome a problem you've got a complex system you see it doesn't behave sensibly when we when this sort of circumstance occurs so we'll say look when that circumstances occurs do this but when you start to tweak complex systems there are often all sorts of other unintended consequences
So, you know, it's a complex system. By definition, we don't understand how it works. It's just too complicated. It's got billions of parameters, whatever. And when you start to tweak complex systems, unintended consequences can occur. So it's great now solving this problem, which caused me to make that modification. But unknown to me, elsewhere in the system, other things will go wrong.
So professor, as we end, if you don't mind me asking a personal question, how is it that you spend your days? Do you have a daily schedule that you stick to and is fairly regimented? Do you oscillate? Are you concerned with publishing research? Like what are your values right now as well as your schedule? What does it look like? I do. I mean, I'm still, yeah, I do publish. I do write. This is a very interesting question.
I'm going to slightly change the question. For many academics, perhaps most academics, there is a tension between the things you've got to do and the things you would want to do, the sort of research you would want to do. And I haven't, over my entire life, I've been wrestling with this tension. Should I start the day by getting that lecture preparation out of the way?
and then realizing that it's the end of the day and I haven't done anything else. Or should I start the day by really trying to tackle that problem which has been nagging at me, that technical issue, you know, calculating whatever it is and then getting to the end of the day and realizing, you know, it's sort of 8 o'clock in the evening and I've still got to write the lecture for tomorrow, whatever. And I have never managed to balance those two. It's all too easy to not do the things that you regard as important in terms of your research.
by focusing on the things that you've got to do, like lectures or whatever, but it's easy to go the other way as well. I haven't told you what my day is because I haven't found an answer for that. It varies. I suppose it depends how pressing things are when I get up in the morning. Ideally, I'd spend all my day on research and writing, thinking about things, having great thoughts, but it doesn't work out like that.
When you're having these thoughts are you ordinarily walking like you have a certain trail that you go down or do you sit with the pen or you're reading like how does it work? That's a very nice question because I discovered that so I'll sit at my computer trying to do this and then I think well I've been sitting here for two hours I need to make sure my body still works so I'll get up and go for a stroll
And of course, this is the wonderful thing about being sort of an academic, your work comes with you in some sense. And very often I will find that
That enables me to focus on the problem better and it becomes clearer and I do find a solution. Maybe not all the way while I'm walking because I need to sit down with my pencil and paper or computer or whatever to actually go through the details, but maybe new avenues for tackling it come along. So yeah, it's interesting you should say that. For me, certainly, walking helps. So what's next for you? I know you don't want to reveal the book, one of several book ideas you're working on, perhaps one more than the others, but what can you reveal?
I'm especially interested in is, and I have been and I've written papers on this throughout my career, is getting people to formulate the right question or getting people to think carefully about what it is they want to know. In data analysis and statistics over there are trivial. Let me give you a trivial example if I've got time.
Should you use the mean or the median when you're comparing distributions? And the truth is that either may be appropriate depending upon exactly what you want to know. I want to know if... Well, let me give you a real example. In I think mid-90s, American baseball players went on strike. They said they weren't being paid enough because they said that
average salary is only a third of a million dollars and you can possibly survive on a third of a million dollars a year sort of thing but the club owners said on the contrary you're being paid very well you're being paid half a million dollars a year and you'd think well you know this is easy to sort out who is right we'll go and look at the numbers but it turned out that the
owners of the club, the people paying the, they were using the arithmetic mean and the arithmetic mean matters because their total wage bill was the number of players times the arithmetic mean. But the players were using the medium, which is the right thing for them because 50% of them earn less than that. So it really depends crucially on what exactly you want to know. And I've written quite a bit about this and I'm still working on this sort of as a general issue.
It is very important to be precise about what it is you want to know or if you can't be precise, recognise that you can't be precise and recognise that there will be ambiguity in the answer and you have to cope with that ambiguity in some way. So that's a general aspect of my ongoing research.
And just for the people who are still scratching their head about the difference between average, which is a synonym for mean, and then median, which is what you referenced. Okay, yeah, sorry, sorry. Average is a general term. Mean, arithmetic mean, sometimes just abbreviated to mean, is the conventional thing. You add up all the numbers and divide by how many there are. The median is the number which has got 50% of the values below and 50% of the values above.
But I've written about this in various places. I can't remember if it's in the data book or not. Sure. And just for those people, rule of thumb for applying is if you feel like your distribution is like a bell curve, then it doesn't particularly... It doesn't matter. If it's a symmetric distribution, there'll be the same. And the point is, of course, that salary distributions are very skewed. You've got some super player right up there earning vast, earning 15, 20, 100, I don't know, million dollars a year.
You've got lots and lots of people earning less than a third of a million dollars down here. With a skew distribution, the mean and the median are very different.
With a symmetric distribution, they're the same. Yeah. Yeah. One more concrete example is if you're a thousand people and you're working in a factory and each of you are paid forty thousand per year, but the boss says, hey, no, no, no, the average here is one million per year. The reason is because the boss, that one person is a billionaire. And so you can average across those one thousand and say the average person here, if you pluck an average quote unquote person out, it's half a million or a million per year.
Yeah, but that's not what we're trying to say. That's right. And think of it. So the boss is interested in the overall wage bill, so the mean times the number of employees. A new person thinking, should I go to work at that company? And the boss says, the average is a million a year. And then
Alright, well thank you so much Professor, it was a pleasure and I hope to speak with you again when your next book comes out and also
Perhaps even to speak about your former book on coincidences at some point. Well, thank you very much indeed. I've enjoyed it tremendously. You asked some really great questions. I really appreciate it. Thank you. Take care, sir. As promised, here's a compendium of the dark data types. This is as far as I understand them. I've made some personal notes of David Han's collocation of different dark data types and it's this sorted list of 15 where I've given the monikers as a mnemonic to myself so that I can more easily assimilate them into my knowledge.
DD type 1 is data we know are missing, which I've called consciously undocumented. So imagine you're filling out a survey and they don't ask you for your age on purpose. This is a classic example of data that's consciously left out. We know it's missing. It's done for various reasons. It could even be data loss that we know. DD type 2 is data we don't know are missing or inadvertently omitted. Okay, so think about a researcher conducting a study on socioeconomic factors but forgets to include income data.
They don't even realize it's missing. That's why we call it inadvertently omitted. It's like losing, let's say a piece of a puzzle and you don't know that it was part of some larger picture. DD type three is choosing just some cases, which I've called selectively scoped. So imagine you're studying people's eating habits, but only in the city. So you ignore the countryside. That's an example of selectively scoped data. Or if you only go in the evening time, that's like taking a picture.
But you're only capturing a portion of the scene, and you don't realize that what you're leaving out affects the whole story. It could also potentially be due to sampling bias. Already, by the way, you can see that these overlap. DD type 4 is called self-selection, which I've called volitionally included. All right, so you've sent out a survey about your favorite TV show. Let's say The Bear. The Bear is a fantastic show, by the way. And only fans respond. This is an example of only seeing part of the data.
Those who don't like the show may not participate. And by the way, there is no one who doesn't like the bear because the bear is fan-tastic. It's just like Breaking Bad. Breaking Bad is fan-tastic. So that's volitionally included data. This is one of the reasons, by the way, you're told as a creator, as someone who publishes work, that don't listen to the extreme praise, nor the extreme negativity, as very likely you're not as great as people say you are, and you're not as odious as people say you are. In other words, you have a biased set.
DD type 5 is missing what matters. So this is called vitally overlooked. Let's say you're conducting a climate study and you ignore humidity. It's there, it's vital, but you overlook it. It's like you're making French toast and you forget the eggs. So you'll get French toast. It's just not going to taste the same. That's vitally overlooked data. Now DD type 6 is data which might have been or counterfactually void.
By the way, we talk about what counterfactuals are in the Tim Modellin episode, and that's on screen right now. Tim Modellin and Tim Palmer talk about counterfactual definiteness as it relates to Bell's theorem in quantum theory. OK, so what is it? You can think about what might have happened had you taken a different path. Counterfactually void data is like that. This hypothetical data, the events that could have transpired under different circumstances or what could have existed under different circumstances. DD type 7 is what changes with time.
or temporarily altered data. So picture a tree through the four seasons. In data terms, this is temporarily altered data. It changes over time. Like also sales data fluctuates during the holidays. Watching a time lapse, you see the movie shift and change and it affects the data landscape. DD type eight now is the definitions of data, which I've called definitionally exchanged. This is one of my favorite dark data types. So if you have two doctors who are using different medical codes for the same condition,
They're seeing the same part of reality, the same patient, the same symptoms, though they're defining it differently. This is definitionally exchanged data. And this, by the way, is one of the reasons why certain medical conditions seem to go away with time and some seem to creep up over time. It's because we're changing the data. So we think, hey, there's this huge spike all of a sudden and some issue doesn't even have to be a medical issue. It could be something we want. Oh, wow. Look at this. Crime has gone down. Well, we've defined violence differently than in the past, let's say.
DD type 9 is summaries of data, which I've called summarily reduced. You can think of this like reading a book summary instead of the whole novel. You get the gist, sure, but you miss the details. This, by the way, is one of the reasons why the theories of everything podcast is started, because I find that the popularizers of science do this. And I've just been, for myself, I've been missing so much data when I was studying physics, and it bothered me.
So theories of everything is an attempt to be more comprehensive and rigorous to avoid DD type 9, which is summarily reduced data. That is, we don't skip on the details and try to not water down the explanations, as I believe that's where progress is made. DD type 10 is measurement error and uncertainty, precision lacking is what I call it. So imagine you're measuring something with a broken ruler. You'll get a measurement, sure, but it's off. So that's precision lacking data.
DD type 11 is feedback and gaming. Feedback distorted is what I call it. Okay, this is another one of my favorite ones. Sometimes you'll go to a store and then they'll say, hey, review our product online. We'll give you a discount. And then you're thinking, okay, great. I'd love a free packet of cream cheese or whatever it may be. Yeah. Okay. So now you've rated them five stars and you've distorted the feedback. So sometimes online you can't trust all these reviews. Those reviews are feedback distorted.
It's not entirely genuine and it skews the overall picture. So DD type 12 is informational asymmetry or information asymmetry, which I've called asymmetrically acquired. Imagine being in a game where someone knows the rules, but you don't know the rules, so they have an advantage. At least you think so. Asymmetrically acquired data is like that, where some people have more or better information that can tilt the playing field. And this imbalance in information can actually influence what data
gets collected and analyzed. DD type 13 is intentionally darkened data. Think of this as those FOIA requests or classified documents with the black marks. I call it intentionally obfuscated. It's when you make something vague or hidden for whatever reason, obscurantism, privacy, confidentiality, security, or the veneer of that. It could be an excuse. That's called intentionally darkened data.
DD type 14 is fabricated and synthetic data, so synthetically concocted. If you play a video game and there's simulated characters and environments, that's synthetically concocted data. It's not the real world. It's observations that were generated by a crafted simulation, like a virtual reality where we feel it's real, but it's not. And DD type 15 is extrapolating beyond your data, which I've called extrapolatively risky. OK, so let's imagine you're trying to predict next decade's fashion trends based on last year's style.
That's extrapolatively risky data. You're making a leap, a large leap. And while it might land, it could also miss the mark and it leads to increased uncertainty, potential inaccuracies. So it's like you're shooting an arrow. You're just hoping for the best. OK, that's it. I hope you enjoyed it. Because of all the talks that I've seen and even the papers that talk about dark data, there's not one single place that just has in video form an analysis like this. So I would have appreciated this when I was researching and I thought maybe you would as well. Take care.
The podcast is now concluded. Thank you for watching. If you haven't subscribed or clicked that like button now would be a great time to do so as each subscribe and like helps YouTube push this content to more people. You should also know that there's a remarkably active discord and subreddit for theories of everything where people explicate toes, disagree respectfully about theories and build as a community our own toes. Links to both are in the description.
Also, I recently found out that external links count plenty toward the algorithm, which means that when you share on Twitter, on Facebook, on Reddit, et cetera, it shows YouTube that people are talking about this outside of YouTube, which in turn greatly aids the distribution on YouTube as well. Last but not least, you should know that this podcast is on iTunes.
It's on Spotify. It's on every one of the audio platforms. Just type in theories of everything and you'll find it. Often I gain from re-watching lectures and podcasts and I read that in the comments. Hey, toll listeners also gain from replaying. So how about instead re-listening on those platforms? iTunes, Spotify, Google Podcasts, whichever podcast catcher you use. If you'd like to support more conversations like this, then do consider visiting patreon.com slash Kurt Jaimungal.
And donating with whatever you like again it's support from the sponsors and you that allow me to work on toe full time you get early access to add free audio episodes there as well for instance this episode was released a few days earlier every dollar helps far more than you think either way your viewership is generosity enough.
Think Verizon, the best 5G network, is expensive? Think again. Bring in your AT&T or T-Mobile bill to a Verizon store today and we'll give you a better deal. Now what to do with your unwanted bills? Ever seen an origami version of the Miami Bull? Jokes aside, Verizon has the most ways to save on phones and planets.
So bring in your bill to your local Miami Verizon store today and we'll give you a better deal.
▶ View Full JSON Data (Word-Level Timestamps)
{
"source": "transcribe.metaboat.io",
"workspace_id": "AXs1igz",
"job_seq": 7692,
"audio_duration_seconds": 5133.76,
"completed_at": "2025-12-01T00:52:02Z",
"segments": [
{
"end_time": 20.896,
"index": 0,
"start_time": 0.009,
"text": " The Economist covers math, physics, philosophy, and AI in a manner that shows how different countries perceive developments and how they impact markets. They recently published a piece on China's new neutrino detector. They cover extending life via mitochondrial transplants, creating an entirely new field of medicine. But it's also not just science they analyze."
},
{
"end_time": 36.067,
"index": 1,
"start_time": 20.896,
"text": " Culture, they analyze finance, economics, business, international affairs across every region. I'm particularly liking their new insider feature. It was just launched this month. It gives you, it gives me, a front row access to The Economist's internal editorial debates."
},
{
"end_time": 64.514,
"index": 2,
"start_time": 36.34,
"text": " Where senior editors argue through the news with world leaders and policy makers in twice weekly long format shows. Basically an extremely high quality podcast. Whether it's scientific innovation or shifting global politics, The Economist provides comprehensive coverage beyond headlines. As a toe listener, you get a special discount. Head over to economist.com slash TOE to subscribe. That's economist.com slash TOE for your discount."
},
{
"end_time": 81.34,
"index": 3,
"start_time": 66.203,
"text": " Once upon a time, three bears found a golden haired guest who wouldn't leave their home. This Duncan cold is just right. And this Duncan cold is just right. Should we introduce ourselves? No, Barry. The home with Duncan is where you want to be."
},
{
"end_time": 105.384,
"index": 4,
"start_time": 82.5,
"text": " Most current AI systems in these large language models are based on data-driven models, and there is this risk of them being fundamentally brittle."
},
{
"end_time": 124.241,
"index": 5,
"start_time": 106.476,
"text": " Man, I'm excited to bring Professor David Hand to the Toll Podcast. He's an eminent British statistician and a professor of mathematics at Imperial College London. David's made significant contributions to the field of data analysis, statistical theory, as well as something else that he's pioneered called"
},
{
"end_time": 140.794,
"index": 6,
"start_time": 124.241,
"text": " Professor Han has the ability to dissect and simplify ordinarily convoluted statistical concepts and articulate them in a digestible manner. This is a rare skill and it's one of the main reasons I'm honored to introduce you to him. Questions explored today are what is dark data?"
},
{
"end_time": 167.039,
"index": 7,
"start_time": 140.794,
"text": " Are there different kinds? What is this relationship to dark matter? And how could it be that the data we're missing is more important than the data we have? Why isn't it as simple as just collecting more data? Also, you've heard the phrase that there are known knowns and unknown knowns and unknown unknowns. Can you categorize those unknown unknowns? Is this a perennial problem or are there techniques to overcome this ostensible intractable conundrum? Further, this is something extremely relevant."
},
{
"end_time": 193.114,
"index": 8,
"start_time": 167.039,
"text": " How do these issues become exacerbated by the large language models that we see coming out, a new one every month or so now? Also, what's the difference between a data driven model and a theory driven model? Make sure to stick around until the end of the podcast because I go over all 15 dark data types. You may want to watch this section first to gain an overview or wait until the end as a summary. The timestamp is above. Thank you and enjoy this episode with Professor David Hand."
},
{
"end_time": 222.568,
"index": 9,
"start_time": 193.49,
"text": " Welcome, Professor David Han. I'm super excited to be speaking with you. I was watching some of your lectures and reading your book now for at least two and a half months. So this is two and a half months in the making. Great. Thank you. Can you give the audience an overview as to what dark data is? Yeah. Okay. So I sometimes describe dark data in a phrase. It's data you haven't got. What I mean by that is it's data that you"
},
{
"end_time": 251.254,
"index": 10,
"start_time": 222.944,
"text": " assume you have or think you have or perhaps would like to have, which would have had an impact on any conclusions you're drawing, the results of any analysis you may have undertaken, but which for one reason or another you don't have. You might assume you've got it. Let me take some examples. It might be simple missing data of various kinds, and we can talk about different kinds later on. And you might think, well,"
},
{
"end_time": 278.797,
"index": 11,
"start_time": 251.664,
"text": " I can go ahead with the analysis of the data I've got. The fact that some of the data are missing, I haven't got some responses from respondents or observations from some of the experiments won't matter. But it can be absolutely crucial. It really depends how the missingness mechanism is related to what you have got. So the missing data can be even more important in many ways when the data you've got can certainly mislead you."
},
{
"end_time": 303.456,
"index": 12,
"start_time": 279.548,
"text": " One of the illustrations you give in the book or in a lecture, because I've forgotten conflating them now, perhaps both, is in an earthquake, you may think that, well, let's send the help to where the most 911 calls are coming from. And then you think, well, that sounds reasonable if you do some overlay, a heat map of where are the most requests for help. Well, it seems like let's send it to where it is. And you could do some analysis of like, okay, well, it's weighted by population density."
},
{
"end_time": 324.155,
"index": 13,
"start_time": 303.456,
"text": " But then what about the places where there's zero? It could be that it was so disastrous there that the cell towers themselves have been taken out and that's actually where the help is needed most. So can you give some other counterintuitive illustrations or examples that even made you while you were researching this book go like, man, we're going about this all wrong and holy. That's such a, I never thought of it like that."
},
{
"end_time": 345.384,
"index": 14,
"start_time": 324.633,
"text": " That's right. That's the example you gave. In fact, in the book I talk about Hurricane Sandy at the time, the second most costly hurricane in American history, $75 billion worth of damage. And all the publicity was, of course, on that physical storm, but it was accompanied by the Twitter storm with 20 million tweets being sent about it."
},
{
"end_time": 373.08,
"index": 15,
"start_time": 345.93,
"text": " But exactly as you say, those tweets are great. Twitter tells you exactly what's happening, where it's happening, when it's happening. So as you say, you know where to send the emergency services. But what about a community which has been completely obliterated? You wouldn't know. That's really where it's critically important you should send the emergency services. So that's a great example of missing data you don't know about, which is completely misleading you."
},
{
"end_time": 400.333,
"index": 16,
"start_time": 373.677,
"text": " And there are any number of other examples. Oh, think of a pandemic, of course. Maybe there are lots of infection. We've learned this about Covid, but any other pandemic, maybe there are lots of infectious people out there without symptoms. They could be spreading the disease through the community and that could be disastrous and you wouldn't know if they're not presenting with symptoms and going to the medical services. So that would be another example of that."
},
{
"end_time": 426.425,
"index": 17,
"start_time": 400.896,
"text": " A classic sort of example which people have researched in great depth is non-response in surveys. Oh, I suppose traditional surveys, you have a sampling frame, you know who you're supposed to be getting answers from. And if some people don't answer, you know they haven't answered. These are Rumsfeld's known unknowns. That's one kind of missing data which is possible without too much difficulty to adjust for."
},
{
"end_time": 455.964,
"index": 18,
"start_time": 426.817,
"text": " But what about doing a web server? You write a blog or a podcast and you get lots of comments. And all the comments might say, oh, this is a wonderful blog or podcast. But maybe there are lots of people out there who hate it and can't be bothered because they hate it. So to tell you so, you'd be getting them completely. I'm sure you just get wonderful comments and all the people out there who don't say anything also think it's wonderful. But one could be completely misled by this. And so it goes on."
},
{
"end_time": 482.381,
"index": 19,
"start_time": 456.425,
"text": " Here's an anecdotal example, not an anecdote, it's an anecdote, it's a joke really. But it conveys a particular kind of dark data to do with what I call self-selection. You go to a new town on a train, you get out at the railway station and outside the railway station there's a map with a big dot in the middle saying you are here."
},
{
"end_time": 509.258,
"index": 20,
"start_time": 483.012,
"text": " And the man thinks, how did they know I was here? Well, obviously they knew he was here because you have to be there. But that, in order to read the map, but exactly the same sort of thing happens in other contexts. My classic example is the anthropic principle, which you'll be familiar with being a physicist, you know, about why is the universe, it's extraordinary that the universe has got lots of numerical constants which have to have"
},
{
"end_time": 527.585,
"index": 21,
"start_time": 509.667,
"text": " Very precisely the values they have, or humanity wouldn't exist. Isn't that incredible? Well no, if the constants were different, humanity wouldn't exist, so we wouldn't see the universe. So there are all sorts of ways one can be subtly misled by data you don't have."
},
{
"end_time": 550.06,
"index": 22,
"start_time": 528.626,
"text": " Can you talk about the data that we don't have and some of the DD types? And by the way, just so you know, when you've watched this, you have already now seen the summary of the 15 DD types. Why don't you go over that example right now? How changing definitions sounds like that's something that's great and what we want to keep up with the times. So we change definitions of all sorts for all reasons. Yeah, let me give you two medical examples."
},
{
"end_time": 572.449,
"index": 23,
"start_time": 550.657,
"text": " The definition of autism has changed over time and the amount of autism in the community has gone up. Now has it gone up just because in the early days we were excluding people whom we would now regard as having autism from having autism and other medical diseases also undergo that sort of revision. But another example would be during the pandemic where"
},
{
"end_time": 589.991,
"index": 24,
"start_time": 572.961,
"text": " countries were very keen to report their COVID infection rates and death rates and saying we are doing better with our strategy and so on and so forth. But of course, when a death from COVID is reported, you have to be very careful about exactly what you mean. Do you mean death"
},
{
"end_time": 616.954,
"index": 25,
"start_time": 590.555,
"text": " from COVID or with COVID, with a formal diagnosis of COVID, with symptoms of COVID. What exactly do you mean? And there are, funnily enough, my book appeared right at the start of the pandemic, so the words pandemic and COVID don't appear in it, but you could easily write an entire book about dark data and the pandemic. But that illustrates it. The death rate for COVID depends crucially on what you exactly mean by"
},
{
"end_time": 641.869,
"index": 26,
"start_time": 617.312,
"text": " death with COVID from COVID and so on. Right. And that could be partially or largely responsible for why different countries had such drastically different rates of death. I'm sure it absolutely. I'm sure it. Are you planning on writing that second part? No, I probably won't. I have been tempted, but probably not. Just so you know, so the audience is aware, you have another book that's come out afterward on coincidences."
},
{
"end_time": 662.824,
"index": 27,
"start_time": 641.869,
"text": " Are you planning on writing another book? Are you working on one right now? I'm always working on another book, but the way I work, I spend quite a lot of time"
},
{
"end_time": 692.807,
"index": 28,
"start_time": 664.77,
"text": " Formulating the structure of the book, getting examples and creating the book before I sort of go to the world and say, this is going to be my next book. I don't want to jinx it, if you see what I mean. Well, I'm sure it's not jinxing. What's the real reason? It's because you don't want to look foolish if you don't end up pursuing that particular one. Or if you say it and it's premature and someone gives you a look askance, then you're like, oh, this wasn't such a great idea. But if I pursued it more, I would have had more confidence behind it."
},
{
"end_time": 723.2,
"index": 29,
"start_time": 693.729,
"text": " It's like that. Looking foolish is almost it, but not quite. I don't mind looking foolish, but if I tell people I'm going to work on this and then I decide that it really doesn't make a very good book, then people will be saying, well, where is this? Well, I decided it wasn't. I just want to be sure that this is what I want to work on and finish before I actually"
},
{
"end_time": 745.026,
"index": 30,
"start_time": 724.002,
"text": " Do it. Yeah, you look footloose and erratic. Perhaps 10 ideas occur to you each day and you don't want to say, yeah, I'm going to pursue these 10 ideas. Then only one out of the whole month actually makes it through. Then you look like you're responsible. Yeah. If you look in my computer and you'll see folders for, I don't know, hundreds of books, say, well, brilliant idea. That would be, but you know, then have I got"
},
{
"end_time": 773.268,
"index": 31,
"start_time": 746.067,
"text": " the resources to put a year's work into writing that book. And so most of those hundred ideas are just a few lines or sentences or even pages. But for one reason or other, you can't do everything, you know, so I have to decide which ones are worth pursuing. And you're collecting examples primarily because for books, the stories matter so much because as a statistician, I'm sure you don't care too much about the anecdotes. You care about the data set."
},
{
"end_time": 787.927,
"index": 32,
"start_time": 773.746,
"text": " Yeah, professionally I care about the data and what understanding you can squeeze, what illumination you can extract from data, so what statistics is all about. But when I write these sort of popular books,"
},
{
"end_time": 809.155,
"index": 33,
"start_time": 789.377,
"text": " They are driven. I want people to identify with them, so the Coincidences and Rare Events book is full of things that have happened to people, just like the dark data is full of real stories about how significant, how important, how crucial dark data can be, so that people think this is important."
},
{
"end_time": 830.913,
"index": 34,
"start_time": 809.155,
"text": " Personally do you find that part it's not something you care about doing it's a bit taxing it's grueling it's just your editor is like we need it something that we need an emotional help here and you're like hey is the theorem an emotional enough i'm a mathematician. No no no no it's not the editor it's me doing it so sometimes people ask me how long does it take to write a book and i get i said there are two answers."
},
{
"end_time": 851.391,
"index": 35,
"start_time": 831.186,
"text": " One is 20 years of collecting examples, illustrations, recognising that there is a common thread and deciding it should be pulled together. The other answer is, and I get them both answers, the other answer is six months. I've got all these examples and I know what the common thread is, now I have to sit down and actually put it all together."
},
{
"end_time": 880.469,
"index": 36,
"start_time": 851.391,
"text": " So really I'm always collecting examples and in fact in my hundred folders of potential books whenever I see something which illustrates one of these sort of common threads then I'll put it in and you know and coming back to your earlier question if I find that sooner or later I've got lots and lots of examples like with the improbability principle and dark data then you know I know I've got a sensible book on my hand there are so many"
},
{
"end_time": 909.855,
"index": 37,
"start_time": 881.237,
"text": " examples and aspects to it, but it's got to be written down and tied together. Are you using a particular app like Microsoft Word or Notion or Obsidian? I just use Word, yeah. I'm very lucky. I mean, I'm lucky because not all academics like writing and I do. So, you know, I just use Word. Why do you think that is? Why do I like it? Yeah. And why do you think most academics don't? I don't know if most don't, but a lot don't."
},
{
"end_time": 941.613,
"index": 38,
"start_time": 912.21,
"text": " There's a very good question. I suppose many might not. They might just enjoy the hunt, the search, running the experiments, analyzing the data, collecting the data, and maybe actually sitting down and writing it all up as a chore. I like it because I always feel that when you sit down and write it up, you often find new questions, but you're trying to force it into a structure."
},
{
"end_time": 958.148,
"index": 39,
"start_time": 942.005,
"text": " You're trying to understand it. You know, they say that the way to understand something properly is to teach it, because that forces you to structure it and make it coherent and consistent and sensible. And I think for me, the writing process plays a similar role."
},
{
"end_time": 985.316,
"index": 40,
"start_time": 959.053,
"text": " Maybe not for everybody. Yeah, I've always been skeptical of this. Hey, if you understand something, you can explain it to a 10 year old or a five year old. Yeah, I'm not saying that you necessarily can. Yeah, I'm saying that if you want to hope to explain it, then you've got to understand. And the reason why I say that is because at least in the University of Toronto, that's my alma mater, it's known that the brightest people are the worst at lecturing. It's not necessarily inverse correlation, but it's not terribly correlated."
},
{
"end_time": 1012.363,
"index": 41,
"start_time": 985.623,
"text": " I think it may even be an inverse correlation. If they're right, they don't understand why you don't understand it. I can't understand this, but it's obvious, isn't it? Not to me, it isn't. There's a name for that. It's called the expert's curse. Yes, I didn't know that. You can't place yourself in the position of someone. You can't remember what it's like to not know it. And even if you don't know it, your reason for not knowing it's not the same as the... Exactly. Very good. Yes, that's spot on."
},
{
"end_time": 1038.865,
"index": 42,
"start_time": 1012.807,
"text": " Have you used any large language models like ChatGPT or any of the ones that have come out for your writing? Not for my writing, no. No, no, no. I don't mean to say you're plagiarizing it. No, no, no, no. I, as part of what I enjoy putting things down so that they're saying what I want them to say. I have to say, over my career, I have been immensely frustrated by"
},
{
"end_time": 1063.985,
"index": 43,
"start_time": 1040.981,
"text": " PhD students, not all PhD students, just some who can't write. This is a sort of bettenoir of mine in a sense. It's perfectly understandable. In the UK, it may be a decade in which they haven't had to write essays or things like that."
},
{
"end_time": 1087.261,
"index": 44,
"start_time": 1064.411,
"text": " you know, from early, relatively early, perhaps late, but sometime in their school education through to the end of university, they've done maths or physics or whatever it is, they haven't been told to write coherent prose. And so I have found sometimes that, and I'm not talking about people whose first language isn't English, I'm talking about people whose first language is English,"
},
{
"end_time": 1116.613,
"index": 45,
"start_time": 1087.671,
"text": " They haven't really thought to think carefully about the words and the punctuations and the paragraphs and what have you, so that those actually convey what's in their head. Either that or what's in their head isn't sufficiently clear to enable conveying. But I have found that a bit of a, that's as I say, sort of one of my things I complain about. But I enjoy the process of"
},
{
"end_time": 1144.667,
"index": 46,
"start_time": 1117.534,
"text": " It's quite interesting actually. I enjoy sitting down and trying to beat my thoughts into and then making sure that each word is where it should be and conveys what it should. And it's interesting because all too often I will go back, read something I've written and think, I can improve that. That's not very good. Occasionally I think, oh yeah, that's good. I wish I could still write like that. But all too often I think, oh dear, you know, why didn't I"
},
{
"end_time": 1171.186,
"index": 47,
"start_time": 1145.299,
"text": " It would have been better had I done it that way and so on. So how do you use the language models? I haven't used those. I just use Word. I have played with Trap GPT just to explore what it's like and so on and we can talk about its strengths and weaknesses but I don't use those. It's part of me getting to grips with things and understanding and"
},
{
"end_time": 1198.063,
"index": 48,
"start_time": 1173.626,
"text": " Often in machine learning, well, plenty these days, we hear about bias. And until I read your book, I didn't realize that most of the time when people are referencing bias, they're biased in referencing their bias. And the reason I say that is that they're selecting about three or four of the DD types when they're talking about bias. So DD type three and four and perhaps 12."
},
{
"end_time": 1226.8,
"index": 49,
"start_time": 1198.063,
"text": " I can explain more like they're talking about certain qualities have been overemphasized or de-emphasized or asymmetrically acquired. That's only a couple of them and they're a vast I don't know if you would call all of dark data can be summarized as bias but they're a vast array of other ways that you can have bias or dark data. My question is what are the ways that we have bias in these large language models that we aren't talking about much that we perhaps should be talking about more? I think I mean"
},
{
"end_time": 1252.739,
"index": 50,
"start_time": 1227.944,
"text": " So I'm going to take a step back if that's okay. When I talk about data science in general or statistics in general, I say there are two kinds of models, but this isn't taught enough, it ought to be taught more, but there are two kinds of models. There are theory-driven and data-driven models, they sometimes go under other names but those would do. Theory-driven models are based on some"
},
{
"end_time": 1282.637,
"index": 51,
"start_time": 1253.78,
"text": " sort of assumed underlying understanding of what's going on. So if I'm building a, if I'm trying to understand something in physics, I might base it on Newton's laws of motion or laws of thermodynamics or something, some sort of theory behind it. And then I will collect my data and do my statistical analysis and so on to try to, as I say, squeeze more understanding from it. Those are theory driven. In psychology, you might have a model based on prospect theory or something."
},
{
"end_time": 1313.575,
"index": 52,
"start_time": 1284.36,
"text": " The other kind of model are data-driven models. Data-driven models just take large, usually, perhaps vast, as in the case of these large language models, corpuses of data, text for large language models, and they summarize it. They look for structures, correlations, patterns, features, anomalies, this sort of thing. So those models are based entirely on the data that you've fed them. Hear that sound?"
},
{
"end_time": 1340.06,
"index": 53,
"start_time": 1313.933,
"text": " That's the sweet sound of success with Shopify. Shopify is the all-encompassing commerce platform that's with you from the first flicker of an idea to the moment you realize you're running a global enterprise. Whether it's handcrafted jewelry or high-tech gadgets, Shopify supports you at every point of sale, both online and in person. They streamline the process with the internet's best converting checkout, making it 36% more effective than other leading platforms."
},
{
"end_time": 1366.169,
"index": 54,
"start_time": 1340.06,
"text": " There's also something called Shopify Magic, your AI-powered assistant that's like an all-star team member working tirelessly behind the scenes. What I find fascinating about Shopify is how it scales with your ambition. No matter how big you want to grow, Shopify gives you everything you need to take control and take your business to the next level. Join the ranks of businesses in 175 countries that have made Shopify the backbone."
},
{
"end_time": 1391.954,
"index": 55,
"start_time": 1366.169,
"text": " of their commerce. Shopify, by the way, powers 10% of all e-commerce in the United States, including huge names like Allbirds, Rothy's, and Brooklyn. If you ever need help, their award-winning support is like having a mentor that's just a click away. Now, are you ready to start your own success story? Sign up for a $1 per month trial period at shopify.com slash theories, all lowercase."
},
{
"end_time": 1420.828,
"index": 56,
"start_time": 1391.954,
"text": " Razor blades are like diving boards. The longer the board, the more the wobble, the more the wobble, the more nicks, cuts, scrapes. A bad shave isn't a blade problem, it's an extension problem. Henson is a family-owned aerospace parts manufacturer that's made parts for the International Space Station and the Mars Rover."
},
{
"end_time": 1442.654,
"index": 57,
"start_time": 1420.828,
"text": " Now they're bringing that precision engineering to your shaving experience. By using aerospace-grade CNC machines, Henson makes razors that extend less than the thickness of a human hair. The razor also has built-in channels that evacuates hair and cream, which make clogging virtually impossible. Henson Shaving wants to produce the best razors, not the best razor business,"
},
{
"end_time": 1462.671,
"index": 58,
"start_time": 1442.654,
"text": " So that means no plastics, no subscriptions, no proprietary blades and no planned obsolescence. It's also extremely affordable. The Henson razor works with the standard dual edge blades that give you that old school shave with the benefits of this new school tech. It's time to say no to subscriptions and yes to a razor that'll last you a lifetime."
},
{
"end_time": 1482.5,
"index": 59,
"start_time": 1462.671,
"text": " Visit HensonShaving.com slash everything. If you use that code, you'll get two years worth of blades for free. Just make sure to add them to the cart. Plus 100 free blades when you head to H E N S O N S H A V I N G dot com slash everything and use the code everything."
},
{
"end_time": 1513.2,
"index": 60,
"start_time": 1484.189,
"text": " which of course has the potential to lead to conventional bias, gender bias or whatever it happens to be, for example. But this is the crucial thing, I think. They also have the potential to be brittle because they're based on this corpus of data or text or whatever it happens to be. If things change, you think of financial crisis, a pandemic or a war or whatever, if things change, that data can become"
},
{
"end_time": 1531.442,
"index": 61,
"start_time": 1514.65,
"text": " Less relevant, irrelevant, so that your data-driven model is fundamentally brittle. If things change, it could no longer apply. We saw this actually with the, well, we've seen it lots of times, but I used to do a lot of work in retail consumer credit scoring and so on, and with the"
},
{
"end_time": 1553.251,
"index": 62,
"start_time": 1531.869,
"text": " financial crash in 2008, some of the models started, they just weren't very good anymore. Well, not surprisingly, they were based on a relatively benign economic period, and they worked great under those circumstances, but if the circumstances change. Anyway, so we have these two kinds of models, theory driven and data driven models. Almost all of this, the"
},
{
"end_time": 1576.015,
"index": 63,
"start_time": 1554.411,
"text": " recent advances in not entirely all but almost all of the recent advances in machine learning and AI are based on data-driven models. You give it a lot of data and it trawls through looking for correlations patterns and so on and produces a structure which allows it to generate new text or whatever it happens to be based on what it's found in the data you've given it."
},
{
"end_time": 1599.224,
"index": 64,
"start_time": 1576.544,
"text": " If things change, those data no longer apply, so they're fundamentally brittle. Theory-driven models, of course, if your theory is wrong, if Newton's laws turned out to be wrong, which would of course be the case if you were traveling very fast or whatever it happened to be, then you could be wrong as well. But theory-driven models are more likely to be right"
},
{
"end_time": 1628.882,
"index": 65,
"start_time": 1599.889,
"text": " because of the sort of continuity they imply. You know, okay, circumstances change, but that means I'm just in a different part of the space spanned by my theory, for example. But of course, they can also be wrongly for theories wrong. Okay, that's interesting. Please expand more on the difference between theory driven and data driven. Like why I still don't get why is theory driven? You mentioned continuity is a reason that it may be more. Is there a technical term for this brittleness? Is it called robustness? Robustness would be another word, but that also has sort"
},
{
"end_time": 1658.882,
"index": 66,
"start_time": 1631.169,
"text": " Okay, we'll stick with brittle. Why is it that something that is theory driven is less brittle compared to something that is data driven? Sorry, models that are... Yeah, so my theory driven model... Let me give you two examples then. Yeah, okay. I can use a theory driven model to predict the trajectory of a thrown stone. And I can use Newton's laws to do that. But I could also, if I didn't know anything about"
},
{
"end_time": 1687.824,
"index": 67,
"start_time": 1659.189,
"text": " Newton's laws. I could also try to build a model just based on things I happen to observe. In retail credit scoring, for example, they use logistic regression trees. They basically partition the population and they use a predictive regression type model, a logistic regression model in each of those partitions, a separate model. Now, if in my thrown rock example,"
},
{
"end_time": 1713.046,
"index": 68,
"start_time": 1688.183,
"text": " I suddenly go way out of the data I've got and start so that the world changes. No longer am I throwing rocks with a fairly gentle speed. I'm now throwing them much harder. My data now looks nothing like the data I got before, but Noon's laws still apply, so my predictions will still be pretty accurate. But my data driven model, the logistic regression model, if now"
},
{
"end_time": 1731.544,
"index": 69,
"start_time": 1713.78,
"text": " People start behaving differently or I'm getting a different population. I try to apply them to a different population. No reason why the structures and correlations I've found in the data I had before, no reason why they should also apply here. Suppose my initial population was youngsters, people under the age of 30, so"
},
{
"end_time": 1753.951,
"index": 70,
"start_time": 1732.073,
"text": " And now I'm applying them to people over the age of 70. People over the age of 70 are mostly retired. The circumstances are completely different. Why would I expect the credit model to apply? But of course, you know, so it may completely collapse. Actually, a better example would be to flip it around. I built my credit scoring model on people aged 70 plus."
},
{
"end_time": 1765.299,
"index": 71,
"start_time": 1754.462,
"text": " Youngsters are more risky usually. I built my credit scoring model at people aged 70 plus and then applied it to people aged less than 30. I might find that my bank's going bankrupt because of my models."
},
{
"end_time": 1791.715,
"index": 72,
"start_time": 1765.572,
"text": " Just no good anymore. Okay, so in that latter case, would that be a data driven model? Yeah, that's exactly right. That's data driven because I took my 70 plus people, people aged 70 plus, and I said, okay, let's correlate default risk with whatever other factors we can think of. And we have found this relationship. And these relationships, we can build quite elaborate models. And then I try to apply it to the people under 30."
},
{
"end_time": 1812.568,
"index": 73,
"start_time": 1792.432,
"text": " Where the relationships unknown to me, the relationships are completely different. If I had a very good psychological theory, theory driven, then I could perhaps make, you know, make that extrapolation down the ages. But without that, if it's just based on what I observe in my 70 plus people, it could go completely wrong."
},
{
"end_time": 1836.442,
"index": 74,
"start_time": 1812.995,
"text": " Is there a strict dichotomy between those two? No, very good. Perfect question. You actually know a lot about this. I can see that. In fact, it's a sort of leapfrogging. It's like your classic way of advancing science. You collect data, you look at it, you try to make sense of it. You conjecture a theory. You then go away and test the theory. But"
},
{
"end_time": 1863.558,
"index": 75,
"start_time": 1836.869,
"text": " conjecture a theory, you make some predictions based on that theory, then you apply it to new data. So they overlap sometimes in some models and mixtures of them both. Others, it's difficult to say which side of the line they're on. It might be on one side of the line for one person, different for the other. So it's not a hard and fast division. But nevertheless, it's a useful division because, well, in the modern world,"
},
{
"end_time": 1893.353,
"index": 76,
"start_time": 1864.394,
"text": " These large language models and most current AI systems and ML machine learning systems are based on data driven models and there is this risk of them being fundamentally brittle. So why don't you give an example of we had earlier an example of the cell towers going down or the tweets not being sent out. Can you give an example of something perhaps it hasn't occurred yet but it could with regard to these large language models of some danger that we haven't thought much about?"
},
{
"end_time": 1921.357,
"index": 77,
"start_time": 1893.797,
"text": " What I say is, but we haven't thought much about, what I mean is that, like you perhaps have thought plenty about, but that in the media, we talk about, we as in not myself, but we here talked about certain AI risk scenarios, that AI alignment, misalignment, and there are other risks, but something else that you from your dark data book that you can see is like, here's another way you can go awry that's subtle, but it's extremely impactful. Yeah, from the dark data book, I think it is"
},
{
"end_time": 1951.527,
"index": 78,
"start_time": 1921.664,
"text": " The fundamental question that I dealt with there is the fact that you've built your model on what you think is a good representative data set which describes the phenomena you're trying to study or about which you're trying to make predictions, the disease that you're trying to diagnose or whatever it happens to be, but you are being misled because you're missing something crucial. Covid provides plenty of examples and in that it was a while"
},
{
"end_time": 1982.073,
"index": 79,
"start_time": 1952.432,
"text": " As data began to come in, COVID is a nice example. The pandemic is a nice example because early on in the pandemic, there were lots of things we didn't know. We didn't know most things. But people had to make decisions. They couldn't hang around. So they had to build models, try to build medicines and so on, try to work out what they thought the likely causes were. And then later, you know, errors were corrected and so on. But early on, they didn't recognise that"
},
{
"end_time": 2007.91,
"index": 80,
"start_time": 1982.415,
"text": " Age was a crucial factor in susceptibility to COVID and serious consequences. So you could build a model and say, right, everybody must behave like this, and then realize later on that older people are all dying because of what you had done. So I think that's the fundamental problem, the link between the two here that"
},
{
"end_time": 2034.633,
"index": 81,
"start_time": 2008.49,
"text": " You're building a model based on what you think is a nice, comprehensive, representative data set describing the entirety of the phenomenon you're interested in, and you're missing something crucial. And you're missing something crucial. There might be particular variables you're failing to measure. In many other examples other than COVID, sex might be a crucial variable that"
},
{
"end_time": 2060.93,
"index": 82,
"start_time": 2035.35,
"text": " causes different impacts between people and if you didn't think of that before that's an obvious one so you probably would have thought that but if you hadn't thought of that before and but there may be all sorts of other things that you hadn't thought of which can have a big impact so it might be particular variables or it might be other kinds of distortion which lead you to a misunderstanding let me take a let me take a physics example sure"
},
{
"end_time": 2085.435,
"index": 83,
"start_time": 2061.186,
"text": " There's something called the Marmquist bias, and you may well be familiar with this, which basically says it's the bias in the models you build because of the fact that brighter astronomical objects are easier to see. So if you build a model just based on the objects that you can see or detect the radiation from, you might be getting a completely misleading impression because there might be"
},
{
"end_time": 2100.469,
"index": 84,
"start_time": 2085.691,
"text": " actually there might be lots of stuff out there that you're missing and in fact another physics example where that you see where this is going another physics example is of course dark matter yeah yeah dark matter i'm sorry yeah"
},
{
"end_time": 2123.439,
"index": 85,
"start_time": 2100.623,
"text": " Yeah, and earlier you said with regard to COVID, I don't care about being political, this is not a political question, that we didn't know what was occurring, but we had to do something. Now, it's not always the case. Is it better as a rule of thumb that if we feel like we have only partial data, and we have a, well, this is a tricky question. Like, we have freeze for a reason in our biologically, like don't do anything."
},
{
"end_time": 2153.217,
"index": 86,
"start_time": 2123.677,
"text": " If it's extremely uncertain, just stop, stop for a moment, even more than a moment. So when we get partial data from the government, for whatever reason, they don't want to release it. They don't have it. They're embarrassed by it or fraud. And this is not just with COVID, but it could be with whatever. There's something called FOIA requests in the States. So Freedom of Information Act, you request certain documents and redacted, redacted, or they'll just say they don't have it or they won't get back to you. I'm always interested in is this what they give you this partial truth"
},
{
"end_time": 2172.039,
"index": 87,
"start_time": 2153.66,
"text": " Can it be worse than saying nothing?"
},
{
"end_time": 2200.316,
"index": 88,
"start_time": 2172.671,
"text": " We think this is how it is, but we're going to collect more data. We recognise we've got to do something now. We're in the political arena. People will die if we don't. We're going to do what we think is the best, but we recognise that we may be missing something. That is at least safer. People will keep their eyes open and look and so on. But if they say, this is how it is, then that's disastrous. And I think they can't sort of say, well, we're going to do nothing."
},
{
"end_time": 2225.538,
"index": 89,
"start_time": 2200.879,
"text": " If you're in the political arena, you know, you have to make decisions. Even doing nothing is making a decision. We're not going to vaccinate anybody. Well, that's a decision just as much as vaccinating people is. Is there an in-principle problem with these unknown unknowns? So in this matrix of like known knowns and then known or unknown knowns? Yeah, yeah. Okay, whatever. You get the idea. There's four of them."
},
{
"end_time": 2252.688,
"index": 90,
"start_time": 2225.93,
"text": " that it seems like there's no way we could ever make any decision or we could ever get any information about an unknown unknown. Now, is that true, though? Is there some systematic way of thinking about them, of classifying the unknown unknowns, like some taxonomy there? Have you uncovered something about this? What seems like an epistemological black hole? I agree. Yeah, I think this is probably very context dependent. But yeah, I mean,"
},
{
"end_time": 2280.64,
"index": 91,
"start_time": 2253.268,
"text": " One thing you should always do is sense check your conclusions and your results. If they seem totally bizarre, that casts doubt on what you're missing. In clinical trials, there's something called a funnel plot. Funnel plot? Funnel, as in funnel web spider. Exactly. Actually, it's like an inverted funnel, normally plotted."
},
{
"end_time": 2303.831,
"index": 92,
"start_time": 2282.022,
"text": " One of the problems with clinical trials is that if they produce a clear result then they tend to get published. But if the results are not that clear or perhaps even don't go in the direction you expected, well maybe you've got other things to do rather than spending your time writing up this thing. So often this is"
},
{
"end_time": 2335.043,
"index": 93,
"start_time": 2305.247,
"text": " I've got to write a lecture for tomorrow or I've got to go and chair this meeting. I'll write it up when I get time and you never do. Exciting results, however, you know, well, I'm going to write this up because it's going to attract a lot of attention. What this means is that there can be a bias away from certain kinds of results in the funnel plot."
},
{
"end_time": 2358.302,
"index": 94,
"start_time": 2335.333,
"text": " is a way of plotting all the results of clinical trials in a particular area or indeed you can do it for other kinds of studies and you can sometimes see that there is a sort of void, a gap. If there's a dot for each the result of each clinical trial occasionally you can see this void and you think well"
},
{
"end_time": 2383.729,
"index": 95,
"start_time": 2358.746,
"text": " It's incredibly unlikely that there were no results forming in that region. So we're getting, if we just analyze the dots that we have got, it's going to bias us. So that's one way you can do it. Very much, I think, depends upon the context. But yeah, I think there are ways. This doesn't mean that it's always possible. After all, bottom line,"
},
{
"end_time": 2408.78,
"index": 96,
"start_time": 2384.411,
"text": " You can't measure everything. If we're studying human beings, you can't measure everything about that human being. I can measure their age and weight and BMI and IQ and preferences for politicians and what have you. But I can't measure everything. So naturally, I must be by definition missing almost everything about those people. So there's always a risk. You can never guarantee having everything."
},
{
"end_time": 2440.52,
"index": 97,
"start_time": 2410.828,
"text": " never guarantee that your scientific theory is, in inverted commas, right. For all you know, new data tomorrow might cast out on it. You see this all the time at the front of the physics, very exciting, where people push the boundaries and say, well, it seems to be a bit of a problem with a standard model or something like that. But of course, that's true. And of course, you see it all the time in medicine. You really see it all the time in medicine, where if you read the papers or the news media,"
},
{
"end_time": 2459.309,
"index": 98,
"start_time": 2441.049,
"text": " Every day you get what appear to be contradictory results. Today coffee is good for you, tomorrow next week coffee is bad for you. Well the reason for this is they are looking at different aspects of it or they're looking at different consequences of it or whatever."
},
{
"end_time": 2479.991,
"index": 99,
"start_time": 2459.787,
"text": " This inverted plot seems like it's so tricky to do a meta analysis then I can't imagine doing my brother's a professor of statistics at U of T the math finance program but anyway I just thought he's technically under the umbrella of statistics."
},
{
"end_time": 2503.387,
"index": 100,
"start_time": 2479.991,
"text": " I once showed him this meta-analysis and I said is this an okay meta-analysis or I don't even recall what it was years and years ago. He said that would take me a month or more to go through and I remember thinking how like you're a professor of statistics but he's saying no no it's so it's extremely subtle and they also use different techniques so it's not like everyone knows all the techniques but it's just it's so subtle to go through the reason you need to comb through"
},
{
"end_time": 2510.452,
"index": 101,
"start_time": 2503.951,
"text": " Can I say, I'm with him on this. I get lots of questions."
},
{
"end_time": 2538.097,
"index": 102,
"start_time": 2511.032,
"text": " The short answer is yes, okay, but it's going to take me a while to dig down, look at exactly what they did, look at their comments about each of the studies. So, yeah, I'm with him on it. What's that effect or the study that showed that in high prestige journals, the results tend to be less reliable than those of medium or low quality? Yes. What's that phenomenon called? The primary one is regression to the mean you may be thinking of."
},
{
"end_time": 2567.551,
"index": 103,
"start_time": 2538.285,
"text": " Let me describe how that works because that's an example of another kind of data. Let's suppose we're carrying out lots of experiments on the same topic. An experiment, and let's suppose, I don't know, we're comparing two treatments or whatever, an experiment might show a pronounced positive effect for one of these treatments compared to the other for two reasons. One is maybe the underlying data, maybe there really is an effect."
},
{
"end_time": 2579.309,
"index": 104,
"start_time": 2568.268,
"text": " The second reason is that there's always random variation in these sorts of results, and perhaps this time, just by chance, the random variation has gone on the high side."
},
{
"end_time": 2601.425,
"index": 105,
"start_time": 2580.026,
"text": " You put those two things together and the very highest observations, the most pronounced, significant observations in your collection of results are going to be the ones which have both of those effects combining. There's a real underlying effect plus random variations just giving you an extreme thing. And when you replicate the study,"
},
{
"end_time": 2626.493,
"index": 106,
"start_time": 2602.688,
"text": " And because it's a big effect, it goes to a high prestige journal. They're really delighted and you're delighted as well. But when you replicate the study, any real effect will still be there. But there's a 50-50 chance which way the random aspect will go. And it's much more likely that it's less than the extremely high random bit that you've got before. It's more likely to be lower. So things"
},
{
"end_time": 2644.565,
"index": 107,
"start_time": 2627.09,
"text": " A classic example is the offspring of tall couples are likely to be tall but not as tall as them and offspring of short couples are likely to be taller but still likely to be short but not as short as them. Regression to the mean."
},
{
"end_time": 2675.145,
"index": 108,
"start_time": 2645.913,
"text": " There's the potential to be misled. You're being misled by the data you've got because it includes aspects of random variation. Is there another effect happening that's creating this unreliability in the high prestige journal? There's the file drawer effect, of course. High prestige journals. Yeah, high prestige journals. One wants to publish one's results in a high prestige journal. So if you get a big result,"
},
{
"end_time": 2703.08,
"index": 109,
"start_time": 2675.725,
"text": " highly significant people are going to pay attention that's where you send it and of course those journals maintain their prestige by publishing these important significant results which means that all of the other results tend to go elsewhere or not get published which means that the high prestige journals are more likely to be susceptible to the regression to the mean effect"
},
{
"end_time": 2723.404,
"index": 110,
"start_time": 2703.524,
"text": " The file drawer effects means that the things which weren't significant are less likely to be reported. Let's take an example of a comparison between two drugs where really there's no difference between them. We don't know that, but there's no difference between them."
},
{
"end_time": 2750.964,
"index": 111,
"start_time": 2723.882,
"text": " So we do 20 experiments. So in reality, these two drugs, they work the same. We don't know that we're trying to find out. So it's like Namebrand, Advil and ibuprofen. Exactly, exactly. So we got 20 groups comparing these two treatments. Now, all of those 20 groups produce a result. And there's no difference. So the results are randomly scattered around zero."
},
{
"end_time": 2767.176,
"index": 112,
"start_time": 2751.374,
"text": " the difference between the two treatments is randomly scattered around zero. Some show A better than B, some show B better than A. The ones which might depend whose funding the"
},
{
"end_time": 2797.432,
"index": 113,
"start_time": 2769.258,
"text": " The most significant ones are likely to be the ones which attract attention and get published. If there's no difference, I'm not going to rush out and say, hey guys, I did this experiment and found no difference between the drugs. Well, nobody will care about that. Maybe they should, but it's the ones which have a big difference that are likely to get published. So there's the file drawer effect, meaning that the ones which are around zero don't get published."
},
{
"end_time": 2825.247,
"index": 114,
"start_time": 2798.114,
"text": " And then remember that these results were actually random. We'll take one of the ones, we'll take the most extreme one, which showed a great deal, treatment A, a great deal better than treatment B. It was actually random. So if the group which did that experiment were to repeat it, it could just as easily go the other way. It's very unlikely to be as high as it was that way. So that's regression to the mean coupled with the file drawer effect."
},
{
"end_time": 2849.957,
"index": 115,
"start_time": 2825.555,
"text": " What are some ways around this file drawer effect? I think there's something called pre registered trials. Yes, exactly. There's been a big move to do this. What about creating some high prestige? I know this is difficult. If there were journals, I don't know if this exists, but if there are journals that just said, hey, give us your null results, like we will publish them so that you don't have to feel like, okay, I don't have anything. It's not worth publishing. No, no, no, it'll go here. This is a high prestige, no result journal."
},
{
"end_time": 2862.261,
"index": 116,
"start_time": 2850.23,
"text": " Yeah, yeah, and indeed there are medical journals which do that sort of thing now. They say before you do the experiment, write up your paper just leaving out the results section."
},
{
"end_time": 2888.507,
"index": 117,
"start_time": 2863.268,
"text": " If we think it's good enough, carefully, we'll be carefully enough done and so on. We guarantee to publish it. And that's called pre-registered? Is that different? Yeah, you're doing it. And there's a bigger move, which also says if you're going to do clinical trials, you've got to say so before you actually carry them out. You've actually got to register with a registry that you're going to do them. Yeah, I mean, this is a way to tackle these sorts of biases"
},
{
"end_time": 2911.954,
"index": 118,
"start_time": 2888.78,
"text": " and help tackle the unknown unknowns in that case often when we hear some results like smoking is good for you or is not good for you we'll say yeah but look who was funded by like the not good for you or by the people who are anti-smoking or good for you is by the cigarette companies themselves so we'll say that and we'll dismiss it because it was funded by so why do we even care about any results that are funded by if we're going to just question them anyhow"
},
{
"end_time": 2938.148,
"index": 119,
"start_time": 2913.541,
"text": " Well, we rely on the sort of morality and ethics of the funding body, but it's a very good, very good question. Or is it just in comment sections that people are quarreling with one another saying, oh, yeah, you shouldn't listen to it because it's funded by someone. But the professors behind the scenes or the researchers are like, no, no, that's a statistically relevant result. I'm glad they did this. Yeah, well, I think it depends on the particular case. But I mean, yeah, if it's funded by"
},
{
"end_time": 2965.435,
"index": 120,
"start_time": 2938.916,
"text": " Well, you gave the example of tobacco companies and that has a very long and interesting history. If it's funded by such an organization, they may well have a vested interest in getting the results a certain way. So in some sense, there's always this question of, did they manipulate things in some way? There are classic cases of this where an organization would like to show that"
},
{
"end_time": 2988.865,
"index": 121,
"start_time": 2965.998,
"text": " Drug A is better than drug B for a particular illness. They carry out the experiment and they find no difference. So what they then do is look for secondary endpoints. Okay, it wasn't good in this way, but what about this? Did people survive for longer in this way or did they suffer less from such and such in that way? You keep looking through, this is a"
},
{
"end_time": 3016.442,
"index": 122,
"start_time": 2989.36,
"text": " another aspect. You keep looking through the possible things in which A is better than B. You're bound to find something eventually. The truth is that in the past, probably still now, but certainly in the past, there have been cases of data distortion of various kinds or data selection"
},
{
"end_time": 3046.596,
"index": 123,
"start_time": 3016.732,
"text": " Raise a spoon to Grandma, who always took all the hungry cousins to McDonald's for McNuggets and the Play Play Slide. Have something sweet in her honor. Come to McDonald's and treat yourself to the Grandma McFlurry today. And participate in McDonald's for a limited time."
},
{
"end_time": 3054.138,
"index": 124,
"start_time": 3047.619,
"text": " Whether one can trust the results. The trusting of the results sounds like an issue in the places where you're dealing with data collection."
},
{
"end_time": 3083.797,
"index": 125,
"start_time": 3054.462,
"text": " But it's not so much an issue in math when you're publishing a theorem. But I do have a question about peer review in general. Like, is there some problem with peer review other than the main critique, which is also the plus side, but the main critique is like, yeah, but you're excluding some results that could be let in. But you're like, yeah, but we have to also put guardrails up because we don't want to let any result in. What are the pros and the cons of peer review that you see from your particular point of view? Because you've yes, because you've studied dark data. If there's something that can be applied from dark data to this. Yeah, I mean, peer review."
},
{
"end_time": 3111.323,
"index": 126,
"start_time": 3084.77,
"text": " The whole question of peer review is a very interesting one. It's a bit like what's been said about democracy. There's a crappy system, but it's the best we've got sort of thing. It does have its shortcomings, but it does mean that there are all sorts of implications. If you're in charge of peer review, you have a wand, what would you change?"
},
{
"end_time": 3141.271,
"index": 127,
"start_time": 3116.34,
"text": " I'm not sure that I could come up with a better system and therefore that there's anything I would change. I was hesitating over whether to say, hey, reviewers, but I'm not sure that that would help. It could lead to its own problems and I can see that. Just as open access publication has led to its own problems, because we all saw this coming. I can remember talking about it in various meetings."
},
{
"end_time": 3163.985,
"index": 128,
"start_time": 3141.681,
"text": " You know open access publishing? No, is that like the archive or is that different? No, it's a bit different. Archive is interesting. Oh yeah, that's another topic. In a moment, yeah, let's talk about open access. In the old days, the basic publication model was you could do your work, write up your paper, submit it to a journal,"
},
{
"end_time": 3192.346,
"index": 129,
"start_time": 3164.548,
"text": " and it would go into that journal and only the people who had subscription to that journal or the universities which have subscriptions to that journal would have access to it and then it was gradually recognized that this was not to the benefit of the whole scientific community and hence humanity but also it was a bit unfair. Lots of these projects were funded by public money so you know if I'm latently"
},
{
"end_time": 3222.568,
"index": 130,
"start_time": 3193.012,
"text": " funding this research, I ought to be able to get access to it. Okay. So they switched to an open access model whereby essentially the authors pay a fee to have their papers published. It's got to be accepted first, go through the refereeing process, whatever first, but then they have to pay to have it accepted. And then anybody can read it. It's open access after that."
},
{
"end_time": 3251.254,
"index": 131,
"start_time": 3222.978,
"text": " So the business model for the publishers is now they're collecting money from the authors instead of from the subscribers. One of the adverse consequences of this is, of course, that a load of crooked journals have been set up. There are thousands of these now, which will publish your paper, no matter how crappy it is, for a fee. And so they make their money by this one"
},
{
"end_time": 3268.865,
"index": 132,
"start_time": 3251.903,
"text": " Predatory journals? Predatory journals. There's a classic case of someone who submitted a paper to one of these journals, paid his $500 to get it published, and the paper"
},
{
"end_time": 3283.575,
"index": 133,
"start_time": 3269.565,
"text": " The title of the paper was, this is effing nonsense. And the text of the paper said, this is effing nonsense, this is effing nonsense. It just repeated that and it got published, which shows you that it's nonsense. So that's the risk of open access."
},
{
"end_time": 3311.578,
"index": 134,
"start_time": 3284.326,
"text": " I don't know if this is a published paper, but this is hilarious. It's a paper, it's just called Chicken, Chicken, Chicken, Chicken, but it's written in LaTeX and so it's beautiful. And then there's like an equations Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken, Chicken,"
},
{
"end_time": 3333.643,
"index": 135,
"start_time": 3312.193,
"text": " You know, they produce nonsense, but that's the consequences. So what classifies them as a journal? Like I could create some platform, call it a journal, but do I have to petition to somewhere else for it to be called a journal or? No, no, no, no, no. There's no overseeing regulatory body. You just call yourself a journal. Exactly. Exactly. The self-publish and just create your own journal. Yeah, yeah, that's right."
},
{
"end_time": 3364.582,
"index": 136,
"start_time": 3334.991,
"text": " The Advanced Journal of Artificial Intelligence Innovations and Applications. There we are. And we charge $1,000 to everybody who submits their paper to us. We'll have a very high acceptance rate. And yeah, we're away. Okay, now that's the issue with open access. And what's the issue with the archive? And for those who are unfamiliar, the archive is? Okay. Oh, yeah, archive is a way. I think this is great innovation. It's a way of"
},
{
"end_time": 3386.493,
"index": 137,
"start_time": 3365.265,
"text": " getting your paper out there before it goes through the refereeing process and appears formally in a journal. So it hasn't been refereed. None of your peers have had a chance to read it and say, oh, there's a mistake here, or, you know, not sure about this bit, this isn't clear, whatever. It hasn't gone through that process. But"
},
{
"end_time": 3401.852,
"index": 138,
"start_time": 3387.483,
"text": " I think archive and similar things are great because in a way the papers become dynamic. They're out there but they can get revised and replaced as it goes through the refereeing process or as other people comment on it."
},
{
"end_time": 3431.425,
"index": 139,
"start_time": 3402.295,
"text": " so you know you write the paper you publish it in this electronic archive other people can read it it's meanwhile going through the refereeing process so it will appear in a journal but you're getting feedback perhaps and it's all it's out there if you've got an important result people can read it earlier than the months perhaps years it would take to appear in a conventional journal i didn't know that you can update your link with a different pdf the same exact link you can update it may depend on the particular"
},
{
"end_time": 3460.265,
"index": 140,
"start_time": 3432.005,
"text": " Now going back to large language models, what are some issues that you see that we haven't spoken about so far and that you don't see talked about much? Okay, I mean, whether they're spoken about, perhaps they're not. I mean, the fundamental problem is that they are based on, okay, we've actually the theory driven data driven stuff, large language models are not based on an understanding of the world."
},
{
"end_time": 3488.217,
"index": 141,
"start_time": 3460.828,
"text": " They are just based on what people have written and so on on the web or on text or wherever. So when you throw a rock and you ask chat GPT or one of the other large language models, what happens when I throw a rock? And it says when you throw a rocket describes a parabolic trajectory determined by how hard you've thrown it and the air resistance and so on."
},
{
"end_time": 3518.131,
"index": 142,
"start_time": 3489.548,
"text": " It's not saying that because it understands what happens when you throw a rock. It's saying that because it's looked through all of its masses of billions of records of text and this is how it's described. So the fundamental risk is that it can do silly things and you will be aware from looking at the web about large language models, they can say absolutely stupid things, they can say nonsense. A classic example which you may have come across"
},
{
"end_time": 3546.852,
"index": 143,
"start_time": 3518.422,
"text": " Jack GPT was asked which of these two presidents was older, Grover Cleveland, this is American presidents, Grover Cleveland or George Bush. And it said, whichever way around it was, Cleveland was older because he was 47 and George Bush was 64. And this guy said, the guy asking it said, okay, the 47 is bigger than 64. And the machine said, yeah, the 47 is bigger than 64. The machine said, yeah, a number is considered bigger than another one if it's larger."
},
{
"end_time": 3570.196,
"index": 144,
"start_time": 3547.398,
"text": " So that I can imagine after a pause, the human interlocutor said, okay, could you count up from 64 to 47 for me? And the machine started doing this and then gave up after a while. And there are many other examples of nonsense produced by these things. So"
},
{
"end_time": 3600.896,
"index": 145,
"start_time": 3571.067,
"text": " I think that's the risk. They don't understand them. Think of my throne rock example. They don't have a model, a theory about how the world works. All they have is what people say, the text, and they try to put it together and generate a response based on that. So they don't know if they're saying something absolutely stupid in terms of the way the world works. Now, I have a sort of corollary to this. You know, often when you're talking to ordinary people, talking to me,"
},
{
"end_time": 3631.101,
"index": 146,
"start_time": 3601.254,
"text": " People say stupid things as well. It's just that the kind of stupid things that the large language model says is different from the kinds of stupid things that people say. You've got the same sort of thing with chess programs. The way a chess program plays is different from the way humans do. So common sense has been a long standing challenge in AI. To explain why, let me draw an analogy to dark matter."
},
{
"end_time": 3658.78,
"index": 147,
"start_time": 3631.51,
"text": " So only 5% of the universe is normal matter that you can see and interact with, and the remaining 95% is dark matter and dark energy. Dark matter is completely invisible, but scientists speculate that it's there because it influences the visible world, even including the trajectory of light. So for language, the normal matter is the visible text, and the dark matter is the unspoken rules about how the world works,"
},
{
"end_time": 3686.817,
"index": 148,
"start_time": 3659.275,
"text": " including NIVA physics and folk psychology, which influence the way people use and interpret language. So why is this common sense even important? Well, in a famous thought experiment proposed by Nick Bostrom, AI was asked to produce and maximize the paper clips, and that AI decided to kill humans to utilize them as additional resources"
},
{
"end_time": 3711.886,
"index": 149,
"start_time": 3687.483,
"text": " to turn you into paper clips. Because AI didn't have the basic human understanding about human values. The final moves it makes is different from the way humans do. And I have to say, I was just going to say, I have to say that the most powerful chess playing systems are combinations of the two."
},
{
"end_time": 3722.159,
"index": 150,
"start_time": 3712.568,
"text": " And I can see that also applying in other AI contexts in the future. Yeah, so I don't see that being much of an issue. In the chess example, a chess computer can..."
},
{
"end_time": 3745.964,
"index": 151,
"start_time": 3722.619,
"text": " Obliterate I think Magnus Carlsen is the top player easily and even when it makes a mistake it's easily identifiable as a mistake that only a computer would make at least some of the time and there's some controversy around that or at least there was a few months ago about cheating but regardless who cares if it makes a mistake and makes a mistake of a particular kind but overall it's better than a person playing chess now that's a defined game."
},
{
"end_time": 3763.865,
"index": 152,
"start_time": 3745.964,
"text": " What's the danger in that sounds like it's a boon is like a helpful assistant if i had an extra high school student to help me with this channel."
},
{
"end_time": 3787.978,
"index": 153,
"start_time": 3763.865,
"text": " That's great if I had 100 extra for 10 cents at a moment's notice. That's wonderful. I won't trust it when it comes to certain, verify certain facts, but. I would say, yeah, I'd say exams aren't like the real world. Would you want to, suppose you had one of these systems without human involvement, telling you what medicines and what doses to take?"
},
{
"end_time": 3808.695,
"index": 154,
"start_time": 3788.609,
"text": " and it could easily say 47 is bigger than 64. Would you trust it to tell you what dose to take? Or think of an earlier example of a software system, the Boeing 747 Max 8 crash. Now, this was telling the pilot to"
},
{
"end_time": 3837.773,
"index": 155,
"start_time": 3809.087,
"text": " put the nose down. The autopilot took over and was driving the thing towards the ground, even though the passengers looking out of the window could see the ground getting closer and closer together. It just didn't have the bigger perspective, the sort of theory perspective. It was just based on the data being fed to it. And it was unable to look outside that way, in a way that the humans can. So again, let me come back to a medical case. You might"
},
{
"end_time": 3865.077,
"index": 156,
"start_time": 3838.148,
"text": " Someone might go to the doctor and say, I got this terrible backache doctor and the machine will look at its vast records of information on backaches and prescribe something. But what it won't know is that you spend your job is lifting heavy sacks of flour or something like that. It doesn't have this broader perspective that you do have, which is why I think the two things together can be so much more beneficial."
},
{
"end_time": 3895.333,
"index": 157,
"start_time": 3865.538,
"text": " even in chess that's true the two things together can do better than either alone yeah i heard that somewhere but i never saw the source of this i heard some claim that if you take a mediocre player with a somewhat great machine that it can beat a fantastic machine with zero input from a human something like that but then i don't know what is fantastic what is mediocre like well exactly subject to that sort of qualification but the general principle is right i mean at some point um you know you would"
},
{
"end_time": 3925.691,
"index": 158,
"start_time": 3896.305,
"text": " You've got to have some, I suppose, reasonable level on each side. Can that be mitigated with these plugins? Like, I don't know if you've heard, but Wolfram Alpha is now integrated into ChatGPT. And so when it comes to calculations, the chat model knows, hey, when it comes to numbers, go use this tool. And this tool is a reliable tool. And in fact, mathematicians use Mathematica. I think Mathematica and Wolfram Alpha have some interplay. Yes. I think that is a very interesting idea. And in a way, it's sort of like a combination"
},
{
"end_time": 3955.794,
"index": 159,
"start_time": 3926.049,
"text": " What are your reservations? Yeah, exactly. I suppose I have two reservations. There's a version of one of these systems which writes code, probably more than one many systems which write computer code, and they can sometimes do similar silly things. But I think my real hesitancy"
},
{
"end_time": 3981.8,
"index": 160,
"start_time": 3956.357,
"text": " is you've got a complex system and you're tweaking it and when you start to tweak complex systems to overcome a problem you've got a complex system you see it doesn't behave sensibly when we when this sort of circumstance occurs so we'll say look when that circumstances occurs do this but when you start to tweak complex systems there are often all sorts of other unintended consequences"
},
{
"end_time": 4009.445,
"index": 161,
"start_time": 3982.449,
"text": " So, you know, it's a complex system. By definition, we don't understand how it works. It's just too complicated. It's got billions of parameters, whatever. And when you start to tweak complex systems, unintended consequences can occur. So it's great now solving this problem, which caused me to make that modification. But unknown to me, elsewhere in the system, other things will go wrong."
},
{
"end_time": 4036.817,
"index": 162,
"start_time": 4009.94,
"text": " So professor, as we end, if you don't mind me asking a personal question, how is it that you spend your days? Do you have a daily schedule that you stick to and is fairly regimented? Do you oscillate? Are you concerned with publishing research? Like what are your values right now as well as your schedule? What does it look like? I do. I mean, I'm still, yeah, I do publish. I do write. This is a very interesting question."
},
{
"end_time": 4063.046,
"index": 163,
"start_time": 4037.193,
"text": " I'm going to slightly change the question. For many academics, perhaps most academics, there is a tension between the things you've got to do and the things you would want to do, the sort of research you would want to do. And I haven't, over my entire life, I've been wrestling with this tension. Should I start the day by getting that lecture preparation out of the way?"
},
{
"end_time": 4092.261,
"index": 164,
"start_time": 4063.763,
"text": " and then realizing that it's the end of the day and I haven't done anything else. Or should I start the day by really trying to tackle that problem which has been nagging at me, that technical issue, you know, calculating whatever it is and then getting to the end of the day and realizing, you know, it's sort of 8 o'clock in the evening and I've still got to write the lecture for tomorrow, whatever. And I have never managed to balance those two. It's all too easy to not do the things that you regard as important in terms of your research."
},
{
"end_time": 4119.48,
"index": 165,
"start_time": 4092.858,
"text": " by focusing on the things that you've got to do, like lectures or whatever, but it's easy to go the other way as well. I haven't told you what my day is because I haven't found an answer for that. It varies. I suppose it depends how pressing things are when I get up in the morning. Ideally, I'd spend all my day on research and writing, thinking about things, having great thoughts, but it doesn't work out like that."
},
{
"end_time": 4144.087,
"index": 166,
"start_time": 4120.265,
"text": " When you're having these thoughts are you ordinarily walking like you have a certain trail that you go down or do you sit with the pen or you're reading like how does it work? That's a very nice question because I discovered that so I'll sit at my computer trying to do this and then I think well I've been sitting here for two hours I need to make sure my body still works so I'll get up and go for a stroll"
},
{
"end_time": 4152.602,
"index": 167,
"start_time": 4144.411,
"text": " And of course, this is the wonderful thing about being sort of an academic, your work comes with you in some sense. And very often I will find that"
},
{
"end_time": 4182.005,
"index": 168,
"start_time": 4153.439,
"text": " That enables me to focus on the problem better and it becomes clearer and I do find a solution. Maybe not all the way while I'm walking because I need to sit down with my pencil and paper or computer or whatever to actually go through the details, but maybe new avenues for tackling it come along. So yeah, it's interesting you should say that. For me, certainly, walking helps. So what's next for you? I know you don't want to reveal the book, one of several book ideas you're working on, perhaps one more than the others, but what can you reveal?"
},
{
"end_time": 4209.36,
"index": 169,
"start_time": 4185.998,
"text": " I'm especially interested in is, and I have been and I've written papers on this throughout my career, is getting people to formulate the right question or getting people to think carefully about what it is they want to know. In data analysis and statistics over there are trivial. Let me give you a trivial example if I've got time."
},
{
"end_time": 4242.961,
"index": 170,
"start_time": 4213.063,
"text": " Should you use the mean or the median when you're comparing distributions? And the truth is that either may be appropriate depending upon exactly what you want to know. I want to know if... Well, let me give you a real example. In I think mid-90s, American baseball players went on strike. They said they weren't being paid enough because they said that"
},
{
"end_time": 4271.237,
"index": 171,
"start_time": 4243.797,
"text": " average salary is only a third of a million dollars and you can possibly survive on a third of a million dollars a year sort of thing but the club owners said on the contrary you're being paid very well you're being paid half a million dollars a year and you'd think well you know this is easy to sort out who is right we'll go and look at the numbers but it turned out that the"
},
{
"end_time": 4299.565,
"index": 172,
"start_time": 4272.807,
"text": " owners of the club, the people paying the, they were using the arithmetic mean and the arithmetic mean matters because their total wage bill was the number of players times the arithmetic mean. But the players were using the medium, which is the right thing for them because 50% of them earn less than that. So it really depends crucially on what exactly you want to know. And I've written quite a bit about this and I'm still working on this sort of as a general issue."
},
{
"end_time": 4319.497,
"index": 173,
"start_time": 4300.111,
"text": " It is very important to be precise about what it is you want to know or if you can't be precise, recognise that you can't be precise and recognise that there will be ambiguity in the answer and you have to cope with that ambiguity in some way. So that's a general aspect of my ongoing research."
},
{
"end_time": 4345.367,
"index": 174,
"start_time": 4320.06,
"text": " And just for the people who are still scratching their head about the difference between average, which is a synonym for mean, and then median, which is what you referenced. Okay, yeah, sorry, sorry. Average is a general term. Mean, arithmetic mean, sometimes just abbreviated to mean, is the conventional thing. You add up all the numbers and divide by how many there are. The median is the number which has got 50% of the values below and 50% of the values above."
},
{
"end_time": 4375.708,
"index": 175,
"start_time": 4346.305,
"text": " But I've written about this in various places. I can't remember if it's in the data book or not. Sure. And just for those people, rule of thumb for applying is if you feel like your distribution is like a bell curve, then it doesn't particularly... It doesn't matter. If it's a symmetric distribution, there'll be the same. And the point is, of course, that salary distributions are very skewed. You've got some super player right up there earning vast, earning 15, 20, 100, I don't know, million dollars a year."
},
{
"end_time": 4385.555,
"index": 176,
"start_time": 4376.169,
"text": " You've got lots and lots of people earning less than a third of a million dollars down here. With a skew distribution, the mean and the median are very different."
},
{
"end_time": 4410.265,
"index": 177,
"start_time": 4385.725,
"text": " With a symmetric distribution, they're the same. Yeah. Yeah. One more concrete example is if you're a thousand people and you're working in a factory and each of you are paid forty thousand per year, but the boss says, hey, no, no, no, the average here is one million per year. The reason is because the boss, that one person is a billionaire. And so you can average across those one thousand and say the average person here, if you pluck an average quote unquote person out, it's half a million or a million per year."
},
{
"end_time": 4427.517,
"index": 178,
"start_time": 4410.265,
"text": " Yeah, but that's not what we're trying to say. That's right. And think of it. So the boss is interested in the overall wage bill, so the mean times the number of employees. A new person thinking, should I go to work at that company? And the boss says, the average is a million a year. And then"
},
{
"end_time": 4448.985,
"index": 179,
"start_time": 4427.517,
"text": " Alright, well thank you so much Professor, it was a pleasure and I hope to speak with you again when your next book comes out and also"
},
{
"end_time": 4476.647,
"index": 180,
"start_time": 4448.985,
"text": " Perhaps even to speak about your former book on coincidences at some point. Well, thank you very much indeed. I've enjoyed it tremendously. You asked some really great questions. I really appreciate it. Thank you. Take care, sir. As promised, here's a compendium of the dark data types. This is as far as I understand them. I've made some personal notes of David Han's collocation of different dark data types and it's this sorted list of 15 where I've given the monikers as a mnemonic to myself so that I can more easily assimilate them into my knowledge."
},
{
"end_time": 4505.418,
"index": 181,
"start_time": 4476.647,
"text": " DD type 1 is data we know are missing, which I've called consciously undocumented. So imagine you're filling out a survey and they don't ask you for your age on purpose. This is a classic example of data that's consciously left out. We know it's missing. It's done for various reasons. It could even be data loss that we know. DD type 2 is data we don't know are missing or inadvertently omitted. Okay, so think about a researcher conducting a study on socioeconomic factors but forgets to include income data."
},
{
"end_time": 4534.241,
"index": 182,
"start_time": 4505.64,
"text": " They don't even realize it's missing. That's why we call it inadvertently omitted. It's like losing, let's say a piece of a puzzle and you don't know that it was part of some larger picture. DD type three is choosing just some cases, which I've called selectively scoped. So imagine you're studying people's eating habits, but only in the city. So you ignore the countryside. That's an example of selectively scoped data. Or if you only go in the evening time, that's like taking a picture."
},
{
"end_time": 4560.572,
"index": 183,
"start_time": 4534.718,
"text": " But you're only capturing a portion of the scene, and you don't realize that what you're leaving out affects the whole story. It could also potentially be due to sampling bias. Already, by the way, you can see that these overlap. DD type 4 is called self-selection, which I've called volitionally included. All right, so you've sent out a survey about your favorite TV show. Let's say The Bear. The Bear is a fantastic show, by the way. And only fans respond. This is an example of only seeing part of the data."
},
{
"end_time": 4588.848,
"index": 184,
"start_time": 4560.879,
"text": " Those who don't like the show may not participate. And by the way, there is no one who doesn't like the bear because the bear is fan-tastic. It's just like Breaking Bad. Breaking Bad is fan-tastic. So that's volitionally included data. This is one of the reasons, by the way, you're told as a creator, as someone who publishes work, that don't listen to the extreme praise, nor the extreme negativity, as very likely you're not as great as people say you are, and you're not as odious as people say you are. In other words, you have a biased set."
},
{
"end_time": 4611.34,
"index": 185,
"start_time": 4589.053,
"text": " DD type 5 is missing what matters. So this is called vitally overlooked. Let's say you're conducting a climate study and you ignore humidity. It's there, it's vital, but you overlook it. It's like you're making French toast and you forget the eggs. So you'll get French toast. It's just not going to taste the same. That's vitally overlooked data. Now DD type 6 is data which might have been or counterfactually void."
},
{
"end_time": 4640.111,
"index": 186,
"start_time": 4611.578,
"text": " By the way, we talk about what counterfactuals are in the Tim Modellin episode, and that's on screen right now. Tim Modellin and Tim Palmer talk about counterfactual definiteness as it relates to Bell's theorem in quantum theory. OK, so what is it? You can think about what might have happened had you taken a different path. Counterfactually void data is like that. This hypothetical data, the events that could have transpired under different circumstances or what could have existed under different circumstances. DD type 7 is what changes with time."
},
{
"end_time": 4670.367,
"index": 187,
"start_time": 4640.401,
"text": " or temporarily altered data. So picture a tree through the four seasons. In data terms, this is temporarily altered data. It changes over time. Like also sales data fluctuates during the holidays. Watching a time lapse, you see the movie shift and change and it affects the data landscape. DD type eight now is the definitions of data, which I've called definitionally exchanged. This is one of my favorite dark data types. So if you have two doctors who are using different medical codes for the same condition,"
},
{
"end_time": 4700.503,
"index": 188,
"start_time": 4670.691,
"text": " They're seeing the same part of reality, the same patient, the same symptoms, though they're defining it differently. This is definitionally exchanged data. And this, by the way, is one of the reasons why certain medical conditions seem to go away with time and some seem to creep up over time. It's because we're changing the data. So we think, hey, there's this huge spike all of a sudden and some issue doesn't even have to be a medical issue. It could be something we want. Oh, wow. Look at this. Crime has gone down. Well, we've defined violence differently than in the past, let's say."
},
{
"end_time": 4724.582,
"index": 189,
"start_time": 4700.725,
"text": " DD type 9 is summaries of data, which I've called summarily reduced. You can think of this like reading a book summary instead of the whole novel. You get the gist, sure, but you miss the details. This, by the way, is one of the reasons why the theories of everything podcast is started, because I find that the popularizers of science do this. And I've just been, for myself, I've been missing so much data when I was studying physics, and it bothered me."
},
{
"end_time": 4748.882,
"index": 190,
"start_time": 4724.838,
"text": " So theories of everything is an attempt to be more comprehensive and rigorous to avoid DD type 9, which is summarily reduced data. That is, we don't skip on the details and try to not water down the explanations, as I believe that's where progress is made. DD type 10 is measurement error and uncertainty, precision lacking is what I call it. So imagine you're measuring something with a broken ruler. You'll get a measurement, sure, but it's off. So that's precision lacking data."
},
{
"end_time": 4774.394,
"index": 191,
"start_time": 4749.002,
"text": " DD type 11 is feedback and gaming. Feedback distorted is what I call it. Okay, this is another one of my favorite ones. Sometimes you'll go to a store and then they'll say, hey, review our product online. We'll give you a discount. And then you're thinking, okay, great. I'd love a free packet of cream cheese or whatever it may be. Yeah. Okay. So now you've rated them five stars and you've distorted the feedback. So sometimes online you can't trust all these reviews. Those reviews are feedback distorted."
},
{
"end_time": 4801.886,
"index": 192,
"start_time": 4774.701,
"text": " It's not entirely genuine and it skews the overall picture. So DD type 12 is informational asymmetry or information asymmetry, which I've called asymmetrically acquired. Imagine being in a game where someone knows the rules, but you don't know the rules, so they have an advantage. At least you think so. Asymmetrically acquired data is like that, where some people have more or better information that can tilt the playing field. And this imbalance in information can actually influence what data"
},
{
"end_time": 4825.572,
"index": 193,
"start_time": 4802.21,
"text": " gets collected and analyzed. DD type 13 is intentionally darkened data. Think of this as those FOIA requests or classified documents with the black marks. I call it intentionally obfuscated. It's when you make something vague or hidden for whatever reason, obscurantism, privacy, confidentiality, security, or the veneer of that. It could be an excuse. That's called intentionally darkened data."
},
{
"end_time": 4855.913,
"index": 194,
"start_time": 4826.169,
"text": " DD type 14 is fabricated and synthetic data, so synthetically concocted. If you play a video game and there's simulated characters and environments, that's synthetically concocted data. It's not the real world. It's observations that were generated by a crafted simulation, like a virtual reality where we feel it's real, but it's not. And DD type 15 is extrapolating beyond your data, which I've called extrapolatively risky. OK, so let's imagine you're trying to predict next decade's fashion trends based on last year's style."
},
{
"end_time": 4884.548,
"index": 195,
"start_time": 4856.169,
"text": " That's extrapolatively risky data. You're making a leap, a large leap. And while it might land, it could also miss the mark and it leads to increased uncertainty, potential inaccuracies. So it's like you're shooting an arrow. You're just hoping for the best. OK, that's it. I hope you enjoyed it. Because of all the talks that I've seen and even the papers that talk about dark data, there's not one single place that just has in video form an analysis like this. So I would have appreciated this when I was researching and I thought maybe you would as well. Take care."
},
{
"end_time": 4912.944,
"index": 196,
"start_time": 4886.425,
"text": " The podcast is now concluded. Thank you for watching. If you haven't subscribed or clicked that like button now would be a great time to do so as each subscribe and like helps YouTube push this content to more people. You should also know that there's a remarkably active discord and subreddit for theories of everything where people explicate toes, disagree respectfully about theories and build as a community our own toes. Links to both are in the description."
},
{
"end_time": 4932.517,
"index": 197,
"start_time": 4913.285,
"text": " Also, I recently found out that external links count plenty toward the algorithm, which means that when you share on Twitter, on Facebook, on Reddit, et cetera, it shows YouTube that people are talking about this outside of YouTube, which in turn greatly aids the distribution on YouTube as well. Last but not least, you should know that this podcast is on iTunes."
},
{
"end_time": 4960.862,
"index": 198,
"start_time": 4932.688,
"text": " It's on Spotify. It's on every one of the audio platforms. Just type in theories of everything and you'll find it. Often I gain from re-watching lectures and podcasts and I read that in the comments. Hey, toll listeners also gain from replaying. So how about instead re-listening on those platforms? iTunes, Spotify, Google Podcasts, whichever podcast catcher you use. If you'd like to support more conversations like this, then do consider visiting patreon.com slash Kurt Jaimungal."
},
{
"end_time": 4978.797,
"index": 199,
"start_time": 4960.862,
"text": " And donating with whatever you like again it's support from the sponsors and you that allow me to work on toe full time you get early access to add free audio episodes there as well for instance this episode was released a few days earlier every dollar helps far more than you think either way your viewership is generosity enough."
},
{
"end_time": 5118.899,
"index": 200,
"start_time": 5103.695,
"text": " Think Verizon, the best 5G network, is expensive? Think again. Bring in your AT&T or T-Mobile bill to a Verizon store today and we'll give you a better deal. Now what to do with your unwanted bills? Ever seen an origami version of the Miami Bull? Jokes aside, Verizon has the most ways to save on phones and planets."
},
{
"end_time": 5133.763,
"index": 201,
"start_time": 5120.333,
"text": " So bring in your bill to your local Miami Verizon store today and we'll give you a better deal."
}
]
}
No transcript available.