Perplexity vs the internet
In which I attempt (and fail) to listen to a podcast and reflect on the risks of Perplexity's approach
When I have time (which is rarely), I have been trying to listen through the Lex Fridman podcast featuring Aravind Srinivas, the CEO of Perplexity (which I have written about in quite a bit of detail when I wrote about AI and search engines).
It’s an interesting listen so far, and whilst I haven’t got that far into it, there are a couple of things that leapt out that I immediately thought worth highlighting.
Right off the bat, the podcast starts with a clip of Aravind contemplating a future vision of the product where by it can be like having a conversation with Einstein. Where you can ask it something and it can say “I don’t know” and then come back to you later with an answer and loads of understanding, having done research.
I have to be honest, I think that’s a really compelling vision of the product. The resource and compute power for us all to have our own digital Einstein would undoubtedly be huge, but I really like this idea of being able to outsource research to an AI agent and leave it working and then come back, not just with results/pages to read, but with an actual understanding of the topic, so through conversation, you can dig into areas you don’t understand, expand into other ideas and just generally talk it through with a (now) expert.
I’m not sure there are loads of day-to-day use cases where such expertise would be needed for many of us, but it could definitely be interesting and it may just increase the overall human pursuit of knowledge. I could see it being super useful as a dedicated private tutor for students, too, though.
Fundamentally, it’s about search
That’s a quote from the host, Lex Fridman, and is also the main point that I made in my previous write up on Perplexity. I really can’t emphasise this enough - they will be using a variety of LLMs (generative AI platforms), but the key part to this success is how good the data is that the LLM has to work with, and that all comes down to search.
Early in the conversation, Aravind uses the analogy of Wikipedia, and how if you make an edit you are expected to cite sources, but not just any random source, they have to be “notable sources” - this is the first thing that made me stop and think.
The comment made me immediately think back to my previous comment:
Given these conditions, and having worked in technology start-ups, I’d be expecting them to try to work out what is the most efficient way to get to better results. They will weigh up functionality and improvements in the product in terms of cost-benefit and potential risk, and will be making decisions quickly as they scale.
In that context, you could imagine some quick fixes being something along the lines of manually curated “approved” sites - it’d probably take minimal cost to get a list of thousands of reputable sites and simply bumping the ranking of those if they come up, to make sure the content is reliable would appear to be a reliably 80-20 approach to improving the quality of the content for the majority of people.
As I said, given that the problem being solved is fundamentally search, and Google has a several decade head start on perfecting how to find the best, most accurate content on the web, whilst also detecting and down-ranking all the naughty folks trying to game the system.
So what shortcuts might we imagine there would be available to someone trying to get their search engine from 0-to-100 quickly? Curating a list of “reputable” sources would be a sensible first step right?
My immediate next thought was, I should 1) test this and 2) I should just ask if it curates sites.
1) The test
Testing is obviously not going to stand up to much scientific rigour, as I’m trying to determine the behaviour of the search engine by just asking a question and seeing how the resulting sources differ from, for example, Google. Luckily I have a pretty safe go-to query “can I bbq and sous vide my christmas turkey” - this is safe because this is a pretty curve ball approach to cooking your Christmas dinner that not many people have written about (but I did, because why not?). My site actually ranks ok on Google (it’s been about for a while, has decent user engagement and has a reasonable amount of content for Google to establish it as a credible site), however, it’s a relative minnow in terms of web content publishing properties, so there is absolutely no way it is going to be on Perplexity’s “curated” list.
Googling it, and my site comes up as 2 of the top 3 results:
Next up, I asked Google Gemini Search (this is Google’s equivalent to Perplexity). My starting assumption was it might reference my site, given its performance in Google Search.
The way Gemini does it is by generating an answer from it’s LLM (like asking ChatGPT) and then there is an option to double-check the answer against Google - at which point it will perform a search and add references backing up it’s claims.
Interestingly, on the first generate only step (so no source websites referenced), it is mentioning my site in the generated text directly (not sure how I feel about that), so Google’s LLM obviously “knows” about my website in its “brain” (the distinction here is this was not added following googling search results, but just from “memory” of the Google LLM, which proves that their LLM was been trained using my website):
I then asked Google to add references (at this point, it googles for relevant websites and fact-checks the answer it has generated), and sure enough, it added in some references to my guide:
Next up, I asked Perplexity the exact same question, and as expected, it didn’t reference my site at all:
So far, this lines up to the expectations - I’m not expecting Perplexity to have any knowledge of my site or to have it on it’s curated list. To further (seemingly) back up my original assertion of curated sites, that main photo top right is from my site - so for images it is happy crawling all the sites (and has clearly crawled my site), but for generating answers and “notable sources” it’s far more picky.
2) Forget testing, let’s just ask it
Next up, I just went ahead and asked Perplexity if it used curated sites, and whilst this is answering based on other web based knowledge (e.g. third party, non-perplexity websites) it did indeed confirm the behaviour:
So what?
So aside from me potentially being sad that Perplexity isn’t reading my little side-project food site, does it make any difference? Approved, notable sources makes total sense for a research/answer tool?
The real issue of this, and why I stopped and listened when he talked about “notable sources”, is because this is a risk to the internet and society in general (that sounds ever so dramatic, but I couldn’t think of a better way of saying it).
In the conversation Perplexity continue to say they are an “answer engine” not a search engine, and that they aren’t going up against Google. But the reality is, they are competing in that same space.
When you google something, have you ever expanded or just read an answer snippet that Google has directly shown on the page? Conversely, have you ever googled something, clicked the first result, seen its a long form article without the answer immediately highlighted and just hit the back button to look for a more accessible answer?
People are lazy and the function of google is to answer questions. People have a question, they google it. In this sense, Google is an “answer engine”, just one where the user has to do more work. If Perplexity comes along as an “answer engine” that removes or reduces that work, then people will use it.
For the last two decades or so, Google has been most people’s gateway to the internet. For the vast majority of people on the internet, Google has been their starting point and the controlling factor that leads them to webpages. Imagine that title changes over to Perplexity, and for the next decade the vast majority of people use Perplexity to “google” stuff. All of a sudden, the internet is walled-off - Perplexity has complete control over “notable” sites, it gets to choose who and what gets used and where the data comes from.
Aside from the wider issue of what happens to the rest of the internet that isn’t “approved” (e.g. it becomes effectively unaccessible other than direct links - but does anyone even want to access content based sites when you can have it tailor made for your exact question?), there is the issue of a single, private organisation having control of what “information” we can access. Perplexity answers could become seen as “truth” but with no transparency or understanding of what data is being considered in building that truth.
Could a curated list consider more minority held views, ideas or knowledge compared to more dominant mainstream ideas? How would it cater for equality and balance of changing ideas? An easy example to consider is the growing popularity of the anti-vax movement, should anti-vaccination websites be on the curated list? I’m pro vaccinations, so it’d seem easy for me to say that they shouldn’t be on the curated list. But really, by agreeing to that idea I’m also implicitly agreeing with the idea that it’s ok for this company to make their own moral decision on what should and shouldn’t be allowed.
This is just another step in the web neutrality conversation, and if we should be limiting ideas on the internet, and if so, who makes those decisions? Who governs what is and isn’t allowed (and should it be in the hands of big-tech cos).