That's a fascinating claim, and it does not align with my anecdotal experience using the web for many years.
For fuel, Google results were 90% scams, for coffee machines closer to 75% The scams are fairly elaborate: they clone some legitimate looking sites, then offer prices that are very competitive -- between 50% and 75% of market prices -- that put them on top of SEO. It's only by looking in details at contact information that there are some things that look off (one common thing is that they may encourage bank transfers since there's no buyer protection there, but it's not always the case).
A 75% market rate is not crazy "too good to be true" thing, it's in the realm of what a legitimate business can do, and with the prices of the items being in the 1000s, that means any hooked victim is a good catch. A particular example was a website copying the one for a massive discount appliance store chain in the Netherlands. They had a close domain name, even though the website looked different, so any Google search linked it towards the legitimate business.
You really have to apply a high level of scrutiny, or understand that Google is basically a scam registry.
why did you change subject to scams?
edit: ok, I bothered to look this up: Microsoft had a guy do a study on nigerian scams, the guys who wrote Freakonomics did a sequel referencing that study and drew absurb unfounded conclusions, which have been repeated over and over. Business as usual for the fig-leaf salesmen.
Given I'll often see the same fraudulent ad repeated I think anecdotal experience is there are not many of them.
I can even talk to friends about the most boring fraudulent ads and they know them. i.e. Elon doubling your bitcoin scams.
For normal ads unless they are viral, there are millions out there that are never repeated or not even seen.
Because fraud ads have short lifetimes pulled out of 'production traffic' you can collect many for the training data
I assume 'clickbait' is the safety word for 'fraud'
> To find the most informative examples, we separately cluster examples labeled clickbait and examples labeled benign, which yields some overlapping clusters
How can you get overlapping clusters if the two sets of labelled examples are disjoint?
Typically LLMs don't produce usable embeddings for clustering or retrieval and embedding models trained with contrastive learning are used instead, but there seems to be no mention of any other models than LLMs.
I'm also curious about what type of clustering is used here.