AI as Fuzzy Interface // BrXnd Dispatch vol. 36

On web scraping, net new ideas, and AI as glue.

Feb 15, 2024

You’re getting this email as a subscriber to the BrXnd Dispatch, a (roughly) weekly email at the intersection of brands and AI. 2024 conference planning is now in full swing. I’m looking for sponsors and speakers (including cool demos!). The initial round of conference invites went out to previous attendees. If you’re interested in attending, feel free to be in touch. As space is limited, I’m prioritizing attendees from brands first. If you work at an agency, just bring a client along. Ok … onto the newsletter.

First, a bit of business: If you received an early bird invite, you only have until the end of this week to buy your ticket before they’re opened up further. To those who didn’t get an early bird invite, I plan to open up tickets on Monday and make my first set of speaker announcements.

Second, today’s edition is a bit more technical than usual, but I suggest trying to read through it. While I’m aware many of you don’t write code, the concepts here are pretty universal, and I hope speak to what these models are best at and the kind of work they open up.

One of my go-to AI use cases is data extraction. I first discovered this when I was building my Marketing x AI Landscape way back when, and I wanted to do price comparisons. I scraped each pricing page from each vendor and then sent it through the GPT-3 API (at the time) to turn it into structured data. This has legitimately opened up a world of use cases for me, and it’s something I do enough of that I’ve actually built a separate product that I might one day make available to others that has the sole purpose of taking a data schema and a URL and doing all the scraping and extraction.

When people talk about AI, they often focus on replacing work they’re already doing, but the thing about this particular use case for me is that it opens up a whole world of interesting projects that I would have just entirely skipped because getting at the data would have sucked.

One of those projects is building a product recommendation site based on my newsletter Why is this interesting? (WITI). If you’re not familiar with WITI, it goes out daily to about 25,000 subscribers, and we (Colin Nagy and myself) bill it as aimed at the intellectually omnivorous. We have many guest posts, and after 1,500 emails, we have quite a catalog of ideas and product recommendations buried in each issue’s 750 words.

But I certainly don’t feel like digging through over one million words to get at the products inside, so I finally got around to working through building an extraction pipeline using AI (mostly GPT 3.5 and 4). To me, this is a perfect use case: take some unstructured data and ask the model to structure it.

A big thank you to our first 2024 sponsors: Brandguard is the world’s #1 AI-powered brand governance platform, Focaldata is combining LLMs with qualitative research in fascinating ways, and Redscout is a strategy and design consultancy that partners with founders, CEOs, and CMOs at moments of inflection for their organizations. If you’re interested in sponsoring the 2024 event, please be in touch.

The project isn’t quite ready to launch yet, but I thought it might be interesting to talk through the process.

To start, I downloaded all posts from Substack and stuck them in a local database to work with them (I use Postgres, but anything will do). From there, I refined the prompt to extract the products.1 Over a year ago, I wrote about using Typescript types to get valid JSON out of OpenAI, but things have come a long way since then.2 First off, OpenAI (and many others) have added a “JSON mode” that ensures you get (mostly) valid JSON back in your response. This is a massive relief if you’re parsing the LLM output with code since it means you can now throw away a whole bunch of code used to validate and clean basic JSON. Second, I have found that rather than Typescript, JSON Schema gives me the ability to describe my fields better to provide the model with some more context. It looks like this:

This is pretty in the weeds, but if you write Typescript, you can even describe your schema using Zod and then convert it to JSON Schema so that you can also validate the response against it to make sure all the fields returned by the AI meet your requirements.

What does this look like in practice? Here’s the system prompt I send to GPT-4 along with the HTML from the newsletter:

You will be given a post from the newsletter Why is this interesting? as HTML content. Your job is to extract any specific recommended products from the post. If the user recommends books by a specific author or films by a director, that’s not enough, only capture the specific book titles or films they reference. The goal is to only extract products that someone could buy, as we are building an affiliate site that will capture all of the newsletter’s recommendations. Do not include recommendations that are part of paid promotions or advertisements. Do not included recommended videos if they are not movies that can be purchased. Return valid JSON using the following JSON Schema and make sure to include all required fields (name, creator, type, description, url). Only return null for a field if the field is explicitly marked nullable. Do not return products that are only mentioned in links and don’t have descriptions. JSON Schema: ${productSchema}

The product schema gets inserted dynamically and describes the fields I want to be returned. Once I’m through this, there are a bunch more steps, but what it’s allowed me to do is quickly run through over a million words of content and find hundreds of products mentioned along with their context. The site isn’t quite ready for prime time, but here’s a sneak peek:

Turning unstructured data into structured data is a magical feature of these tools and one of the moments in my AI rabbit hole that really excited me. The first time I got access to build a ChatGPT plugin last April, I was immediately struck by how the ability of OpenAI to interpret and restructure the data returned to it had completely turned everything I knew about integrating software on its head. Here’s what I wrote at the time:

In that snippet, I am telling ChatGPT that when it sends me a request for YouTube tutorials, it should format that request query as keywords=YOUTUBE+SEARCH+QUERY. The AI then takes that definition and structures the input accordingly. That means they just magically figure out how to deliver data in whatever way the plugin requires. It’s hard to explain how mind-blowing this is to me. Instead of spending a couple hours learning the rules for the platform I wanted to integrate with and bending my code to fit that system, the AI found the best way to do it for me. The magic is in these fuzzy borders between the systems, which allows for some very different ways of working.

This might not sound like a big deal if you’ve never integrated software, but it’s legitimately counterintuitive in the pure sense of the word: I have spent a career building an intuition around how to use APIs, and what OpenAI did ran literally opposite to that intuition. In every other case I’ve experienced, the big company tells you how to send them data and how they will send you data back, and if you don’t do either of those things exactly as described, something breaks.

In this, I think there are some real clues to where to find value:

Classification: One of the most apparent things these models are good at is classification. Most any scraping tasks can be broken into classification in as much as the goal is to extract data and classify it into the different categories.
Fuzzy Interface: AI can act as the bridge at any point in the process where you need to transform data in one format into data in another. This is what I’m doing going from raw HTML to products and what ChatGPT does when I send them whatever data in whatever format.
Net New Ideas: I’m not sure it’s net new as much as opening up a world of ideas that either couldn’t or wouldn’t have existed before. When I first built my BrXnd Collabs project (which is finally working again, by the way), the whole idea was how to generate things that could possibly exist in numbers that couldn’t. Similarly, I did a project a while back where I worked on AI-generated dog breeds.

That’s it for now. I hope you enjoyed this, and if it’s generally an area of interest for you, you should try to make it to my NYC event on May 8.

If you have any questions, please be in touch. If you are interested in sponsoring, reach out, and I’ll send you the details (we’ve already got a few sponsors on board).

Thanks for reading,

Noah

While I broadly think most of the conversation around prompt engineering for day-to-day tasks is nonsense, prompt engineering is crucial when you need consistent output in a development setting. I wrote about prompt engineering a few weeks ago.

For those unfamiliar, JSON is a format for sending structured data. Typescript allows you to essentially notate your JSON to describe precisely what structure you want back, whether a string, boolean, or number.

AI as Fuzzy Interface // BrXnd Dispatch vol. 36

On web scraping, net new ideas, and AI as glue.

Discussion about this post