There is a flaw with the base problem: each tweet only has one label, while a tweet is often about many different things and can't be delinated so cleanly. Here's an alternate approach that both allows for multiple labels and lower marginal costs (albeit higher initial cost) for each tweet classified.
1. Curate a large representative subsample of tweets.
2. Feed all of them to an LLM in a single call with the prompt along the lines of "generate N unique labels and their descriptions for the tweets provided". This bounds the problem space.
3. For each tweet, feed them to a LLM along with the prompt "Here are labels and their corresponding descriptions: classify this tweet with up to X of those labels". This creates a synthetic dataset for training.
4. Encode each tweet as a vector as normal.
5. Then train a bespoke small model (e.g. a MLP) using tweet embeddings as input to create a multilabel classification model, where the model predicts the probability for each label that it is the correct one.
The small MLP will be super fast and cost effectively nothing above what it takes to create the embedding. It saves time/cost from performing a vector search or even maintaining a live vector database.
show comments
vessenes
Arthur, question from your GitHub + essay:
In GitHub you show stats that say a "cache hit" is 200ms and a miss is 1-2s (LLM call).
I don't think I understand how you get a cache hit off a novel tweet. My understanding is that you
1) get a snake case category from an LLM
2) embed that category
3) check if it's close to something else in the embedding space via cosine similarity
4) if it is, replace og label with the closest in embedding space
5) if not, store it
Is that the right sequence? If it is, it looks to me like all paths start with an LLM, and therefore are not likely to be <200ms. Do I have the sequence right?
show comments
sethkim
Under-discussed superpower of LLMs is open-set labeling, which I sort of consider to be inverse classification. Instead of using a static set of pre-determined labels, you're using the LLM to find the semantic clusters within a corpus of unstructured data. It feels like "data mining" in the truest sense.
show comments
kgeist
I did something similar: made an LLM generate a list of "blockers" per transcribed customer call, calculated the blockers' embeddings, and clustered them.
The OP has 6k labels and discusses time + cost, but what I found is:
- a small, good enough locally hosted embedding model can be faster than OpenAI's embedding models (provided you have a fast GPU available), and it doesn't cost anything
- for just 6k labels you don't need Pinecone at all, with Python it took me like a couple of seconds to do all calculations in memory
For classification + embedding you can use locally hosted models, it's not a particularly complex task that requires huge models or huge GPUs. If you plan to do such classification tasks regularly, you can make a one-time investment (buy a GPU) and then you'll be able to run many experiments with your data without having to think about costs anymore.
show comments
witnessme
A simple word2vec embedding with continuous bag of words (CBOW) training is enough and beats all other complex solutions at rhe performance as well as cost
This is sensitive to the initial candidate set of labels that the LLM generates.
Meaning if you ran this a few times over the same corpus, you’ll probably get different performance depending upon the order of the way you input the data and the classification tag the LLM ultimately decided upon.
Here’s an idea that is order invariant: embed first, take samples from clusters, and ask the LLM to label the 5 or so samples you’ve taken. The clusters are serving as soft candidate labels and the LLM turns them into actual interpretable explicit labels.
pu_pe
What about accuracy? Maybe I'm missing something, but the crucial piece of information that is missing is whether the labels produced by both methods converge nicely. The fact that OP had >6000 categories using LLMs makes me wonder whether there is any validation at all, or you just let the LLMs freestyle.
jawns
If you already have your categories defined, you might even be able to skip a step and just compare embeddings.
I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.
Then created embeddings for the call notes and matched to closest category using cosine_similarity.
show comments
rao-v
You could probably speed this up a lot by getting the token log probs and finding your category based on the highest log prob token that is in the category (you may need multiple steps if your categories share token prefixes)
axpy906
Arthur’s classifier will only be as accurate as their retrieval. The approach depends on the candidates to be the correct ones for classification to work.
show comments
dan_h
This is very similar to how I've approached classifying RSS articles by topic on my personal project[1]. However to generate the embedding vector for each topic, I take the average vector of the top N articles tagged with that topic when sorted by similarity to the topic vector itself. Since I only consider topics created in the last few months, it helps adjust topics to account for semantic changes over time. It also helps with flagging topics that are "too similar" and merging them when clusters sufficiently overlap.
There's certainly more tweaking that needs to be done but I've been pretty happy with the results so far.
1: jesterengine.com
nreece
Am I understanding it right that for each new text (tweet) you generate its embedding first, try to match across existing vector embeddings for all other text (full text or bag of words), and then send the text to the LLM for tag classification only if no match is found or otherwise classify it to the same tag for which a match was found.
Will it be any better if you sent a list of existing tags with each new text to the LLM, and asked it to classify to one of them or generate a new tag? Possibly even skipping embeddings and vector search altogether.
show comments
kpw94
Nice!
So the cache check tries to find if a previously existing text embedding has >0.8 match with the current text.
If you get a cache hit here, iiuc, you return that matched' text label right away. But do you also insert a text embedding of the current text in the text embeddings table? Or do you only insert it in case of cache miss?
From reading the GitHub readme it seems you only "store text embedding for future lookups" in the case of cache miss. This is by design to keep the text embedding table not too big?
show comments
deepsquirrelnet
I think a less order biased, more straightforward way would be just to vectorize everything, perform clustering and then label the clusters with the LLM.
show comments
ur-whale
PSA: DSU means Disjoint Set Union
and PSA means Public Service Announcement.
TZubiri
"Read the following tweet and provide a classification string to categorize it.
Your class label should be between 30 and 60 characters and be precise in snake_case format. For example:
- complain_about_political_party
- make_joke_about_zuckerberg_rebranding
Now, classify this tweet: {{tweet}}"
I stopped reading here. It's a bit obvious that you need to define your classification schema beforehand, not on a per message basis. And if you do, you need a way to remember your schema. Of course you will generate an inconsistent and non-orthogonal set of labels. I expected the next paragraphs to immediately fix this like
"Classify the tweet into one of: joke, rant, meme..." but instead the post went on to intellectualizing with math? It's like a chess player hanging a queen and then going on about bishop pairs and the london system
There is a flaw with the base problem: each tweet only has one label, while a tweet is often about many different things and can't be delinated so cleanly. Here's an alternate approach that both allows for multiple labels and lower marginal costs (albeit higher initial cost) for each tweet classified.
1. Curate a large representative subsample of tweets.
2. Feed all of them to an LLM in a single call with the prompt along the lines of "generate N unique labels and their descriptions for the tweets provided". This bounds the problem space.
3. For each tweet, feed them to a LLM along with the prompt "Here are labels and their corresponding descriptions: classify this tweet with up to X of those labels". This creates a synthetic dataset for training.
4. Encode each tweet as a vector as normal.
5. Then train a bespoke small model (e.g. a MLP) using tweet embeddings as input to create a multilabel classification model, where the model predicts the probability for each label that it is the correct one.
The small MLP will be super fast and cost effectively nothing above what it takes to create the embedding. It saves time/cost from performing a vector search or even maintaining a live vector database.
Arthur, question from your GitHub + essay:
In GitHub you show stats that say a "cache hit" is 200ms and a miss is 1-2s (LLM call).
I don't think I understand how you get a cache hit off a novel tweet. My understanding is that you
1) get a snake case category from an LLM
2) embed that category
3) check if it's close to something else in the embedding space via cosine similarity
4) if it is, replace og label with the closest in embedding space
5) if not, store it
Is that the right sequence? If it is, it looks to me like all paths start with an LLM, and therefore are not likely to be <200ms. Do I have the sequence right?
Under-discussed superpower of LLMs is open-set labeling, which I sort of consider to be inverse classification. Instead of using a static set of pre-determined labels, you're using the LLM to find the semantic clusters within a corpus of unstructured data. It feels like "data mining" in the truest sense.
I did something similar: made an LLM generate a list of "blockers" per transcribed customer call, calculated the blockers' embeddings, and clustered them.
The OP has 6k labels and discusses time + cost, but what I found is:
- a small, good enough locally hosted embedding model can be faster than OpenAI's embedding models (provided you have a fast GPU available), and it doesn't cost anything
- for just 6k labels you don't need Pinecone at all, with Python it took me like a couple of seconds to do all calculations in memory
For classification + embedding you can use locally hosted models, it's not a particularly complex task that requires huge models or huge GPUs. If you plan to do such classification tasks regularly, you can make a one-time investment (buy a GPU) and then you'll be able to run many experiments with your data without having to think about costs anymore.
A simple word2vec embedding with continuous bag of words (CBOW) training is enough and beats all other complex solutions at rhe performance as well as cost
Reference: https://blog.invidelabs.com/how-invide-analyzes-deep-work/
I used sentence transformers for clustering for a similar use case: https://huggingface.co/sentence-transformers
Dunno if this passes the bootstrapping test.
This is sensitive to the initial candidate set of labels that the LLM generates.
Meaning if you ran this a few times over the same corpus, you’ll probably get different performance depending upon the order of the way you input the data and the classification tag the LLM ultimately decided upon.
Here’s an idea that is order invariant: embed first, take samples from clusters, and ask the LLM to label the 5 or so samples you’ve taken. The clusters are serving as soft candidate labels and the LLM turns them into actual interpretable explicit labels.
What about accuracy? Maybe I'm missing something, but the crucial piece of information that is missing is whether the labels produced by both methods converge nicely. The fact that OP had >6000 categories using LLMs makes me wonder whether there is any validation at all, or you just let the LLMs freestyle.
If you already have your categories defined, you might even be able to skip a step and just compare embeddings.
I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.
Then created embeddings for the call notes and matched to closest category using cosine_similarity.
You could probably speed this up a lot by getting the token log probs and finding your category based on the highest log prob token that is in the category (you may need multiple steps if your categories share token prefixes)
Arthur’s classifier will only be as accurate as their retrieval. The approach depends on the candidates to be the correct ones for classification to work.
This is very similar to how I've approached classifying RSS articles by topic on my personal project[1]. However to generate the embedding vector for each topic, I take the average vector of the top N articles tagged with that topic when sorted by similarity to the topic vector itself. Since I only consider topics created in the last few months, it helps adjust topics to account for semantic changes over time. It also helps with flagging topics that are "too similar" and merging them when clusters sufficiently overlap.
There's certainly more tweaking that needs to be done but I've been pretty happy with the results so far.
1: jesterengine.com
Am I understanding it right that for each new text (tweet) you generate its embedding first, try to match across existing vector embeddings for all other text (full text or bag of words), and then send the text to the LLM for tag classification only if no match is found or otherwise classify it to the same tag for which a match was found.
Will it be any better if you sent a list of existing tags with each new text to the LLM, and asked it to classify to one of them or generate a new tag? Possibly even skipping embeddings and vector search altogether.
Nice!
So the cache check tries to find if a previously existing text embedding has >0.8 match with the current text.
If you get a cache hit here, iiuc, you return that matched' text label right away. But do you also insert a text embedding of the current text in the text embeddings table? Or do you only insert it in case of cache miss?
From reading the GitHub readme it seems you only "store text embedding for future lookups" in the case of cache miss. This is by design to keep the text embedding table not too big?
I think a less order biased, more straightforward way would be just to vectorize everything, perform clustering and then label the clusters with the LLM.
PSA: DSU means Disjoint Set Union and PSA means Public Service Announcement.
"Read the following tweet and provide a classification string to categorize it.
Your class label should be between 30 and 60 characters and be precise in snake_case format. For example: - complain_about_political_party - make_joke_about_zuckerberg_rebranding
Now, classify this tweet: {{tweet}}"
I stopped reading here. It's a bit obvious that you need to define your classification schema beforehand, not on a per message basis. And if you do, you need a way to remember your schema. Of course you will generate an inconsistent and non-orthogonal set of labels. I expected the next paragraphs to immediately fix this like
"Classify the tweet into one of: joke, rant, meme..." but instead the post went on to intellectualizing with math? It's like a chess player hanging a queen and then going on about bishop pairs and the london system