You probably don't need expensive LLMs to map data into different formats

Semantically map JSON objects using key-level vector embeddings

Jul 09, 2024

Because language models are great at understanding context, they have proven particularly useful in reformatting data; you have source data in one format, but you need it in another. This is a super common scenario and it’s easy to default to LLMs to solve this problem for you; whether you're integrating with a new API, aggregating, or standardizing data from multiple sources.

When you're processing millions of data points, the latency and inference costs of running the input and output objects through an LLM add up incredibly fast.

At Rectangle discovered that you can use key-level vector embeddings to map arbitrarily structured JSON objects, eliminating the need for constant LLM API calls. By using semantic similarity captured in pre-computed embeddings to match keys between objects, you get the intelligence of semantic mapping without all the overhead.

Key-level semantic mappings match the accuracy of the best LLMs (using the same embeddings to build understanding), are about 50 times cheaper than inference on a per-token basis, massively reduce the number of used tokens (because you’re only sending keys instead of full data objects), and are practically instant.

To semantically map an input object to an output structure, you’d roughly perform the following steps:

Flatten both the source and target objects, preserving the nested structure information in the keys. It calculates embeddings for the flattened keys of both objects using the specified embedding provider and model. This way the semantic information in something like bio.lastName vs profile.username is preserved more accurately.
For each key in the flattened source object, it finds the most semantically similar key in the flattened target object using cosine similarity of their embeddings.
If the vector similarity score is above the specified threshold, the value from the source object is mapped to the corresponding key in the target object.
Finally, unflatten the results with the dot-notations, restoring the nested structure of the target object.

Shapeshift Library

We’ve built and open-sourced our Shapeshift library to semantically convert JSON objects to the output formats we need and reduce our LLM overhead.

Shapeshift is available as an MIT-licensed Typescript library through the NPM registry and on Github.

npm install @rectanglehq/shapeshift

https://github.com/rectanglehq/Shapeshift

We’re hiring

If you’re interested in solving some of the largest data, communication, and collaboration problems in the supply chain, you should consider joining us! We’re actively hiring software engineers. 👉 join@rectanglehq.com

Marvin’s Substack

You probably don't need expensive LLMs to map data into different formats

Semantically map JSON objects using key-level vector embeddings

Shapeshift Library

We’re hiring