Oleg Tereshin

Senior Software Engineer,

Independent Software Engineer

ABOUT THE SPEAKER:

I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.

TALK TITLE:

Optimizing Vector Search: Why You Should Flatten Structured Data. An Analysis of How Flattening Structured Data Can Boost Precision and Recall by Up to 20%

TRACK:

Technical / Engineering Talks

SUB TOPIC:

Data Engineering / Rag Pipelines – Search / Recommendation Systems

ABSTRACT:

When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.

WHAT YOU’LL LEARN:

The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.

Oleg Tereshin

Who Attends

2023 Event Demographics

2023 Technical Background

2023 Attendees & Thought Leadership