Helix the Robot
Helix Helix
Helix AI > Blog > AI Data + Dashboarding Agent Tech Stack

Helix Tech Stack

Helix builds dashboards from your data

Helix is capable of creating standard line/bar/scatter/maps as well as diverse charts like 3d, heatmaps, distributions, cumulative distibutions, treemaps, violin charts etc

As well as the visualizations, Helix is also a capable ETL and data engineer who can project your data in the right way for a given query.

Helix can do groupby, aggregations, filtering, (mode, min, max, mean, percentile, etc) Helix can enrich data with advanced AI functions like sentiment analysis

Sweet Words

word cloud nlp keywords

^ Here is a word cloud of some buzzwords in our tech stack, generated by Helix.

Important New AI Concepts

Everything is Fuzzy

Understanding and executing arbitrary transformations and creating arbitrary visualizations is AI hard, that means it's basically just plain hard

With AI we pass in a fuzzy search query and edits from a human, and what we get out we use fuzzy, traditional and NLP approaches to interpret the AI solution into a concrete action plan.

For example, if we ask for a bar chart of the number of people who like apples OpenAI may respond with some data query

count() where liked = "Apples"

In the example "liked" may not be a real column name, so we send the column names to OpenAI to give it hints, but also need to use fuzzy logic to interpret when its slightly wrong

Apples might not be exactly the data your searching for, so we search for something very like Apples, such as apples or apple as well.

Prepare For Exponential Improvement

With conversational AI, the data given to Helix is better than traditional dashboarding tools, users now tell Helix directly their job to be done e.g. Summarize this, show correlation between this and this, calculate total profit etc.

Given data about what people are really trying to do, adapting the tool in directions useful for real humans becomes possible, now its clear users dont need another button, a push notification, of an advanced admin panel, they would like to summarize data, or see advanced insights via asking specifically for them.

AI Improvement Systems Concepts

These should not get in your users way, whilst still being aligned to their needs.

They should be flexible to allow you to either use traditional programming or training AI to fix bugs by example.

Data and improvement flywheels should have as much automation and visibility as possible.

Scale testing, data improvement and AI beyond what you think is reasonable.

Automate data augmentation/collection - we generate queries as well as test them to drastically increase the surface area of testing and self improvement.

However keep data augmentation and training data a reasonable and balanced representation of your real users jobs to be done (we transform our queries to be centered around users needs).

Automated improvement - Audited by humans

For training networks we use WandB to check on training progress.

For data and code we use a combination of human and AI auditing to ensure the data is correct.

Helix has access to a growing library of data transformations that it is a contributor to itself.

AI Improvement in future needs Humans less and less

Our system generates and tests its own transformation modules, generates and tests itself (via generating data and queries similar to real user data) and in near future will expose its internal thought process so users can learn how to converse with it directly, effectively fixing issues on their own.

Bitter lessons

As computational power grows and grows AI becomes more powerful and as does the capabilities of the AI.

This has called for less specific solutions and more solutions that can be optimized end to end with gradient descent and computational power instead.

Now its easier to flip traditional programming on its head and lean more on existing AI features from the start, we add constrains and fuzzy error correction for values going in and out of the AI core, but it's best to give AI a larger sandbox and future more powerful AI will grow into it.

This is called bitter because a lot of smart code that solves specific issues with specific problem domain expertise becomes less useful than originally thought, now more than ever we need to paint with broad strokes and design self-improvement systems that use a lot of compute to solve bugs when you are sleeping.

Pipeline for data editing

The overall pipeline consists of first the data editing then charting pipelines.

Within data editing we need to cleanse any bad data, generate new columns/extra information, potential grouping (which then requires an aggregation over any grouped data) and sorting.

Initial Query

We have algorithms for determining the data source and chart type, (which can also be manually specified in the query to force a specific dataset or chart type to be used).

Fairly basic query matching is used to determine the chart types and then if it fails we fallback on AI from Text-Generator.io and small specialized models, an AI heavy approach allows good choices for chart types as they mostly aren't specified by users.

The datasets are all loaded into memory and stemming, synonym expansion and TF-IDF search is done over (a summary of the data+filenames+the schema) to determine the best dataset for the query, if this fails we fallback on AI from ChatGPT to choose given the files/schema/query

The Pandas Query itself is normally generated via ChatGPT, with a fallback to Edit Code davinci/Text-Generator.io and is given the data schema/query in a python/pandas code format

We first use traditional stemming, synonym expansion and TF-IDF search principles to find how AI output best maps to real data columns or searches for real values.

The query is executed in a sandbox with access to many Python scientific computing and data transformation libraries

Further user edits to the data are executed one line after another using a similar process.

Data manipulation

At a high level we ask ChatGPT/GPT4/Text-Generator.io to generate Pandas code to transform our data/schema, similar to a LangChain DataFrame Agent

Our code sandbox includes functions for clustering like scipy and natural language understanding e.g. for supporting adding new fields:

add a field for sentiment based on the customer_feedback field

We use fuzzy logic for ensuring pandas column names match real column names, imports work, We have a syntax fixing module that rewrites code to compile properly (common errors like missing commas/brackets/braces/spacing), and with context on any errors when we think AI could correct itself or finally use retries over the entire process.

Pipeline for graph creation

The charting and dashboard building functions extend the capability of the first data transformation pipeline to make the data suitable for charting, and mix best practice human and AI knowledge of visualizations.

Defaults

We have some default starting points for best practice data visualization, such as reasonable fonts, spacing and size markers. These can be overriden.

AI charting

AI generates the charting code, using a combination of OpenAI and Text-Generator.io

Most charts are created using plotly code that's executed in plotly express, this way we can send only the data in a pandas dataframe that's required to the frontend, as well as support a Discord/Slack bot that can communicate in images of charts over the limited APIs.

If generating and error correcting Python code for charting doesn't work we fallback to asking the AI to request charts in JSON which are applied to the defaults as diffs

If the entire process fails we retry with rewriting the customer query to make more sense in the context of data transformatoins and visualizations e.g. map the original query into a more formal language and retry

After the AI chart and settings have been created given the schema we move on to specific optimizations per chart type

Per chart optimizations include upgrading choropleth maps into mapbox maps, fixing violin charts to visualize distributions more clearly on skewed datasets, coloring maps using a linear colorscale that works well for skewed data,

In future this may be expanded with a chromium renderer to support rendering the kinds of advanced D3 charts required for things like word clouds or deep scatter plots as images if required.

More Defaults! "Concrete Defaults"

Preferable layouts and fonts are added ontop of the original query.

Layering in Edit Queries from the user

One by one we formulate all the edits using a combination of OpenAI and Text-Generator.io to generate charting code.

This turns into a few edit Queries to the OpenAI edit API prompted to make edits to a plotly chart given the chart layout/data schema and some similar chart editing code we generate to look similar (AI fixes our code to infill for example the colors), then falls back to new code generation as the OpenAI edit API can return the same results in as output if it fails to Edit.

This allows users to edit

  • Sizing
  • Fonts
  • Layout
  • X or Y Axis
  • Color
  • Much more

For editing we construct our faux update code with the existing layout/x/yaxis and graph style, then use the OpenAI edit API to infill the user changes, as well as retrying with ChatGPT when edits fail.

More Defaults again "Edit Concrete Defaults"

We realized we should give AI a longer leash and allow further user edits to override some defaults we originally thought would always be best like fonts/layout/flipping graphs

These can't be changed even by edit mode and are final things like having responsive mobile friendly charts.

Plug

Helix is an AI Agent for Understanding data, Dashboards Making charts.

Managing your data and insights has now gone from many teams taking days, to asking questions in seconds

Try it yourself: