Why should you read this article?
- A use-case of Streamr
- Flow from (ELT) - extract, load, transform and visualize it
- Though on real-time BI and decentralize
Our team is working mainly on data and how to gain insight via data, when doing Cryptocurrency research, I found some problems:
- There are lots of data our there, but it takes time to gather data private data
- There is some BI tool out there but it depends on which data they provided (Glassnode, dune.xyz, Messari, IntoTheBlock,...)
- It's not supported in real-time. I believed that in cryptocurrency or many other industries. If we can go a step ahead, we can take lots of benefits
You can say that, if we need to quick result on the result, we can use Glassnode or Dune.xyz but our team wants to build a tool that everyone can get what data they want, without any or basic technical knowledge and the dashboard must be in REAL-TIME.
Then we come up with the expected data flow
But this is a very starting point so we don't want to build anything from scratch, so we are going to build a really simple flow
What is Prefect? Why is it needed here?
https://www.prefect.io/ - The easiest way to build, run, and monitor data pipelines at scale.
After trying many ETL tools, we think that Prefect is a perfect fit, it clean, easy to write, and the community is really good, and we can also scale the infrastructure easily (They are building a serverless Agent in future)
Why not send data directly to Streamr
I want to but Streamr hasn't supported Python SDK for now and we don't have much time to build an SDK on Python. But I think python client is really important since most of the data tools are written in Python
What is Airbyte? Why is it needed here?
Another open-source data ingestion tool, it is the fastest-growing data ingestion tool with supported many data source and destination
Catalog of Data Integration Connectors | Airbyte
Here's the list of data sources and destinations that Airbyte supports: APIs, files, databases, APIs, data warehouses, data lakes, etc.
We got funds by Streamr to #BUIDL
As you can see, if Streamr can connect to Airbyte, it opens a huge amount of data that can flow into it platform, from Google Analysis, HubSpot, Salesforce to MySQL, BigQuery, Postgres,...
With the Airbyte connectors, now you can publish tons of data into Streamr without any efforts
GitHub - devmate-cloud/streamr-airbyte-connectors: Airbyte connectors
Data ingestion is hard, Airbyte makes it more easy and scalable Take advantage of many data sources from files, api, databases,... Let users have a choice other than traditional, centralize solutions (Kafka, BigQuery, Snowflake,...)
Which support from Streamr data-fund, our team can focus on building the hardest part on the pipeline and this integration also brings them the potential to acquire more users. This is a win-win relationship!
You guys can apply data-fund to building awesome things with Streamr
Which sponsor for the hardest part, we now can start building
- Crawler to crawl DEX volume on https://coinmarketcap.com/rankings/exchanges/dex/
- Building a frontend app to get real-time data and visualize it
...on building real-time BI tool
This is a really hard journey but we started with a very first step, there are lots of problems we need to solve
Building a Crawling tool for anyone is hard
The internet is an open world that we can get many data there, but doing that with scale is hard. Most of the data is their website assets, so you end up building a crawling tool to get that data or buy from them via API. That why their business is, to make it hard to be crawl
Our team is thinking of building a SQL-like language to crawl data
The idea is that we can crawl most of the website data via SQL syntax
SELECT a[href], img[alt] FROM "https://coinmarketcap.coin" WHERE a[class="cmc-link"]; SELECT div[innerText] FROM "https://streamr.network" WHERE div CHILD OF ".Hero__Inner";
Realtime BI tool
Its lack of a tool to visualize real-time data, most of them can do batch data by querying from data-warehouse database but supported in real-time is a hard problem.
Kafka is the only tool that we can use to build a real-time data pipeline, and we only see https://metatron.app/ supported Kafka right now
Is Streamr fit?
No, at the moment. By far, Streamr is supported most of the basic cases: We can subscribe to real-time data, get past data by time,... but it doesn't support the most important cases:
- Process the data. Imagine you have a stream input of DEX volume data by every 30 mins and we need a stream of DEX volume data by every 4 hours as output.
And we think Streamr can utilize its nodes to run lots of potential use-cases not just as an infrastructure to stream data. For eg:
- Support smart contract so we can make a node as a crawler data point. By doing so, we can leverage lots of nodes on Streamr Network to get data.
- Support smart contract so it can process stream like my above example. We got and stream as input → Smart contract → stream as output
- Our team don't want to expose my Private key when integrating Streamr on the frontend, we can hide it by running the Streamr node on my server but we think it takes effort to do and operate our node
This is lots of works to do but our team believed, which the fast-growing industry, we can have those abilities in the near future