R&D

No more sassy SaaS integrations

Amit Ripshtos
Tech Lead
August 24, 2022
Building a framework for integrating 3rd parties to our platform faster.

Modern applications are giant meshes of services and interconnected APIs. However, there isn’t a standardized, systematic way to integrate them. As we built the Dazz Remediation Cloud to accelerate secure cloud software development , we needed to address how different R&D teams write integrations to SaaS applications in a consistent and efficient way.

In this video, we'll cover the patterns and anti-patterns of working with third party integrations, and suggest how to wrap them into a very Pythonic framework that focuses on ease of use and great developer experience, using widely used frameworks such as Typer and Pydantic, and methodologies like dependency injection and snapshot testing.

Hey everyone. Super excited to be here. It's actually my second time and it's twice as fun. My name is Amit. I work at Dazz, and I basically really love Python. So today we are going to talk about sassy SaaS integrations. And before I start I just want to say thank you to the Python organization and Sam for everything as everything is wonderful. Great. So before I talk about the real issue, which is SaaS integrations, I want to explain why we have this issue. So, at Dazz, we are building the cloud remediation cloud,  and that means that enterprises that have a bunch of tools and a lot of developers equals tons of alerts. We basically help them fix these alerts. Before we help them fix the alerts, we need to actually consume these alerts and the information into our own cloud.

There are a lot of security tools and developer tools that we need to integrate with. For example, we need to integrate GitHub for the Dependabot alerts, or we need to integrate Snyk for the issues for the SCA. 

Let's talk about the agenda. We're going to talk about integration in general and how it works, how we do API REST integrations and what are the main obstacles, both in fetching the data and doing the ETL afterwards. And, later on, we are going to talk about Airbyte as a use case of how a really cool company solves this generally for almost every kind of integration, and what we can learn from each SDK. Later on, I would like to share a bit how we solve that issue at Dazz, and how we build our own SDK for that. And I will show some cool tricks, for example, a module discovery and dependency injections to make the developers’ lives easier. I will finish with showing a cool tip on how to make automated snapshot testing for integrations.

Each integration has its own story. You almost all the time need to implement authentication, which is either token or OAuth. You need to implement pagination that might be cursor or by page, and same goes to rate limits and having the 400 status codes also for versions. So, I guess many here integrated something and the schema of that integration changed without noticing. The most important part is the schema itself, how to save the right schema so we can ingest it later.

Handling data is an art and it's fun, because every time I move between jobs, I have the same problem, which is how to handle integrations, how to handle ETLs. So, basically, I want to drill down about the real obstacles and one fetching integration, you always need to do the same things all over again. You need to make sure that the time windowing is right, so you fetch only new data and also the same goes for pagination. You need to make sure you don't spam the API and cause rate limits, or even throttling. You need to make sure that you enter the right exceptions of, for example, validations or even internal server errors, gracefully, and monitor them using metrics and even traces.

 

For validation, after you fetch the data, you don't need to trust the third party. You can validate the code. You can do a JSON schema, or something like that. You make sure that you’ve got enough data to continue working. And if not, maybe you need to crash it.

Regarding parse, let's say you fetch the data, and for example, let’s take AWS resources from S3, for example. While that's on the AWS domain language, it's not really the same as the Dazz domain language, so we need to parse the data and make sure that we convert the data in a manner that our backends were able to work. 

We also need to handle issues, right? So let's say I got an issue in my code and I wrote the integration badly, I need to be able to restream everything without fetching all the data again and again. The last part that we won't talk about, but what’s really an issue is how do I store the data? Do I use S3 and Athena or do I use data warehouse and maybe RDBMS. And what happens when something fails? Do I get duplicates in my database? These three issues are really hard.

So when I joined Dazz, I did a little digging in the code and I found two integrations that we have. One is AWS and the second one is GCP. Naive me thought, okay, it's gonna be the same, right? It's almost the same, but I saw that a developer who wrote the AWS integration saved all the data as one big JSON file, which can be huge. Another developer was more responsible and split the data by GCP projects, but saved it in a CSV file. So, there is no consistency. There is no framework around it, and we understood that that's a real issue. We wanted to create some kind of SDK or a framework so it would be very easy to write integrations and fill that awesome integration page.

I did a little digging and, honestly, I found the area of integration quite a while ago, and one company that is really interesting in my opinion, is a company called Airbyte. In the world of ETLs, there are some competitors, for example, Fivetran, which gives you ETL as a service and lets you integrate everything to everything. And then came Airbyte and disrupted that world.  

What Airbyte gives is an ecosystem around connections and destinations. They let you connect almost everything into almost everything. The nice thing about it and why I'm talking about it in this specific meeting is that they build everything on open source. So, basically, people around the world build connectors for them, and that's amazing. They have a great investment in it, and they created something called CDK, a connector development kit. And I wanted to talk about that because there are some really cool concepts in their CDK that we later on used at Dazz. 

So let's dive in… 

The CDK must handle the integration, right? It must query the API and handle pagination and everything, and even exception and link. In order to manage that, the Airbyte CDK as a concept of a source, which is a class, as you can see, for example, this is a code from the Monday source, and you can see that this source defines only two things. It defines the check connection to make sure that the authentication token is correct, and it defines the streams, which entities you need to fetch for Monday. In this example, you see, there are five entities later on and each one does the API fetching and everything. As you can see, each item is a stream. So that's the second logic that Aibyte defined. And what is a stream? A stream is basically a way of handling the API requests in this example, I should actually change it. It's not Monday now, it’s JIRA, and this is a stream of how to fetch data from JIRA. And as you can see, there is no request.get, because this SDK does everything for you. You basically need to handle out-of-the-box functions such as what is the base URL, how to fetch, how to find the next page, for example, from the response, and how to pass the response. 

I really love that abstraction they created and they give me quite a lot of inspiration about it. We use these kinds of parts to create our own SDK. So we went on a journey to build our own Dazz framework, and this framework needs to enable developers to write integrations quickly, to deliver fast. We decided something that is different from Airbyte that will split integration into two phases.

The first phase is a stream, like Airbyte did, which maintains the incremental fetch and the state and pagination and all that annoying stuff. Then we save the data, as is, raw, from the API. The second phase is the parse. We convert raw data to the Dazz domain objects, and basically then ship it to the back end and then the magic happens. The nice thing about it, again, is that things can be restreamed and it's fine. We don't need to get throttling because we want to restream the entire data for a customer, for example.

How it works, basically, as a developer, you need to do two things. You need to create a, for example, a folder for the, let's say, Snyk integration, and you need to create an integration.py file. In this file, you need to create a class that inherits that abstract integration class from our SDK and inside this class, you basically define the streams. As you can see, our stream of issues, Snyk issues, and Snyk projects, and you define parsers. So for this example, we take the issue streams and the project streams and return output, which is Snyk findings. Maybe this parser knows to cross the data and manipulate it for our own needs, and there is no definition of how to fetch data, right? The stream is the same. It's almost the same.

One thing that I do want to talk about is how do we run this? Because it's a file with a class, that's it. So this is a cool trick that actually I saw other frameworks like Kubeflow Pipelines, for example, that have some kind of CLI that automatically detects the code and runs it. And when I saw itI thought thought, whoa, magic, that's really cool, let's do it! No one will need to create a CLI anymore. So I dig up to understand how it works, and this is the first trick. We do something that is called class discovery. We use the package import lib to import the integration.py file into the memory. Then we use the inspect package to find the class that inherits abstract class. Then we use a really cool package called Typer to create a CLI.

If we go bottoms-up, we see that we run Typer around, across the main function. Then we go to the get_source object function, which does the import lib. Once we get it, we get a model object, and then we can iterate its members and a member can be a variable in this file, for example. Once we found a member that is a class and inherits the abstract integration, we fetch it, and then we run it with a configuration, for example. 

The second trick that we created is using the dependency injection to abstract all the pain away from the developers. So if someone used, for example, FastAPI they saw things like that. Basically as a developer in the parser side, I need to tell the SDK what data I need to fetch.

For example, to parse Snyk, I need to fetch a full refresh of the project stream and then I use generic typing of Python. For issues I want only the new data, the incremental data, data. I am using the issue stream. So basically, when I run it by itself, I need to provide all these kinds of classes, but in real time, when I use this code with our CLI, we automatically understand the typing of this function. We fetch the data because we know that we got full refresh for the project and incremental for the issues. Once we got the data, we bring it to this function. it's really cool actually. So, in order to implement things like that, we will use inspect again. Basically what we did later here to create a class that uses generic typing, just like Java, for example, and uses some kind of a type file. Then let's say we have a stream and we have the parse function, we use dependency injection. We read the signature, the typing of the parse function, and we iterate it. We find if one of the typings is a full refresh, and then we fetch using ge_args inspects, typings get_args, the sub class. And once we got it, we can run the full refresh later on and then give it to the actual function.

So that's about it. The last trick that I want to talk about is how to ease the pain of testing. So let's say we got a lot of integrations, and for each integration, we need to fetch API, save it in a file, then run it in the test to make sure that we got everything right. In order to automate this one, we actually used the request package hooks. In our test, we got a special process that we can run it and it really runs the integration with some kind of our token, for example, but using the request hook that you can see here, we add a function that takes the response, saves it in a JSON file. And the nice thing about it, that if you run our integration with this code, after running, we get a list of JSONs that we can use to mimic the tests. So in order to mimic the test, there is a really cool package called requests mock that actually takes the files, understands the URL that needs to be mocked and basically runs the tests.

Let's summarize. Solid frameworks mean happier developers and faster development and delivery times. And the typing systems in Python really can help us to abstract the pain. Good APIs really abstract all the crappy stuff and give you the flexibility to do whatever you want. And it's limitless. Thank you very much.

Most Popular
Security
Cloud security is broken but it doesn’t have to be

Why ignoring your CI/CD pipeline is unsustainable, and how to fix it.

Tomer Schwartz
Co-founder & CTO
June 2, 2022

See Dazz for yourself.

Get demo