• The Compliance Risks of Synthetic Data Generation

    What Is Synthetic Data? Synthetic data is machine-generated data based on real-world data. It requires building a machine learning (ML) model to capture the patterns in the original, real data before generating new synthetic data based on these patterns. The generated data accurately represents the original data’s statistical distributions, patterns, and properties.  Synthetic data is useful for applications facing privacy concerns – it is not regarded as personally identifiable information (PII), because it is not directly traceable to real individuals. Thus, organizations can freely share and use synthetic data with minimal technical and administrative controls. This process requires a high level of automation, relying on fewer human resources and skills…

  • Develop your first ETL job in Python using bonobo

    In this post I am going to discuss how you can write ETL jobs in Python by using  Bonobo library. Before I get into the library itself, allow me to discuss about ETL itself and why is it needed? What is ETL? ETL is actually short form of Extract, Transform and Load, a process in which data is acquired, changed/processes and then finally get loaded into data warehouse/database(s). You can extract data from data sources like Files, Website or some Database, transform the acquired data and then load the final version into database for business usage. You may ask, Why ETL?, well, what ETL does, many of you might already been doing…

  • A subreddit for web scrappers

    I recently realized that I am in love with Data Scraping. Thanks to Python and Beautifulsoup to get me into this. In past few months I have scraped data from sites like Amazon, Rakuten and NewEgg. I have started a subreddit for developers who love to scrap sites for fun( or for work). I named it Scraping the Web. You can join here and submit your entries.   Happy Scraping!