Writing Out Files & Python UnicodeEncodeError Woes

A very common headache that I am sure every engineer has had to face at least one time in their life is character encoding. Oh, yes that fun topic! No, I do not have a solution for everyone. Sorry! But, in case you are in the Python world like I am and you are writing out files and getting a bunch of UnicodeEncodeError, well try the following below that I have.

tl;dr Show me an example!

import codecs

with codecs.open(filename, 'w', encoding='latin-1') as outfile:
    outfile.write('{}\n'.format(json.dumps(data, encoding='latin-1')))

So, what? And why?

Okay, so I am writing a latin-1 encoded file. Yes, latin-1. Why? Well, I've chosen latin-1 here because latin-1 was giving me issues, so there! But really, if you want to write to a different encoding, obviously just swap that out.

But the long explanation, if you were curious, is that I am reading in data that was latin-1 encoded and it was making my data ingestion jobs fail because I default to utf-8. The json.dumps() bit is actually not needed if you are not working with json (obviously!). But, I wanted to point out that in case you were writing json, you also need to set the encoding to whatever you choose there as well. It is currently on my TODO list to see why that is the case.

By Adrian Cruz | Published March 21, 2016, 9:50 p.m. | Permalink | tags: python

Intro to Building Out Data Pipelines With Python and Luigi

A very common question that I have been getting asked is, "Luigi? What's that?". Well, my answer that I usually give, in brief, is that it is a project open sourced by Spotify to facilitate workflow and dependencies. But to quote the Luigi ReadTheDocs page:

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Luigi Tasks & Targets

The main pieces of Luigi are built around a Task and a Target. A Task is exactly what you would think it would be, it's a single task in your data pipeline. So for example, you may have a task that reads in some csv file and pull out specific values that you want from the file. The Target, is the intended output for your Task. So, back to our csv file example, an output() Target for that task may be a cleaned up csv file with the values you wanted.

Here Is Our Example Task

import luigi


class MyTask(luigi.Task):
    def output(self):
         return luigi.LocalTarget('my_output.csv')

    def run(self):
        with open('input.csv', 'r') as input:
            cat_count = 0
            for line in input.readlines():
                animal, age, color = line.split(',')
                if animal == 'cat':
                    cat_count += 1

        with self.output().open('w') as out_file:
             out_file.write(cat_count)

That is a [dumb] simple Task. This task has no requirements. All that it does is read in a file, parse it and write out its output to another file.

A few tings to note are the importance of output(). Luigi checks to see if the output() exists to check if this task is complete. That is Luigi's definition of complete. You can also override complete() if you do not have an output, but for now just think that every task needs an output.

"So, what is this good for?"

So one big thing that I purposely left out when describing Luigi, is that it integrates really well into the whole Hadoop ecosystem! So, now let's take a step back and think about how we would process these batch data processing jobs without Luigi...

Let's say we have several jobs that need to be accomplished in order for the task you want to be considered complete. So for example, maybe we have a need to process some data that we have found in some log files. The log files are currently stored in S3, so we'll have a job to fetch those locally. Maybe, the logs need to be cleaned up a little bit, so we will do whatever filtering and et cetera transformations we'll need to do to cleanse the data. Next, we'll want the newly formatted data logs loaded into HDFS. After that, we can utilize Hive, so we'll want to create a table with those logs.

Data jobs in the past

So in the past, we would just have these several jobs run in several cron jobs. But wait, depending on how much data we're working with, these jobs will have varying length of time to complete! So, the best you can do is see how long these jobs run and schedule them accordingly. So for example, we know the data is pulled down from S3 within ~20 minutes, so we'll schedule the next job 30 minutes later, and do the same thing for the remaining jobs as well.

Now, with Luigi

With Luigi, you create dependency chains for jobs very easily by overriding the requires() method. Now when you define your entire process, you want it to run in this order: Task0->Task1->Task2->Task3. So, you can now say that Task3 requires Task2 to run, Task2 requires Task1 to run, etc. etc. This looks like the following:

class Task3(luigi.Task):
    def requires(self):
        return [Task2()]
    
    """
    Other core code would go here as well, like run(), output()...

    """
Now, you have one single point to schedule and no need to guess when each job runs! Pretty neat right? :)

This is obviously just an introduction. I've only touched the surface about Luigi. But, if you need to build out data pipelines and enjoy doing so in Python, I highly recommend checking out Luigi! Cheers!

By Adrian Cruz | Published May 31, 2015, 7:57 p.m. | Permalink | tags: big data, hadoop, luigi, python

Julython 2014 Recap

Julython 2014 has come and gone. I hope everyone had a blast, I know I did!

A Month of Hacking In a Nutshell

Let's take a quick look into what was worked on in July:

An Opportunity to Learn, Build

I like to use Julython as a very good excuse to learn something new or learn something more about. In this passed July's case, I have been toying with the idea of learning SQLAlchemy, and this was a perfect excuse to do so.

Take a look at my Slidedecker repository and you will see that I added SQLAlchemy to it. Obviously, I wanted to learn more about it first, so I also added the tutorial to my Julython projects. But I really wanted to learn by example, so obviously working on your own personal project is the way to go!

PyPI Contributions

Both bump and MPPyResponse are packages available on PyPI.

bump is a fork of the original that I use to do version bumping. I had a need to do version bumping for release candidates or beta candidates and that functionality was missing, so I added that in.

MPPyResponse was created out of sheer humor. A friend of mine jokingly asked me to create an application that responded with a random Monty Python quote; no problem!

Obviously, with it still being Julython, I wanted to learn something new while still having fun, so with MPPyResponse, I started that project off with cookiecutter. It is truly a fantastic little package for starting off your own little package. I highly recommend it if you need a little boilerplate to get your projects going!

Summary

In summation, Julython for me is really just an excuse to learn something new in Python while having a little fun. Yes, sharing code and watching myself on the leader board are awesome too, but all in all, contributing to a month-long event of hacking is really just a wonderful thing!

Until next time! Cheers!

By Adrian Cruz | Published Aug. 6, 2014, 1:51 a.m. | Permalink | tags: julython, opensource, python

Py-Closure: A Python Client for Closure

I made a little command line utility to assist in minifying javascript files. It uses Closure Compiler to compile. I've been using the web app and needed a way to automate the minification process; this command line utility is just the beginning of that process.

What can I do with it?

Well, if you use Closure Compiler already I'm sure you have always wanted a way to streamline your minification process. This is really just a piece of that. For example, you can probably add this to your build process in Jenkins to minify all javascript files within your code repositiory.

But, there are a bunch of other asset minifiers I can use!

There are, and I actually was looking into using Grunt. Well, I have been busy with this new project and just want to keep the ball rolling. So, I'm sure that I will revisit this minifying process again in due time.

So, where's the code?

Sure, it's here on my GitHub. It's short and sweet and super simple. I'm actually going to fork this off and use it similarly at work. Anyhow, I just thought I'd share. Cheers!

By Adrian Cruz | Published May 7, 2014, 2:35 a.m. | Permalink | tags: javascript, opensource, python

Silly Bug Fixes

Yes, this was a bug fix. "A one line bug?" you ask? Yes, sad but true.

Now, for some background...

A bug was sent in about pagination not completely working for a web application. It was consistently showing only four pages. This is obviously not expected behavior especially if the user knows that there should have been 10, 20, 50 pages.

Okay, now the history of that code

So it just so happens that the portion of the site was quickly rolled out to return only 100 rows from the database as a shortcut. As part of technical debt, a fix for pagination was then added. Unfortunately, sometimes working in a fast pace, you miss some details in code and bugs like this slip through.

Doh!

Silly, but let's fix this

So, I know I'm not the only engineer to have commits like this. I mean, remember the goto fail; bug? To remedy this, we've already started doing code review on this project. Why haven't we been doing code review from the beginning? Well, if you remember my post on a quick site launch, I'm sure you are aware that this is for a site that was bootstrap'd fast to ship a functional product in a short amount of time.

Unfortunately, things like this was expected due to the lack of time. But here's to finding a humorous tone in my mistakes. Cheers!

By Adrian Cruz | Published April 7, 2014, 5:29 a.m. | Permalink | tags: mysql, python, silly bug fixes