Pandas: Data analysis with Python

Data Panda


In the Big Data real world, the data to be analyzed do not usually originate directly with the application that analyzes them. Pandas thus comes with some auxiliary functions that read popular file formats and transfer their contents directly into Pandas data structures: read_csv(), read_table(), and read_fwf(). Figure 1 shows an example of a session with the advanced Python shell, IPython [4], and a call to read_csv(); Figure 2 shows a curtailed record.

Figure 1: IPython and Pandas for interactive data analysis.
Figure 2: Pandas reads and writes data from and to files and displays the data in a clear-cut way.

These methods expect data sources in tabular form (i.e., one record per line and comma- or tab-separated cells). Arbitrary field separators can be defined with the sep argument in the form of simple strings or regular expressions.

For read_fwf() only, fixed field widths remove the need for field separators; instead, you pass in a list of field widths, stated as the number of characters, or colspecs, stated as the absolute start and end values of each column as a tuple. As a data source, the read methods always expect the first argument to be file names or URLs – or a path.

By default, Pandas reading methods interpret the first line of a file as a header that contains the column names. If you set the argument header=None when calling the method, the first line becomes the first record. In this case, it makes sense to pass in the column names as a list using the names argument.

To save memory and time when processing very large files, the iterator=True argument instructs all the read functions to do the reading chunkwise. Instead of returning the complete file contents, they then return a TextParser object. The size of the read chunks is specified by the chunksize argument. If this argument is set, Pandas automatically sets the iterator to True. Using a TextParser, you can read and process the data line by line in a for loop. The get_chunk() method directly returns the next chuck of the file.

The Series and DataFrame structures make it just as easy to write their content to files. Both have a to_csv() method that expects the output file as an argument; if you instead specify sys.stdout, it passes the data directly to the standard output. The default field separator is the comma, but you can declare an alternative with the sep argument.

Various Formats

Pandas can even process Excel files using the ExcelFile class. Its constructor expects the file path; the resulting ExcelFile uses the parse() method to return DataFrame objects of the individual sheets:

excelfile = pandas.ExcelFile('file.xls')
dataframe = excelfile.parse('Sheet1')

If so desired, Pandas uses the pickle module to store binary format objects on disk. Series and DataFrames, and all other Pandas structures, support the save() helper method for this. It simply expects the output file as an argument. Conversely, the Pandas load() method reads the file and returns the corresponding object.

The data library also adds support for HDF5 (Hierarchical Data Format), which is used, among other programs, by Matlab mathematics software. The advantage is that it can be read efficiently in chunks, even when using compression, so it is particularly suitable for very large data sets.

Pandas uses the HDFStore class to read HDF5 files; the class constructor expects a file name. The resulting object can be read much like a dictionary:

hdf = HDFStore('file.h5')
s1 = hdf['s1']

These calls read the file.h5 HDF file, whose data structure contains a Series object named s1. This is also stored in the s1 variable.


After reading the data, Pandas applies numerous auxiliary functions to shape them. First, merge() merges two DataFrame objects,

pandas.merge(dataframe1, dataframe2)

here, by combining the columns of two DataFrames on the basis of identical indices by default. If instead you want to use another column to identify the records to be merged, you can use the on argument to specify the relevant name. This of course only works if both data frames contain a like-named column.

Instead of merging records in two objects, concat() concatenates Series or DataFrames. In the simplest case, the concatenation of two Series objects creates a new Series object that lists all the entries in the two source objects in succession. Alternatively, the line

concat([series1, series2, series3], axis=1)

generates a DataFrame from multiple Series objects. In this example, the function concatenates the sources on the basis of the columns (axis=1), instead of line by line (axis=0).

SQL database users are familiar with the concat() functionality from joins. By default, the inner method is used; this generates an intersection of the keys used. Alternatively, you may use the outer (union), left, or right method. With left and right, the result of a merge contains only the keys of the left or right source object, respectively.

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus