Skip to end of metadata
Go to start of metadata

Pandas is an incredibly convenient Python module for working with tabular data when ArcGIS table tools and workflows are missing functionality or are simply too slow. Panda's main data structure, the DataFrame, cannot be directly ingested back into a GDB table. Esri's tool to do this, NumPyArrayToTable(), only reads numpy arrays. More specifically, the tool requires a structured numpy array, which means that each column needs to have a "dtype" definition that specifies the name and data type (like "int16"). 

There are several problems, the first of which is that the conversion from the pandas.DataFrame to a numpy.array tends to strip off the specification of the data type in each column, or at least it does some of the things you want, but not all. When making a pandas-->numpy conversion, each column is cast from a specific pandas data type to a corresponding numpy data type. The crux of the problem is that there are not equivalent numpy types for all pandas data types (most, but definitely not all). In these cases, the conversion "upcasts" to a more generic data type. This frequently results in an output data type of "object", which is so generic that the arcpy.da.NumPyArrayToTable() function barfs. This conversion to "object" happens for string fields, for example. If the pandas.DataFrame columns have different data types, then the conversion usually just treats all columns as "object".

Here is an example of how to deal with this. First, the sample pandas.DataFrame:

source data
[Dbg]>>> type(chorizonVar)
<class 'pandas.core.frame.DataFrame'>

[Dbg]>>> chorizonVar.dtypes
cokey              object
chkey              object
hzdept_r            int64
hzdepb_r            int64
claytotal_r       float64
hzdepb_r20          int64
h_thk               int64
co_thk            float64
co_wt             float64
claytotal_r_wt    float64

 

There are  a variety functions within both pandas and numpy that can convert a pandas.DataFrame to a numpy.array, so there might be additional methods for doing this. Here is one example. 

source data
x = np.array(np.rec.fromrecords(chorizonVar.values))
names = chorizonVar.dtypes.index.tolist()
x.dtype.names = tuple(names)
arcpy.da.NumPyArrayToTable(x, r'E:\Workspace\testData.gdb\testTable')

Most of the action/smarts is/are happening on Line 1. The first thing is the conversion of the pandas.DataFrame to an unstructured numpy.array with chorizonVar.values.  This np.array characterizes all columns as "object". The real magic is turning this into a numpy.recarray (aka "record array") with np.rec.fromrecords(). This function looks at the content of each column and makes its best guess, which worked ideally in my experiment that contained strings, integers, and floats. The final thing that happens on this line is converting from a np.recarray in to a regular np.array because that is what's required for use with arcpy.da.NumPyArrayToTable(). Also worth noting that while the np.recarray data structure is generally a "richer" data structure than an np.array, it is also substantially more expensive in terms of memory and the speed of operations applied to it.

So, now we have a numpy.array,  x, with good data type specifications. The only problem is that the names of the columns have been dumbed down to "f0", "f1" and so on (for "field 0", etc). Since I want to keep the names that were in the pandas.DataFrame, I pull those into a list on Line 2 and then reset the names in the numpy.array with Line 3. The numpy.array is now just what I want to go into the ArcGIS gdb table (Line 4). The first few rows of the result is shown below.

 

 

  • No labels