{ "cells": [ { "cell_type": "markdown", "id": "4c6c548b", "metadata": {}, "source": [ "# 10 Minutes to cuDF and Dask-cuDF\n", "\n", "Modelled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly towards new users.\n", "\n", "## What are these Libraries?\n", "\n", "[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of [pandas](https://pandas.pydata.org).\n", "\n", "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas to execute operations in parallel on DataFrame partitions.\n", "\n", "[Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames. For instance, when you call `dask_cudf.read_csv(...)`, your cluster's GPUs do the work of parsing the CSV file(s) by calling [`cudf.read_csv()`](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.read_csv.html).\n", "\n", "\n", "## When to use cuDF and Dask-cuDF\n", "\n", "If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 1, "id": "36307e42", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import cupy as cp\n", "import pandas as pd\n", "\n", "import cudf\n", "import dask_cudf\n", "\n", "cp.random.seed(12)\n", "\n", "#### Portions of this were borrowed and adapted from the\n", "#### cuDF cheatsheet, existing cuDF documentation,\n", "#### and 10 Minutes to Pandas." ] }, { "cell_type": "markdown", "id": "eff5fc19", "metadata": {}, "source": [ "## Object Creation" ] }, { "cell_type": "markdown", "id": "0a747886", "metadata": {}, "source": [ "Creating a `cudf.Series` and `dask_cudf.Series`." ] }, { "cell_type": "code", "execution_count": 2, "id": "f5e303df", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 <NA>\n", "4 4\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([1, 2, 3, None, 4])\n", "s" ] }, { "cell_type": "code", "execution_count": 3, "id": "9a893956", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2)\n", "# Note the call to head here to show the first few entries, unlike\n", "# cuDF objects, dask-cuDF objects do not have a printing\n", "# representation that shows values since they may not be in local\n", "# memory.\n", "ds.head(n=3)" ] }, { "cell_type": "markdown", "id": "d934af4e", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column." ] }, { "cell_type": "code", "execution_count": 4, "id": "3f53fb8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>9</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>10</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>11</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>12</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>13</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>14</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>15</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>17</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>18</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>19</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4\n", "5 5 14 5\n", "6 6 13 6\n", "7 7 12 7\n", "8 8 11 8\n", "9 9 10 9\n", "10 10 9 10\n", "11 11 8 11\n", "12 12 7 12\n", "13 13 6 13\n", "14 14 5 14\n", "15 15 4 15\n", "16 16 3 16\n", "17 17 2 17\n", "18 18 1 18\n", "19 19 0 19" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.DataFrame(\n", " {\n", " \"a\": list(range(20)),\n", " \"b\": list(reversed(range(20))),\n", " \"c\": list(range(20)),\n", " }\n", ")\n", "df" ] }, { "cell_type": "markdown", "id": "b17db919", "metadata": {}, "source": [ "Now we will convert our cuDF dataframe into a dask-cuDF equivalent. Here we call out a key difference: to inspect the data we must call a method (here `.head()` to look at the first few values). In the general case (see the end of this notebook), the data in `ddf` will be distributed across multiple GPUs.\n", "\n", "In this small case, we could call `ddf.compute()` to obtain a cuDF object from the dask-cuDF object. In general, we should avoid calling `.compute()` on large dataframes, and restrict ourselves to using it when we have some (relatively) small postprocessed result that we wish to inspect. Hence, throughout this notebook we will generally call `.head()` to inspect the first few values of a dask-cuDF dataframe, occasionally calling out places where we use `.compute()` and why.\n", "\n", "*To understand more of the differences between how cuDF and dask-cuDF behave here, visit the [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) tutorial after this one.*" ] }, { "cell_type": "code", "execution_count": 5, "id": "8904b5ad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.from_cudf(df, npartitions=2)\n", "ddf.head()" ] }, { "cell_type": "markdown", "id": "0573b7bb", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n", "\n", "*Note that best practice for using dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with `read_csv` or other builtin I/O routines (discussed below).*" ] }, { "cell_type": "code", "execution_count": 6, "id": "06a42f3a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>0.2</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>0.3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 0 0.1\n", "1 1 0.2\n", "2 2 <NA>\n", "3 3 0.3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pdf = pd.DataFrame({\"a\": [0, 1, 2, 3], \"b\": [0.1, 0.2, None, 0.3]})\n", "gdf = cudf.DataFrame.from_pandas(pdf)\n", "gdf" ] }, { "cell_type": "code", "execution_count": 7, "id": "c67de344", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>0.2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 0 0.1\n", "1 1 0.2" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dask_gdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", "dask_gdf.head(n=2)" ] }, { "cell_type": "markdown", "id": "5820795f", "metadata": {}, "source": [ "## Viewing Data" ] }, { "cell_type": "markdown", "id": "b3008757", "metadata": {}, "source": [ "Viewing the top rows of a GPU dataframe." ] }, { "cell_type": "code", "execution_count": 8, "id": "0c329914", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "code", "execution_count": 9, "id": "b989e208", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head(2)" ] }, { "cell_type": "markdown", "id": "16798a32", "metadata": {}, "source": [ "Sorting by values." ] }, { "cell_type": "code", "execution_count": 10, "id": "2190856d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>19</th>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>19</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>18</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>17</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>15</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>14</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>13</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>12</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>11</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>10</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>9</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "19 19 0 19\n", "18 18 1 18\n", "17 17 2 17\n", "16 16 3 16\n", "15 15 4 15\n", "14 14 5 14\n", "13 13 6 13\n", "12 12 7 12\n", "11 11 8 11\n", "10 10 9 10\n", "9 9 10 9\n", "8 8 11 8\n", "7 7 12 7\n", "6 6 13 6\n", "5 5 14 5\n", "4 4 15 4\n", "3 3 16 3\n", "2 2 17 2\n", "1 1 18 1\n", "0 0 19 0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by=\"b\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "6594bd6f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>19</th>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>19</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>18</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>17</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>15</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "19 19 0 19\n", "18 18 1 18\n", "17 17 2 17\n", "16 16 3 16\n", "15 15 4 15" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.sort_values(by=\"b\").head()" ] }, { "cell_type": "markdown", "id": "3302a647", "metadata": {}, "source": [ "## Selecting a Column" ] }, { "cell_type": "markdown", "id": "aebc72ab", "metadata": {}, "source": [ "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)." ] }, { "cell_type": "code", "execution_count": 12, "id": "4dafb4ed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "10 10\n", "11 11\n", "12 12\n", "13 13\n", "14 14\n", "15 15\n", "16 16\n", "17 17\n", "18 18\n", "19 19\n", "Name: a, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"a\"]" ] }, { "cell_type": "code", "execution_count": 13, "id": "b38f05fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "Name: a, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"a\"].head()" ] }, { "cell_type": "markdown", "id": "a5160dd1", "metadata": {}, "source": [ "## Selecting Rows by Label" ] }, { "cell_type": "markdown", "id": "51ff2093", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." ] }, { "cell_type": "code", "execution_count": 14, "id": "e8870657", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[2:5, [\"a\", \"b\"]]" ] }, { "cell_type": "code", "execution_count": 15, "id": "f041e661", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.loc[2:5, [\"a\", \"b\"]].head()" ] }, { "cell_type": "markdown", "id": "d8e07162", "metadata": {}, "source": [ "## Selecting Rows by Position" ] }, { "cell_type": "markdown", "id": "435eb76a", "metadata": {}, "source": [ "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames." ] }, { "cell_type": "code", "execution_count": 16, "id": "bc337d5d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 19\n", "c 0\n", "Name: 0, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0]" ] }, { "cell_type": "code", "execution_count": 17, "id": "03671456", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 0 19\n", "1 1 18\n", "2 2 17" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0:3, 0:2]" ] }, { "cell_type": "markdown", "id": "aa935d5b", "metadata": {}, "source": [ "You can also select elements of a `DataFrame` or `Series` with direct index access." ] }, { "cell_type": "code", "execution_count": 18, "id": "79883c37", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "3 3 16 3\n", "4 4 15 4" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[3:5]" ] }, { "cell_type": "code", "execution_count": 19, "id": "2f761695", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 <NA>\n", "4 4\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[3:5]" ] }, { "cell_type": "markdown", "id": "9bf9c0b0", "metadata": {}, "source": [ "## Boolean Indexing" ] }, { "cell_type": "markdown", "id": "9b08b2b9", "metadata": {}, "source": [ "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing." ] }, { "cell_type": "code", "execution_count": 20, "id": "1eb08f0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.b > 15]" ] }, { "cell_type": "code", "execution_count": 21, "id": "324dd036", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[ddf.b > 15].head(n=3)" ] }, { "cell_type": "markdown", "id": "f95c9ab7", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." ] }, { "cell_type": "code", "execution_count": 22, "id": "fa643410", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"b == 3\")" ] }, { "cell_type": "markdown", "id": "7aa0089f", "metadata": {}, "source": [ "Note here we call `compute()` rather than `head()` on the dask-cuDF dataframe since we are happy that the number of matching rows will be small (and hence it is reasonable to bring the entire result back)." ] }, { "cell_type": "code", "execution_count": 23, "id": "e2706a02", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.query(\"b == 3\").compute()" ] }, { "cell_type": "markdown", "id": "694dcbad", "metadata": {}, "source": [ "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." ] }, { "cell_type": "code", "execution_count": 24, "id": "353b0250", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf_comparator = 3\n", "df.query(\"b == @cudf_comparator\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "a35c8a5a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dask_cudf_comparator = 3\n", "ddf.query(\"b == @val\", local_dict={\"val\": dask_cudf_comparator}).compute()" ] }, { "cell_type": "markdown", "id": "8e004749", "metadata": {}, "source": [ "Using the `isin` method for filtering." ] }, { "cell_type": "code", "execution_count": 26, "id": "20936418", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>5</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "5 5 14 5" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.a.isin([0, 5])]" ] }, { "cell_type": "markdown", "id": "8e456f03", "metadata": {}, "source": [ "## MultiIndex" ] }, { "cell_type": "markdown", "id": "e494bd0b", "metadata": {}, "source": [ "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex." ] }, { "cell_type": "code", "execution_count": 27, "id": "4ae70724", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex([('a', 1),\n", " ('a', 2),\n", " ('b', 3),\n", " ('b', 4)],\n", " )" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arrays = [[\"a\", \"a\", \"b\", \"b\"], [1, 2, 3, 4]]\n", "tuples = list(zip(*arrays))\n", "idx = cudf.MultiIndex.from_tuples(tuples)\n", "idx" ] }, { "cell_type": "markdown", "id": "ab232727", "metadata": {}, "source": [ "This index can back either axis of a DataFrame." ] }, { "cell_type": "code", "execution_count": 28, "id": "cb1d1633", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>first</th>\n", " <th>second</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">a</th>\n", " <th>1</th>\n", " <td>0.082654</td>\n", " <td>0.967955</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.399417</td>\n", " <td>0.441425</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">b</th>\n", " <th>3</th>\n", " <td>0.784297</td>\n", " <td>0.793582</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.070303</td>\n", " <td>0.271711</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " first second\n", "a 1 0.082654 0.967955\n", " 2 0.399417 0.441425\n", "b 3 0.784297 0.793582\n", " 4 0.070303 0.271711" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1 = cudf.DataFrame(\n", " {\"first\": cp.random.rand(4), \"second\": cp.random.rand(4)}\n", ")\n", "gdf1.index = idx\n", "gdf1" ] }, { "cell_type": "code", "execution_count": 29, "id": "73ba31af", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">a</th>\n", " <th colspan=\"2\" halign=\"left\">b</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>first</th>\n", " <td>0.343382</td>\n", " <td>0.003700</td>\n", " <td>0.20043</td>\n", " <td>0.581614</td>\n", " </tr>\n", " <tr>\n", " <th>second</th>\n", " <td>0.907812</td>\n", " <td>0.101512</td>\n", " <td>0.24179</td>\n", " <td>0.224180</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b \n", " 1 2 3 4\n", "first 0.343382 0.003700 0.20043 0.581614\n", "second 0.907812 0.101512 0.24179 0.224180" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf2 = cudf.DataFrame(\n", " {\"first\": cp.random.rand(4), \"second\": cp.random.rand(4)}\n", ").T\n", "gdf2.columns = idx\n", "gdf2" ] }, { "cell_type": "markdown", "id": "c5f33160", "metadata": {}, "source": [ "Accessing values of a DataFrame with a MultiIndex, both with `.loc`" ] }, { "cell_type": "code", "execution_count": 30, "id": "1048b7cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "first 0.784297\n", "second 0.793582\n", "Name: ('b', 3), dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1.loc[(\"b\", 3)]" ] }, { "cell_type": "markdown", "id": "5123f759", "metadata": {}, "source": [ "And `.iloc`" ] }, { "cell_type": "code", "execution_count": 31, "id": "369d164d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>first</th>\n", " <th>second</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">a</th>\n", " <th>1</th>\n", " <td>0.082654</td>\n", " <td>0.967955</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.399417</td>\n", " <td>0.441425</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " first second\n", "a 1 0.082654 0.967955\n", " 2 0.399417 0.441425" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1.iloc[0:2]" ] }, { "cell_type": "markdown", "id": "8b3b96e9", "metadata": {}, "source": [ "Missing Data\n", "------------" ] }, { "cell_type": "markdown", "id": "d12141eb", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." ] }, { "cell_type": "code", "execution_count": 32, "id": "913a7b5f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 999\n", "4 4\n", "dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.fillna(999)" ] }, { "cell_type": "code", "execution_count": 33, "id": "14479a42", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.fillna(999).head(n=3)" ] }, { "cell_type": "markdown", "id": "d97605e6", "metadata": {}, "source": [ "## Stats" ] }, { "cell_type": "markdown", "id": "f29a5de0", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." ] }, { "cell_type": "code", "execution_count": 34, "id": "b1a1666e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2.5, 1.666666666666666)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.mean(), s.var()" ] }, { "cell_type": "markdown", "id": "f4879742", "metadata": {}, "source": [ "This serves as a prototypical example of when we might want to call `.compute()`. The result of computing the mean and variance is a single number in each case, so it is definitely reasonable to look at the entire result!" ] }, { "cell_type": "code", "execution_count": 35, "id": "0cb7a207", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2.5, 1.6666666666666667)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.mean().compute(), ds.var().compute()" ] }, { "cell_type": "markdown", "id": "af792a18", "metadata": {}, "source": [ "## Applymap" ] }, { "cell_type": "markdown", "id": "f6094cbe", "metadata": {}, "source": [ "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe." ] }, { "cell_type": "code", "execution_count": 36, "id": "5b154619", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "5 15\n", "6 16\n", "7 17\n", "8 18\n", "9 19\n", "10 20\n", "11 21\n", "12 22\n", "13 23\n", "14 24\n", "15 25\n", "16 26\n", "17 27\n", "18 28\n", "19 29\n", "Name: a, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add_ten(num):\n", " return num + 10\n", "\n", "\n", "df[\"a\"].apply(add_ten)" ] }, { "cell_type": "code", "execution_count": 37, "id": "8da5c3cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "Name: a, dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"a\"].map_partitions(add_ten).head()" ] }, { "cell_type": "markdown", "id": "4d4fdde1", "metadata": {}, "source": [ "## Histogramming" ] }, { "cell_type": "markdown", "id": "b98a7daf", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." ] }, { "cell_type": "code", "execution_count": 38, "id": "c7b8ea5d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 1\n", "6 1\n", "1 1\n", "14 1\n", "2 1\n", "5 1\n", "11 1\n", "7 1\n", "17 1\n", "13 1\n", "8 1\n", "16 1\n", "0 1\n", "10 1\n", "4 1\n", "9 1\n", "19 1\n", "18 1\n", "3 1\n", "12 1\n", "Name: a, dtype: int32" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.a.value_counts()" ] }, { "cell_type": "code", "execution_count": 39, "id": "cc9d34f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 1\n", "6 1\n", "1 1\n", "14 1\n", "2 1\n", "Name: a, dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.a.value_counts().head()" ] }, { "cell_type": "markdown", "id": "437172ba", "metadata": {}, "source": [ "## String Methods" ] }, { "cell_type": "markdown", "id": "fd3fc4f3", "metadata": {}, "source": [ "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the [cuDF API documentation](https://docs.rapids.ai/api/cudf/stable/api_docs/series.html#string-handling) for more information." ] }, { "cell_type": "code", "execution_count": 40, "id": "86974041", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "4 baca\n", "5 <NA>\n", "6 caba\n", "7 dog\n", "8 cat\n", "dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([\"A\", \"B\", \"C\", \"Aaba\", \"Baca\", None, \"CABA\", \"dog\", \"cat\"])\n", "s.str.lower()" ] }, { "cell_type": "code", "execution_count": 41, "id": "c6a61a08", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2)\n", "ds.str.lower().head(n=4)" ] }, { "cell_type": "markdown", "id": "44fe1243", "metadata": {}, "source": [ "As well as simple manipulation, We can also match strings using [regular expressions](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.column.string.StringMethods.match.html)." ] }, { "cell_type": "code", "execution_count": 42, "id": "51158a24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "4 False\n", "5 <NA>\n", "6 False\n", "7 False\n", "8 True\n", "dtype: bool" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.match(\"^[aAc].+\")" ] }, { "cell_type": "code", "execution_count": 43, "id": "4f3e36f5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "4 False\n", "dtype: bool" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.str.match(\"^[aAc].+\").head()" ] }, { "cell_type": "markdown", "id": "5528afa8", "metadata": {}, "source": [ "## Concat" ] }, { "cell_type": "markdown", "id": "e05b1078", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." ] }, { "cell_type": "code", "execution_count": 44, "id": "6c6d10bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 <NA>\n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 <NA>\n", "4 5\n", "dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([1, 2, 3, None, 5])\n", "cudf.concat([s, s])" ] }, { "cell_type": "code", "execution_count": 45, "id": "d3e5cf87", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n", "dask_cudf.concat([ds2, ds2]).head(n=3)" ] }, { "cell_type": "markdown", "id": "df087d2f", "metadata": {}, "source": [ "## Join" ] }, { "cell_type": "markdown", "id": "89cf5022", "metadata": {}, "source": [ "Performing SQL style merges. Note that the dataframe order is **not maintained**, but may be restored post-merge by sorting by the index." ] }, { "cell_type": "code", "execution_count": 46, "id": "075c97a7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>key</th>\n", " <th>vals_a</th>\n", " <th>vals_b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>10.0</td>\n", " <td>100.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>c</td>\n", " <td>12.0</td>\n", " <td>101.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>e</td>\n", " <td>14.0</td>\n", " <td>102.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>b</td>\n", " <td>11.0</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>d</td>\n", " <td>13.0</td>\n", " <td><NA></td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " key vals_a vals_b\n", "0 a 10.0 100.0\n", "1 c 12.0 101.0\n", "2 e 14.0 102.0\n", "3 b 11.0 <NA>\n", "4 d 13.0 <NA>" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_a = cudf.DataFrame()\n", "df_a[\"key\"] = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n", "df_a[\"vals_a\"] = [float(i + 10) for i in range(5)]\n", "\n", "df_b = cudf.DataFrame()\n", "df_b[\"key\"] = [\"a\", \"c\", \"e\"]\n", "df_b[\"vals_b\"] = [float(i + 100) for i in range(3)]\n", "\n", "merged = df_a.merge(df_b, on=[\"key\"], how=\"left\")\n", "merged" ] }, { "cell_type": "code", "execution_count": 47, "id": "b28fc57b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>key</th>\n", " <th>vals_a</th>\n", " <th>vals_b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>c</td>\n", " <td>12.0</td>\n", " <td>101.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>e</td>\n", " <td>14.0</td>\n", " <td>102.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>b</td>\n", " <td>11.0</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>d</td>\n", " <td>13.0</td>\n", " <td><NA></td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " key vals_a vals_b\n", "0 c 12.0 101.0\n", "1 e 14.0 102.0\n", "2 b 11.0 <NA>\n", "3 d 13.0 <NA>" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n", "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n", "\n", "merged = ddf_a.merge(ddf_b, on=[\"key\"], how=\"left\").head(n=4)\n", "merged" ] }, { "cell_type": "markdown", "id": "e4695c6e", "metadata": {}, "source": [ "## Grouping" ] }, { "cell_type": "markdown", "id": "ecb27b06", "metadata": {}, "source": [ "Like [pandas](https://pandas.pydata.org/docs/user_guide/groupby.html), cuDF and Dask-cuDF support the [Split-Apply-Combine groupby paradigm](https://doi.org/10.18637/jss.v040.i01)." ] }, { "cell_type": "code", "execution_count": 48, "id": "d8db18d9", "metadata": {}, "outputs": [], "source": [ "df[\"agg_col1\"] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n", "df[\"agg_col2\"] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n", "\n", "ddf = dask_cudf.from_cudf(df, npartitions=2)" ] }, { "cell_type": "markdown", "id": "8e2f0961", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." ] }, { "cell_type": "code", "execution_count": 49, "id": "e8a7f1f9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " <tr>\n", " <th>agg_col1</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>90</td>\n", " <td>100</td>\n", " <td>90</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>100</td>\n", " <td>90</td>\n", " <td>100</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col2\n", "agg_col1 \n", "1 90 100 90 4\n", "0 100 90 100 3" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"agg_col1\").sum()" ] }, { "cell_type": "code", "execution_count": 50, "id": "4dd090a1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " <tr>\n", " <th>agg_col1</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>90</td>\n", " <td>100</td>\n", " <td>90</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>100</td>\n", " <td>90</td>\n", " <td>100</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col2\n", "agg_col1 \n", "1 90 100 90 4\n", "0 100 90 100 3" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby(\"agg_col1\").sum().compute()" ] }, { "cell_type": "markdown", "id": "5ff1e8bf", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data." ] }, { "cell_type": "code", "execution_count": 51, "id": "4738f0ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " <tr>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <th>0</th>\n", " <td>54</td>\n", " <td>60</td>\n", " <td>54</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <th>0</th>\n", " <td>73</td>\n", " <td>60</td>\n", " <td>73</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <th>1</th>\n", " <td>36</td>\n", " <td>40</td>\n", " <td>36</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <th>1</th>\n", " <td>27</td>\n", " <td>30</td>\n", " <td>27</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "agg_col1 agg_col2 \n", "1 0 54 60 54\n", "0 0 73 60 73\n", "1 1 36 40 36\n", "0 1 27 30 27" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"agg_col1\", \"agg_col2\"]).sum()" ] }, { "cell_type": "code", "execution_count": 52, "id": "9b07feb1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " <tr>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <th>1</th>\n", " <td>36</td>\n", " <td>40</td>\n", " <td>36</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <th>0</th>\n", " <td>73</td>\n", " <td>60</td>\n", " <td>73</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <th>0</th>\n", " <td>54</td>\n", " <td>60</td>\n", " <td>54</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <th>1</th>\n", " <td>27</td>\n", " <td>30</td>\n", " <td>27</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "agg_col1 agg_col2 \n", "1 1 36 40 36\n", "0 0 73 60 73\n", "1 0 54 60 54\n", "0 1 27 30 27" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby([\"agg_col1\", \"agg_col2\"]).sum().compute()" ] }, { "cell_type": "markdown", "id": "443e179b", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." ] }, { "cell_type": "code", "execution_count": 53, "id": "f196ad8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " <tr>\n", " <th>agg_col1</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>18</td>\n", " <td>10.0</td>\n", " <td>90</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>19</td>\n", " <td>9.0</td>\n", " <td>100</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "agg_col1 \n", "1 18 10.0 90\n", "0 19 9.0 100" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"agg_col1\").agg({\"a\": \"max\", \"b\": \"mean\", \"c\": \"sum\"})" ] }, { "cell_type": "code", "execution_count": 54, "id": "3853483f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " <tr>\n", " <th>agg_col1</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>18</td>\n", " <td>10.0</td>\n", " <td>90</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>19</td>\n", " <td>9.0</td>\n", " <td>100</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "agg_col1 \n", "1 18 10.0 90\n", "0 19 9.0 100" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby(\"agg_col1\").agg({\"a\": \"max\", \"b\": \"mean\", \"c\": \"sum\"}).compute()" ] }, { "cell_type": "markdown", "id": "a5bf30e1", "metadata": {}, "source": [ "## Transpose" ] }, { "cell_type": "markdown", "id": "5ac3b004", "metadata": {}, "source": [ "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 55, "id": "c5fbdb50", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>6</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 4\n", "1 2 5\n", "2 3 6" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample = cudf.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6]})\n", "sample" ] }, { "cell_type": "code", "execution_count": 56, "id": "733ed90c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "a 1 2 3\n", "b 4 5 6" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample.transpose()" ] }, { "cell_type": "markdown", "id": "e0915c46", "metadata": {}, "source": [ "## Time Series" ] }, { "cell_type": "markdown", "id": "ec7f0b81", "metadata": {}, "source": [ "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." ] }, { "cell_type": "code", "execution_count": 57, "id": "a6d45607", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>date</th>\n", " <th>value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2018-11-20</td>\n", " <td>0.986051</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2018-11-21</td>\n", " <td>0.232034</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2018-11-22</td>\n", " <td>0.397617</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2018-11-23</td>\n", " <td>0.103839</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " date value\n", "0 2018-11-20 0.986051\n", "1 2018-11-21 0.232034\n", "2 2018-11-22 0.397617\n", "3 2018-11-23 0.103839" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime as dt\n", "\n", "date_df = cudf.DataFrame()\n", "date_df[\"date\"] = pd.date_range(\"11/20/2018\", periods=72, freq=\"D\")\n", "date_df[\"value\"] = cp.random.sample(len(date_df))\n", "\n", "search_date = dt.datetime.strptime(\"2018-11-23\", \"%Y-%m-%d\")\n", "date_df.query(\"date <= @search_date\")" ] }, { "cell_type": "code", "execution_count": 58, "id": "fbacaae1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>date</th>\n", " <th>value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2018-11-20</td>\n", " <td>0.986051</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2018-11-21</td>\n", " <td>0.232034</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2018-11-22</td>\n", " <td>0.397617</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2018-11-23</td>\n", " <td>0.103839</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " date value\n", "0 2018-11-20 0.986051\n", "1 2018-11-21 0.232034\n", "2 2018-11-22 0.397617\n", "3 2018-11-23 0.103839" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n", "date_ddf.query(\n", " \"date <= @search_date\", local_dict={\"search_date\": search_date}\n", ").compute()" ] }, { "cell_type": "markdown", "id": "45f9408b", "metadata": {}, "source": [ "## Categoricals" ] }, { "cell_type": "markdown", "id": "5eb96f98", "metadata": {}, "source": [ "`DataFrames` support categorical columns." ] }, { "cell_type": "code", "execution_count": 59, "id": "d735b5cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>grade</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>6</td>\n", " <td>e</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b\n", "3 4 a\n", "4 5 a\n", "5 6 e" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf = cudf.DataFrame(\n", " {\"id\": [1, 2, 3, 4, 5, 6], \"grade\": [\"a\", \"b\", \"b\", \"a\", \"a\", \"e\"]}\n", ")\n", "gdf[\"grade\"] = gdf[\"grade\"].astype(\"category\")\n", "gdf" ] }, { "cell_type": "code", "execution_count": 60, "id": "9d1ff798", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>grade</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>b</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", "dgdf.head(n=3)" ] }, { "cell_type": "markdown", "id": "a9c2bcac", "metadata": {}, "source": [ "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 61, "id": "a7135eda", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "StringIndex(['a' 'b' 'e'], dtype='object')" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.grade.cat.categories" ] }, { "cell_type": "markdown", "id": "466e1ed2", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." ] }, { "cell_type": "code", "execution_count": 62, "id": "f00c615a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 0\n", "4 0\n", "5 2\n", "dtype: uint8" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.grade.cat.codes" ] }, { "cell_type": "code", "execution_count": 63, "id": "d209512f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 0\n", "4 0\n", "5 2\n", "dtype: uint8" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dgdf.grade.cat.codes.compute()" ] }, { "cell_type": "markdown", "id": "1b391a0d", "metadata": {}, "source": [ "## Converting to Pandas" ] }, { "cell_type": "markdown", "id": "cfdd172b", "metadata": {}, "source": [ "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`." ] }, { "cell_type": "code", "execution_count": 64, "id": "1fcd9c7f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head().to_pandas()" ] }, { "cell_type": "markdown", "id": "aa8a445b", "metadata": {}, "source": [ "To convert the first few entries to pandas, we similarly call `.head()` on the dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert." ] }, { "cell_type": "code", "execution_count": 65, "id": "786d39d2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head().to_pandas()" ] }, { "cell_type": "markdown", "id": "584c4594", "metadata": {}, "source": [ "In contrast, if we want to convert the entire frame, we need to call `.compute()` on `ddf` to get a local cuDF dataframe, and then call `to_pandas()`, followed by subsequent processing. This workflow is less recommended, since it both puts high memory pressure on a single GPU (the `.compute()` call) and does not take advantage of GPU acceleration for processing (the computation happens on in pandas)." ] }, { "cell_type": "code", "execution_count": 66, "id": "93f06cdc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute().to_pandas().head()" ] }, { "cell_type": "markdown", "id": "a104294a", "metadata": {}, "source": [ "## Converting to Numpy" ] }, { "cell_type": "markdown", "id": "c5d3e508", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 67, "id": "2948b577", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 19, 0, 1, 1],\n", " [ 1, 18, 1, 0, 0],\n", " [ 2, 17, 2, 1, 0],\n", " [ 3, 16, 3, 0, 1],\n", " [ 4, 15, 4, 1, 0],\n", " [ 5, 14, 5, 0, 0],\n", " [ 6, 13, 6, 1, 1],\n", " [ 7, 12, 7, 0, 0],\n", " [ 8, 11, 8, 1, 0],\n", " [ 9, 10, 9, 0, 1],\n", " [10, 9, 10, 1, 0],\n", " [11, 8, 11, 0, 0],\n", " [12, 7, 12, 1, 1],\n", " [13, 6, 13, 0, 0],\n", " [14, 5, 14, 1, 0],\n", " [15, 4, 15, 0, 1],\n", " [16, 3, 16, 1, 0],\n", " [17, 2, 17, 0, 0],\n", " [18, 1, 18, 1, 1],\n", " [19, 0, 19, 0, 0]])" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_numpy()" ] }, { "cell_type": "code", "execution_count": 68, "id": "1cff6352", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 19, 0, 1, 1],\n", " [ 1, 18, 1, 0, 0],\n", " [ 2, 17, 2, 1, 0],\n", " [ 3, 16, 3, 0, 1],\n", " [ 4, 15, 4, 1, 0],\n", " [ 5, 14, 5, 0, 0],\n", " [ 6, 13, 6, 1, 1],\n", " [ 7, 12, 7, 0, 0],\n", " [ 8, 11, 8, 1, 0],\n", " [ 9, 10, 9, 0, 1],\n", " [10, 9, 10, 1, 0],\n", " [11, 8, 11, 0, 0],\n", " [12, 7, 12, 1, 1],\n", " [13, 6, 13, 0, 0],\n", " [14, 5, 14, 1, 0],\n", " [15, 4, 15, 0, 1],\n", " [16, 3, 16, 1, 0],\n", " [17, 2, 17, 0, 0],\n", " [18, 1, 18, 1, 1],\n", " [19, 0, 19, 0, 0]])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute().to_numpy()" ] }, { "cell_type": "markdown", "id": "c1f09303", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 69, "id": "997c89ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"a\"].to_numpy()" ] }, { "cell_type": "code", "execution_count": 70, "id": "243df512", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"a\"].compute().to_numpy()" ] }, { "cell_type": "markdown", "id": "b520acf7", "metadata": {}, "source": [ "## Converting to Arrow" ] }, { "cell_type": "markdown", "id": "050e67e5", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`." ] }, { "cell_type": "code", "execution_count": 71, "id": "0ac9e740", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "----\n", "a: [[0,1,2,3,4,...,15,16,17,18,19]]\n", "b: [[19,18,17,16,15,...,4,3,2,1,0]]\n", "c: [[0,1,2,3,4,...,15,16,17,18,19]]\n", "agg_col1: [[1,0,1,0,1,...,0,1,0,1,0]]\n", "agg_col2: [[1,0,0,1,0,...,1,0,0,1,0]]" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_arrow()" ] }, { "cell_type": "code", "execution_count": 72, "id": "f3170fc3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "----\n", "a: [[0,1,2,3,4]]\n", "b: [[19,18,17,16,15]]\n", "c: [[0,1,2,3,4]]\n", "agg_col1: [[1,0,1,0,1]]\n", "agg_col2: [[1,0,0,1,0]]" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head().to_arrow()" ] }, { "cell_type": "markdown", "id": "6f0251c6", "metadata": {}, "source": [ "## Reading/Writing CSV Files" ] }, { "cell_type": "markdown", "id": "2d1935c6", "metadata": {}, "source": [ "Writing to a CSV file." ] }, { "cell_type": "code", "execution_count": 73, "id": "36f5039f", "metadata": {}, "outputs": [], "source": [ "if not os.path.exists(\"example_output\"):\n", " os.mkdir(\"example_output\")\n", "\n", "df.to_csv(\"example_output/foo.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 74, "id": "22c1eb6a", "metadata": {}, "outputs": [], "source": [ "ddf.compute().to_csv(\"example_output/foo_dask.csv\", index=False)" ] }, { "cell_type": "markdown", "id": "320c3968", "metadata": {}, "source": [ "Reading from a csv file." ] }, { "cell_type": "code", "execution_count": 75, "id": "c110a80f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.read_csv(\"example_output/foo.csv\")\n", "df" ] }, { "cell_type": "markdown", "id": "787eae14", "metadata": {}, "source": [ "Note that for the dask-cuDF case, we use `dask_cudf.read_csv` in preference to `dask_cudf.from_cudf(cudf.read_csv)` since the former can parallelize across multiple GPUs and handle larger CSV files that would fit in memory on a single GPU." ] }, { "cell_type": "code", "execution_count": 76, "id": "a699dfef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv(\"example_output/foo_dask.csv\")\n", "ddf.head()" ] }, { "cell_type": "markdown", "id": "72857b2c", "metadata": {}, "source": [ "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard." ] }, { "cell_type": "code", "execution_count": 77, "id": "825a0c03", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv(\"example_output/*.csv\")\n", "ddf.head()" ] }, { "cell_type": "markdown", "id": "763c555b", "metadata": {}, "source": [ "## Reading/Writing Parquet Files" ] }, { "cell_type": "markdown", "id": "8766d4ac", "metadata": {}, "source": [ "Writing to parquet files with cuDF's GPU-accelerated parquet writer" ] }, { "cell_type": "code", "execution_count": 78, "id": "5038b284", "metadata": {}, "outputs": [], "source": [ "df.to_parquet(\"example_output/temp_parquet\")" ] }, { "cell_type": "markdown", "id": "b4b49824", "metadata": {}, "source": [ "Reading parquet files with cuDF's GPU-accelerated parquet reader." ] }, { "cell_type": "code", "execution_count": 79, "id": "bb657a69", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.read_parquet(\"example_output/temp_parquet\")\n", "df" ] }, { "cell_type": "markdown", "id": "e9a29874", "metadata": {}, "source": [ "Writing to parquet files from a `dask_cudf.DataFrame` using cuDF's parquet writer under the hood." ] }, { "cell_type": "code", "execution_count": 80, "id": "0c3db7b0", "metadata": {}, "outputs": [], "source": [ "ddf.to_parquet(\"example_output/ddf_parquet_files\")" ] }, { "cell_type": "markdown", "id": "90a49967", "metadata": {}, "source": [ "## Reading/Writing ORC Files" ] }, { "cell_type": "markdown", "id": "de9d03fa", "metadata": {}, "source": [ "Writing ORC files." ] }, { "cell_type": "code", "execution_count": 81, "id": "c387f8f2", "metadata": {}, "outputs": [], "source": [ "df.to_orc(\"example_output/temp_orc\")" ] }, { "cell_type": "markdown", "id": "242c32a2", "metadata": {}, "source": [ "And reading" ] }, { "cell_type": "code", "execution_count": 82, "id": "d4bab6da", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>agg_col1</th>\n", " <th>agg_col2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>11</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>12</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>13</td>\n", " <td>6</td>\n", " <td>13</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>14</td>\n", " <td>5</td>\n", " <td>14</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>16</td>\n", " <td>3</td>\n", " <td>16</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>17</td>\n", " <td>2</td>\n", " <td>17</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>19</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = cudf.read_orc(\"example_output/temp_orc\")\n", "df2" ] }, { "cell_type": "markdown", "id": "c988553d", "metadata": {}, "source": [ "## Dask Performance Tips\n", "\n", "Like Apache Spark, Dask operations are [lazy](https://en.wikipedia.org/wiki/Lazy_evaluation). Instead of being executed immediately, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.\n", "\n", "Sometimes, though, we want to force the execution of operations. Calling `persist` on a Dask collection fully computes it (or actively computes it in the background), persisting the result into memory. When we're using distributed systems, we may want to wait until `persist` is finished before beginning any downstream operations. We can enforce this contract by using `wait`. Wrapping an operation with `wait` will ensure it doesn't begin executing until all necessary upstream operations have finished.\n", "\n", "The snippets below provide basic examples, using `LocalCUDACluster` to create one dask-worker per GPU on the local machine. For more detailed information about `persist` and `wait`, please see the Dask documentation for [persist](https://docs.dask.org/en/latest/api.html#dask.persist) and [wait](https://docs.dask.org/en/latest/futures.html#distributed.wait). Wait relies on the concept of Futures, which is beyond the scope of this tutorial. For more information on Futures, see the Dask [Futures](https://docs.dask.org/en/latest/futures.html) documentation. For more information about multi-GPU clusters, please see the [dask-cuda](https://github.com/rapidsai/dask-cuda) library (documentation is in progress)." ] }, { "cell_type": "markdown", "id": "976a8dca", "metadata": {}, "source": [ "First, we set up a GPU cluster. With our `client` set up, Dask-cuDF computation will be distributed across the GPUs in the cluster." ] }, { "cell_type": "code", "execution_count": 83, "id": "39c82511", "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "from dask.distributed import Client, wait\n", "from dask_cuda import LocalCUDACluster\n", "\n", "cluster = LocalCUDACluster()\n", "client = Client(cluster)" ] }, { "cell_type": "markdown", "id": "819c2d92", "metadata": {}, "source": [ "### Persisting Data\n", "\n", "Next, we create our Dask-cuDF DataFrame and apply a transformation, storing the result as a new column." ] }, { "cell_type": "code", "execution_count": 84, "id": "f5c0ca87", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div><strong>Dask DataFrame Structure:</strong></div>\n", "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " <tr>\n", " <th>npartitions=16</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>int64</td>\n", " <td>int64</td>\n", " <td>int64</td>\n", " </tr>\n", " <tr>\n", " <th>625000</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>9375000</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>9999999</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>\n", "<div>Dask Name: assign, 4 graph layers</div>" ], "text/plain": [ "<dask_cudf.DataFrame | 64 tasks | 16 npartitions>" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nrows = 10000000\n", "\n", "df2 = cudf.DataFrame({\"a\": cp.arange(nrows), \"b\": cp.arange(nrows)})\n", "ddf2 = dask_cudf.from_cudf(df2, npartitions=16)\n", "ddf2[\"c\"] = ddf2[\"a\"] + 5\n", "ddf2" ] }, { "cell_type": "code", "execution_count": 85, "id": "eec23c4d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mon Nov 14 03:05:08 2022 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 4538MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |\n", "| N/A 32C P0 56W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |\n", "| N/A 33C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |\n", "| N/A 31C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |\n", "| N/A 32C P0 54W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |\n", "| N/A 33C P0 56W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |\n", "| N/A 35C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |\n", "| N/A 32C P0 54W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: |\n", "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=============================================================================|\n", "| 0 N/A N/A 57132 C .../python 333MiB |\n", "| 1 N/A N/A 57131 C .../python 333MiB |\n", "| 2 N/A N/A 57143 C .../python 333MiB |\n", "| 3 N/A N/A 57124 C .../python 333MiB |\n", "| 4 N/A N/A 57135 C .../python 333MiB |\n", "| 5 N/A N/A 57144 C .../python 333MiB |\n", "| 6 N/A N/A 57126 C .../python 333MiB |\n", "| 7 N/A N/A 57139 C .../python 333MiB |\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "id": "578a1698", "metadata": {}, "source": [ "Because Dask is lazy, the computation has not yet occurred. We can see that there are sixty-four tasks in the task graph and we're using about 330 MB of device memory on each GPU. We can force computation by using `persist`. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline)." ] }, { "cell_type": "code", "execution_count": 86, "id": "3de4c0cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div><strong>Dask DataFrame Structure:</strong></div>\n", "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " <tr>\n", " <th>npartitions=16</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>int64</td>\n", " <td>int64</td>\n", " <td>int64</td>\n", " </tr>\n", " <tr>\n", " <th>625000</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>9375000</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>9999999</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>\n", "<div>Dask Name: assign, 1 graph layer</div>" ], "text/plain": [ "<dask_cudf.DataFrame | 16 tasks | 16 npartitions>" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf2 = ddf2.persist()\n", "ddf2" ] }, { "cell_type": "code", "execution_count": 87, "id": "64c9f96c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mon Nov 14 03:05:15 2022 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 4900MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |\n", "| N/A 32C P0 56W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |\n", "| N/A 33C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |\n", "| N/A 33C P0 56W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |\n", "| N/A 35C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |\n", "| N/A 32C P0 54W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: |\n", "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=============================================================================|\n", "| 0 N/A N/A 57132 C .../python 695MiB |\n", "| 1 N/A N/A 57131 C .../python 695MiB |\n", "| 2 N/A N/A 57143 C .../python 695MiB |\n", "| 3 N/A N/A 57124 C .../python 695MiB |\n", "| 4 N/A N/A 57135 C .../python 695MiB |\n", "| 5 N/A N/A 57144 C .../python 695MiB |\n", "| 6 N/A N/A 57126 C .../python 695MiB |\n", "| 7 N/A N/A 57139 C .../python 695MiB |\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "# Sleep to ensure the persist finishes and shows in the memory usage\n", "!sleep 5; nvidia-smi" ] }, { "cell_type": "markdown", "id": "154d699f", "metadata": {}, "source": [ "Because we forced computation, we now have a larger object in distributed GPU memory. Note that actual numbers will differ between systems (for example depending on how many devices are available)." ] }, { "cell_type": "markdown", "id": "f45064d7", "metadata": {}, "source": [ "### Wait\n", "Depending on our workflow or distributed computing setup, we may want to `wait` until all upstream tasks have finished before proceeding with a specific function. This section shows an example of this behavior, adapted from the Dask documentation.\n", "\n", "First, we create a new Dask DataFrame and define a function that we'll map to every partition in the dataframe." ] }, { "cell_type": "code", "execution_count": 88, "id": "a021a726", "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "nrows = 10000000\n", "\n", "df1 = cudf.DataFrame({\"a\": cp.arange(nrows), \"b\": cp.arange(nrows)})\n", "ddf1 = dask_cudf.from_cudf(df1, npartitions=100)\n", "\n", "\n", "def func(df):\n", " time.sleep(random.randint(1, 10))\n", " return (df + 5) * 3 - 11" ] }, { "cell_type": "markdown", "id": "93a3ee73", "metadata": {}, "source": [ "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-10 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution." ] }, { "cell_type": "code", "execution_count": 89, "id": "8f091ada", "metadata": {}, "outputs": [], "source": [ "results_ddf = ddf2.map_partitions(func)\n", "results_ddf = results_ddf.persist()" ] }, { "cell_type": "markdown", "id": "3c22a1e8", "metadata": {}, "source": [ "However, some partitions will be done **much** sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using `wait`." ] }, { "cell_type": "code", "execution_count": 90, "id": "fea52d0f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DoneAndNotDoneFutures(done={<Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 13)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 1)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 3)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 8)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 10)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 7)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 5)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 6)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 14)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 12)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 9)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 11)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 2)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 4)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 15)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 0)>}, not_done=set())" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wait(results_ddf)" ] }, { "cell_type": "markdown", "id": "db619bec", "metadata": {}, "source": [ "With `wait` completed, we can safely proceed on in our workflow." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "vscode": { "interpreter": { "hash": "8056d08c5310318d9ca4fe60778daf853f02695d9fa19fd0f51ce5f8b089487a" } } }, "nbformat": 4, "nbformat_minor": 5 }