Utilizing OpenAI and PandasAI for Sequence Operations | by Michael B Walker

Incorporate pure language queries and operations into your Python information cleansing workflow.

Crimson panda drawing donated by Karen Walker, the artist.

Most of the sequence operations we have to do in our pandas information cleansing initiatives could be assisted by AI instruments, together with by PandasAI. PandasAI takes benefit of enormous language fashions, comparable to that from OpenAI, to allow pure language queries and operations on information columns. On this put up, we look at how you can use PandasAI to question Sequence values, create new Sequence, set Sequence values conditionally, and reshape our information.

You may set up PandasAI by getting into pip set up pandasai right into a terminal or into Home windows Powershell. Additionally, you will must get a token from openai.com to ship a request to the OpenAI API.

Because the PandasAI library is creating quickly, you may anticipate totally different outcomes relying on the variations of PandasAI and pandas you might be utilizing. On this article, I exploit model 1.4.8 of PandasAI and model 1.5.3 of pandas.

We are going to work with information from the Nationwide Longitudinal Examine of Youth (NLS) performed by the US Bureau of Labor Statistics. The NLS has surveyed the identical cohort of highschool college students for over 25 years, and has helpful information gadgets on instructional outcomes and weeks labored for every of these years, amongst many different variables. It’s out there for public use at nlsinfo.org. (The NLS public releases are lined by the US authorities Open Information Coverage, which allows each non-commercial and industrial use.)

We may also work with COVID-19 information supplied by Our World in Information. That dataset has one row per nation per day with variety of new circumstances and new deaths. This dataset is obtainable for obtain at ourworldindata.org/covid-cases, with a Inventive Commons CC BY 4.0 license. You too can obtain all code and information used on this put up from GitHub.

We begin by importing the OpenAI and SmartDataframe modules from PandasAI. We additionally must instantiate an llm object:

import pandas as pd
from pandasai.llm.openai import OpenAI
from pandasai import SmartDataframe
llm = OpenAI(api_token="Your API Token")

Subsequent, we load the DataFrames we will likely be utilizing and create a SmartDataframe object from the NLS pandas DataFrame:

covidcases = pd.read_csv("information/covidcases.csv")
nls97 = pd.read_csv("information/nls97f.csv")
nls97.set_index("personid", inplace=True)
nls97sdf = SmartDataframe(nls97, config={"llm": llm})

Now we’re able to generate abstract statistics on Sequence from our SmartDataframe. We are able to ask for the typical for a single Sequence, or for a number of Sequence:

nls97sdf.chat("Present common of gpaoverall")2.8184077281812128
nls97sdf.chat("Present common for every weeks labored column")
Common Weeks Labored
0
weeksworked00 26.42
weeksworked01 29.78
weeksworked02 31.83
weeksworked03 33.51
weeksworked04 35.10
weeksworked05 37.34
weeksworked06 38.44
weeksworked07 39.29
weeksworked08 39.33
weeksworked09 37.51
weeksworked10 37.12
weeksworked11 38.06
weeksworked12 38.15
weeksworked13 38.79
weeksworked14 38.73
weeksworked15 39.67
weeksworked16 40.19
weeksworked17 40.37
weeksworked18 40.01
weeksworked19 41.22
weeksworked20 38.35
weeksworked21 36.17
weeksworked22 11.43

We are able to additionally summarize Sequence values by one other Sequence, normally one that’s categorical:

nls97sdf.chat("Present satmath common by gender")Feminine   Male
0  486.65 516.88

We are able to additionally create a brand new Sequence with the chat methodology of SmartDataframe. We don’t want to make use of the precise column names. For instance, PandasAI will determine that we wish the childathome Sequence after we write baby at dwelling:

nls97sdf = nls97sdf.chat("Set childnum to baby at dwelling plus baby not at
dwelling")
nls97sdf[['childnum','childathome','childnotathome']].
pattern(5, random_state=1)childnum  childathome  childnotathome
personid                                       
211230        2.00         2.00            0.00
990746        3.00         3.00            0.00
308169        3.00         1.00            2.00
798458         NaN          NaN             NaN
312009         NaN          NaN             NaN

We are able to use the chat methodology to create Sequence values conditionally:

nls97sdf = nls97sdf.chat("evermarried is 'No' when maritalstatus is
'By no means-married', else 'Sure'")
nls97sdf.groupby(['evermarried','maritalstatus']).dimension()evermarried  maritalstatus
No           By no means-married    2767
Sure          Divorced          669
Married          3068
Separated         148
Widowed            23
dtype: int64

PandasAI is kind of versatile relating to the language you may use right here. For instance, the next gives the identical outcomes:

nls97sdf = nls97sdf.chat("if maritalstatus is 'By no means-married' set
evermarried2 to 'No', in any other case 'Sure'")
nls97sdf.groupby(['evermarried2','maritalstatus']).dimension()evermarried2  maritalstatus
No            By no means-married    2767
Sure           Divorced          669
Married          3068
Separated         148
Widowed            23
dtype: int64

We are able to do calculations throughout various equally named columns:

nls97sdf = nls97sdf.chat("Set weeksworked for every row to the typical of
all weeksworked columns for that row")

This can calculate the typical of all weeksworked00-weeksworked22 columns and assign that to a brand new column referred to as weeksworked.

We are able to simply impute values the place they’re lacking primarily based on abstract statistics:

nls97sdf.gpaenglish.describe()rely   5,798
imply      273
std        74
min         0
25%       227
50%       284
75%       323
max       418
Title: gpaenglish, dtype: float64
nls97sdf = nls97sdf.chat("set lacking gpaenglish to the typical")
nls97sdf.gpaenglish.describe()
rely   8,984
imply      273
std        59
min         0
25%       264
50%       273
75%       298
max       418
Title: gpaenglish, dtype: float64

We are able to additionally use PandasAI to do some reshaping. Recall that the COVID-19 case information has new circumstances for every day for every nation. Let’s say we solely need the primary row of knowledge for every nation. We are able to do this the normal manner with drop_duplicates:

firstcase = covidcases.
sort_values(['location','casedate']).
drop_duplicates(['location'], hold='first')firstcase.set_index('location', inplace=True)
firstcase.form
(231, 67)
firstcase[['iso_code','continent','casedate',
'total_cases','new_cases']].head(2).T
location    Afghanistan     Albania
iso_code            AFG         ALB
continent          Asia      Europe
casedate     2020-03-01  2020-03-15
total_cases        1.00       33.00
new_cases          1.00       33.00

We are able to get the identical outcomes by making a SmartDataframe and utilizing the chat methodology. The pure language I exploit right here is remarkably easy, Present first casedate and placement and different values for every nation:

covidcasessdf = SmartDataframe(covidcases, config={"llm": llm})
firstcasesdf = covidcasessdf.chat("Present first casedate and placement and
different values for every nation.")firstcasesdf.form
(231, 7)
firstcasesdf[['location','continent','casedate',
'total_cases','new_cases']].head(2).T
iso_code               ABW          AFG
location             Aruba  Afghanistan
continent    North America         Asia
casedate        2020-03-22   2020-03-01
total_cases           5.00         1.00
new_cases             5.00         1.00

Discover that PandasAI makes good selections concerning the columns to get. We get the columns we want somewhat than all of them. We may have additionally simply handed the names of the columns we needed to chat. (PandasAI sorted the rows by iso_code, somewhat than by location, which is why the primary row is totally different.)

A lot of the work when utilizing PandasAI is admittedly simply importing the related libraries and instantiating massive language mannequin and SmartDataframe objects. As soon as that’s finished, easy sentences despatched to the chat methodology of the SmartDataframe are adequate to summarize Sequence values and create new Sequence.

PandasAI excels at producing easy statistics from Sequence. We don’t even want to recollect the Sequence identify precisely. Usually the pure language we would use could be extra intuitive than conventional pandas strategies like groupby. The Present satmath common by gender worth handed to chat is an effective instance of that.

Operations on Sequence, together with the creation of a brand new Sequence, can also be fairly easy. We created a complete variety of youngsters Sequence (childnum) by instructing the SmartDataframe so as to add the variety of youngsters residing at dwelling to the variety of youngsters not residing at dwelling. We didn’t even present the literal Sequence names, childathome and childnotathome respectively. PandasAI found out what we meant.

Since we’re passing pure language directions to chat for our Sequence operations, there isn’t a one proper strategy to get what we wish. For instance, we get the identical outcome after we handed evermarried is ‘No’ when maritalstatus is ‘By no means-married’, else ‘Sure’ to chat as we did with if maritalstatus is ‘By no means-married’ set evermarried2 to ‘No’, in any other case ‘Sure’.

We are able to additionally do pretty in depth DataFrame reshaping with easy pure language directions, as within the final command we supplied. We add and different values to the directions to get columns aside from casedate. PandasAI additionally figures out that location is smart because the index.

You may learn extra about how you can use PandasAI and SmartDataframes right here:

Or within the second version of my e book, Python Information Cleansing Cookbook:

Good luck together with your information cleansing and I might love to listen to how issues are going!

Utilizing OpenAI and PandasAI for Sequence Operations | by Michael B Walker | Jun, 2024

Incorporate pure language queries and operations into your Python information cleansing workflow.

Leave a Reply Cancel reply

Latest News

5 Uncommon Platforms That Can Improve The EdTech Expertise

Epic says its EU iOS app retailer is authorised however that Apple needs a change

Safeguarding Healthcare AI: Exposing and Addressing LLM Manipulation Dangers

Harvester pulls 1.5 gallons of ingesting water from arid air per day

AI Century Tech is at the forefront of AI innovation, driving the future with cutting-edge technology and groundbreaking AI solutions.

Quick Link

Top Categories

Sign Up for Our Newsletter

Incorporate pure language queries and operations into your Python information cleansing workflow.

You Might Also Like

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Latest News

Sign Up for Our Newsletter