How To Delete A Blog From Tumblr
TL;DR
A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')
df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.
I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.
Using these solutions are general and will work for the simple case as well.
Setup
Consider the pd.DataFrame
df
and list to delete dlst
df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3)) dlst = list('HIJKLM')
df A B C D E F G H I J 0 1 2 3 4 5 6 7 8 9 10 1 1 2 3 4 5 6 7 8 9 10 2 1 2 3 4 5 6 7 8 9 10
dlst ['H', 'I', 'J', 'K', 'L', 'M']
The result should look like:
df.drop(dlst, 1, errors='ignore') A B C D E F G 0 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 2 1 2 3 4 5 6 7
Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:
- Label selection
- Boolean selection
Label Selection
We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.
-
df.columns.difference(dlst)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
-
np.setdiff1d(df.columns.values, dlst)
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
-
df.columns.drop(dlst, errors='ignore')
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
-
list(set(df.columns.values.tolist()).difference(dlst))
# does not preserve order ['E', 'D', 'B', 'F', 'G', 'A', 'C']
-
[x for x in df.columns.values.tolist() if x not in dlst]
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Columns from Labels
For the sake of comparing the selection process, assume:
cols = [x for x in df.columns.values.tolist() if x not in dlst]
Then we can evaluate
-
df.loc[:, cols]
-
df[cols]
-
df.reindex(columns=cols)
-
df.reindex_axis(cols, 1)
Which all evaluate to:
A B C D E F G 0 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 2 1 2 3 4 5 6 7
Boolean Slice
We can construct an array/list of booleans for slicing
-
~df.columns.isin(dlst)
-
~np.in1d(df.columns.values, dlst)
-
[x not in dlst for x in df.columns.values.tolist()]
-
(df.columns.values[:, None] != dlst).all(1)
Columns from Boolean
For the sake of comparison
bools = [x not in dlst for x in df.columns.values.tolist()]
-
df.loc[: bools]
Which all evaluate to:
A B C D E F G 0 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 2 1 2 3 4 5 6 7
Robust Timing
Functions
setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst) difference = lambda df, dlst: df.columns.difference(dlst) columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore') setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst)) comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst] loc = lambda df, cols: df.loc[:, cols] slc = lambda df, cols: df[cols] ridx = lambda df, cols: df.reindex(columns=cols) ridxa = lambda df, cols: df.reindex_axis(cols, 1) isin = lambda df, dlst: ~df.columns.isin(dlst) in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst) comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()] brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
Testing
res1 = pd.DataFrame( index=pd.MultiIndex.from_product([ 'loc slc ridx ridxa'.split(), 'setdiff1d difference columndrop setdifflst comprehension'.split(), ], names=['Select', 'Label']), columns=[10, 30, 100, 300, 1000], dtype=float ) res2 = pd.DataFrame( index=pd.MultiIndex.from_product([ 'loc'.split(), 'isin in1d comp brod'.split(), ], names=['Select', 'Label']), columns=[10, 30, 100, 300, 1000], dtype=float ) res = res1.append(res2).sort_index() dres = pd.Series(index=res.columns, name='drop') for j in res.columns: dlst = list(range(j)) cols = list(range(j // 2, j + j // 2)) d = pd.DataFrame(1, range(10), cols) dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100) for s, l in res.index: stmt = '{}(d, {}(d, dlst))'.format(s, l) setp = 'from __main__ import d, dlst, {}, {}'.format(s, l) res.at[(s, l), j] = timeit(stmt, setp, number=100) rs = res / dres
rs 10 30 100 300 1000 Select Label loc brod 0.747373 0.861979 0.891144 1.284235 3.872157 columndrop 1.193983 1.292843 1.396841 1.484429 1.335733 comp 0.802036 0.732326 1.149397 3.473283 25.565922 comprehension 1.463503 1.568395 1.866441 4.421639 26.552276 difference 1.413010 1.460863 1.587594 1.568571 1.569735 in1d 0.818502 0.844374 0.994093 1.042360 1.076255 isin 1.008874 0.879706 1.021712 1.001119 0.964327 setdiff1d 1.352828 1.274061 1.483380 1.459986 1.466575 setdifflst 1.233332 1.444521 1.714199 1.797241 1.876425 ridx columndrop 0.903013 0.832814 0.949234 0.976366 0.982888 comprehension 0.777445 0.827151 1.108028 3.473164 25.528879 difference 1.086859 1.081396 1.293132 1.173044 1.237613 setdiff1d 0.946009 0.873169 0.900185 0.908194 1.036124 setdifflst 0.732964 0.823218 0.819748 0.990315 1.050910 ridxa columndrop 0.835254 0.774701 0.907105 0.908006 0.932754 comprehension 0.697749 0.762556 1.215225 3.510226 25.041832 difference 1.055099 1.010208 1.122005 1.119575 1.383065 setdiff1d 0.760716 0.725386 0.849949 0.879425 0.946460 setdifflst 0.710008 0.668108 0.778060 0.871766 0.939537 slc columndrop 1.268191 1.521264 2.646687 1.919423 1.981091 comprehension 0.856893 0.870365 1.290730 3.564219 26.208937 difference 1.470095 1.747211 2.886581 2.254690 2.050536 setdiff1d 1.098427 1.133476 1.466029 2.045965 3.123452 setdifflst 0.833700 0.846652 1.013061 1.110352 1.287831
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True) for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]): ax = axes[i // 2, i % 2] g.plot.bar(ax=ax, title=n) ax.legend_.remove() fig.tight_layout()
This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore')
. It seems like after all that effort, we only improve performance modestly.
If fact the best solutions use reindex
or reindex_axis
on the hack list(set(df.columns.values.tolist()).difference(dlst))
. A close second and still very marginally better than drop
is np.setdiff1d
.
rs.idxmin().pipe( lambda x: pd.DataFrame( dict(idx=x.values, val=rs.lookup(x.values, x.index)), x.index ) ) idx val 10 (ridx, setdifflst) 0.653431 30 (ridxa, setdifflst) 0.746143 100 (ridxa, setdifflst) 0.816207 300 (ridx, setdifflst) 0.780157 1000 (ridxa, setdifflst) 0.861622
How To Delete A Blog From Tumblr
Source: https://stackoverflow.com/questions/13411544/delete-a-column-from-a-pandas-dataframe
Posted by: guywithed.blogspot.com
0 Response to "How To Delete A Blog From Tumblr"
Post a Comment