Skip to content

Conversation

@JanBenisek
Copy link
Contributor

Solves #53 by improving speed of the train/selection/validation split

  • I removed the dependancy of sklearn train_test_split because it involved too many dataframes splitting and merging
  • as a consequence, I dropped the option of stratify_split
    • I hope this is temporary, but this gives us a quick fix and hopefully later we can reintroduce it
  • I kept the tests, they are fine (only small modification was made)
  • I updated the documentation

I tested the solution's efficiency on the earnings dataset (the one we used at Data Science meetup Leuven)

  • current solution:
%timeit train_selection_validation_split(data,
              target_column_name=target_column_name,
              train_prop=0.6,
              selection_prop=0.2,
              validation_prop=0.2)
# 7.29 ms ± 821 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • proposed solution:
%timeit train_selection_validation_split(data=data, 
            train_prop=0.6, 
            selection_prop=0.2, 
            validation_prop=0.2)

# 109 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And also memory consumption (using memory_profiler):

  • current solution: memory usage 111.1 MiB
  • proposed solution: 77.4 MiB

@JanBenisek JanBenisek added this to the v1.0.2 milestone Mar 19, 2021
@JanBenisek JanBenisek linked an issue Mar 19, 2021 that may be closed by this pull request
Copy link
Contributor

@sborms sborms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice speed and memory consumption improvements! Also the code simplification is always a plus.

@sborms sborms merged commit f5b815d into develop Mar 22, 2021
@sborms sborms deleted the feature/faster_split branch March 22, 2021 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve speed of train/selection/validation split function

2 participants