Skip to content

Conversation

iacobo
Copy link

@iacobo iacobo commented Aug 2, 2019

If one sequence has e.g. only CGT the old code will inconsistently assign integers to each letter if other sequences have a different set of letters (i.e. ACG or ACGT). Not problematic for long sequences which are very unlikely to not use all 4 letters, but if e.g. applying to small k-mers very likely to produce errors.

i.e. integer_encoder.fit_transform(list('CGT')) != integer_encoder.fit_transform(list('ACGT'))

Solution: grab alphabet of all letters in all sequences first and fit before transforming each sequence in loop.

If one sequence has e.g. only CGT the old code will inconsistently assign integers to each letter if other sequences have all letters. 

i.e. integer_encoder.fit_transform(list('CGT')) != integer_encoder.fit_transform(list('ACGT'))

Solution: grab alphabet of all letters in all sequences first and fit before transforming each sequence in loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant