comp.lang.ada
 help / color / mirror / Atom feed
From: Laurent <lutgenl@icloud.com>
Subject: Re: Some advice required [OT]
Date: Tue, 28 Dec 2021 10:19:37 -0800 (PST)	[thread overview]
Message-ID: <e906a70c-6550-45b9-9eca-eae4133eba7fn@googlegroups.com> (raw)
In-Reply-To: <87mtkk96in.fsf@bsb.me.uk>

On Tuesday, 28 December 2021 at 14:57:22 UTC+1, Ben Bacarisse wrote:
> Laurent <lut...@icloud.com> writes: 
> 
> > On Tuesday, 28 December 2021 at 10:05:50 UTC+1, Laurent wrote: 
> >> On Tuesday, 28 December 2021 at 08:48:32 UTC+1, Laurent wrote: 
> >> > On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote: 
> >> > > Laurent <lut...@icloud.com> writes: 
> >> > > 
> >> > > > On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote: 
> >> > > >> Laurent <lut...@icloud.com> writes: 
> >> > > >> 
> >> > > >> > On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote: 
> >> > > >> >> Laurent <lut...@icloud.com> writes: 
> >> > > >> >> 
> >> > > >> >> > On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: 
> >> > > >> >> > 
> >> > > >> >> >> Sorry, but I found your problem description impossible to understand. 
> >> > > >> >> >> Try to describe more clearly the experiment that is done, the structure 
> >> > > >> >> >> of the data the experiment provides (the meaning of the Excel rows and 
> >> > > >> >> >> columns), and the statistic you want to compute. 
> >> > > >> >> > 
> >> > > >> >> > Sorry tried to keep it short, was too short. 
> >> > > >> >> > 
> >> > > >> >> > Columns are the antimicrobial drugs 
> >> > > >> >> > Rows are the microorganism. 
> >> > > >> >> > 
> >> > > >> >> > So every cell contains a result of S, I, R or simply an empty cell 
> >> > > >> >> > 
> >> > > >> >> > S = Sensible 
> >> > > >> >> > I = Intermediate 
> >> > > >> >> > R = Resistant 
> >> > > >> >> > 
> >> > > >> >> > empty cell <S<I<R 
> >> > > >> >> > 
> >> > > >> >> > If a patient has 3 strains of the same microorganism but with 
> >> > > >> >> > different resistance profiles I have to find the most resistant 
> >> > > >> >> > one. Or if they are different I keep them all. 
> >> > > >> >> > 
> >> > > >> >> > I have no idea how to explain what I am doing to the compiler. 
> >> > > >> >> I think when you can explain it to people, you'll be able to code it. I 
> >> > > >> >> am still struggling to understand what you need. 
> >> > > >> >> > Why I would choose result from strain B over the result from strain A. 
> >> > > >> >> > 
> >> > > >> >> > strain A: SSSRSS 
> >> > > >> >> > strain B: SSRRRS 
> >> > > >> >> Let's space it out 
> >> > > >> >> 
> >> > > >> >> drug 1 drug 2 drug 3 drug 4 drug 5 drug 6 
> >> > > >> >> strain A S S S R S S 
> >> > > >> >> strain B S S R R R S 
> >> > > >> >> 
> >> > > >> >> You want to choose B because it has is resistant to more drugs, yes? 
> >> > > >> >> 
> >> > > >> > 
> >> > > >> > Yes indeed 
> >> > > >> > 
> >> > > >> >> I think, from the ordering you give, you need a measure that treats an R 
> >> > > >> >> as "more important" that any "I" which is "more important" than an "S". 
> >> > > >> >> (We will come to empty cells later.) 
> >> > > >> >> 
> >> > > >> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
> >> > > >> >> number. In base 10, the strains score 
> >> > > >> >> 
> >> > > >> >> R S I 
> >> > > >> >> strain A 1 5 0 = 150 
> >> > > >> >> strain B 3 3 0 = 330 
> >> > > >> >> 
> >> > > >> >> Now, in fact, you don't need to use base 10. The smallest base you can 
> >> > > >> >> use is one more than the maximum number of test results. If there can 
> >> > > >> >> be up to 16 tests (say) the score is 
> >> > > >> >> 
> >> > > >> >> n(R)*17*17 + n(S)*17 + n(I). 
> >> > > >> >> 
> >> > > >> >> If this suits your needs, we can consider empty cells later on. It's 
> >> > > >> >> not at all clear to me how to compare 
> >> > > >> >> 
> >> > > >> >> strain C R____ 
> >> > > >> >> strain D RRSSSS 
> >> > > >> >> 
> >> > > >> >> Strain C is "less resistant" but only because there is not enough 
> >> > > >> >> information. In fact it seems more serious as it is resistant to all 
> >> > > >> >> tested drugs. 
> >> > > >> >> 
> >> > > >> > 
> >> > > >> > Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete. 
> >> > > >> > 
> >> > > >> >> And then what about 
> >> > > >> >> 
> >> > > >> >> strain D SR 
> >> > > >> >> strain E RS 
> >> > > >> >> 
> >> > > >> > 
> >> > > >> > Yes those are the cases which are annoying me. 
> >> > > >> > 
> >> > > >> > That's why I came up withe idea of multiplying the value of the result 
> >> > > >> > (S=1, I=2 and R=3) with the position of the value. Tried it with 
> >> > > >> > triplets but there will still be cases where different results will 
> >> > > >> > give the same numeric value. Ignoring empty cell able tps for the moment. 
> >> > > >> > 
> >> > > >> > Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 
> >> > > >> > will be the same numerical value but they are different resistance 
> >> > > >> > profiles I would in this case keep both. 
> >> > > >> > 
> >> > > >> > How to prevent that from happening. 
> >> > > >> Can you first say why the suggestion I made is not helpful? 
> >> > > >> 
> >> > > >> -- 
> >> > > >> Ben. 
> >> > > > 
> >> > > > You mean that one: 
> >> > > > 
> >> > > >> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
> >> > > >> >> number. In base 10, the strains score 
> >> > > >> >> 
> >> > > >> >> R S I 
> >> > > >> >> strain A 1 5 0 = 150 
> >> > > >> >> strain B 3 3 0 = 330 
> >> > > >> >> 
> >> > > > 
> >> > > > Different resistance profiles same result: 
> >> > > I don't yet understand the requirements so I am taking it in stages. 
> >> > > The first requirement seemed to be "more or less resistant". To do that 
> >> > > you can use digits in a large enough base but this will make the number 
> >> > > of Rs, Ss and Is paramount. Is that acceptable as a first step? 
> >> > > 
> >> > The requirements are one strain of a certain microorganism/patient 
> >> > The most resistant one or if they have different profiles 
> >> > 
> >> > SRS vs RRS => last one, more Rs 
> >> > 
> >> > SRS vs RSR = both, different profiles 
> >> > > In order to help people to be able to make further suggestions, maybe 
> >> > > you could give the relative ordering you would like to see between the 
> >> > > following sets of profiles. For example, between SSR, SRS and RSS, I 
> >> > > think the order you want is RSS > SRS > SSR. 
> >> > > 
> >> > > 1: SSR, SRS, RSS 
> >> > > 
> >> > > 2: RSI, RIS, SRI, SIR, IRS, ISR 
> >> > > 
> >> > > 3: SSSR, SSRS, SRSS, RSSS 
> >> > > 
> >> > > 4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR 
> >> > > 
> >> > The order of the results is given by the ID of the drug in the extraction tool. 
> >> > I could probably order them by family and hierarchy of potence but 
> >> > would that make a difference? 
> >> > > It's possible you could make do with an extra field (or digits) that 
> >> > > gives some measure of the relative ordering between otherwise similar 
> >> > > sequences. For example, using base 10 (for convenience of arithmetic) 
> >> > > both RRSSI and RSRSI would score 212xx but the last xx would reflect the 
> >> > > positioning of the results in the sequence. There are lots of way to do 
> >> > > this. One way would be use, as you were thinking, some sort of weighted 
> >> > > count. Using S=0, I=1 and R=2 with weights 
> >> > > 
> >> > > 54321 
> >> > > RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219 
> >> > > RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217 
> >> > > 
> >> > So to be sure that I am following: 
> >> > 
> >> > 2*(5+4) = value of R (=2) * position of R(@5 and @4) 
> >> > 2*(5+3) = value of R (=2) * position of R(@5 and @3) 
> >> > 
> >> > 0*(3+2) = value of S (=0) * position of S(@3 and @2) 
> >> > 0*(4+2) = value of S (=0) * position of S(@4 and @2) 
> >> > 
> >> > 1*1 = value of I (=1) * position of I (@1) 
> >> > 
> >> > 2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other 
> >> > number? 
> >> > 
> >> Eh forget the last sentence, brain fart: I have 2 R's so 2*10000, 1 I so 1*1000 and 2 S's so 2*100 
> >> > But in this example I would have to keep both as drug 5,2 and 1 are common 
> >> > to both results but 4 and 3 are unique. 
> >> > 
> >> > The score would be completely misleading. 
> >> > 
> >> > So if my table has a width of 20 columns the first column would be 
> >> > 10^20, the next 10^19,.... +/- a few 0s off? 
> >> > 
> >> > I would have to implement it and see what I get as result. 
> >> > > If you absolutely must never get duplicate numbers, but you still want 
> >> > > to preserve a strict specified ordering, I think you will have much more 
> >> > > work to do. 
> >> > > 
> >> > > Getting a unique number for each case it trivial (but the ordering will 
> >> > > be wrong) and getting an ordering that rates every R > every S > every I 
> >> > > is also trivial, but there will be lots of duplicates. It's finding the 
> >> > > balance that's going to be hard. 
> >> > > 
> >> > > -- 
> >> > > Ben. 
> >> > I have prepared a cleaned up Excel workbook with only the duplicates which 
> >> > pose problems. The ones I would keep have an orange ID. 
> >> > I could upload it to Github. If that helps understanding the different cases. 
> >> > 
> >> > Thanks for your patience 
> >> > 
> >> > Laurent 
> > 
> > Ben,
> Posts crossed. You should probably ignore my last as it was written 
> before I saw this one.
> > I have implemented your solution but I don't understand the reason why S would have a value of 0? 
> > I then don't need to take care of the S'es because the result will always be 0. Not that it changes a lot 
> > 
> > Because I still couldn't choose the profile of interest only based on the numbers. 
> > 
> > R R S S I Ben's Solution: 212 11 Mine: 212 1205 
> > R S R S I 212 13 212 1405 
> > R R R S I 311 17 311 1805 
> > R S R R I 311 21 311 1407 
> > S R R R I 311 23 311 1607 
> > 
> > 311 17 and 311 23 being the most likely but unclear where the 
> > difference might be.
> This is what is so frustrating for me. What do you mean, most likely? 
> What do you mean be what the difference might be? Can you describe to 
> me, as a human being, which you would choose and tell me how you 
> decided. If you can't do that then all you are doing is trying random 
> schemes until something pops up the look right for some specific set of 
> data!

311 17 and 311 23 

Have the most a R's and are most different from each other. 311 23 has an R
in a position which 311 17 doesn't have.

311 21 gets deleted because it has nothing unique which the 2 others wouldn't have.
 212 11 and 212 13 are also deleted. Less R's, nothing unique.

> > I have adapted my current solution to include the number of R,I,S 
> > weight of the results: S=1, I=2, R=3 
> > weight of the position in the triplet: 1st=1, 2nd=2, 3rd=3 
> > 
> > ie.: R R R => First triplet: 1*3+2*3+3*3 = 18 
> > S I => Second triplet 1*1+2*2 = 05 
> > 
> > RIS count: 311 
> > Append 1st triplet: 311 18 
> > Append 2nd triplet: 311 18 05 
> > 
> > 311 18 05 and 311 16 07 being the most likely with some clues which 
> > triplet is different.
> It sound like you want the result to "give some clues". Why not just 
> return the string of letters? SRRRI tells you everything about the 
> tests. What more could you want? If you want these ordered by number 
> R, I and S counts, put these first always using two digits: 
> 
> "030101SRRRI" 
> 
> This string will sort the important, highly resistant strains to the top 
> and also gives all the information about the individual tests.
> > Am I not somehow introducing a bias by multiplying the value with the position in the triplet? 
> > And then there is still the case where SSR (1*1+2*1+3*3=12) and RRS (1*3+2*3+3*1=12) 
> > will both resolve to the same value.
> Eh? Don't you want more Rs to get high scores? That's what the counts 
> are for.

No more R's are not always better. If I have a strain with a R in a place all the other's don't
have I have to keep that one too even with a low score.

Because of that stupid requirement:

> >> > The requirements are one strain of a certain microorganism/patient 
> >> > The most resistant one or if they have different profiles 
> >> > 
> >> > SRS vs RRS => last one, more Rs 
> >> > 
> >> > SRS vs RSR = both, different profiles

The extraction tool I use is doing the stats
for me. Just the "client" Dr Dr something wants it like that. So I have
to do that post-treatment which cost me a lot of time.

Here the Excel file with my play data:
https://github.com/Chutulu/Bacterio-Statistiques.git

The strains with their IDs in orange are the ones I keep.

strain b has the same number of Rs but in different places aka different drugs so different
behaviour if you prescribe antibiotics.

strain d I keep the one with empty cells because it has an R the other one doesn't have

Most R's and/or uniqueness

If a docter knows the name of the microorganism but doesn't have the 
results of antimicrobial susceptibility testing then he can use the data
I have provided to give an empirical treatment.

So I have to consider all unique strains but be very pessimistic about their
sensibility (most R's)

Dr Dr client's idea to print them as plastic pocket table.
Fortunately my name isn't mentioned anywhere.

> > Wouldn't I need some sort of Traveling Salesman Problems algorithm to find the profile 
> > with the highest number of resistances and the highest number of 
> > triplets with high values.
> I don't understand the triplets idea. Sorry. 

No problem. I have used some application specific idea while creating it. 

I think it might be usable. 
Have just to figure out how.

If you want to leave the discussion I can understand. Sometimes it helps me
speaking with someone who has no clue what I am talking about to see
things clearer.

My maths skills are very low. The reason why I am an MTA. No math required.
The most difficult I have to calculate are dilutions. So one reason for
miscommunication comes from there.

The other being the application specific context of microbiologie.
I am doing that for 18 years.

Starring to long at at the results has probably also fried too many
neurones.

> 
> -- 
> Ben.

Thanks for your time and patience

Laurent

  reply	other threads:[~2021-12-28 18:19 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-27  9:21 Some advice required [OT] Laurent
2021-12-27 11:16 ` Niklas Holsti
2021-12-27 12:29   ` Laurent
2021-12-27 13:14     ` Ben Bacarisse
2021-12-27 18:24       ` Laurent
2021-12-27 19:51         ` Dennis Lee Bieber
2021-12-27 20:49         ` Ben Bacarisse
2021-12-27 22:09           ` Laurent
2021-12-28  0:29             ` Ben Bacarisse
2021-12-28  7:48               ` Laurent
2021-12-28  9:05                 ` Laurent
2021-12-28 12:54                   ` Laurent
2021-12-28 13:57                     ` Ben Bacarisse
2021-12-28 18:19                       ` Laurent [this message]
2021-12-28 13:43                 ` Ben Bacarisse
2021-12-28 16:49                 ` Dennis Lee Bieber
2021-12-29  4:20                   ` Randy Brukardt
2021-12-27 17:41     ` Dennis Lee Bieber
2021-12-27 18:56       ` Niklas Holsti
2021-12-27 19:44         ` Laurent
2021-12-28  2:10     ` Randy Brukardt
2021-12-28  6:02       ` Laurent
2021-12-29  3:58         ` Randy Brukardt
2021-12-27 17:18 ` Simon Wright
2021-12-27 18:30   ` Laurent
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox