From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.6
X-Received: by 2002:a05:620a:4495:: with SMTP id x21mr14074376qkp.633.1640677711877;
        Mon, 27 Dec 2021 23:48:31 -0800 (PST)
X-Received: by 2002:a25:bcc3:: with SMTP id l3mr725717ybm.148.1640677711691;
 Mon, 27 Dec 2021 23:48:31 -0800 (PST)
Path: eternal-september.org!reader02.eternal-september.org!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Mon, 27 Dec 2021 23:48:31 -0800 (PST)
In-Reply-To: <87y2456071.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=213.166.55.173; posting-account=sDyr7QoAAAA7hiaifqt-gaKY2K7OZ8RQ
NNTP-Posting-Host: 213.166.55.173
References: <7bede061-4b0f-4029-beb1-1056637e57d6n@googlegroups.com>
 <j2tlk8FneraU1@mid.individual.net> <49538254-21ed-4fd0-8316-1bccc7d3c635n@googlegroups.com>
 <87sfue8a0v.fsf@bsb.me.uk> <31332c61-a370-43a5-bbe0-efe338ee6d8fn@googlegroups.com>
 <87fsqd7oz8.fsf@bsb.me.uk> <e5ab38f4-b456-4e99-b702-be4f88e8b5c1n@googlegroups.com>
 <87y2456071.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7f50b560-9d28-4572-a90c-7488fb27582en@googlegroups.com>
Subject: Re: Some advice required [OT]
From: Laurent <lutgenl@icloud.com>
Injection-Date: Tue, 28 Dec 2021 07:48:31 +0000
Content-Type: text/plain; charset="UTF-8"
Xref: reader02.eternal-september.org comp.lang.ada:63292
List-Id: <comp.lang.ada>

On Tuesday, 28 December 2021 at 01:29:57 UTC+1, Ben Bacarisse wrote:
> Laurent <lut...@icloud.com> writes: 
> 
> > On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote: 
> >> Laurent <lut...@icloud.com> writes: 
> >> 
> >> > On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote: 
> >> >> Laurent <lut...@icloud.com> writes: 
> >> >> 
> >> >> > On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: 
> >> >> > 
> >> >> >> Sorry, but I found your problem description impossible to understand. 
> >> >> >> Try to describe more clearly the experiment that is done, the structure 
> >> >> >> of the data the experiment provides (the meaning of the Excel rows and 
> >> >> >> columns), and the statistic you want to compute. 
> >> >> > 
> >> >> > Sorry tried to keep it short, was too short. 
> >> >> > 
> >> >> > Columns are the antimicrobial drugs 
> >> >> > Rows are the microorganism. 
> >> >> > 
> >> >> > So every cell contains a result of S, I, R or simply an empty cell 
> >> >> > 
> >> >> > S = Sensible 
> >> >> > I = Intermediate 
> >> >> > R = Resistant 
> >> >> > 
> >> >> > empty cell <S<I<R 
> >> >> > 
> >> >> > If a patient has 3 strains of the same microorganism but with 
> >> >> > different resistance profiles I have to find the most resistant 
> >> >> > one. Or if they are different I keep them all. 
> >> >> > 
> >> >> > I have no idea how to explain what I am doing to the compiler. 
> >> >> I think when you can explain it to people, you'll be able to code it. I 
> >> >> am still struggling to understand what you need. 
> >> >> > Why I would choose result from strain B over the result from strain A. 
> >> >> > 
> >> >> > strain A: SSSRSS 
> >> >> > strain B: SSRRRS 
> >> >> Let's space it out 
> >> >> 
> >> >> drug 1 drug 2 drug 3 drug 4 drug 5 drug 6 
> >> >> strain A S S S R S S 
> >> >> strain B S S R R R S 
> >> >> 
> >> >> You want to choose B because it has is resistant to more drugs, yes? 
> >> >> 
> >> > 
> >> > Yes indeed 
> >> > 
> >> >> I think, from the ordering you give, you need a measure that treats an R 
> >> >> as "more important" that any "I" which is "more important" than an "S". 
> >> >> (We will come to empty cells later.) 
> >> >> 
> >> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
> >> >> number. In base 10, the strains score 
> >> >> 
> >> >> R S I 
> >> >> strain A 1 5 0 = 150 
> >> >> strain B 3 3 0 = 330 
> >> >> 
> >> >> Now, in fact, you don't need to use base 10. The smallest base you can 
> >> >> use is one more than the maximum number of test results. If there can 
> >> >> be up to 16 tests (say) the score is 
> >> >> 
> >> >> n(R)*17*17 + n(S)*17 + n(I). 
> >> >> 
> >> >> If this suits your needs, we can consider empty cells later on. It's 
> >> >> not at all clear to me how to compare 
> >> >> 
> >> >> strain C R____ 
> >> >> strain D RRSSSS 
> >> >> 
> >> >> Strain C is "less resistant" but only because there is not enough 
> >> >> information. In fact it seems more serious as it is resistant to all 
> >> >> tested drugs. 
> >> >> 
> >> > 
> >> > Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete. 
> >> > 
> >> >> And then what about 
> >> >> 
> >> >> strain D SR 
> >> >> strain E RS 
> >> >> 
> >> > 
> >> > Yes those are the cases which are annoying me. 
> >> > 
> >> > That's why I came up withe idea of multiplying the value of the result 
> >> > (S=1, I=2 and R=3) with the position of the value. Tried it with 
> >> > triplets but there will still be cases where different results will 
> >> > give the same numeric value. Ignoring empty cell able tps for the moment. 
> >> > 
> >> > Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 
> >> > will be the same numerical value but they are different resistance 
> >> > profiles I would in this case keep both. 
> >> > 
> >> > How to prevent that from happening. 
> >> Can you first say why the suggestion I made is not helpful? 
> >> 
> >> -- 
> >> Ben. 
> > 
> > You mean that one: 
> > 
> >> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
> >> >> number. In base 10, the strains score 
> >> >> 
> >> >> R S I 
> >> >> strain A 1 5 0 = 150 
> >> >> strain B 3 3 0 = 330 
> >> >> 
> > 
> > Different resistance profiles same result:
> I don't yet understand the requirements so I am taking it in stages. 
> The first requirement seemed to be "more or less resistant". To do that 
> you can use digits in a large enough base but this will make the number 
> of Rs, Ss and Is paramount. Is that acceptable as a first step? 
> 

The requirements are one strain of a certain microorganism/patient
The most resistant one or if they have different profiles

SRS vs RRS => last one, more Rs

SRS vs RSR = both, different profiles

> In order to help people to be able to make further suggestions, maybe 
> you could give the relative ordering you would like to see between the 
> following sets of profiles. For example, between SSR, SRS and RSS, I 
> think the order you want is RSS > SRS > SSR. 
> 
> 1: SSR, SRS, RSS 
> 
> 2: RSI, RIS, SRI, SIR, IRS, ISR 
> 
> 3: SSSR, SSRS, SRSS, RSSS 
> 
> 4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR 
> 

The order of the results is given by the ID of the drug in the extraction tool.
I could probably order them by family and hierarchy of potence but 
would that make a difference?

> It's possible you could make do with an extra field (or digits) that 
> gives some measure of the relative ordering between otherwise similar 
> sequences. For example, using base 10 (for convenience of arithmetic) 
> both RRSSI and RSRSI would score 212xx but the last xx would reflect the 
> positioning of the results in the sequence. There are lots of way to do 
> this. One way would be use, as you were thinking, some sort of weighted 
> count. Using S=0, I=1 and R=2 with weights 
> 
> 54321 
> RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219 
> RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217 
> 

So to be sure that  I am following:

2*(5+4) = value of R (=2) * position of R(@5 and @4)
2*(5+3) = value of R (=2) * position of R(@5 and @3)

0*(3+2) = value of S (=0) * position of S(@3 and @2)
0*(4+2) = value of S (=0) * position of S(@4 and @2)

1*1 = value of I (=1) * position of I (@1)

2*10000 + 1*1000 + 2*100 Is just used as padding? So 212 could be any other
number?

But in this example I would have to keep both as drug 5,2 and 1 are common
to both results but 4 and 3 are unique.

The score would be completely misleading.

So if my table has a width of 20 columns the first column would be
10^20, the next 10^19,.... +/- a few 0s off?

I would have to implement it and see what I get as result.

> If you absolutely must never get duplicate numbers, but you still want 
> to preserve a strict specified ordering, I think you will have much more 
> work to do. 
> 
> Getting a unique number for each case it trivial (but the ordering will 
> be wrong) and getting an ordering that rates every R > every S > every I 
> is also trivial, but there will be lots of duplicates. It's finding the 
> balance that's going to be hard. 
> 
> -- 
> Ben.

I have prepared a cleaned up Excel workbook with only the duplicates which
pose problems. The ones I would keep have an orange ID.
I could upload it to Github. If that helps understanding the different cases.

Thanks for your patience

Laurent